Chapter 18 Reinforcement Learning

Hands-On Machine Learning with Scikit-Learn, Keras & Tensorflow 2nd Edition by A. Geron

Reinforcement Learning (RL) is one of the most exciting fields of Machine Learning today, and also one of the oldeat.

It has been around since the 1950s, producing many interesting applications over the years, particularly in games (e.g., TD-Gammon, a Backgammon-playing program) and in machine control, but seldom making the headline news.

But a revolution took place in 2013, when researchers from a British startup called DeepMind demonstrated a system that could learn to play just about any Atari game from scratch (https://homl.info/dqn), eventually outperforming humans (https://homl.info/dqn2) in most of them, using only raw pixels as imputs and without any prior knowledge of the rules of the games.

This was the first of a series of amazing feats, culminating in March 2016 with the victory of their system AlphaGo against Lee Sedol, a legendary professional player of the game of Go, and in May 2017 against Ke Jie, the world champion.

No program had ever come close to beating a master of this game, let alone the world champion.

Today the whole field of RL is boiling with new ideas, with a wide range of applications.

DeepMind was bought by Google for over $500 million in 2014.

So how did DeepMind achieve all this?

With hindsight it seems rather simple: they applied the power of Deep Learning to the field of Reinforcement Learning, and it worked beyond their wildest dreams.

In this chapter we will first explain what Reinforcement Learning is and what it's good at, then present two of the most important techniques in Deep Reinforcement Learning: policy gradients and deep Q-networks (DQNs), including a discussion of Markov decision processes (MDPs).

We will use these techniques to train models to balance a pole on a moving cart; then I'll introduce th TF-Agents library, which uses state-of-the-art algorithms that greatly simplify building powerful RL systems, and we will use the librsry to train an agent to play Breakout, the famous Atari game.

I'll close the chapter by taking a look at some of the latest advances in the field.

Learning to Optimize Rewards

In Reinforcement Learning, a software agent makes observations and takes actions within an environment, and in return it receives rewards.

Its objective is to learn to act in a way that will maximize its expected rewards over time.

If you don't mind a bit of anthropomorphism, you can think of positive rewards as pleasure, and negative rewards as pain ( the term "reward" is a bit misleading in this case).

In short, the agent acts in the environment and learns by trial and error to maximize its pleasure and minimize its pain.

This is quite a broad setting, which can apply to a wide variety of tasks.

Here are a few examples (see Figure 18-1):

a. The agent can be the program controlling a robot.

In this case, the environment is the real world, the agent observes the environment through a set of sensors such as cameras and touch sensors, and its actions consist of sending signals to active motors.

It may be programmed to get positive rewards whenever it approaches the target destination, and negative rewards whenever it wastes time or goes in the wrong direction.

b. The agent can be the program controlling Ms. Pac-Man.

In this case, the environment is a simulation of the Atari game, the actions are the nine possible joystick positions (upper left, down, center, and so on), the observations are screenshots, and the rewards are just the game points.

c. Similarly, the agent can be the program playing a board game such as Go.

d. The agent does not have to control a physically (or virtually) moving thing.

For example, it can be a smart thermostat, getting positive rewards whenever it is close to the target temperature and saves energy, and negative rewarda when humans need to tweak the temperature, so the agent must learn to anticipate human needs.

e. The agent can observe stock market prices and decide how much to buy or sell every second.

Rewards are obviously the monetary gains and losses.

Note that there may not be any positive rewards at all; for example, the agent may move around in a maze, getting a negative reward at every time step, so it had better find the exit as quickly as possible!

There are many other examples of tasks to which Reinforcement Learning is well suited, such as self-driving cars, recommender systems, placing ads on a web page, or controlling where an image classification system should focus its attention.

Policy Search

The algorithm a software agent uses to determine its actions is called its policy.

The policy couuld be a neural network taking obsevations as inputs and outputting tha action to take (see Figure 18-2).

The policy can be any algorithm you can think of, and it does not have to be deterministic.

In fact, in some cases it does not even have to observe the environment!

For example, consider a robotic vacuum cleaner whose reward is the amount of dust it picks up in 30 minutes.

Its policy could be to move forward with some probability p every second, or randomly rotate left or right with probability 1 - p.

The rotation angle would be a random angle between -r and +r.

Since this policy involves some randomness, it is called stochastic policy.

The robot will have an erratic trajectry, which guarantees that it will eventually get to any place it can reach and pick up all the dust.

The question is , how much dust will it pick up in 30 minutes?

How wold you train such a robot?

There are just two policy parameters you can tweak: the probability p and the angle range r.

One possible learning algorithm could be to try out many different values for these parameters, and pick the combination that performs best (see Figure 18-3).

This is an example of policy search, in this case using a brute force approach.

When the policy space is too large (which is generally the case), finding a good set of parameters this way is like searching for a needle in a gigantic haystack.

Anothe way to explore the policy space is to use genetic algorithms.

For example, you could randomly create a first generation of 100 policies and try them out, then "kill" the 80 worst policies and make the 20 survivors produce 4 offspring each.

An offspring is a copy of its parent plus some random variation.

The surviving policies plus their offspring together constitute the second generation.

You can continue to iterate through generations this way until you find a good policy.

Yet another approach is to use optimization techniques, by evaluating the gradients of the rewards with regard to the policy parameters, then tweaking these parameters by following the gradients toward higher rewards.

We will discuss this approach, is called policy gradients (PG), in more detail later in this chapter.

Going back to the vacuum cleaner robot, you could slightly increase p and evaluate whether doing so increase the amount of dust picked up by the robot in 30 minutes; if it does, then increase p some more, or else reduce p.

We will implement a popular PG algorithm using TensorFlow, but before we do, we need to create an environment for the agent to live in - so it's time to introduce OpenAI Gym.

Introduction to OpenAi Gym

Here, we've created a CartPole environment.

This is a 2D simulation in which a cart can be accelerated left or right in order to balance a pole placed on top of it (see Figure 18-4).

You can get the list of all available environments by running gym.envs.registry.all( ).

After the environment is created, you must initialize it using the reset( ) method.

This returns the first observation.

Obsevations depend on the type of environment.

For the CartPole environment, each observation is a 1D NumPy array containing four floats: these floats represent the cart's horizontal position (0.0 = center), its velocity (positive means right), the angle of the pole (0.0 = vertical), and its angular velosity (positive means clockwise).

Neural Network Policies

Lat's create a neural network policy.

Just like with the policy we hardcoded earlier, this neural network will take an observation as input, and it will output the action to be executed.

More precisely, it will estimate a probability for each action, and then we will select an action randomly, according to the estimated probabilities (see Figure 18-5).

In the case of the CartPole environment, there are just two possible actions (left or right), so we only need one output neuron.

It will output the probability p of action 0 (left), and of course the probability of action 1 (right) will be 1 - p.

For example, if it outputs 0.7, then we will pick action 0 with 70% probability, or action 1 with 30% probability.

You may wonder why we are picking a random action based on the probabilities given by the neural network, rather than just picking the action with the highest score.

This approach lets the agent find the right balance between exploring new actions and exploiting the actions that are known to work well.

Here's an analogy: suppose you go to a restaurant for the first time, and all the dishes look equally appealing, so you randomly pick one.

If it turns out to be good, you can increase the probability that you'll order it next time, but you shouldn't increase that probability up to 100%, or else you will never try out the other dishes, some of which may be even better than the one you tried.

Also note that in this particular environment, the past actions and observations can safely be ignored, since each observation contains the environment's full state.

If there were some hidden state, then you might need to consider past actions and obsevations as well.

For example, if the environment only revealed the position of the cart but not velocity, you would have to consider not only the current observation but also the previous observation in order to estimate the current velocity.

Another example is when the observations are noisy; in that case, you generally want to use the past few observations to estimate the most likely current state.

The CartPole problem is thus as simple as can be; the observations are noise-free, and they contain the environment's full state.

＊意味が全く理解できない。

Evaluation Actions: The Credit Assignment Problem

Policy Gradients

Markov Decision Processes

Temporal Difference Learning

Q-Learning

Exploration Policies

Approximate Q-Learning and Deep Q-Learning

Implementing Deep Q-Learning

Deep Q-Learning Variants

Fixed Q-Value Targets