Learning & Planning
Lecture 8: Integrating Learning and Planning
A more conceptual lecture.
- Introduction
- Model based RL
- Integrated Architectures
- Simulation Based Search
Model Based RL
- Lecture 7: Learn policy directly from experience
- Lecture 4-6: Learn value function directly from experience
- This lecture: learn model directly from experience
- The model describes an agent's understanding of the environment.
- Use planning to construct a value function or policy
- Integrate learning and planning into a single architecture
Taxonomy
- Model free RL
- No model
- Learn value function / policy function from experience
- Model-based RL
- Learn a model from experience
- Plan value function / policy function from model. Use model to look ahead and think and get rewards.
The learning diagram for Model-based RL is as follows:
graph LR; A("value/policy") -- acting --> B("experience"); B("experience") -- model learning --> C("model"); C("model") -- planning --> A("value/policy");
Advantages of Model-Based RL
- Can efficiently learn model by supervised learning methods
- Consider a game like chess, where the model and game rule is quite simple
- But the value function is very hard, because moving one piece can dramatically change the value function
- So in some cases, the model is a more compact and useful representation of the problem
- The model simply predicts our next state, given the previous state and action. We can have a teacher which is the environment or game engine.
- Can reason about model uncertainty
Disadvantage of Model-based RL:
- First learn a model, then construct a value function, which gives us two sources of approximation error
What is a Model?
A model is a representation of an MDP parametrized by .
- Assume state space and action space are known
- So a model represents state transitions and rewards
- And we can use the model to sample next state and reward
- Note that we just need to advance one time step to start learning
- It is typical to assume conditional independence between state transitions and rewards, i.e.
Model Learning
The goal is to estimate the model from experience
- Observe that this is just a supervised learning problem
- Learning is a regression problem
- Learning is a density estimation problem
- Pick our favourite loss function, e.g. MSE, KL divergence
- Pick parameters that minimize the empirical loss
Examples of models:
- Table lookup model
- Linear expectation model
- Linear gaussian model
- Gaussian process model
- Deep belief network model
Table Lookup Model
Our model is an explicit MDP, i.e.
- We simply count the number of visits to each state action pair, i.e. and record the probability of each resulting and mean reward :
- An alternative to do this in a non-parametric way:
- At each time step , record experience tuple
- To sample model, each time we are in state , we randomly pick a tuple matching
AB Example
Suppose we have episodes of experience:
A, 0, B, 0
B, 1
B, 1
B, 1
B, 1
B, 1
B, 1
B, 0
From this data, we learn a model using the one-step supervised learning method like so:
graph LR; A("A") -- , 100% --> B("B"); B("B") -- , 75% --> C("END"); B("B") -- , 25% --> D("END");
Planning with a Model
Given a model , now we want to solve the MDP . We can use our favourite planning algo:
- Value iteration
- Policy iteration
- Tree search
One of the simplest approaches is to do sample-based planning, but also one of the most powerful. The idea is to use the model only to generate samples. We sample experiences from the model:
We can then apply model free RL to samples, e.g. using
- Monte carlo control
- Sarsa
- Q-learning etc.
Sample based planning methods are often more efficient.
- Sampling from the model is efficient because sampling gives us high probability events, compared to full width look up
Back To AB Example
After we built the model, we can sample from it, for e.g.
B, 1
B, 0
B, 1
A, 0, B, 1
B, 1
A, 0, B, 1
B, 1
B, 0
Based on this sample data, we apply monte carlo learning, and end up with:
Planning with an Inaccurate Model
Given an imperfect model , we need to remember that the performance of model-based RL is limited to the optimal policy for the approximate MDP .
- i.e. the model based RL is only as good as the estimated model
- When the model is inaccurate, the planning process will compute a suboptimal policy
Solutions:
- When model is wrong, use model-free RL
- Reason explicitly about model uncertainty
Integrated Architectures
Bring together the best of model-based and model free architectures. We consider two sources of experience:
- Real experience. Sampled from environment (true MDP):
- Simulated experience. Sampled from model (approximate MDP):
Some taxonomy:
- Model Free RL:
- No model
- Learn value function (or policy) from real experience
- Model based RL
- Learn a model from real experience
- Plan value function (or policy) from simulated experience
- Dyna
- Learn a model from real experience
- Learn and plan value function (or policy) from real and simulated experience
The dyna architecture looks like that:
graph LR; A("value/policy") -- acting --> B("experience"); B("experience") -- model learning --> C("model"); C("model") -- planning --> A("value/policy"); B("experience") -- direct RL --> A("value/policy");
Dyna-Q Algorithm
- Initialize and for all and
- Do forever:
- Execute action ; observe reward and state
- Repeat times:
Note that:
- Step 4 is the standard Q-learning update step
- Step 5 updates the model using simple SGD supervised learning
- Step 6 is the thinking/planning step, where we "imagine" scenarios using our model and update Q times, without actually moving our agent in the real world
Experiments show that planning significantly speeds up convergence, requiring much fewer exploration steps in the real world to converge. So we are squeezing much more information out of what we have explored so far.
A variation to Dyna-Q is Dyna-Q+, which puts higher weight on unexplored states and encourages exploration.
Simulation Based Search
Two key ideas: sampling and forward search.
Forward search algorithms select the best action by lookahead. A search tree is built with the current state at the root. We then use a model of the MDP to look ahead.
So we do not need to solve the entire MDP, just the sub-MDP starting from the current position. Solving the entire MDP is a waste of time. Note that this is in contrast to the Dyna-Q algorithm, where the "thinking" step starts by randomly visiting a previously observed state.
So we can simulate episodes of experience from now using our model, and apply model free RL to simulated episodes.
Simulation based search:
- Simulate episodes of experience from now with the model
- Apply model free RL to simulated episodes
- If we use Monte-carlo control, we get Monte-Carlo search
- If we use Sarsa for control, we get TD search
Simple Monte Carlo Search
- Given a model and some simulation policy (how we pick actions in our imagination)
- For each current action :
- Simulate episodes using model from current real state
- Note that after current , we follow policy for future actions
- Evaluate actions by mean return (Monte Carlo evaluation)
- Select current real actions with maximum value
Monte Carlo Tree Search (Evaluation)
MCTS differs in that we allow the policy to improve within our simulation (i.e. policy is not stationary within our simulation runs).
- Given a model
- Simulate episodes using model from current real state using current simulation policy
- Build a search tree containing visited states and actions from the above simulations
- Evaluate states by mean return of episodes starting from (i.e. monte carlo evaluation)
- Note that:
- is the number of times we visited the pair during our simulations.
- We are assuming our model is good enough so that we can rely on the simulated returns
- We are not really learning a persistent Q function. At each step, we run simulations to get fresh estimations of the Q values
- After the search is finished, select the current real action with maximum value in the search tree
In MCTS, the simulation policy actually improves.
- Each simulation comprises two phases (in-tree, out of tree)
- Tree policy (improves): pick actions to maximize by looking at the search tree and the node's children
- Default policy (fixed): in our simulation, when we run beyond the frontier of the search tree, we will pick actions randomly
- Repeat (each simulation)
- Evaluate states by monte carlo evaluation
- Improve tree policy, e.g. by -greedy
- Essentially, this is monte carlo control applied to simulated experience
- This method converges on the optimal search tree, i.e.
Case Study: Game of Go
Position evaluation in Go:
- How good is a position s?
- Reward function (undiscounted):
- Policy selects moves for both players
- The value function (how good is position ):
How does simple monte carlo evaluation work?
- Suppose we start with a certain board configuration
- We simulate many runs of games starting from with current policy
- The value function would be the fraction of simulated games where black wins