Model Free Control
All lectures are building up to this point, to optimize a problem where we do not have access to the underlying MDP. For such problems, we either do not know the underlying MDP, or it is too big to use (e.g. game of Go).
On policy vs Off policy
- On policy learning is to learn on the job. Learn about policy based on experience sampled from
- Off policy learning is to learn by observing others. Learn about policy by sampling another robot (or human) experience
Start with the simpler case, which is on policy learning. The basic framework is generalized policy iteration (recap), which alternates between:
- Policy evaluation: estimate
- Policy improvement: generate a new
Naive case: policy iteration with Monte-Carlo evaluation. Basically, we use MC policy evaluation to update our value function, and then do greedy policy improvement. Would this work?
- No. The main problem is that previously when we had access to the underlying MDP, we could do greedy policy improvement because we had access to the transition dynamics. Specifically, when we do policy improvement, we want to compute:
- However in model-free control, we do not have access to , meaning that we do not know what probabilities determine the state we will end up in given action . So there is no clear way to do greedy policy improvement if we only have an estimate of .
- To deal with this issue, we can do greedy policy improvement over instead. Then we can simply take:
So now we do generalized policy iteration with action-value function.
- Start with
- Update action-value function
- Greedily update policy to
However, we still have another problem, which is the exploration issue. If we act greedily all the time, there is no guarantee that we will explore all states and thus find the optimal policy.
Toy Example: Greedy Action Selection
Choose between two doors:
- Open left door: reward
0
. - Open right door: reward
+1
. - Open right door: reward
+3
. - Open right door: reward
+2
.
The greedy policy will lock us onto right door forever. But we will never know if the left door actually has higher mean return.
-Greedy Exploration
The simplest idea for ensuring continual exploration.
- Try all actions with non-zero probability
- With probability choose the greedy action
- With probability choose an action at random
Note that is added for the first case as well since the action chosen at random can include the greedy policy as well.
-greedy policy is important because there is a theorem to assure us that we will indeed get a policy improvement on every step.
Theorem. For any -greedy policy , the -greedy policy with respect to is an improvement, i.e.
Proof. Therefore from the policy improvement theorem, .
The key step in the proof is the transition from line 2 to line 3. The idea is that the maximum q-value (by choosing the greedy action) will be greater than or equal to any weighted average of . Hence we choose a clever weighted average such that we can end up with in line 4.
Note that it is indeed a weighted average because of the following. Note that must sum to 1 over all actions as it is a valid policy. And since there are unique actions, we multiply the constant by .
An idea that we encountered earlier. We do not need to fully evaluate the policy before we do a greedy improvement. In the context of Monte Carlo policy evaluation, in the extreme case, we can update the policy after every episode instead of gathering many episodes.
How can we guarantee that we find the optimal policy ? We need to ensure that our algorithm balances two things: (i) suitably explore all options and (ii) ensure that at the end, we converge on a greedy policy.
This leads us to GLIE, which is a property that we want our algo to have.
Definition Greedy in the Limit with Infinite Exploration (GLIE).
- All state-action pairs are explored infinitely many times, i.e.
- The policy converges on a greedy policy, i.e.
One simple way to get GLIE is to use -greedy with a decaying schedule for .
GLIE Monte Carlo Control
This brings us to GLIE Monte Carlo control.
Algorithm GLIE Monte-Carlo Control.
- Sample kth episode using policy
- For each state and action in the episode, update
- Improve policy based on the new action-value function:
MC vs TD Control
- TD learning has several advantages over MC:
- Lower variance
- Online
- Can deal with incomplete sequences
- Natural idea: use TD instead of MC in our control loop
- Apply TD to Q(S, A)
- Use -greedy policy improvement
- Update every time step
- This is probably the most well known RL algorithm (Sarsa)
Sarsa policy evaluation update step:
Note that we are updating the Q value for one single state-action pair. We take action on state and observe reward , and use that to update the Q-value. In addition, we also sample a next action and corresponding resultant state , and we bootstrap the Q-value to use to also update the Q-value. So it corresponds to a one-step lookahead in TD.
So the off-policy control with Sarsa algo. For every time step:
- Policy evaluation with Sarsa:
- Policy improvement using -greedy
Algorithm. Sarsa algorithm for on-policy control.
- Initialize arbitrarily
- Repeat (for each episode):
- Initialize
- Choose from using policy derived from (e.g. -greedy choice)
- Repeat (for each step of episode):
- Take action , observe
- Choose from using policy derived from (e.g. -greedy)
- Update
- ,
- Until is terminal
Note that this is a fundamentally on-policy algorithm, because the that we sample and use to bootstrap is also the next action and state we end up in.
Algorithm. Sarsa converges to the optimal action value function, under the following conditions:
- GLIE sequence of policies
- Robbins Monro sequence of step sizes :
-step Sarsa
As before, we saw that -step algorithm gets the best of both worlds in betwen MC and TD. So we do the same here.
Consider the following -step returns for :
- ,
- ,
- ,
Define the -step Q-return:
Sarsa update towards the n-step Q-return:
Forward View Sarsa()
As before, we saw that the n-step return itself is noisy and sensitive to hyperparameter choice of n and . So the better way is to average the value over all steps.
- The return combines all -step Q-returns
- Using weight , we have:
- And the forward view Sarsa() is:
Backward View Sarsa()
Recall that we used eligibility traces to construct the backward view TD(). As the forward view algo is not an online policy - we need to wait until the end of the episode to do the update.
- Just like TD(), we use eligibility traces in an online algorithm
- But Sarsa() has one eligibility trace for each state-action pair instead of just for every state
- is updated for every state and action
- In proportion to TD-error and eligibility trace :
Algorithm. Sarsa() On Policy Algorithm.
- Initialize arbitrarily, for all
- Repeat (for each episode):
- , for all
- Initialize
- Repeat (for each step of episode):
- Take action , observe
- Choose from , using policy derived from (E.g. -greedy)
- For all :
- Until is terminal
Note that for a given step we have a single value of which is our TD error, but we propagate that to all pairs based on the eligibility trace, as potentially every pair could have contributed to it.
Off Policy Learning
So far we have been looking at on-policy learning. However it is often useful to do off policy learning, i.e. evaluate a target policy to compute or , while we follow the behaviour policy . Of course in this case, .
Why is off policy learning useful?
- We can learn from observing humans or other agents
- We can re-use experience that was previously generated from old policies , possibly in a batched manner
- We can learn about the optimal policy while following the exploratory policy
- We can learn about multiple policies while following one policy
First mechanism is importance sampling. The main idea is to estimate the expectation of a different distribution by re-weighting the distributions:
We can apply importance sampling to Monte Carlo for Off policy monte carlo learning:
- We use returns generated from behaviour policy to evaluate
- Then we weight the return according to the ratio of probabilities between the two policies
- We need to apply the correction at every time step along the whole episode, because the change in policy affects every time step
- And then update the value towards the corrected return
While off policy MC learning is theoretically sound, there are some major problems which make it practically useless in practice:
- Importance sampling dramatically increases variance, as we are adjusting over every time step, and the cumulative effect over the whole episode makes our estimate of vary wildly
- We also cannot use this adjustment if is zero when is non-zero
So we have to use bootstrapping for importance sampling. This allows us to only adjust the probability for one time step. So we have importance sampling for off policy TD:
- We use TD targets generated from to evaluate
- For TD(0), We weight the TD target by importance sampling
- This means we only need a single importance sampling correction:
- This has much lower variance that MC importance sampling, and could work if and do not differ by too much over a single step
As we have seen, importance sampling leads to large variances. The best solution is known as Q-learning, which is specific to TD(0) or Sarsa(0).
- Does not require any importance sampling
- Allows off policy learning of action values
Recall that is the behaviour policy that our agent is actually following, and is a target policy that we want to learn from. The main idea is that in our Sarsa(0) update step, we update the Q-value towards the target policy , but allow our agent to continue following the behaviour policy .
This allows the agent to explore the environment using , but learn from the action-value function of . Specifically:
- We choose each next action for the agent using behaviour policy
- But use alternative successor action in our Q-value update
- So we update using :
Note importantly that we are using in the Q value above, instead of . This allows us to learn off policy.
Q-Learning (or SARSA-MAX)
A special case of Q-learning is the case where the target policy is greedy wrt . This is usually what people refer to as Q-learning.
We allow both behaviour and target policies to improve:
- The target policy is greedy wrt , i.e.
- The behaviour policy is -greedy wrt again
- The learning target inside the Q-update then simplifies as follows:
Note that since we are following a greedy target policy, the action chosen will be the Q-maximizing one (line 2). Since we are choosing the Q-maximizing action, we get the maximum Q-value over all possible actions (line 3). This simplifies the equation quite abit, and now it resembles the Bellman optimality equation.
This leads us to the well known Q-learning algorithm, which David calls Sarsa-max. The Q-update is:
There is a theorem that tells us that the Q-learning control algorithm converges to the optimal action-value function, i.e.
To wrap up, here is a classification of some algorithms we have so far:
Full Backup (Dynamic Programming) | Sample Backup (Temporal Difference) | |
---|---|---|
Bellman Expectation Equation for | Iterative Policy Evaluation | TD Learning |
Bellman Expectation Equation for | Q-policy iteration | Sarsa |
Bellman Optimality Equation for | Q-value iteration | Q-learning |