Model Free Control

Lecture 5: Model Free Control

All lectures are building up to this point, to optimize a problem where we do not have access to the underlying MDP. For such problems, we either do not know the underlying MDP, or it is too big to use (e.g. game of Go).

On policy vs Off policy

On policy learning is to learn on the job. Learn about policy $π$ based on experience sampled from $π$
Off policy learning is to learn by observing others. Learn about policy $π$ by sampling another robot (or human) experience $μ$

Start with the simpler case, which is on policy learning. The basic framework is generalized policy iteration (recap), which alternates between:

Policy evaluation: estimate $v_{π}$
Policy improvement: generate a new $π^{'} \geq π$

Naive case: policy iteration with Monte-Carlo evaluation. Basically, we use MC policy evaluation to update our value function, and then do greedy policy improvement. Would this work?

No. The main problem is that previously when we had access to the underlying MDP, we could do greedy policy improvement because we had access to the transition dynamics. Specifically, when we do policy improvement, we want to compute: $π^{'} (s) = a arg max R_{s}^{a} + P_{s s^{'}}^{a} V (s^{'})$
However in model-free control, we do not have access to $P_{s s^{'}}^{a}$ , meaning that we do not know what probabilities determine the state we will end up in given action $a$ . So there is no clear way to do greedy policy improvement if we only have an estimate of $V (s) \forall s$ .
To deal with this issue, we can do greedy policy improvement over $Q (s, a)$ instead. Then we can simply take: $π^{'} (s) = a \in A arg max Q (s, a)$

So now we do generalized policy iteration with action-value function.

Start with $Q, π$
Update action-value function $Q = q_{π}$
Greedily update policy to $π = g ree d y (Q)$

However, we still have another problem, which is the exploration issue. If we act greedily all the time, there is no guarantee that we will explore all states and thus find the optimal policy.

Toy Example: Greedy Action Selection

Choose between two doors:

Open left door: reward 0. $V (l e f t) = 0$
Open right door: reward +1. $V (r i g h t) = + 1$
Open right door: reward +3. $V (r i g h t) = + 2$
Open right door: reward +2. $V (r i g h t) = + 2$

The greedy policy will lock us onto right door forever. But we will never know if the left door actually has higher mean return.

$ϵ$ -Greedy Exploration

The simplest idea for ensuring continual exploration.

Try all $m$ actions with non-zero probability
With probability $1 - ϵ$ choose the greedy action
With probability $ϵ$ choose an action at random

$π (a ∣ s) = {ϵ / m + 1 - ϵ ϵ / m if a^{*} = arg max_{a \in A} Q (s, a) otherwise$

Note that $ϵ / m$ is added for the first case as well since the action chosen at random can include the greedy policy $a^{*}$ as well.

$ϵ$ -greedy policy is important because there is a theorem to assure us that we will indeed get a policy improvement on every step.

Theorem. For any $ϵ$ -greedy policy $π$ , the $ϵ$ -greedy policy $π^{'}$ with respect to $q_{π}$ is an improvement, i.e. $v_{π^{'}} (s) \geq v_{π} (s)$

Proof. $q_{π} (s, π^{'} (s)) = a \in A \sum π^{'} (a ∣ s) q_{π} (s, a) = ϵ / m a \in A \sum q_{π} (s, a) + (1 - ϵ) a \in A max q_{π} (s, a) \geq ϵ / m a \in A \sum q_{π} (s, a) + (1 - ϵ) a \in A \sum \frac{π ( a ∣ s ) - ϵ / m}{1 - ϵ} q_{π} (s, a) = a \in A \sum π (a ∣ s) q_{π} (s, a) = v_{π} (s)$ Therefore from the policy improvement theorem, $v_{π^{'}} (s) \geq v_{π} (s)$ .

The key step in the proof is the transition from line 2 to line 3. The idea is that the maximum q-value (by choosing the greedy action) will be greater than or equal to any weighted average of $q_{π} (s, a)$ . Hence we choose a clever weighted average such that we can end up with $\sum_{a \in A} π (a ∣ s) q_{π} (s, a)$ in line 4.

Note that it is indeed a weighted average because of the following. Note that $π (a ∣ s)$ must sum to 1 over all actions as it is a valid policy. And since there are $m$ unique actions, we multiply the constant $ϵ / m$ by $m$ . $a \in A \sum \frac{π ( a ∣ s ) - ϵ / m}{1 - ϵ} = \frac{1}{1 - ϵ} a \in A \sum [π (a ∣ s) - ϵ / m] = \frac{1}{1 - ϵ} [1 - m \cdot ϵ / m] = 1$

An idea that we encountered earlier. We do not need to fully evaluate the policy before we do a greedy improvement. In the context of Monte Carlo policy evaluation, in the extreme case, we can update the policy after every episode instead of gathering many episodes.

How can we guarantee that we find the optimal policy $π^{*}$ ? We need to ensure that our algorithm balances two things: (i) suitably explore all options and (ii) ensure that at the end, we converge on a greedy policy.

This leads us to GLIE, which is a property that we want our algo to have.

Definition Greedy in the Limit with Infinite Exploration (GLIE).

All state-action pairs are explored infinitely many times, i.e. $k \to \infty lim N_{k} (s, a) = \infty$

The policy converges on a greedy policy, i.e. $k \to \infty lim π_{k} (a ∣ s) = 1 (a = a^{'} \in A arg max Q_{k} (s, a^{'}))$

One simple way to get GLIE is to use $ϵ$ -greedy with a decaying schedule for $ϵ_{k} = \frac{1}{k}$ .

GLIE Monte Carlo Control

This brings us to GLIE Monte Carlo control.

Algorithm GLIE Monte-Carlo Control.

Sample kth episode using policy $π : {S_{1}, A_{1}, R_{2}, ..., S_{T}} \sim π$

For each state $S_{t}$ and action $A_{t}$ in the episode, update $N (S_{t}, A_{t}) Q (S_{t}, A_{t}) \leftarrow N (S_{t}, A_{t}) + 1 \leftarrow Q (S_{t}, A_{t}) + \frac{1}{N ( S _{t} , A _{t} )} (G_{t} - Q (S_{t}, A_{t}))$

Improve policy based on the new action-value function: $ϵ π \leftarrow 1/ k \leftarrow ϵ -greedy(Q)$

MC vs TD Control

TD learning has several advantages over MC:
- Lower variance
- Online
- Can deal with incomplete sequences
Natural idea: use TD instead of MC in our control loop
- Apply TD to Q(S, A)
- Use $ϵ$ -greedy policy improvement
- Update every time step
- This is probably the most well known RL algorithm (Sarsa)

Sarsa policy evaluation update step: $Q (S, A) \leftarrow Q (S, A) + α (R + γ Q (S^{'}, A^{'}) - Q (S, A))$

Note that we are updating the Q value for one single state-action pair. We take action $A$ on state $S$ and observe reward $R$ , and use that to update the Q-value. In addition, we also sample a next action $A^{'}$ and corresponding resultant state $S^{'}$ , and we bootstrap the Q-value to use $Q (S^{'}, A^{'})$ to also update the Q-value. So it corresponds to a one-step lookahead in TD.

So the off-policy control with Sarsa algo. For every time step:

Policy evaluation with Sarsa: $Q = q_{π}$
Policy improvement using $ϵ$ -greedy

Algorithm. Sarsa algorithm for on-policy control.

Initialize $Q (s, a), \forall s \in S, a \in A (s)$ arbitrarily

Repeat (for each episode):

Initialize $S$

Choose $A$ from $S$ using policy derived from $Q$ (e.g. $ϵ$ -greedy choice)

Repeat (for each step of episode):

Take action $A$ , observe $R, S^{'}$

Choose $A^{'}$ from $S^{'}$ using policy derived from $Q$ (e.g. $ϵ$ -greedy)

Update $Q (S, A) \leftarrow Q (S, A) + α [R + γ Q (S^{'}, A^{'}) - Q (S, A)]$

$S \leftarrow S^{'}$ , $A \leftarrow A^{'}$

Until $S$ is terminal

Note that this is a fundamentally on-policy algorithm, because the $A^{'}, S^{'}$ that we sample and use to bootstrap is also the next action and state we end up in.

Algorithm. Sarsa converges to the optimal action value function, $Q (s, a) \to q * (s, a)$ under the following conditions:

GLIE sequence of policies $π_{t} (a ∣ s)$

Robbins Monro sequence of step sizes $α_{t}$ : $t = 1 \sum \infty α_{t} t = 1 \sum \infty α_{t}^{2} = \infty < \infty$

$n$ -step Sarsa

As before, we saw that $n$ -step algorithm gets the best of both worlds in betwen MC and TD. So we do the same here.

Consider the following $n$ -step returns for $n = 1, 2, \infty$ :

$n = 1$ , $q_{t}^{(1)} = R_{t + 1} + γ Q (S_{t} + 1)$
$n = 2$ , $q_{t}^{(2)} = R_{t + 1} + γ R_{t + 2} + γ^{2} Q (S_{t} + 2)$
$n = \infty$ , $q_{t}^{(\infty)} = R_{t + 1} + γ R_{t + 2} + ... + γ^{T - 1} R_{t}$

Define the $n$ -step Q-return: $q_{t}^{(n)} = R_{t + 1} + γ R_{t + 2} + ... + γ^{n - 1} R_{t + n} + γ^{n} Q (S_{t + n})$

Sarsa update $Q (s, a)$ towards the n-step Q-return: $Q (S_{t}, A_{t}) \leftarrow Q (S_{t}, A_{t}) + α (q_{t}^{(n)} - Q (S_{t}, A_{t}))$

Forward View Sarsa( $λ$ )

As before, we saw that the n-step return itself is noisy and sensitive to hyperparameter choice of n and $α$ . So the better way is to average the value over all $n$ steps.

The $q^{λ}$ return combines all $n$ -step Q-returns $q_{t}^{(n)}$
Using weight $(1 - λ) λ^{n - 1}$ , we have: $q_{t}^{λ} = (1 - λ) n = 1 \sum \infty λ^{n - 1} q_{t}^{(n)}$
And the forward view Sarsa( $λ$ ) is: $Q (S_{t}, A_{t}) \leftarrow Q (S_{t}, A_{t}) + α (q_{t}^{λ} - Q (S_{t}, A_{t}))$

Backward View Sarsa( $λ$ )

Recall that we used eligibility traces to construct the backward view TD( $λ$ ). As the forward view algo is not an online policy - we need to wait until the end of the episode to do the update.

Just like TD( $λ$ ), we use eligibility traces in an online algorithm
But Sarsa( $λ$ ) has one eligibility trace for each state-action pair instead of just for every state $E_{0} (s, a) E_{t} (s, a) = 0 = γλ E_{t - 1} (s, a) + 1 (S_{t} = s, A_{t} = a)$
$Q (s, a)$ is updated for every state $s$ and action $a$
In proportion to TD-error $δ_{t}$ and eligibility trace $E_{t} (s, a)$ : $δ_{t} Q (s, a) = R_{t + 1} + γ Q (S_{t + 1}, A_{t + 1}) - Q (S_{t}, A_{t}) \leftarrow Q (s, a) + α δ_{t} E_{t} (s, a)$

Algorithm. Sarsa( $λ$ ) On Policy Algorithm.

Initialize $Q (s, a)$ arbitrarily, for all $s \in S, a \in A$

Repeat (for each episode):

$E (s, a) = 0$ , for all $s \in S, a \in A$

Initialize $S, A$

Repeat (for each step of episode):

Take action $A$ , observe $R, S^{'}$

Choose $A^{'}$ from $S^{'}$ , using policy derived from $Q$ (E.g. $ϵ$ -greedy)

$δ \leftarrow R + γ Q (S^{'}, A^{'}) - Q (S, A)$

$E (S, A) \leftarrow E (S, A) + 1$

For all $s \in S, a \in A$ :

$Q (s, a) \leftarrow Q (s, a) + α δ E (s, a)$

$E (s, a) \leftarrow γλ E (s, a)$

$S \leftarrow S^{'}, A \leftarrow A^{'}$

Until $S$ is terminal

Note that for a given step we have a single value of $δ$ which is our TD error, but we propagate that to all $s, a$ pairs based on the eligibility trace, as potentially every $s, a$ pair could have contributed to it.

Off Policy Learning

So far we have been looking at on-policy learning. However it is often useful to do off policy learning, i.e. evaluate a target policy $π (a ∣ s)$ to compute $v_{π} (s)$ or $q_{π} (s, a)$ , while we follow the behaviour policy $μ (a ∣ s) : {S_{1}, A_{1}, R_{2}, ..., S_{T} \sim μ}$ . Of course in this case, $μ \neq = π$ .

Why is off policy learning useful?

We can learn from observing humans or other agents
We can re-use experience that was previously generated from old policies $π_{1}, π_{2}, ... π_{t - 1}$ , possibly in a batched manner
We can learn about the optimal policy while following the exploratory policy
We can learn about multiple policies while following one policy

First mechanism is importance sampling. The main idea is to estimate the expectation of a different distribution by re-weighting the distributions: $E_{X \sim P} [f (X)] = \sum P (X) f (X) = \sum Q (X) \frac{P ( X )}{Q ( X )} f (X) = E_{X \sim Q} [\frac{P ( X )}{Q ( X )} f (X)]$

We can apply importance sampling to Monte Carlo for Off policy monte carlo learning:

We use returns generated from behaviour policy $μ$ to evaluate $π$
Then we weight the return $G_{t}$ according to the ratio of probabilities between the two policies
We need to apply the correction at every time step along the whole episode, because the change in policy affects every time step $G_{t}^{π / μ} = \frac{π ( A _{t} ∣ S _{t} )}{μ ( A _{t} ∣ S _{t} )} \frac{π ( A _{t + 1} ∣ S _{t + 1} )}{μ ( A _{t + 1} ∣ S _{t + 1} )} ... \frac{π ( A _{T} ∣ S _{T} )}{μ ( A _{T} ∣ S _{T} )} G_{t}$
And then update the value towards the corrected return $V (S_{t}) \leftarrow V (S_{t}) + α (G_{t}^{π / μ} - V (S_{t}))$

While off policy MC learning is theoretically sound, there are some major problems which make it practically useless in practice:

Importance sampling dramatically increases variance, as we are adjusting over every time step, and the cumulative effect over the whole episode makes our estimate of $G_{t}^{π / μ}$ vary wildly
We also cannot use this adjustment if $μ$ is zero when $π$ is non-zero

So we have to use bootstrapping for importance sampling. This allows us to only adjust the probability for one time step. So we have importance sampling for off policy TD:

We use TD targets generated from $μ$ to evaluate $π$
For TD(0), We weight the TD target $R + γV (S^{'})$ by importance sampling
This means we only need a single importance sampling correction: $V (S_{t}) \leftarrow V (S_{t}) + α (\frac{π ( A _{t} ∣ S _{t} )}{μ ( A _{t} ∣ S _{t} )} (R_{t + 1} + γV (S_{t + 1})) - V (S_{t}))$
This has much lower variance that MC importance sampling, and could work if $μ$ and $π$ do not differ by too much over a single step

As we have seen, importance sampling leads to large variances. The best solution is known as Q-learning, which is specific to TD(0) or Sarsa(0).

Does not require any importance sampling
Allows off policy learning of action values $Q (s, a)$

Recall that $μ$ is the behaviour policy that our agent is actually following, and $π$ is a target policy that we want to learn from. The main idea is that in our Sarsa(0) update step, we update the Q-value towards the target policy $π$ , but allow our agent to continue following the behaviour policy $μ$ .

This allows the agent to explore the environment using $μ$ , but learn from the action-value function of $π$ . Specifically:

We choose each next action for the agent using behaviour policy $A_{t + 1} \sim μ (.∣ S_{t})$
But use alternative successor action $A^{'} \sim π (.∣ S_{t})$ in our Q-value update
So we update $Q (S_{t}, A_{t})$ using $A^{'}$ : $Q (S_{t}, A_{t}) \leftarrow Q (S_{t}, A_{t}) + α (R_{t + 1} + γ Q (S_{t + 1}, A^{'}) - Q (S_{t}, A_{t}))$

Note importantly that we are using $A^{'} \sim π$ in the Q value above, instead of $A_{t + 1}$ . This allows us to learn off policy.

Q-Learning (or SARSA-MAX)

A special case of Q-learning is the case where the target policy is greedy wrt $Q (s, a)$ . This is usually what people refer to as Q-learning.

We allow both behaviour and target policies to improve:

The target policy $π$ is greedy wrt $Q (s, a)$ , i.e. $π (S_{t + 1}) = a^{'} arg max Q (S_{t + 1}, a^{'})$
The behaviour policy $μ$ is $ϵ$ -greedy wrt $Q (s, a)$ again
The learning target inside the Q-update then simplifies as follows: $R_{t + 1} + γ Q (S_{t + 1}, A^{'}) = R_{t + 1} + γ Q (S_{t + 1}, a^{'} arg max Q (S_{t + 1}, a^{'})) = R_{t + 1} + a^{'} max γ Q (S_{t + 1}, a^{'})$

Note that since we are following a greedy target policy, the action chosen will be the Q-maximizing one (line 2). Since we are choosing the Q-maximizing action, we get the maximum Q-value over all possible actions (line 3). This simplifies the equation quite abit, and now it resembles the Bellman optimality equation.

This leads us to the well known Q-learning algorithm, which David calls Sarsa-max. The Q-update is: $Q (S, A) \leftarrow Q (S, A) + α (R + γ a^{'} max Q (S^{'}, a^{'}) - Q (S, A))$

There is a theorem that tells us that the Q-learning control algorithm converges to the optimal action-value function, i.e. $Q (s, a) \to q_{*} (s, a)$

To wrap up, here is a classification of some algorithms we have so far:

	Full Backup (Dynamic Programming)	Sample Backup (Temporal Difference)
Bellman Expectation Equation for $v_{π} (s)$	Iterative Policy Evaluation	TD Learning
Bellman Expectation Equation for $q_{π} (s, a)$	Q-policy iteration	Sarsa
Bellman Optimality Equation for $q_{*} (s, a)$	Q-value iteration	Q-learning

Keyboard shortcuts

Chux's Notebook