Model Free Prediction

Lecture 4: Model Free Prediction

Methods in which no one tells us the environment, as opposed to before.

Introduction

Lecture 3: planning by dynamic programming, in which we solve a known MDP
Lecture 4: Model-free prediction, in which we estimate the value function of an unknown MDP
- Unknown in the sense that we do not have access to the environment, only the interactions that the agent has with the environment
Lecture 5: Model-free control (or optimization), in which we optimize the value function of an unknown MDP

Monte Carlo Reinforcement Learning

MC methods learn directly from episodes of experience.

It is model-free in that there is no knowledge of MDP transitions / rewards.
It only learns from complete episodes
MC uses the simplest idea: the value function of a state is the average return from that state over many many runs
Downside: MC can only be applied to episodic MDPs
- Episodic meaning that all episodes must terminate and we get a return value

The goal of MC Policy Evaluation is to learn $v_{π}$ from episodes of experience under policy $π$ :

An episode: $S_{1}, A_{1}, R_{2}, ..., S_{k} \sim π$

Recall that:

The return is the total discounted reward: $G_{t} = R_{t + 1} + γ R_{t + 2} + ... + γ^{T - 1} R_{T}$
And the value function is the expected return given that we start at a given state: $v_{π} (s) = E_{π} [G_{t} ∣ S_{t} = s]$

So the whole idea of MC policy evaluation is to replace the expected return function with the empirical mean return from observing many many episodes.

There are two main methods of performing this:

First visit MC policy evaluation
Every visit MC policy evaluation

Method 1: First visit MC policy evaluation. Algorithm for evaluating a given state $s$ is:

At the first time step $t$ where state $s$ is visited in a given episode:
- Increment counter $N (s) \leftarrow N (s) + 1$ . $N (s)$ is the count of episodes where $s$ was visited.
- Increment total return $S (s) \leftarrow S (s) + G_{t}$
- Value is estimated by mean return $V (s) = S (s) / N (s)$
By the law of large numbers, $V (s) \to v_{π} (s)$ as $N (s) \to \infty$

Note that $G_{t}$ above is the total discounted reward from time step $t$ onwards.

Method 2: Every visit MC policy evaluation. The algorithm is identical to first visit, with the only difference in that we perform the increment step every time we visit state $s$ .

BlackJack example

There are 200 unique states:
- Current sum (12 to 21). If it's 11 or below, the action is automatically to twist.
- Dealer's showing card ace to 10
- Do I have a "useable" ace?
Actions:
- stick: stop receiving cards
- twist: take another card.
Reward for stick:
- +1 if our sum > dealer's sum
- 0 if our sum = dealer's sum
- -1 if our sum < dealer's sum
Reward for twist:
- -1 if our sum > 21
- 0 otherwise

We can use MC policy evaluation algorithm to play 10,000 episodes of blackjack and compute the value function of each state, under a given policy. For e.g. a naive policy is to stick if our sum >= 20, otherwise twist.

Incremental Mean

The mean of a sequence of values can be computed in an incremental algorithm: $μ_{k} = \frac{1}{k} j = 1 \sum k x_{j} = \frac{1}{k} (x_{k} + j = 1 \sum k - 1 x_{j}) = \frac{1}{k} (x_{k} + (k - 1) μ_{k - 1}) = μ_{k - 1} + \frac{1}{k} (x_{k} - μ_{k - 1})$

The last line shows that at each step, we just need to adjust the running mean $μ$ by a small quantity, which is the difference between the new observed value $x_{k}$ and the current mean $μ_{k - 1}$ . This is analogous to a gradient update.

So applying this incremental mean algorithm to monte carlo updates. Recall that the value function $V (s)$ is the mean return over episodes. Hence we can change the above MC algorithm to an incremental mean update:

After observing a given episode $S_{1}, A_{1}, R_{2}, ..., S_{T}$ :
- For each state $S_{t}$ with return $G_{t}$ : $N (S_{t}) V (S_{t}) \leftarrow N (S_{t}) + 1 \leftarrow V (S_{t}) + \frac{1}{N ( S _{t} )} (G_{t} - V (S_{t}))$
- We may even replace the running count $N (S_{t})$ with a fixed step size $α$ . This is the usual approach in non-stationary problem. This algorithm allows us to avoid keeping track of old episodes and just keep updating $V (S_{t})$ . $V (S_{t}) \leftarrow V (S_{t}) + α (G_{t} - V (S_{t}))$

Temporal Difference Learning

TD methods are different from MC methods, in that we do not need to wait for full episodes to learn.

TD methods, like MC methods, learn directly from episodes of experience
TD methods, like MC methods, are also model-free
TD learns from incomplete episodes using bootstrapping
TD updates a guess towards a guess

Goal remains the same: learn $v_{π}$ online from experience under policy $π$ .

Simplest temporal difference learning algorithm: $T D (0)$ .

Update value $V (S_{t})$ toward estimated return $R_{t + 1} + γV (S_{t + 1})$ $V (S_{t}) \leftarrow V (S_{t}) + α (R_{t + 1} + γV (S_{t + 1}) - V (S_{t}))$
$R_{t + 1} + γV (S_{t + 1})$ is called the TD target - $R_{t + 1} + γV (S_{t + 1}) - V (S_{t})$ is called the TD error

Contrast this with incremental every-visit Monte Carlo which we saw earlier:

Update value $V (S_{t})$ toward actual return $G_{t}$ $V (S_{t}) \leftarrow V (S_{t}) + α (G_{t} - V (S_{t}))$

Car Driving Example

An analogy for understanding the difference between MC and TD methods. Imagine we are driving the car home. At the start, we expect to take 30 mins for the journey. And then it rains, so we update our prediction to 40 mins. And so on. Eventually, the final journey takes 43 mins.

For MC method, we need to wait until the journey is complete, and then we update the value of our policy to 43 mins each step of the way
For TD method, we can immediately update the value function to the next prediction each step of the way

Bias Variance Trade-off

There is a bias variance trade-off between choosing MC or TD method for policy evaluation.

The return $G_{t} = R_{t + 1} + γ R_{t + 2} + .. + γ^{T - 1} R_{T}$ is an unbiased estimate of $v_{π} (S_{t})$
The oracle TD target $R_{t + 1} + γ v_{π} (S_{t + 1})$ is also an unbiased estimate of $v_{π} (S_{t})$
- We know this from the bellman expectation equation
- But it requires access to the oracle $v_{π} (S_{t + 1})$ which we do not have
The TD target $R_{t + 1} + γV (S_{t + 1})$ is a biased estimate of $v_{π} (S_{t})$
- This is because $V$ is our current estimate of the value function which can be wildly wrong
Observe that the TD target has much lower variance than the return:
- The return depends on many random actions, transitions, rewards through the entire run of the episode
- The TD target only depends on one random action, transition and reward
  - The value function $V$ is a deterministic function

So to summarize:

MC has high variance and zero bias
- So it has good convergence properties, even with function approximation later
- It is not very sensitive to the initial value
- Very simple to understand and use
TD has low variance but some bias
- Usually it is much more efficient than MC
- TD(0) can be proven to converge to $v_{π} (s)$ using a table lookup
- But with function approximation convergence is not always guaranteed
- More sensitive to the initial value

What is function approximation? This will be covered later on. But in general, we have been looking at $v (s)$ as a table lookup for each state. This is not feasible for problems with large state spaces, hence we need to learn a function to approximate $v (s)$ for all states.

MC vs TD empirical example

TD generally converges faster than MC
But if the step size $α$ is too larger, TD may not fully converge as it will oscillate

So far we have seen that both MC and TD converge as the number of episodes goes to infinity.

That is, $V (s) \to v_{π} (s)$ as $e p i so d es \to \infty$
But what if we only have a limited number of $K$ episodes to learn from?
For example, what if we are repeatedly sampling episode $k \in [1, K]$ ?

AB Example

A simple example to illustrate difference between MC and TD in the finite data case. Suppose we have 6 episodes:

A, 0, B, 0
B, 1
B, 1
B, 1
B, 1
B, 0

What would we think $V (A), V (B)$ are?

If we use MC, then $V (A) = 0, V (B) = 4/6$ . $V (A)$ is $0$ because we only encountered one episode involving state $A$ and the reward was $0$ .
If we use TD, then $V (A) = 4/6, V (B) = 4/6$ . $V (A)$ is $4/6$ because we observed a 100% probability of transiting from $A \to B$ , so the value of $A$ (without discounting) is the same as value of $B$ due to bootstrapping.

In more precise terms:

MC converges to the solution which minimizes the mean squared error
- i.e. minimizes the divergence from observed returns $min k = 1 \sum K t = 1 \sum T_{k} [g_{t}^{k} - V (s_{t}^{k})]^{2}$
- In the AB example above, this sets $V (A) = 0$
TD converges to the solution of the maximum likelihood Markov model
- i.e. it converges to the MDP that best fits the data
- In the AB example, $V (A) = 4/6$

In summary:

TD exploits the markov property, so it is usually more efficient in markov environments, where we can rely on states to encode information
MC does not exploit the markov property, so it is usually more efficient in non-markov environments, e.g. partial observability etc.

So far we have looked at 3 types of backup:

Monte Carlo Backup: we sample one entire trajectory / episode from the agent's interactions with the environment till termination $V (S_{t}) \leftarrow V (S_{t}) + α (G_{T} - V (S_{t}))$
Temporal Difference Backup: we sample one step lookahead and then update parameters $V (S_{t}) \leftarrow V (S_{t}) + α (R_{t + 1} + γV (S_{t + 1}) - V (S_{t}))$
Dynamic Programming Backup: we look ahead one step, but because we have access to the environment, we can compute the expectation over all possible next steps. $V (S_{t}) \leftarrow E_{π} [R_{t + 1} + γV (S_{t + 1})]$

This gives us two dimensions to categorize our algorithms:

Bootstrapping: the update involves an estimate (e.g. our value function)
- MC does not bootstrap
- DP bootstraps
- TD bootstraps
Sampling: we use sampling instead of a full-width expectation / search
- MC samples
- DP does not sample
- TD samples

TD Lambda

TD Lambda is a generalization of the above trade-off. We let TD target look $n$ steps into the future before updating. If we look forward $\infty$ number of steps, it becomes monte carlo learning.

Specifically, for $n = 1, 2, \infty$ , our returns are:

$n = 1$ : $G_{t}^{(1)} = R_{t + 1} + γV (S_{t + 1})$
$n = 2$ : $G_{t}^{(2)} = R_{t + 1} + γ R_{t + 2} + γ^{2} V (S_{t + 2})$
$n = \infty$ : $G_{t}^{(\infty)} = R_{t + 1} + γ R_{t + 2} + ... + γ^{T - 1} R_{T}$ . We can see this corresponds to MC update, without use of value function $V$ at all

So the n-step return is: $G_{t}^{(n)} = R_{t + 1} + γ R_{t + 2} + ... + γ^{n - 1} R_{t + n} + γ^{n} V (S_{t + n})$

And the n-step TD learning update is: $V (S_{t}) \leftarrow V (S_{t}) + α (G_{t}^{(n)} - V (S_{t}))$

What is the best $n$ ? It is a highly sensitive parameters that depends on the problem, $α$ etc. Hence a proposal is made to average the returns from each time step, up to step $n$ . For example, we could average $\frac{1}{2} G_{t}^{(2)} + \frac{1}{2} G_{t}^{(4)}$ . This averaging would make the algorithm much more robust to step size $n$ .

The common way to perform a weighted average of returns is to use exponential $λ$ decay, such that returns with a longer look-ahead window are weighted less. This algorithm is called TD- $λ$ . Specifically: $G_{t}^{λ} = (1 - λ) n = 1 \sum \infty λ^{n - 1} G_{t}^{(n)}$

Note that the weight given to the final return $G_{t}^{(n)}$ is the sum to $\infty$ of weights from step $n$ onwards, i.e. it is a geometric series. It makes sense to put more weight on the final, actual return.

This leads directly to forward-view TD( $λ$ ), where we sample trajectories of $n$ steps and update the value function according to: $V (S_{t}) \leftarrow V (S_{t}) + α (G_{t}^{λ} - V (S_{t}))$

Now, the forward view has a shortcoming, which is that we need to wait until we have sampled $n$ steps into the future, before we can update the value function. Thus it suffers similar downside to MC update, where we cannot update the value function immediately after each step.

Backward View TD Lambda

One key idea is eligibility traces. In deciding to assign credit to past events for a current reward, there are generally two intuitive heuristics to use:

Frequency heuristic: assign credit to most frequent recent states
Recency heuristic: assign credit to most recent states

The eligiblity trace combines both heuristics in a simple formula:

$E_{0} (s) = 0$
$E_{t} (s) = γλ E_{t - 1} (s) + 1 (S_{t} = s)$

The eligibility trace gives us a weight at a given time step for each state $s$ . This weight tells us how much credit we should assign to $s$ for a reward at the current time step.

The Backward View TD Lambda uses this idea:

Keep an eligibility trace for every state $s$
Update value V(s) for every state $s$ in proportion to the TD-error $δ_{t}$ and eligibility trace $E_{t} (s)$ : $δ_{t} V (s) = R_{t + 1} + γV (S_{t + 1}) - V (S_{t}) \leftarrow V (s) + α δ_{t} E_{t} (s)$

Observe that $δ_{t}$ is just our update for TD(0) with a single step look ahead, i.e. $G_{t}^{(1)} - V (S_{t})$ . Thus we can see that when $λ = 0$ , only the current state is updated, since $E_{t} (s) = 1 (S_{t} = s)$ . This results in the TD(0) update: $V (S_{t}) \leftarrow V (S_{t}) + α δ_{t}$ .

On the other extreme, when $λ = 1$ , all credit is deferred until the end of the episode (not sure I see this from the formula). Thus it is equivalent to MC update.

In fact, there is a theorem that the sum of offline updates is identical for both forward view and backward view TD-lambda. This is nice because the backward view with eligibility traces makes it easy to implement, as we never need to look forward into the future. We just need to keep track of eligibility traces at each time step, and then apply the update to all states at each step.

Keyboard shortcuts

Chux's Notebook