Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Model Free Prediction

Lecture 4: Model Free Prediction

Methods in which no one tells us the environment, as opposed to before.

Introduction

  • Lecture 3: planning by dynamic programming, in which we solve a known MDP
  • Lecture 4: Model-free prediction, in which we estimate the value function of an unknown MDP
    • Unknown in the sense that we do not have access to the environment, only the interactions that the agent has with the environment
  • Lecture 5: Model-free control (or optimization), in which we optimize the value function of an unknown MDP

Monte Carlo Reinforcement Learning

MC methods learn directly from episodes of experience.

  • It is model-free in that there is no knowledge of MDP transitions / rewards.
  • It only learns from complete episodes
  • MC uses the simplest idea: the value function of a state is the average return from that state over many many runs
  • Downside: MC can only be applied to episodic MDPs
    • Episodic meaning that all episodes must terminate and we get a return value

The goal of MC Policy Evaluation is to learn from episodes of experience under policy :

  • An episode:

Recall that:

  • The return is the total discounted reward:
  • And the value function is the expected return given that we start at a given state:

So the whole idea of MC policy evaluation is to replace the expected return function with the empirical mean return from observing many many episodes.

There are two main methods of performing this:

  • First visit MC policy evaluation
  • Every visit MC policy evaluation

Method 1: First visit MC policy evaluation. Algorithm for evaluating a given state is:

  • At the first time step where state is visited in a given episode:
    • Increment counter . is the count of episodes where was visited.
    • Increment total return
    • Value is estimated by mean return
  • By the law of large numbers, as

Note that above is the total discounted reward from time step onwards.

Method 2: Every visit MC policy evaluation. The algorithm is identical to first visit, with the only difference in that we perform the increment step every time we visit state .

BlackJack example

  • There are 200 unique states:
    • Current sum (12 to 21). If it's 11 or below, the action is automatically to twist.
    • Dealer's showing card ace to 10
    • Do I have a "useable" ace?
  • Actions:
    • stick: stop receiving cards
    • twist: take another card.
  • Reward for stick:
    • +1 if our sum > dealer's sum
    • 0 if our sum = dealer's sum
    • -1 if our sum < dealer's sum
  • Reward for twist:
    • -1 if our sum > 21
    • 0 otherwise

We can use MC policy evaluation algorithm to play 10,000 episodes of blackjack and compute the value function of each state, under a given policy. For e.g. a naive policy is to stick if our sum >= 20, otherwise twist.

Incremental Mean

The mean of a sequence of values can be computed in an incremental algorithm:

The last line shows that at each step, we just need to adjust the running mean by a small quantity, which is the difference between the new observed value and the current mean . This is analogous to a gradient update.

So applying this incremental mean algorithm to monte carlo updates. Recall that the value function is the mean return over episodes. Hence we can change the above MC algorithm to an incremental mean update:

  • After observing a given episode :
    • For each state with return :
    • We may even replace the running count with a fixed step size . This is the usual approach in non-stationary problem. This algorithm allows us to avoid keeping track of old episodes and just keep updating .

Temporal Difference Learning

TD methods are different from MC methods, in that we do not need to wait for full episodes to learn.

  • TD methods, like MC methods, learn directly from episodes of experience
  • TD methods, like MC methods, are also model-free
  • TD learns from incomplete episodes using bootstrapping
  • TD updates a guess towards a guess

Goal remains the same: learn online from experience under policy .

Simplest temporal difference learning algorithm: .

  • Update value toward estimated return
  • is called the TD target - is called the TD error

Contrast this with incremental every-visit Monte Carlo which we saw earlier:

  • Update value toward actual return

Car Driving Example

An analogy for understanding the difference between MC and TD methods. Imagine we are driving the car home. At the start, we expect to take 30 mins for the journey. And then it rains, so we update our prediction to 40 mins. And so on. Eventually, the final journey takes 43 mins.

  • For MC method, we need to wait until the journey is complete, and then we update the value of our policy to 43 mins each step of the way
  • For TD method, we can immediately update the value function to the next prediction each step of the way

Bias Variance Trade-off

There is a bias variance trade-off between choosing MC or TD method for policy evaluation.

  • The return is an unbiased estimate of
  • The oracle TD target is also an unbiased estimate of
    • We know this from the bellman expectation equation
    • But it requires access to the oracle which we do not have
  • The TD target is a biased estimate of
    • This is because is our current estimate of the value function which can be wildly wrong
  • Observe that the TD target has much lower variance than the return:
    • The return depends on many random actions, transitions, rewards through the entire run of the episode
    • The TD target only depends on one random action, transition and reward
      • The value function is a deterministic function

So to summarize:

  • MC has high variance and zero bias
    • So it has good convergence properties, even with function approximation later
    • It is not very sensitive to the initial value
    • Very simple to understand and use
  • TD has low variance but some bias
    • Usually it is much more efficient than MC
    • TD(0) can be proven to converge to using a table lookup
    • But with function approximation convergence is not always guaranteed
    • More sensitive to the initial value

What is function approximation? This will be covered later on. But in general, we have been looking at as a table lookup for each state. This is not feasible for problems with large state spaces, hence we need to learn a function to approximate for all states.

MC vs TD empirical example

  • TD generally converges faster than MC
  • But if the step size is too larger, TD may not fully converge as it will oscillate

So far we have seen that both MC and TD converge as the number of episodes goes to infinity.

  • That is, as
  • But what if we only have a limited number of episodes to learn from?
  • For example, what if we are repeatedly sampling episode ?

AB Example

A simple example to illustrate difference between MC and TD in the finite data case. Suppose we have 6 episodes:

  • A, 0, B, 0
  • B, 1
  • B, 1
  • B, 1
  • B, 1
  • B, 0

What would we think are?

  • If we use MC, then . is because we only encountered one episode involving state and the reward was .
  • If we use TD, then . is because we observed a 100% probability of transiting from , so the value of (without discounting) is the same as value of due to bootstrapping.

In more precise terms:

  • MC converges to the solution which minimizes the mean squared error
    • i.e. minimizes the divergence from observed returns
    • In the AB example above, this sets
  • TD converges to the solution of the maximum likelihood Markov model
    • i.e. it converges to the MDP that best fits the data
    • In the AB example,

In summary:

  • TD exploits the markov property, so it is usually more efficient in markov environments, where we can rely on states to encode information
  • MC does not exploit the markov property, so it is usually more efficient in non-markov environments, e.g. partial observability etc.

So far we have looked at 3 types of backup:

  • Monte Carlo Backup: we sample one entire trajectory / episode from the agent's interactions with the environment till termination
  • Temporal Difference Backup: we sample one step lookahead and then update parameters
  • Dynamic Programming Backup: we look ahead one step, but because we have access to the environment, we can compute the expectation over all possible next steps.

This gives us two dimensions to categorize our algorithms:

  • Bootstrapping: the update involves an estimate (e.g. our value function)
    • MC does not bootstrap
    • DP bootstraps
    • TD bootstraps
  • Sampling: we use sampling instead of a full-width expectation / search
    • MC samples
    • DP does not sample
    • TD samples

TD Lambda

TD Lambda is a generalization of the above trade-off. We let TD target look steps into the future before updating. If we look forward number of steps, it becomes monte carlo learning.

Specifically, for , our returns are:

  • :
  • :
  • : . We can see this corresponds to MC update, without use of value function at all

So the n-step return is:

And the n-step TD learning update is:

What is the best ? It is a highly sensitive parameters that depends on the problem, etc. Hence a proposal is made to average the returns from each time step, up to step . For example, we could average . This averaging would make the algorithm much more robust to step size .

The common way to perform a weighted average of returns is to use exponential decay, such that returns with a longer look-ahead window are weighted less. This algorithm is called TD-. Specifically:

Note that the weight given to the final return is the sum to of weights from step onwards, i.e. it is a geometric series. It makes sense to put more weight on the final, actual return.

This leads directly to forward-view TD(), where we sample trajectories of steps and update the value function according to:

Now, the forward view has a shortcoming, which is that we need to wait until we have sampled steps into the future, before we can update the value function. Thus it suffers similar downside to MC update, where we cannot update the value function immediately after each step.

Backward View TD Lambda

One key idea is eligibility traces. In deciding to assign credit to past events for a current reward, there are generally two intuitive heuristics to use:

  • Frequency heuristic: assign credit to most frequent recent states
  • Recency heuristic: assign credit to most recent states

The eligiblity trace combines both heuristics in a simple formula:

The eligibility trace gives us a weight at a given time step for each state . This weight tells us how much credit we should assign to for a reward at the current time step.

The Backward View TD Lambda uses this idea:

  • Keep an eligibility trace for every state
  • Update value V(s) for every state in proportion to the TD-error and eligibility trace :

Observe that is just our update for TD(0) with a single step look ahead, i.e. . Thus we can see that when , only the current state is updated, since . This results in the TD(0) update: .

On the other extreme, when , all credit is deferred until the end of the episode (not sure I see this from the formula). Thus it is equivalent to MC update.

In fact, there is a theorem that the sum of offline updates is identical for both forward view and backward view TD-lambda. This is nice because the backward view with eligibility traces makes it easy to implement, as we never need to look forward into the future. We just need to keep track of eligibility traces at each time step, and then apply the update to all states at each step.