Value Function Approximation
Lecture 6: Value Function Approximation
This lecture looks into approximating functions with neural networks to overcome the large state-action space problem.
RL often encounters large problems:
- Backgammon: states
- Go: states
- Helicopter: continuous state space
We want to do policy evaluation and control efficiently in large state spaces. So far, we have represented or with a lookup table:
- Every state has an entry
- Every state-action pair has an entry
This is a problem for large MDPs:
- Too many states or actions to store in memory
- It is too slow or data inefficient to learn the value of each state individually
Solution for large MDPs:
- Estimate the value function with function approximation using parameters :
- Generalizes from seen states to unseen states
- Update parameters of our function using MC or TD learning
Types of value function approximation (different architectures):
- Represent a given state with some parameters . Then neural network spits out , which is our value function for being in state
- Have a neural network , which takes in a state-action pair and spits out the Q value
- Sometimes, it is more efficient to have a neural network , such that we feed in a single state and we get Q-values for every possible action in a single forward pass, i.e. we get
Which function approximator? We focus on differentiable function approximators that we can easily optimize, i.e. Linear combinations of features, neural networks. Furthermore, we want a training algorithm for a non-iid, non-stationary set of data, so it is not standard supervised learning.
Incremental Methods
Gradient Descent
Starting with gradient descent.
- Let be a differentiable function of parameter vector
- Define the gradient of to be a vector where is
- To find the local minimum of , we adjust the parameter in the -ve gradient direction:
Goal: find parameter vector minimizing mean squared error between approximate value fn and true oracle value fn (assuming we know the oracle)
Gradient descent finds a local minimum:
Stochastic gradient descent samples the gradient:
The nice thing about SGD is that it still converges under non-stationary environment. The expected update is equal to full gradient update.
Feature Vectors
To represent a state, we use a feature vector.
For example, the features (numeric) could be:
- Distance of robot to landmarks
- Trends in the stock market
- Configuration of pawn on a chess board
Linear Value Function Approximation
Let us represent the value function using a linear combination of features (i.e. just a dot product between two vectors):
The nice thing is that linear approximator is quadratic in the parameters , so it is a convex optimization problem, i.e. SGD will converge on the global optimum:
The gradient update is really simple:
Note that we are just subbing the simple expression for into the general formula above. The update may be interpreted as step-size x prediction error x feature value
. This means that features with high correlation with the prediction error will have large gradient updates intuitively.
We can think of table lookup as a special case of linear value function approximation. Suppose we use a table lookup feature (1-hot) as follows:
And suppose we have a parameter vector of size , such that we have one parameter for each state. Then we have:
And we can see that this reduces to a table lookup where the parameter represents the state value for each state .
Estimating the Oracle
So far, we have assumed that the true oracle value function is available, but in RL there is no true label, only rewards. So in practice, we need to substitute a target for :
- For MC, the target is the return :
- For TD(0), the target is the TD target :
- For TD(), the target is the -return :
Monte Carlo with Value Function Approximation
We can think of our algorithm as supervised learning.
- Treat the return as an unbiased noisy sample of the true value
- We therefore are applying supervised learning to "training data":
- For example, using linear MC policy evaluation:
- MC evaluation converges to a local optimum even when using non-linear value function approximation
TD with Value Function Approximation
The same applies to TD learning, but we have some biased estimate:
- The TD target is a biased sample of the true value - it's biased because our own value function is a biased estimate
- We can still apply supervised learning to the "training data":
- For example using linear TD(0):
- There is a theorem showing that for linear TD(0), despite the bias, we will always converge (close) to the global optimum
Note: There is a little inconsistency in the above formula, once we start introducing bootstrapped approximations of the return. Recall that when we used the oracle to represent the target and took the derivative, only enters the derivative as we treat the oracle value as a constant.
However, once we introduce itself to substitute the oracle function, we should technically include that term in the derivative as well. As it turns out, this is not a good idea and will not lead to convergence. There is some theoretical analysis for this to justify it.
TD() with Value Function Approximation
And again, we can do the same with TD-, since the -return is also a biased sample of the true value :
- The training data is now:
- The forward view linear TD() is:
- The backward view linear TD() is:
- There is a theorem to show that the forward view and backward view linear TD() are equivalent.
For the backward view, notice that the eligibility trace is now updated using the gradient wrt the parameter vector, namely , which is of the same dimensionality as . More precisely, the eligibility trace is the decaying accumulation of past gradients. In the linear case, this is an accumulation of the feature vector .
It is a bit unintuitive to understand why we use the accumulated gradient as the eligibility trace, but I suppose it is proved in the equivalence theorem between the forward and backward view. Perhaps we can just think of it as "the features which we see the most often will have high eligibility trace".
Control with Value Function Approximation
- Start with some random parameter vector
- Set policy based on some greedy function
- Do policy evaluation
First we need to do everything again wrt to action-value function instead of value function to perform this algorithm. The steps are:
- Approximate the action-value function
- Minimize the mean squared error between approximate action value function and true oracle action value :
- Use SGD to find a local minimum:
- Again, we represent the state and action by a feature vector:
- Represent action value function by a linear combination of features:
- Do an SGD update:
Incremental Control Algorithms
Like prediction, we need to substitute a target for the unknown oracle . We sub out all the for an approximate target:
- For MC, target is the return
- For TD(0), the target is the TD target :
- For forward view TD(), the target is the action value -return:
- For backward view TD(), the equivalent update is:
Should we bootstrap? Empirically across many examples, we almost always have the case that:
- MC takes too many steps because variance is too high
- TD(0) always has a large efficiency gain compared to MC
- There's always some value in between which is better than TD(0)