Value Function Approximation
Lecture 6: Value Function Approximation
This lecture looks into approximating functions with neural networks to overcome the large state-action space problem.
RL often encounters large problems:
- Backgammon: states
- Go: states
- Helicopter: continuous state space
We want to do policy evaluation and control efficiently in large state spaces. So far, we have represented or with a lookup table:
- Every state has an entry
- Every state-action pair has an entry
This is a problem for large MDPs:
- Too many states or actions to store in memory
- It is too slow or data inefficient to learn the value of each state individually
Solution for large MDPs:
- Estimate the value function with function approximation using parameters :
- Generalizes from seen states to unseen states
- Update parameters of our function using MC or TD learning
Types of value function approximation (different architectures):
- Represent a given state with some parameters . Then neural network spits out , which is our value function for being in state
- Have a neural network , which takes in a state-action pair and spits out the Q value
- Sometimes, it is more efficient to have a neural network , such that we feed in a single state and we get Q-values for every possible action in a single forward pass, i.e. we get
Which function approximator? We focus on differentiable function approximators that we can easily optimize, i.e. Linear combinations of features, neural networks. Furthermore, we want a training algorithm for a non-iid, non-stationary set of data, so it is not standard supervised learning.
Incremental Methods
Gradient Descent
Starting with gradient descent.
- Let be a differentiable function of parameter vector
- Define the gradient of to be a vector where is
- To find the local minimum of , we adjust the parameter in the -ve gradient direction:
Goal: find parameter vector minimizing mean squared error between approximate value fn and true oracle value fn (assuming we know the oracle)
Gradient descent finds a local minimum:
Stochastic gradient descent samples the gradient:
The nice thing about SGD is that it still converges under non-stationary environment. The expected update is equal to full gradient update.
Feature Vectors
To represent a state, we use a feature vector.
For example, the features (numeric) could be:
- Distance of robot to landmarks
- Trends in the stock market
- Configuration of pawn on a chess board
Linear Value Function Approximation
Let us represent the value function using a linear combination of features (i.e. just a dot product between two vectors):
The nice thing is that linear approximator is quadratic in the parameters , so it is a convex optimization problem, i.e. SGD will converge on the global optimum:
The gradient update is really simple:
Note that we are just subbing the simple expression for into the general formula above. The update may be interpreted as step-size x prediction error x feature value
. This means that features with high correlation with the prediction error will have large gradient updates intuitively.
We can think of table lookup as a special case of linear value function approximation. Suppose we use a table lookup feature (1-hot) as follows:
And suppose we have a parameter vector of size , such that we have one parameter for each state. Then we have:
And we can see that this reduces to a table lookup where the parameter represents the state value for each state .
Estimating the Oracle
So far, we have assumed that the true oracle value function is available, but in RL there is no true label, only rewards. So in practice, we need to substitute a target for :
- For MC, the target is the return :
- For TD(0), the target is the TD target :
- For TD(), the target is the -return :
Monte Carlo with Value Function Approximation
We can think of our algorithm as supervised learning.
- Treat the return as an unbiased noisy sample of the true value
- We therefore are applying supervised learning to "training data":
- For example, using linear MC policy evaluation:
- MC evaluation converges to a local optimum even when using non-linear value function approximation
TD with Value Function Approximation
The same applies to TD learning, but we have some biased estimate:
- The TD target is a biased sample of the true value - it's biased because our own value function is a biased estimate
- We can still apply supervised learning to the "training data":
- For example using linear TD(0):
- There is a theorem showing that for linear TD(0), despite the bias, we will always converge (close) to the global optimum
Note: There is a little inconsistency in the above formula, once we start introducing bootstrapped approximations of the return. Recall that when we used the oracle to represent the target and took the derivative, only enters the derivative as we treat the oracle value as a constant.
However, once we introduce itself to substitute the oracle function, we should technically include that term in the derivative as well. As it turns out, this is not a good idea and will not lead to convergence. There is some theoretical analysis for this to justify it.
TD() with Value Function Approximation
And again, we can do the same with TD-, since the -return is also a biased sample of the true value :
- The training data is now:
- The forward view linear TD() is:
- The backward view linear TD() is:
- There is a theorem to show that the forward view and backward view linear TD() are equivalent.
For the backward view, notice that the eligibility trace is now updated using the gradient wrt the parameter vector, namely , which is of the same dimensionality as . More precisely, the eligibility trace is the decaying accumulation of past gradients. In the linear case, this is an accumulation of the feature vector .
It is a bit unintuitive to understand why we use the accumulated gradient as the eligibility trace, but I suppose it is proved in the equivalence theorem between the forward and backward view. Perhaps we can just think of it as "the features which we see the most often will have high eligibility trace".
Control with Value Function Approximation
- Start with some random parameter vector
- Set policy based on some greedy function
- Do policy evaluation
First we need to do everything again wrt to action-value function instead of value function to perform this algorithm. The steps are:
- Approximate the action-value function
- Minimize the mean squared error between approximate action value function and true oracle action value :
- Use SGD to find a local minimum:
- Again, we represent the state and action by a feature vector:
- Represent action value function by a linear combination of features:
- Do an SGD update:
Incremental Control Algorithms
Like prediction, we need to substitute a target for the unknown oracle . We sub out all the for an approximate target:
- For MC, target is the return
- For TD(0), the target is the TD target :
- For forward view TD(), the target is the action value -return:
- For backward view TD(), the equivalent update is:
Should we bootstrap? Empirically across many examples, we almost always have the case that:
- MC takes too many steps because variance is too high
- TD(0) always has a large efficiency gain compared to MC
- There's always some value in between which is better than TD(0)
Batch Methods
Motivation:
- Gradient descent is simple and appealing
- But it is not sample efficient (we throw a sample away as soon as we use it once)
- Batch methods seek to find the best fitting value function, given the agent's experience ("training data")
Least Squares Prediction
The problem becomes the following:
- Given our value function approximation
- And experience consisting of
<state, value>
pairs
- Find the parameters that give the best fitting function
Least squares algorithms simply try to find that minimizes the sum of squares error between and target values :
SGD with Experience Replay
It turns out there is a really easy way to find the least squares solution, using experience replay. The idea is to just keep using the data over and over again, instead of throwing away every sample after each update.
Given experience comprising of:
Repeat:
- Sample state, value from experience:
- Apply SGD update:
It can be shown that this converges to the least squares solution:
Experience Replay in Deep Q-Networks (DQN)
DQN (for atari games) uses experience replay and fixed Q-targets:
- Take action according to -greedy policy
- Store transition in replay memory
- Sample random mini batch of transitions
- Small batch size of
64
is sufficient
- Small batch size of
- Maintain two neural networks that estimate Q-values:
- The old reference neural network is frozen periodically and used as the target
- Call its parameters
- The actual neural network we are training has parameters
- Compute Q-learning targets wrt old, fixed parameters
- Optimize MSE between reference Q-network and Q-learning targets:
- This is essentially Q-learning with a one-step look ahead, but using the reference network instead of the current active network under training
- Success of this method depends on its stability in training:
- Experience replay helps to stabilize training as it randomly samples from past experience instead of getting batches of highly correlated data
- Fixed Q-targets - fixing the reference neural network helps to stabilize the targets and thus training
- The neural network is just a large convolutional neural network
- Input state is a stack of raw pixels from last 4 frames
- Output is for
18
joystick / button positions - Reward is the change in score for that step
- Applied to a large number of Atari games
Linear Least Squares Prediction
Experience replay finds the least squares solution, but it takes many iterations. If we use a linear value function approximation, we can solve the least squares solution directly.
At the minimum of , the expected update must be zero:
So the expected update is zero:
Solving for :
- Note that the matrix inverse is performed on a matrix of size , where is the size of the feature / parameter vector. Hence if the number of parameters is small, this is acceptable to take the complexity
- Using Shermann-Morrison, the solution time is reduced to
Linear least squares prediction algorithms actually have better convergence properties.
Least Squares Policy Iteration
- Policy evaluation is done using least squares Q-learning (linear or otherwise)
- Policy improvement is done using greedy policy improvement as per normal