Value Function Approximation

This lecture looks into approximating functions with neural networks to overcome the large state-action space problem.

RL often encounters large problems:

Backgammon: $1 0^{20}$ states
Go: $1 0^{170}$ states
Helicopter: continuous state space

We want to do policy evaluation and control efficiently in large state spaces. So far, we have represented $V$ or $Q$ with a lookup table:

Every state $s$ has an entry $V (s)$
Every state-action pair $s, a$ has an entry $Q (s, a)$

This is a problem for large MDPs:

Too many states or actions to store in memory
It is too slow or data inefficient to learn the value of each state individually

Solution for large MDPs:

Estimate the value function with function approximation using parameters $w$ :
- $\overset{v}{^} (s, w) \approx v_{π} (s)$
- $\overset{q}{^} (s, a, w) \approx q_{π} (s, a)$
Generalizes from seen states to unseen states
Update parameters of our function using MC or TD learning

Types of value function approximation (different architectures):

Represent a given state $s$ with some parameters $w$ . Then neural network spits out $\overset{v}{^} (s, w)$ , which is our value function for being in state $s$
Have a neural network $\overset{q}{^} (s, a, w)$ , which takes in a state-action pair and spits out the Q value
Sometimes, it is more efficient to have a neural network $\overset{q}{^} (s, w)$ , such that we feed in a single state and we get Q-values for every possible action in a single forward pass, i.e. we get $\overset{q}{^} (s, a_{1}, w), \overset{q}{^} (s, a_{2}, w), ...$

Which function approximator? We focus on differentiable function approximators that we can easily optimize, i.e. Linear combinations of features, neural networks. Furthermore, we want a training algorithm for a non-iid, non-stationary set of data, so it is not standard supervised learning.

Incremental Methods

Gradient Descent

Starting with gradient descent.

Let $J (w)$ be a differentiable function of parameter vector $w$
Define the gradient of $J (w)$ to be a vector $\nabla_{w} J (w)$ where $\nabla_{w} J (w) [0]$ is $\frac{\partial J ( w )}{\partial w _{1}}$
To find the local minimum of $J (w)$ , we adjust the parameter $w$ in the -ve gradient direction: $△ w = - \frac{1}{2} α \nabla_{w} J (w)$

Goal: find parameter vector $w$ minimizing mean squared error between approximate value fn $\overset{v}{^} (s, w)$ and true oracle value fn $v_{π} (s)$ (assuming we know the oracle) $J (w) = E_{π} [(v_{π} (S) - \overset{v}{^} (S, w)^{2})]$

Gradient descent finds a local minimum: $△ w = - \frac{1}{2} α \nabla_{w} J (w) = α E_{π} [(v_{π} (S) - \overset{v}{^} (S, w)) \nabla_{w} \overset{v}{^} (S, w)]$

Stochastic gradient descent samples the gradient: $△ w = α (v_{π} (S) - \overset{v}{^} (S, w)) \nabla_{w} \overset{v}{^} (S, w)$

The nice thing about SGD is that it still converges under non-stationary environment. The expected update is equal to full gradient update.

Feature Vectors

To represent a state, we use a feature vector. $x (S) = x_{1} (S) ⋮ x_{n} (S)$

For example, the features (numeric) could be:

Distance of robot to landmarks
Trends in the stock market
Configuration of pawn on a chess board

Linear Value Function Approximation

Let us represent the value function using a linear combination of features (i.e. just a dot product between two vectors): $\overset{v}{^} (S, w) = x (S)^{T} w = j = 1 \sum n x_{j} (S) w_{j}$

The nice thing is that linear approximator is quadratic in the parameters $w$ , so it is a convex optimization problem, i.e. SGD will converge on the global optimum: $J (w) = E_{π} [(v_{π} (S) - x (S)^{T} w)^{2}]$

The gradient update is really simple: $\nabla_{w} \overset{v}{^} (S, w) △ w = x (S) = α (v_{π} (S) - \overset{v}{^} (S, w)) x (S)$

Note that we are just subbing the simple expression for $\nabla_{w} \overset{v}{^} (S, w)$ into the general $△ w$ formula above. The update may be interpreted as step-size x prediction error x feature value. This means that features with high correlation with the prediction error will have large gradient updates intuitively.

We can think of table lookup as a special case of linear value function approximation. Suppose we use a table lookup feature (1-hot) as follows: $x^{t ab l e} (S) = 1 (S = s_{1}) ⋮ 1 (S = s_{n})$

And suppose we have a parameter vector of size $n$ , such that we have one parameter for each state. Then we have: $\overset{v}{^} (S, w) = 1 (S = s_{1}) ⋮ 1 (S = s_{n}) \cdot w_{1} ⋮ w_{n}$

And we can see that this reduces to a table lookup where the parameter $w_{j}$ represents the state value for each state $j$ .

Estimating the Oracle

So far, we have assumed that the true oracle value function $v_{π} (s)$ is available, but in RL there is no true label, only rewards. So in practice, we need to substitute a target for $v_{π} (s)$ :

For MC, the target is the return $G_{t}$ : $△ w = α (G_{t} - \overset{v}{^} (S_{t}, w)) \nabla_{w} \overset{v}{^} (S_{t}, w)$
For TD(0), the target is the TD target $R_{t + 1} + γ \overset{v}{^} (S_{t + 1}, w)$ : $△ w = α (R_{t + 1} + γ \overset{v}{^} (S_{t + 1}, w) - \overset{v}{^} (S_{t}, w)) \nabla_{w} \overset{v}{^} (S_{t}, w)$
For TD( $λ$ ), the target is the $λ$ -return $G_{t}^{λ}$ : $△ w = α (G_{t}^{λ} - \overset{v}{^} (S_{t}, w)) \nabla_{w} \overset{v}{^} (S_{t}, w)$

Monte Carlo with Value Function Approximation

We can think of our algorithm as supervised learning.

Treat the return $G_{t}$ as an unbiased noisy sample of the true value $v_{π} (S_{t})$
We therefore are applying supervised learning to "training data": $< S_{t}, G_{1} >, < S_{2}, G_{2} >, ..., < S_{T}, G_{T} >$
For example, using linear MC policy evaluation: $△ w = α (G_{t} - \overset{v}{^} (S_{t}, w)) \nabla_{w} \overset{v}{^} (S_{t}, w) = α (G_{t} - \overset{v}{^} (S_{t}, w)) x (S_{t})$
MC evaluation converges to a local optimum even when using non-linear value function approximation

TD with Value Function Approximation

The same applies to TD learning, but we have some biased estimate:

The TD target $R_{t + 1} + γ \overset{v}{^} (S_{t + 1}, w)$ is a biased sample of the true value $v_{π} (S_{t})$ - it's biased because our own value function is a biased estimate
We can still apply supervised learning to the "training data": $< S_{1}, R_{2} + γ \overset{v}{^} (S_{2}, w) >, < S_{2}, R_{3} + γ \overset{v}{^} (S_{3}, w) >, ..., < S_{t - 1}, R_{T} >$
For example using linear TD(0): $△ w = α (R + γ \overset{v}{^} (S^{'}, w) - \overset{v}{^} (S, w)) \nabla_{w} \overset{v}{^} (S, w) = α δ x (S)$
There is a theorem showing that for linear TD(0), despite the bias, we will always converge (close) to the global optimum

Note: There is a little inconsistency in the above formula, once we start introducing bootstrapped approximations of the return. Recall that when we used the oracle $v_{π}$ to represent the target and took the derivative, only $\nabla_{w} \overset{v}{^} (S, w)$ enters the derivative as we treat the oracle value as a constant.

However, once we introduce $\overset{v}{^}$ itself to substitute the oracle function, we should technically include that term in the derivative as well. As it turns out, this is not a good idea and will not lead to convergence. There is some theoretical analysis for this to justify it.

TD( $λ$ ) with Value Function Approximation

And again, we can do the same with TD- $λ$ , since the $λ$ -return $G_{t}^{λ}$ is also a biased sample of the true value $v_{π} (s)$ :

The training data is now: $< S_{1}, G_{1}^{λ} >, < S_{2}, G_{2}^{λ} >, ..., < S_{T - 1}, G_{T - 1}^{λ} >$
The forward view linear TD( $λ$ ) is: $△ w = α (G_{t}^{λ} - \overset{v}{^} (S_{t}, w)) \nabla_{w} \overset{v}{^} (S_{t}, w) = α (G_{t}^{λ} - \overset{v}{^} (S_{t}, w)) x (S_{t})$
The backward view linear TD( $λ$ ) is: $δ_{t} E_{t} △ w = R_{t + 1} + γ \overset{v}{^} (S_{t + 1}, w) - \overset{v}{^} (S_{t}, w) = γλ E_{t - 1} + x (S_{t}) = α δ_{t} E_{t}$
There is a theorem to show that the forward view and backward view linear TD( $λ$ ) are equivalent.

For the backward view, notice that the eligibility trace is now updated using the gradient wrt the parameter vector, namely $\nabla_{w} \overset{v}{^} (S_{t}, w)$ , which is of the same dimensionality as $w$ . More precisely, the eligibility trace is the decaying accumulation of past gradients. In the linear case, this is an accumulation of the feature vector $x (S_{t})$ .

It is a bit unintuitive to understand why we use the accumulated gradient as the eligibility trace, but I suppose it is proved in the equivalence theorem between the forward and backward view. Perhaps we can just think of it as "the features which we see the most often will have high eligibility trace".

Control with Value Function Approximation

Start with some random parameter vector $w$
Set policy based on some greedy function $π = ϵ -greedy (q_{w})$
Do policy evaluation $\overset{q}{^} (., ., w) \approx q_{π}$

First we need to do everything again wrt to action-value function instead of value function to perform this algorithm. The steps are:

Approximate the action-value function $\overset{q}{^} (S, A, w) \approx q_{π} (S, A)$
Minimize the mean squared error between approximate action value function and true oracle action value $q_{π} (S, A)$ : $J (w) = E_{π} [(q_{π} (S, A) - \overset{q}{^} (S, A, w)^{2})]$
Use SGD to find a local minimum: $- \frac{1}{2} \nabla_{w} J (w) △ w = (q_{π} (S, A) - \overset{q}{^} (S, A, w)) \nabla_{w} \overset{q}{^} (S, A, w) = α (q_{π} (S, A) - \overset{q}{^} (S, A, w)) \nabla_{w} \overset{q}{^} (S, A, w)$
Again, we represent the state and action by a feature vector: $x (S, A) = x_{1} (S, A) ⋮ x_{n} (S, A)$
Represent action value function by a linear combination of features: $\overset{q}{^} (S, A, w) = x (S, A)^{T} w = j = 1 \sum n x_{j} (S, A) w_{j}$
Do an SGD update: $\nabla_{w} \overset{q}{^} (S, A, w) △ w = x (S, A) = α (q_{π} (S, A) - \overset{q}{^} (S, A, w)) x (S, A)$

Incremental Control Algorithms

Like prediction, we need to substitute a target for the unknown oracle $q_{π} (S, A)$ . We sub out all the $v_{π}$ for an approximate target:

For MC, target is the return $G_{t}$ $△ w = α (G_{t} - \overset{q}{^} (S_{t}, A_{t}, w)) \nabla_{w} \overset{q}{^} (S_{t}, A_{t}, w)$
For TD(0), the target is the TD target $R_{t + 1} + γ Q (S_{t + 1}, A_{t + 1}) :$ : $△ w = α (R_{t + 1} + γ \overset{q}{^} (S_{t + 1}, A_{t + 1}, w)) - \overset{q}{^} (S_{t}, A_{t}, w) \nabla_{w} \overset{q}{^} (S_{t}, A_{t}, w)$
For forward view TD( $λ$ ), the target is the action value $λ$ -return: $△ w = α (q_{t}^{λ} - \overset{q}{^} (S_{t}, A_{t}, w)) \nabla_{w} \overset{q}{^} (S_{t}, A_{t}, w)$
For backward view TD( $λ$ ), the equivalent update is: $δ_{t} E_{t} △ w = R_{t + 1} + γ \overset{q}{^} (S_{t + 1}, A_{t + 1}, w) - \overset{q}{^} (S_{t}, A_{t}, w) = γλ E_{t - 1} + \nabla_{w} \overset{q}{^} (S_{t}, A_{t}, w) = α δ_{t} E_{t}$

Should we bootstrap? Empirically across many examples, we almost always have the case that:

MC takes too many steps because variance is too high
TD(0) always has a large efficiency gain compared to MC
There's always some $λ$ value in between which is better than TD(0)

Batch Methods

Motivation:

Gradient descent is simple and appealing
But it is not sample efficient (we throw a sample away as soon as we use it once)
Batch methods seek to find the best fitting value function, given the agent's experience ("training data")

Least Squares Prediction

The problem becomes the following:

Given our value function approximation $\overset{v}{^} (s, w) \approx v_{π} (s)$
And experience $D$ consisting of <state, value> pairs

$D = {< s_{1}, v_{1}^{π} >, < s_{2}, v_{2}^{π} >, ..., < s_{T}, v_{T}^{π} >,}$

Find the parameters $w$ that give the best fitting function $\overset{v}{^} (s, w)$

Least squares algorithms simply try to find $w$ that minimizes the sum of squares error between $\overset{v}{^} (s_{t}, w)$ and target values $v_{t}^{π}$ : $LS (w) = t = 1 \sum T [v_{t}^{π} - \overset{v}{^} (s_{t}, w)]^{2} = E_{D} [v^{π} - \overset{v}{^} (s, w)]^{2}$

SGD with Experience Replay

It turns out there is a really easy way to find the least squares solution, using experience replay. The idea is to just keep using the data over and over again, instead of throwing away every sample after each update.

Given experience comprising of: $D = {⟨ s_{1}, v_{1}^{π}, ⟩ ⟨ s_{2}, v_{2}^{π}, ⟩ ..., ⟨ s_{T}, v_{T}^{π}, ⟩}$

Repeat:

Sample state, value from experience: $⟨ s, v^{π} ⟩ \sim D$
Apply SGD update: $△ w = α (v^{π} - \overset{v}{^} (s, w)) \nabla_{w} \overset{v}{^} (s, w)$

It can be shown that this converges to the least squares solution: $w^{π} = w arg min LS (w)$

Experience Replay in Deep Q-Networks (DQN)

DQN (for atari games) uses experience replay and fixed Q-targets:

Take action $a_{t}$ according to $ϵ$ -greedy policy
Store transition $(s_{t}, a_{t}, r_{t + 1}, s_{t + 1})$ in replay memory $D$
Sample random mini batch of transitions $(s, a, r, s^{'})$
- Small batch size of 64 is sufficient
Maintain two neural networks that estimate Q-values:
- The old reference neural network is frozen periodically and used as the target
- Call its parameters $w^{-}$
- The actual neural network we are training has parameters $w$
Compute Q-learning targets wrt old, fixed parameters $w^{-}$
Optimize MSE between reference Q-network and Q-learning targets: $L_{i} (w_{i}) = E_{s, a, r, s^{'} \sim D_{i}} [(r + γ a^{'} max Q (s^{'}, a^{'}; w_{i}^{-}) - Q (s, a; w_{i}))^{2}]$
This is essentially Q-learning with a one-step look ahead, but using the reference network instead of the current active network under training
Success of this method depends on its stability in training:
1. Experience replay helps to stabilize training as it randomly samples from past experience instead of getting batches of highly correlated data
2. Fixed Q-targets - fixing the reference neural network helps to stabilize the targets and thus training
The neural network is just a large convolutional neural network
- Input state $s$ is a stack of raw pixels from last 4 frames
- Output is $Q (s, a)$ for 18 joystick / button positions
- Reward is the change in score for that step
- Applied to a large number of Atari games

Linear Least Squares Prediction

Experience replay finds the least squares solution, but it takes many iterations. If we use a linear value function approximation, we can solve the least squares solution directly.

At the minimum of $LS (w)$ , the expected update must be zero: $E_{D} [△ w] = 0$

So the expected update is zero: $t = 1 \sum T x (s_{t}) (v_{t}^{π} - x (s_{t})^{T} w) = 0$

Solving for $w$ : $t = 1 \sum T x (s_{t}) v_{t}^{π} w = t = 1 \sum T x (s_{t}) x (s_{t})^{T} w = (t = 1 \sum T x (s_{t}) x (s_{t})^{T})^{- 1} t = 1 \sum T x (s_{t}) v_{t}^{π}$

Note that the matrix inverse is performed on a matrix of size $∣ w ∣^{2}$ , where $∣ w ∣$ is the size of the feature / parameter vector. Hence if the number of parameters is small, this is acceptable to take the $O (N^{3})$ complexity
Using Shermann-Morrison, the solution time is reduced to $O (N^{2})$

Linear least squares prediction algorithms actually have better convergence properties.

Least Squares Policy Iteration

Policy evaluation is done using least squares Q-learning (linear or otherwise)
Policy improvement is done using greedy policy improvement as per normal

Keyboard shortcuts

Chux's Notebook