Policy Gradient
Look at methods that update the policy directly, instead of working with value / action-value functions.
- In the last lecture we approximated the value of action value function using parameters .
- We generated a policy directly from the value function, e.g. using -greedy algorithm
- In this lecture, we will directly parametrise the policy:
- Again, we will focus on model free reinforcement learning
Value-based vs Policy-based RL
- Value Based
- Learn the value function
- Policy is implicit from the value function (e.g. -greedy)
- Policy Based
- No value function
- Learn policy directly
- Actor-Critic
- Learn a value function
- Also learn a policy
Advantages of Policy-based RL
- Better convergence properties than value based
- Value based sometimes swings or chatters around the optimum and do not converge
- Effective in high dimensional or continuous action spaces
- We do not need to compute max action over Q-values
- E.g. if the action space is continuous, the maximization is not at all trivial and may be prohibitive
- This may be the main impediment to value-based RL
- Can learn stochastic policies
Disadvantages of Policy-base RL
- Typically converges to a local rather than global optimum
- Evaluating a policy is typically inefficient and high variance
Why might we want to have a stochastic policy?
- e.g. Rock paper scissors
- Having a deterministic policy is easily exploited
- A uniform random policy is optimal
Stochastic policy is also necessary in a case of state-aliasing, in which our state representation cannot exhaustively differentiate states from each other. In this case, we do not have a Markov Decision Process, and there may not exist a deterministic policy that is optimal. That is why we need a stochastic policy.
Policy Objective Functions
Our goal in policy gradient is to find the best for a given policy with parameters . But how do we measure the quality of a given policy?
There are 3 ways of measuring:
- In episodic environments we can use the start value:
- In continuing environments we can use the average value:
- Or the average reward per time step
In the above, is the stationary distribution of the markov chain for . It tells us the amount of time we spend in each state, and so it provides the weighting required to get the average value or reward.
Policy Optimization
Policy based reinforcement learning is an optimisation problem:
- Gradient free methods:
- Hill climbing
- Simplex / amoeba / Nelder Mead
- Genetic algorithms
- Gradient methods are almost always more efficient:
- Gradient descent
- Conjugate gradient
- Quasi Newton
Finite Difference Policy Gradient
- Let be any policy objective function
- Policy gradient algorithms search for a local maximum in by ascending the gradient of the policy wrt parameters
- Where is the policy gradient (a vector of partial derivatives along each dimension)
The simplest way to compute the policy gradient is to use finite differences:
- For each dimension :
- Estimate the kth partial derivative of objective function wrt
- By perturbing by a small amount in the kth dimension
- Where is a unit vector with 1 in the kth component and 0 elsewhere
- This is not the most efficient algorithm, as it requires evaluations (once for each dimension) to compute a single gradient step
- It is simple, noisy and inefficient, but sometimes works
- Works for arbitrary policies, even if the policy is not differentiable
Likelihood Ratios
We now want to compute the policy gradient analytically, assuming:
- The policy is differentiable whenever it is non-zero; and
- We know the gradient
- Likelihood ratio methods exploit the following identity (call it the log trick):
- Note that we use the simple identity
- The new formulation is nicer to work with because we have on the left, which when integrated over, basically gives us the expectation over our policy
- This allows us to basically sample trajectories from the data and compute the gradient at each step
- The score function is the quantity
Linear Softmax Policy
Use the softmax policy as a simple running example:
- Weight actions using linear combination of features
- The probability of action is then proportional to the exponentiated weight:
- The score function is then:
- Note that we are omitting the derivation of the second term of the score function which is a bit more involved, as it involves differentiating the normalization factor (not shown above)
Gaussian Policy
In continuous action spaces, a Gaussian policy is natural
- Let the mean of the gaussian be a linear combination of state features
- The variance may be fixed or parametrized
- The policy is gaussian (recall that we are in a continuous action space, so is a vector of floats):
- The score function is then:
- We can probably derive this score function by writing down the PDF of the gaussian distribution for and then taking the derivative
Policy Gradient Theorem: One-Step MDPs
Consider a simple class of one-step MDPs to simplify the math
- Start in a state
- Terminate after one step with reward
- This is a sort of contextual bandit
Use likelihood ratios to compute the policy gradient
- First we pick our objective function, which is just the expected reward (averaged over our start state and action that we choose)
- Then we take the derivative to do gradient ascent:
- Note that when taking the gradient of , we use the log trick to rewrite it in line 2, and it becomes a new expectation again because we recover outside of the gradient. This shows the power of the log trick.
Policy Gradient Theorem
But we don't just want to do one-step MDPs, we want to generalize to multi-step MDPs
- It turns out that we just need to replace the instantaneous reward with long term value (I suppose this means we need to model as well)
- Regardless of whether we use the (i) start state objective, (ii) average value objective or (iii) average reward objective, the policy gradient theorem hold
Theorem. Policy Gradient Theorem.
For any differentiable policy , and for any of the policy objective functions , or , the policy gradient is:
Monte Carlo Policy Gradient (REINFORCE)
The policy gradient theorem basically gives rise to a simple monte carlo policy gradient algorithm to find the optimal policy:
REINFORCE Algorithm.
- Initialize randomly
- For each episode do:
- For to do:
- Return
Note that:
- We are doing monte carlo, i.e. we wait until the end of the episode before we go back to update the parameters for each time step.
- We are doing SGD, so there is no expectation term
- We use the return as an unbiased sample of . Recall that is the total discounted reward from time step until termination.
- This is the simplest and oldest policy gradient algorithm.
Empirically, policy gradient methods have a nice learning curve without the jittery behaviour of value based methods. But, monte carlo methods take very very long (millions of steps) to converge due to high variance.
Actor Critic Policy Gradient
The main problem with monte carlo policy gradient is the high variance of the return . Sometimes we get no reward, sometimes we get high reward.
The idea is thus to use a critic to estimate the action-value function :
The name critic refers to the value function, which simply "watches" and evaluates the value of an action, whilst the actor is the policy itself which decides how we should act.
We maintain two sets of parameters:
- Critic updates the action value function parameters
- Actor updates the policy parameters , in the direction suggested by the critic
Actor-critic algorithms follow an approximate policy gradient:
Notice that we just replace with , which is the value function of the critic model.
Estimating the Action-Value Function of the Critic
The critic is solving a familiar problem: policy evaluation. What is the value for policy for current parameters ? We have explored this previously, i.e.:
- Monte Carlo policy evaluation
- TD-
- or least squares policy evaluation
This leads us to a simple actor-critic algorithm:
- Critic: use linear value function approximation
- Update using linear TD(0)
- Actor: update using policy gradient
Q Actor Critic (QAC) Algorithm.
- Initialize
- Sample
- for each step:
- Sample reward ; sample transitions
- Sample action
Note that the line is simply the TD(0) update, and the line is simply the policy gradient step.
Bias in Actor Critic Algorithms
The problem is that approximating the policy gradient introduces bias
- A biased policy gradient may not find the right solution
- Luckily, if we choose the value function approximation carefully, we can avoid introducing any bias
- i.e. we can still follow the exact policy gradient (see Compatible Function Approximation theorem)
Reducing Variance using a Baseline
Here we move on to tricks to improve the training algorithm. The baseline method is the best known trick.
The main idea is to subtract a baseline function from the policy gradient, such that we can reduce variance without changing expectation.
The reason is that since does not depend on the action , its expectation when plugged in will be , like so:
Note that in line 2, since is a probability, the right-most term sums to , which is a constant. Hence the gradient will become , and the expectation resolves to .
A good and popular choice of baseline is the state value function .
- So we can rewrite the policy gradient using the advantage function :
- Intuitively, the advantage function tells us the additional benefit of an action over the baseline value of being in that state
Estimating the Advantage Function
The advantage function can significantly reduce the variance of the policy gradient. The naive way is to estimate and separately, i.e.:
Then, we use TD methods to update both value functions. However, this is not efficient in terms of parameters.
The better way is to observe that the TD error is an unbiased estimate of the advantage function. Therefore, we can plug in the TD error in our policy gradient update instead.
- Recall that the TD error is
- And this is an unbiased estimate of the advantage function
- Note that we have not taken any approximations above. We just showed that the expectation of the TD(0) error when following policy corresponds to the advantage function. Interestingly, computing the TD error does not require estimating the function, only the function. This gives us a simpler update
- We can thus use the TD error to compute the policy gradient
- In practice, since we do not have the true value function , we use an estimate :
- This approach only requires one set of critic parameters
Improving the Actor Updates
Recall that we can estimate the value function from targets at different time scales to trade-off bias and variance:
- For MC, the target is the return
- For TD(0), the target is the TD target
- For forward view TD(), the target is the lambda return
- For backward view TD(), we use eligibility traces
Similarly, we can estimate the policy gradient for the actor at many time scales. The main target we want to estimate is
- MC policy gradient uses the error from the return
- Actor critic policy gradient uses the one-step TD error
- Forward view TD() uses the target
- Backward view TD() uses eligibility traces. Note that the eligbility update now uses the score function instead of which was the feature function
- The update can be performed online, unlike MC policy gradient
Natural Policy Gradient
The natural policy gradient is a parametrization independent approach. It finds ascent direction that is closest to vanilla gradient, when changing the policy by a small, fixed amount
- Where is the fisher information matrix
- Notice that this obviates the need for a critic, as is based on the actor itself
Summary
The policy gradient has many equivalent forms
Each formulation leads to a different SGD algorithm. We can learn the critic using policy evaluation (e.g. MC or TD learning) to estimate or .