Policy Gradient

Lecture 7: Policy Gradient

Look at methods that update the policy directly, instead of working with value / action-value functions.

In the last lecture we approximated the value of action value function using parameters $θ$ . $V_{θ} (s) Q_{θ} (s, a) \approx V^{π} (s) \approx Q^{π} (s, a)$
We generated a policy directly from the value function, e.g. using $ϵ$ -greedy algorithm
In this lecture, we will directly parametrise the policy: $π_{θ} (s, a) = P [a ∣ s, θ]$
Again, we will focus on model free reinforcement learning

Value-based vs Policy-based RL

Value Based
- Learn the value function
- Policy is implicit from the value function (e.g. $ϵ$ -greedy)
Policy Based
- No value function
- Learn policy directly
Actor-Critic
- Learn a value function
- Also learn a policy

Advantages of Policy-based RL

Better convergence properties than value based
- Value based sometimes swings or chatters around the optimum and do not converge
Effective in high dimensional or continuous action spaces
- We do not need to compute max action over Q-values
- E.g. if the action space is continuous, the maximization is not at all trivial and may be prohibitive
- This may be the main impediment to value-based RL
Can learn stochastic policies

Disadvantages of Policy-base RL

Typically converges to a local rather than global optimum
Evaluating a policy is typically inefficient and high variance

Why might we want to have a stochastic policy?

e.g. Rock paper scissors
Having a deterministic policy is easily exploited
A uniform random policy is optimal

Stochastic policy is also necessary in a case of state-aliasing, in which our state representation cannot exhaustively differentiate states from each other. In this case, we do not have a Markov Decision Process, and there may not exist a deterministic policy that is optimal. That is why we need a stochastic policy.

Policy Objective Functions

Our goal in policy gradient is to find the best $θ$ for a given policy $π_{θ} (s, a)$ with parameters $θ$ . But how do we measure the quality of a given policy?

There are 3 ways of measuring:

In episodic environments we can use the start value: $J_{1} (θ) = V^{π_{θ}} (s_{1}) = E_{π_{θ}} [v_{1}]$
In continuing environments we can use the average value: $J_{average value} (θ) = s \sum d^{π_{θ}} (s) V^{π_{θ}} (s)$
Or the average reward per time step $J_{average reward} (θ) = s \sum d^{π_{θ}} (s) a \sum π_{θ} (s, a) R_{s}^{a}$

In the above, $d^{π_{θ}} (s)$ is the stationary distribution of the markov chain for $π_{θ}$ . It tells us the amount of time we spend in each state, and so it provides the weighting required to get the average value or reward.

Policy Optimization

Policy based reinforcement learning is an optimisation problem:

Gradient free methods:
- Hill climbing
- Simplex / amoeba / Nelder Mead
- Genetic algorithms
Gradient methods are almost always more efficient:
- Gradient descent
- Conjugate gradient
- Quasi Newton

Finite Difference Policy Gradient

Let $J (θ)$ be any policy objective function
Policy gradient algorithms search for a local maximum in $J (θ)$ by ascending the gradient of the policy wrt parameters $θ$ $△ θ = α \nabla_{θ} J (θ)$
Where $\nabla_{θ} J (θ)$ is the policy gradient (a vector of partial derivatives along each dimension) $\nabla_{θ} J (θ) = \frac{\partial J ( θ )}{\partial θ _{1}} ⋮ \frac{\partial J ( θ )}{\partial θ _{n}}$

The simplest way to compute the policy gradient is to use finite differences:

For each dimension $k \in [1, n]$ :
- Estimate the kth partial derivative of objective function wrt $θ$
- By perturbing $θ$ by a small amount $ϵ$ in the kth dimension $\frac{\partial J ( θ )}{\partial θ _{k}} \approx J ( θ + ϵ u _{k} ) - J ( θ )$
- Where $u_{k}$ is a unit vector with 1 in the kth component and 0 elsewhere
This is not the most efficient algorithm, as it requires $n$ evaluations (once for each dimension) to compute a single gradient step
It is simple, noisy and inefficient, but sometimes works
Works for arbitrary policies, even if the policy is not differentiable

Likelihood Ratios

We now want to compute the policy gradient analytically, assuming:

The policy $π_{θ}$ is differentiable whenever it is non-zero; and
We know the gradient $\nabla_{θ} π_{θ} (s, a)$
Likelihood ratio methods exploit the following identity (call it the log trick): $\nabla_{θ} π_{θ} (s, a) = π_{θ} (s, a) \frac{\nabla _{θ} π _{θ} ( s , a )}{π _{θ} ( s , a )} = π_{θ} (s, a) \nabla_{θ} lo g π_{θ} (s, a)$
Note that we use the simple identity $\partial_{θ} lo g f (θ) = \frac{\partial _{θ} f ( θ )}{f ( θ )}$
The new formulation is nicer to work with because we have $π_{θ} (s, a)$ on the left, which when integrated over, basically gives us the expectation over our policy $π_{θ}$
- This allows us to basically sample trajectories from the data and compute the gradient at each step
The score function is the quantity $\nabla_{θ} lo g π_{θ} (s, a)$

Linear Softmax Policy

Use the softmax policy as a simple running example:

Weight actions using linear combination of features $ϕ (s, a)^{⊺} θ$
The probability of action is then proportional to the exponentiated weight: $π_{θ} (s, a) \propto e^{ϕ (s, a)^{⊺} θ}$
The score function is then: $\nabla_{θ} lo g π_{θ} (s, a) = ϕ (s, a) - E_{π_{θ}} [ϕ (s, \cdot)]$
Note that we are omitting the derivation of the second term of the score function which is a bit more involved, as it involves differentiating the normalization factor (not shown above)

Gaussian Policy

In continuous action spaces, a Gaussian policy is natural

Let the mean of the gaussian be a linear combination of state features $μ (s) = ϕ (s)^{⊺} θ$
The variance may be fixed $σ^{2}$ or parametrized
The policy is gaussian (recall that we are in a continuous action space, so $a$ is a vector of floats): $a \sim N (μ (s), σ^{2})$
The score function is then: $\nabla_{θ} lo g π_{θ} (s, a) = \frac{( a - μ ( s )) ϕ ( s )}{σ ^{2}}$
We can probably derive this score function by writing down the PDF of the gaussian distribution for $π_{θ} (s, a)$ and then taking the derivative

Policy Gradient Theorem: One-Step MDPs

Consider a simple class of one-step MDPs to simplify the math

Start in a state $s \sim d (s)$
Terminate after one step with reward $r = R_{s, a}$
This is a sort of contextual bandit

Use likelihood ratios to compute the policy gradient

First we pick our objective function, which is just the expected reward (averaged over our start state and action that we choose) $J (θ) = E_{π_{θ}} [r] = s \in S \sum d (s) a \in A \sum π_{θ} (s, a) R_{s, a}$
Then we take the derivative to do gradient ascent: $\nabla_{θ} J (θ) = s \in S \sum d (s) a \in A \sum \nabla_{θ} π_{θ} (s, a) R_{s, a} = s \in S \sum d (s) a \in A \sum π_{θ} (s, a) \nabla_{θ} lo g π_{θ} (s, a) R_{s, a} = E_{π_{θ}} [\nabla_{θ} lo g π_{θ} (s, a) r]$
Note that when taking the gradient of $π_{θ}$ , we use the log trick to rewrite it in line 2, and it becomes a new expectation again because we recover $π_{θ} (s, a)$ outside of the gradient. This shows the power of the log trick.

Policy Gradient Theorem

But we don't just want to do one-step MDPs, we want to generalize to multi-step MDPs

It turns out that we just need to replace the instantaneous reward $r$ with long term value $Q^{π} (s, a)$ (I suppose this means we need to model $Q$ as well)
Regardless of whether we use the (i) start state objective, (ii) average value objective or (iii) average reward objective, the policy gradient theorem hold

Theorem. Policy Gradient Theorem.

For any differentiable policy $π_{θ} (s, a)$ , and for any of the policy objective functions $J_{1}$ , $J_{average value}$ or $J_{average reward}$ , the policy gradient is: $\nabla_{θ} J (θ) = E_{π_{θ}} [\nabla_{θ} lo g π_{θ} (s, a) Q^{π_{θ}} (s, a)]$

Monte Carlo Policy Gradient (REINFORCE)

The policy gradient theorem basically gives rise to a simple monte carlo policy gradient algorithm to find the optimal policy:

REINFORCE Algorithm.

Initialize $θ$ randomly

For each episode ${s_{1}, a_{1}, r_{2}, ..., s_{T - 1}, a_{T_{1}}, r_{T}} \sim π_{θ}$ do:

For $t = 1$ to $T - 1$ do:

$θ \leftarrow θ + α \nabla_{θ} lo g π_{θ} (s_{t}, a_{t}) G_{t}$

Return $θ$

Note that:

We are doing monte carlo, i.e. we wait until the end of the episode before we go back to update the parameters for each time step.
We are doing SGD, so there is no expectation term
We use the return $G_{t}$ as an unbiased sample of $Q^{π_{θ}} (s_{t}, a_{t})$ . Recall that $G_{t}$ is the total discounted reward from time step $t$ until termination.
This is the simplest and oldest policy gradient algorithm.

Empirically, policy gradient methods have a nice learning curve without the jittery behaviour of value based methods. But, monte carlo methods take very very long (millions of steps) to converge due to high variance.

Actor Critic Policy Gradient

The main problem with monte carlo policy gradient is the high variance of the return $G_{t}$ . Sometimes we get no reward, sometimes we get high reward.

The idea is thus to use a critic to estimate the action-value function $Q$ : $Q_{w} (s, a) \approx Q^{π_{θ}} (s, a)$

The name critic refers to the value function, which simply "watches" and evaluates the value of an action, whilst the actor is the policy itself which decides how we should act.

We maintain two sets of parameters:

Critic updates the action value function parameters $w$
Actor updates the policy parameters $θ$ , in the direction suggested by the critic

Actor-critic algorithms follow an approximate policy gradient: $\nabla_{θ} J (θ) △ θ \approx E_{π_{θ}} [\nabla_{θ} lo g π_{θ} (s, a) Q_{w} (s, a)] = α \nabla_{θ} lo g π_{θ} (s, a) Q_{w} (s, a)$

Notice that we just replace $G_{t}$ with $Q_{w} (s, a)$ , which is the value function of the critic model.

Estimating the Action-Value Function of the Critic

The critic is solving a familiar problem: policy evaluation. What is the value for policy $π_{θ}$ for current parameters $θ$ ? We have explored this previously, i.e.:

Monte Carlo policy evaluation
TD- $λ$
or least squares policy evaluation

This leads us to a simple actor-critic algorithm:

Critic: use linear value function approximation $Q_{w} (s, a) = ϕ (s, a)^{⊺} w$
- Update $w$ using linear TD(0)
Actor: update $θ$ using policy gradient

Q Actor Critic (QAC) Algorithm.

Initialize $s, θ$

Sample $a \sim π_{θ}$

for each step:

Sample reward $r = R_{s}^{a}$ ; sample transitions $s^{'} \sim P_{s}^{a}$

Sample action $a^{'} \sim π_{θ} (s^{'}, a^{'})$

$δ = r + γ Q_{w} (s^{'}, a^{'}) - Q_{w} (s, a)$

$θ = θ + α \nabla_{θ} lo g π_{θ} (s, a) Q_{w} (s, a)$

$w \leftarrow w + β δ ϕ (s, a)$

$a \leftarrow a^{'}, s \leftarrow s^{'}$

Note that the $δ$ line is simply the TD(0) update, and the $θ$ line is simply the policy gradient step.

Bias in Actor Critic Algorithms

The problem is that approximating the policy gradient introduces bias

A biased policy gradient may not find the right solution
Luckily, if we choose the value function approximation carefully, we can avoid introducing any bias
i.e. we can still follow the exact policy gradient (see Compatible Function Approximation theorem)

Reducing Variance using a Baseline

Here we move on to tricks to improve the training algorithm. The baseline method is the best known trick.

The main idea is to subtract a baseline function $B (s)$ from the policy gradient, such that we can reduce variance without changing expectation.

The reason is that since $B (s)$ does not depend on the action $a$ , its expectation when plugged in will be $0$ , like so: $E_{π_{θ}} [\nabla_{θ} lo g π_{θ} (s, a) B (s)] = s \in S \sum d^{π_{θ} (s)} a \in A \sum \nabla_{θ} π_{θ} (s, a) B (s) = s \in S \sum d^{π_{θ}} B (s) \nabla_{θ} a \in A \sum π_{θ} (s, a) = 0$

Note that in line 2, since $π_{θ}$ is a probability, the right-most term sums to $1$ , which is a constant. Hence the gradient $\nabla_{θ}$ will become $0$ , and the expectation resolves to $0$ .

A good and popular choice of baseline is the state value function $B (s) = V^{π_{θ}} (s)$ .

So we can rewrite the policy gradient using the advantage function $A^{π_{θ}} (s, a)$ : $A^{π_{θ}} (s, a) \nabla_{θ} J (θ) = Q^{π_{θ}} (s, a) - V^{π_{θ}} (s) = E_{π_{θ}} [\nabla_{θ} lo g π_{θ} (s, a) A^{π_{θ}} (s, a)]$
Intuitively, the advantage function tells us the additional benefit of an action over the baseline value of being in that state

Estimating the Advantage Function

The advantage function can significantly reduce the variance of the policy gradient. The naive way is to estimate $V^{π_{θ}} (s)$ and $Q^{π_{θ}} (s, a)$ separately, i.e.: $\hat{V} (s) Q_{w} (s, a) A (s, a) \approx V^{π_{θ}} (s) \approx Q^{π_{θ}} (s, a) = Q_{w} (s, a) - \hat{V} (s)$

Then, we use TD methods to update both value functions. However, this is not efficient in terms of parameters.

The better way is to observe that the TD error $δ^{π_{θ}}$ is an unbiased estimate of the advantage function. Therefore, we can plug in the TD error in our policy gradient update instead.

Recall that the TD error is $δ^{π_{θ}} = r + γ V^{π_{θ}} (s^{'}) - V^{π_{θ}} (s)$
And this is an unbiased estimate of the advantage function $E_{π_{θ}} [δ^{π_{θ}} ∣ s, a] = E_{π_{θ}} [r + γ V^{π_{θ}} (s^{'}) ∣ s, a] - V^{π_{θ}} (s) = Q^{π_{θ}} (s, a) - V^{π_{θ}} (s) = A^{π_{θ}} (s, a)$
Note that we have not taken any approximations above. We just showed that the expectation of the TD(0) error when following policy $π_{θ}$ corresponds to the advantage function. Interestingly, computing the TD error does not require estimating the $Q$ function, only the $V$ function. This gives us a simpler update
We can thus use the TD error to compute the policy gradient $\nabla_{θ} J (θ) = E_{π_{θ}} [\nabla_{θ} lo g π_{θ} (s, a) δ^{π_{θ}}]$
In practice, since we do not have the true value function $V^{π_{θ}}$ , we use an estimate $\hat{V}$ : $δ_{w} = r + γ \hat{V_{w}} (s^{'}) - \hat{V_{w}} (s)$
This approach only requires one set of critic parameters $w$

Improving the Actor Updates

Recall that we can estimate the value function $V_{θ} (s)$ from targets at different time scales to trade-off bias and variance:

For MC, the target is the return $G_{t}$ $△ θ = α (G_{t} - V_{θ} (s)) ϕ (s)$
For TD(0), the target is the TD target $△ θ = α (r + γ V_{θ} (s^{'}) - V_{θ} (s)) ϕ (s)$
For forward view TD( $λ$ ), the target is the lambda return $△ θ = α (v_{t}^{λ} - V_{θ} (s)) ϕ (s)$
For backward view TD( $λ$ ), we use eligibility traces $δ_{t} e_{t} △ θ = r_{t + 1} + γV (s_{t + 1}) - V (s_{t}) = λ e_{t - 1} + ϕ (s_{t}) = α δ_{t} e_{t}$

Similarly, we can estimate the policy gradient for the actor at many time scales. The main target we want to estimate is $\nabla_{θ} J (θ) = E_{π_{θ}} [\nabla_{θ} lo g π_{θ} (s, a) A^{π_{θ}} (s, a)]$

MC policy gradient uses the error from the return $△ θ = α (G_{t} - \hat{V} (s_{t})) \nabla_{θ} lo g π_{θ} (s_{t}, a_{t})$
Actor critic policy gradient uses the one-step TD error $△ θ = α (r + γ \hat{V} (s_{t + 1}) - \hat{V} (s_{t})) \nabla_{θ} lo g π_{θ} (s_{t}, a_{t})$
Forward view TD( $λ$ ) uses the $λ$ target $△ θ = α (v_{t}^{λ} - \hat{V} (s_{t})) \nabla_{θ} lo g π_{θ} (s_{t}, a_{t})$
Backward view TD( $λ$ ) uses eligibility traces. Note that the eligbility update now uses the score function instead of $ϕ$ which was the feature function $δ_{t} e_{t + 1} △ θ = r_{t + 1} + γ \hat{V} (s_{t + 1}) - \hat{V} (s_{t}) = λ e_{t} + \nabla_{θ} lo g π_{θ} (s, a) = α δ_{t} e_{t}$
The update can be performed online, unlike MC policy gradient

Natural Policy Gradient

The natural policy gradient is a parametrization independent approach. It finds ascent direction that is closest to vanilla gradient, when changing the policy by a small, fixed amount $\nabla_{θ}^{nat} π_{θ} (s, a) = G_{θ}^{- 1} \nabla_{θ} π_{θ} (s, a)$

Where $G_{θ}$ is the fisher information matrix $G_{θ} = E_{π_{θ}} [\nabla_{θ} lo g π_{θ} (s, a) \nabla_{θ} lo g π_{θ} (s, a)^{⊺}]$
Notice that this obviates the need for a critic, as $G_{θ}$ is based on the actor itself

Summary

The policy gradient has many equivalent forms $\nabla_{θ} J (θ) G_{θ}^{- 1} \nabla_{θ} J (θ) = E_{π_{θ}} [\nabla_{θ} lo g π_{θ} (s, a) G_{t}] = E_{π_{θ}} [\nabla_{θ} lo g π_{θ} (s, a) Q_{w} (s, a)] = E_{π_{θ}} [\nabla_{θ} lo g π_{θ} (s, a) A_{w} (s, a)] = E_{π_{θ}} [\nabla_{θ} lo g π_{θ} (s, a) δ] = E_{π_{θ}} [\nabla_{θ} lo g π_{θ} (s, a) δe] = w REINFORCE Q Actor-Critic Advantage Actor-Critic TD Actor-Critic TD (λ) Actor-Critic Natural Actor-Critic$

Each formulation leads to a different SGD algorithm. We can learn the critic using policy evaluation (e.g. MC or TD learning) to estimate $Q_{π}, A_{π}$ or $V_{π}$ .

Keyboard shortcuts

Chux's Notebook