Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Policy Gradient

Lecture 7: Policy Gradient

Look at methods that update the policy directly, instead of working with value / action-value functions.

  • In the last lecture we approximated the value of action value function using parameters .
  • We generated a policy directly from the value function, e.g. using -greedy algorithm
  • In this lecture, we will directly parametrise the policy:
  • Again, we will focus on model free reinforcement learning

Value-based vs Policy-based RL

  • Value Based
    • Learn the value function
    • Policy is implicit from the value function (e.g. -greedy)
  • Policy Based
    • No value function
    • Learn policy directly
  • Actor-Critic
    • Learn a value function
    • Also learn a policy

Advantages of Policy-based RL

  • Better convergence properties than value based
  • Effective in high dimensional or continuous action spaces
    • We do not need to compute max action over Q-values
    • If the action space is continuous, the maximization is not straightforward at all
  • Can learn stochastic policies

Disadvantages of Policy-base RL

  • Typically converges to a local rather than global optimum
  • Evaluating a policy is typically inefficient and high variance

Why might we want to have a stochastic policy?

  • e.g. Rock paper scissors
  • Having a deterministic policy is easily exploited
  • A uniform random policy is optimal