Policy Gradient
Look at methods that update the policy directly, instead of working with value / action-value functions.
- In the last lecture we approximated the value of action value function using parameters .
- We generated a policy directly from the value function, e.g. using -greedy algorithm
- In this lecture, we will directly parametrise the policy:
- Again, we will focus on model free reinforcement learning
Value-based vs Policy-based RL
- Value Based
- Learn the value function
- Policy is implicit from the value function (e.g. -greedy)
- Policy Based
- No value function
- Learn policy directly
- Actor-Critic
- Learn a value function
- Also learn a policy
Advantages of Policy-based RL
- Better convergence properties than value based
- Effective in high dimensional or continuous action spaces
- We do not need to compute max action over Q-values
- If the action space is continuous, the maximization is not straightforward at all
- Can learn stochastic policies
Disadvantages of Policy-base RL
- Typically converges to a local rather than global optimum
- Evaluating a policy is typically inefficient and high variance
Why might we want to have a stochastic policy?
- e.g. Rock paper scissors
- Having a deterministic policy is easily exploited
- A uniform random policy is optimal