Lecture 2: Markov Decision Processes

Lecture 2: Markov Decision Processes

Markov Decision Processes formally describe an environment for Reinforcement Learning.

  • The environment is fully observable, i.e. the current state fully characterizes the process
  • Almost all RL problems can be characterized as an MDP
  • Even continuous things like Optimal Control
  • Partially observable cases can be formulated as MDPs
  • Bandits are MDPs with one state

The Markov Property is central to MDPs. "The future is independent of the past given the present."

State Transition Matrix. For a Markov state and successor state , the state transition probability is defined as:

The state transition matrix defines transition probabilities from all states to all successor states .

Each row of the transition matrix sums to .

Markov Process. A Markov Process is a memoryless random process, i.e. a sequence of of random states with the Markov Property.

Definition. A Markov Process (or Markov Chain) is a tuple , where:
(i) is a (finite) set of states
(ii) is a state transition probability matrix
(iii)

Example of a Markov Process. A student can transit from Class 1 to Class 2 to Class 3, Pass or Sleep or Pub based on transition probabilities. We can sample episodes for the markov chain. E.g. one episode may be C1 C2 C3 Pass Sleep.

The transition probability matrix may look something like the below. Note that Sleep is the terminal state, so its self-probability is 1.0.

A markov reward process is a markov chain with values.

Definition. A Markov reward process is a tuple :

  • is a finite set of states
  • is a transition probability matrix,
  • is a reward function,
  • is a discount factor,

Definition. The return is the total discounted reward from time step .

There is no expectation because is one sample run of the Markov reward process. We'll take expectation later to get the expected return over infinite runs.

  • Note that the discount factor is the present value of future rewards. implies maximally short-sighted and implies maximally far-sighted.
  • The value of receiving reward after time steps is
  • This setup values immediate reward above future reward

Most Markov reward and decision processes are discounted. Why?

  • We do not have a perfect model, so the expected future rewards are more uncertain. Hence we put higher weights on immediate rewards.
  • Avoids inifite returns in cyclic Markov processes
  • If the reward is financial immediate rewards earn more interest than delayed rewards
  • Animal / human behaviour shows preference for immediate rewards
  • It is sometimes possible to use undiscounted Markov reward processes, e.g. if all sequences terminate

The value function gives the long-term value of state .

Definition. The state value function of an MRP is the expected return stating from state :

How do we compute the state value function? One way is to sample returns from the MRP. e.g. stating from and :

  • C1 C2 C3 Pass Sleep: -2.25
  • C1 FB FB C1 C2 Sleep: -3.125

Consider if we set . Then the value function , i.e. the value is just the immediate reward.

Now the important Bellman equation for MRPs:

It essentially tells us that the value function can be decomposed into two parts:

  • Immediate reward
  • Discounted value of successor state

  • Note that in the second-to-last line, the argument inside is a random variable, to express the fact that the state at time is random.
  • Note that both and are random variables, which express the value function at each possible state at time step .
  • becomes due to the law of iterated expectations. Recall that . (Not very sure exactly how this works out.)

To dig into bellman equation a bit more. Use a 1-step look ahead search. We start at state , we look ahead one step and integrate over the probabilities of the next time step. Hence we get .

We can use the bellman equation to verify if our value function is correct. Taking the value at a particular state, we can check if it is indeed the sum of the immediate reward and the weighted sum of values in all possible next steps.

The Bellman equation can be expressed concisely using matrices,

The bellman equation is a linear equation and can be solved directly using matrix inversion.

The complexity due to the matrix inversion if for states, which is not feasible for a large number of states. There are many iterative methods which are more efficient:

  • Dynamic programming
  • Monte Carlo evaluation
  • Temporal Difference learning

Markov Decision Process

So far it has been a building block. The MDP is what we really use. A Markov Decision Process (MDP) is a markov reward process with decisions (actions). It is an environment in which all states are Markov.

Definition. A Markov Decision Process is a tuple

  • is a finite set of states
  • is a finite set of actions
  • is a transition probability matrix with
  • is a reward function,
  • is a discount factor

Note that the transition probabilities and reward functions now also depend on an action, in which we can have some agency now. We can choose actions to influence the reward and values.

Definition. A policy is a distribution over actions given states,

A policy fully defines the behaviour of an agent. Some properties of a policy:

  • It only depends on the current state (not the history)
  • The policy does not depend on the time step (i.e. stationary)

We can still obtain the optimal policy because of the markov property - the current state captures all relevant information to make the optimal decision.

Definition. The state-value function of an MDP is the expected return starting from state , and following policy :

Definition. The action-value function of an MDP is the expected return starting from state , taking action , and following policy :

Bellman Expectation Equation. The state-value function can again be decomposed into immediate reward plus discounted value of the successor state.

Similarly we can do so for the action-value function, by inserting the chosen action:

From a given state, we have a value function attached to that state, i.e. . From this state, we have some possible actions to take. The policy determines the probability distribution over which action to take. With each action comes an action-value function . Hence we have:

Another way to look at it. We start with having chosen a particular action. Having chose a particular action, the environment will determine the particular state I end up in (based on the transition probability matrix ). Hence we have:

Now we can stitch these two perspectives together. Starting from a particular state, we can write in terms of , then write in terms of again. This will allow us to get a recursive relationship of in terms of and allow us to solve the equation.

The bellman expectation equation for is thus:

The math is expressing a simple idea: that the value at a particular state is the weighted sum of values from all possible actions we take under the current policy . The value of each action is in turn affected by the reward function and the transition probability that determines the state we end up in after taking a particular action.

Similarly, we can do the same by starting at an action instead of a state. The bellman expectation equation for is thus:

Optimal Value Function

So far we have been defining the dynamic process of the MDP, but have not tried solving the optimization problem. We will turn to this now.

Definition. The optimal state-value function is the maximum value function over all policies: The optimal action-value function is the maximum action-value function over all policies:

The MDP problem is solved once we find . We thus need some algorithms to systematically find .

Define a partial ordering over policies:

Theorem. For any MArkov Decision Process:

  • There exists an optimal policy that is better than or equal to all other policies, i.e.
  • All optimal policies achieve the optimal value function, i.e.
  • All optimal policies achieve the optimal action-value function, i.e.

How do we find the optimal policy? An optimal policy can be found trivially by maximizing over , if we knew it. That is, we always pick the action with the highest value. Hence if we have , we have .

Intuitively, we find the optimal policy by starting at the end (resting), and iteratively look backward. This is the same kind of intuition for the Bellman optimality equations.

The optimal value of being in a state is the highest value action we can take in that state. Note that we use instead of a generic because we are choosing from the optimal action-value function.

The optimal value of an action is the weighted sum of values of states that we can end up in after taking the action. Note that in this step, we do not get to choose an action - the transition probabilities will determine what state we end up in after taking actions :

Finally, stitching these two equations together, we get the bellman optimality equation for :

How do we solve the bellman optimality equations? It is now non-linear due to the max function, so we cannot solve it with matrix inversion as before. There is no closed form solution in general, but there are many iterative solution methods:

  • Value iteration
  • Policy iteration
  • Q-learning
  • Sarsa

Intuition. The core idea behind the bellman equations is to break down a complex sequential decision problem into a series of simpler, recursive steps. Imagine we are at a particular point in time and in a particular state. The bellman equations tell us that if we can assume that we will act optimally for all future steps after this action, then the problem of finding the best current action becomes trivial - we simply choose the action that yields the highest expected value (based on assuming future optimality).

To actually start unravelling the equations and solving them, we start from the termination point of a process (where the assumption of future optimality trivially holds) and work backwards.