Reinforcement learning: Markov Decision Process

In the previous blog, we learned basic terminologies used in reinforcement learning, now we are going to see the basic mathematics and rules behind reinforcement learning i.e MDP.

Markov Decision Processes (MDPs) are mathematical frameworks for modeling decision-making problems in which an agent takes actions to maximize a reward signal, that is where MDP is connected with reinforcement learning because in reinforcement learning we also want to maximize the reward. In this blog post, we’ll take a closer look at what MDPs are, how they are constructed, and how they can be solved. But before going toward MDP need to see the fundamentals of MDP i.e Markov Property and Markov Chain, on which we are building MDP.

Markov Property

The Markov property is a fundamental concept in Markov Decision Processes (MDPs). It states that the future is independent of the past given the present. In other words, the future state of a system depends only on the current state and the actions taken, and not on any previous states or actions.

Formally, the Markov property can be expressed as follows:

For any state s and any time step t, the probability distribution over future states, given the history of states and actions up to t, is equal to the probability distribution over states at time t+1, given only the state at time t.

This property makes MDPs well-suited for modeling decision-making problems where the future is uncertain, but the uncertainty can be reduced by taking action and observing the results.

The Markov property is a key requirement for MDPs because it allows us to model the decision-making process in a way that is computationally tractable. By assuming the Markov property, we can simplify the problem of finding an optimal policy by considering only the current state and the immediate rewards and transitions, rather than the entire history of the system. This allows us to use algorithms like value iteration, and policy iteration to solve the MDP efficiently. Now we will take a look at Markov Chain.

Markov Chain

A Markov Chain is a mathematical model for a sequence of events in which the probability of each event depends only on the state of the previous event. It is a type of Markov process, named after the Russian mathematician Andrey Markov.

A Markov Chain is defined as a tuple (S, P), where:

  • S is a finite set of states, representing the possible configurations of the system.
  • P is a transition probability matrix, which maps each state to a distribution over the next states.

The transition probabilities in a Markov Chain specify the probabilities of transitioning from one state to another state in a single time step. For example, consider a simple Markov Chain in which there are two states, A and B. The transition probability matrix might look like this:

P = [0.8  0.2]
    [0.1  0.9]

This transition probability matrix says that if the system is in state A, there is an 80% chance that it will remain in state A and a 20% chance that it will transition to state B. If the system is in state B, there is a 10% chance that it will transition to state A and a 90% chance that it will remain in state B.

Markov Decision Process

A reinforcement learning task that satisfies the Markov property is called a Markov decision process, or MDP. If the state and action spaces are finite, then it is called a finite Markov decision process (finite MDP). Finite MDPs are particularly important to the theory of reinforcement learning.

A Markov Decision Process is a tuple (S, A, P, R, γ), where:

  • S is a finite set of states, representing the possible configurations of the system.
  • A is a finite set of actions, representing the possible decisions that the agent can make.
  • P is a transition probability function, which maps each state and action to a distribution over the next states.
  • R is a reward function, which maps each state and action to a real-valued reward.
  • γ is a discount factor, which determines the relative importance of immediate rewards versus future rewards.

The goal of an MDP is to find a policy, which is a function that maps each state to an action, such that the expected cumulative reward of the policy is maximized. In other words, the agent wants to find a way to behave that maximizes the total reward that it will receive over time.

Let’s consider a simple example of an MDP: a grid world. The grid world is a rectangular grid of cells, some of which are obstacles, and some of which are open. The agent starts in the upper-left cell and moves one step at a time in any of the four cardinal directions (up, down, left, or right). The goal of the agent is to reach the lower-right cell, avoiding obstacles along the way.

We can define the state space S as the set of all possible cell configurations of the grid. Each cell can be either open or blocked, so there are 2^(rows * columns) possible states.

The action space A is simply the set of four cardinal directions. The transition probability function P maps each state and action to a distribution over the next states. In the grid world, the transition probabilities are deterministic, so P(s’ | s, a) is either 0 or 1 depending on whether the transition from s to s’ is possible.

The reward function R maps each state and action to a real-valued reward. In the grid world, the agent receives a reward of -1 for each step it takes, except for reaching the lower-right cell, which has a reward of +100.

The discount factor γ determines the relative importance of immediate rewards versus future rewards. In the grid world, γ is typically set to 1, which means that the agent values all reward equally, regardless of when they are received.

Example: The Grid World

The Markov Decision Process (MDP) and the Markov Chain are both linked by the Markov Property, which states that the future is independent of the past given the present.

An example of an MDP is a simple problem of an agent navigating in a grid world. The state of the system is represented by the position of the agent in the grid, and the actions represent the moves the agent can make (e.g. up, down, left, right). The transition model defines the probabilities of transitioning from one state to another given an action, and the reward function assigns a numerical reward to each state-action pair.

The states in this MDP form a Markov Chain because the future state of the agent depends only on the current state and the action taken, and not on any previous states or actions. The Markov Property ensures that the future state depends only on the current state and the actions are taken, which is precisely what the MDP models.

For example, consider the following grid world:

+---+---+---+
| A | B | C |
+---+---+---+
| D | E | F |
+---+---+---+

Let’s assume that the agent starts in state A and has two actions available: move right or move down. The transition model defines the probabilities of transitioning from one state to another given an action, for example, if the agent moves right, it will transition to state B with probability 1. The reward function assigns a numerical reward to each state-action pair, for example, the reward for moving from state A to state B is +10, and the reward for moving from state A to state D is +5.

In this example, the states form a Markov Chain because the future state of the agent depends only on the current state and the action taken. The Markov Property ensures that the future state depends only on the current state and the actions are taken, which is precisely what the MDP models.

In summary, the Markov Decision Process (MDP) is a mathematical framework for modeling decision-making problems where an agent interacts with an environment, while the Markov Chain is a mathematical model for a sequence of events in which the probability of each event depends only on the state of the previous event. The MDP and the Markov Chain are linked by the Markov Property, which states that the future is independent of the past given the present.

In the next blog, we are going to discuss the value iteration and policy iteration which actually solve the MDP problems.

Leave a comment