Quick Notes on Reinforcement Learning


In this series of blog posts, I intend to write my notes as I go through Richard S. Sutton’s excellent Reinforcement Learning: An Introduction (1).

I will try to formalise the maths behind it a little bit, mainly because I would like to use it as a useful personal reference to the main concepts in RL. I will probably add a few remarks about a possible implementation as I go on.

Relationship between agent and environment

Context and assumptions

The goal of reinforcement learning is to select the best actions availables to an agent as it goes through a series of states in an environment. In this post, we will only consider discrete time steps.

The most important hypothesis we make is the Markov property:

At each time step, the next state of the agent depends only on the current state and the current action taken. It cannot depend on the history of the states visited by the agent.

This property is essential to make our problems tractable, and often holds true in practice (to a reasonable approximation).

With this assumption, we can define the relationship between agent and environment as a Markov Decision Process (MDP).

A Markov Decision Process is a tuple \((\mathcal{S}, \mathcal{A}, \mathcal{R}, p)\) where:

  • \(\mathcal{S}\) is a set of states,

  • \(\mathcal{A}\) is an application mapping each state \(s \in \mathcal{S}\) to a set \(\mathcal{A}(s)\) of possible actions for this state. In this post, we will often simplify by using \(\mathcal{A}\) as a set, assuming that all actions are possible for each state,

  • \(\mathcal{R} \subset \mathbb{R}\) is a set of rewards,

  • and \(p\) is a function representing the dynamics of the MDP:

    \[\begin{align} p &: \mathcal{S} \times \mathcal{R} \times \mathcal{S} \times \mathcal{A} \mapsto [0,1] \\ p(s', r \;|\; s, a) &:= \mathbb{P}(S_t=s', R_t=r \;|\; S_{t-1}=s, A_{t-1}=a), \end{align} \]

    such that \[ \forall s \in \mathcal{S}, \forall a \in \mathcal{A},\quad \sum_{s', r} p(s', r \;|\; s, a) = 1. \]

The function \(p\) represents the probability of transitioning to the state \(s'\) and getting a reward \(r\) when the agent is at state \(s\) and chooses action \(a\).

We will also use occasionally the state-transition probabilities:

\[\begin{align} p &: \mathcal{S} \times \mathcal{S} \times \mathcal{A} \mapsto [0,1] \\ p(s' \;|\; s, a) &:= \mathbb{P}(S_t=s' \;|\; S_{t-1}=s, A_{t-1}=a) \\ &= \sum_r p(s', r \;|\; s, a). \end{align} \]

Rewarding the agent

The expected reward of a state-action pair is the function

\[\begin{align} r &: \mathcal{S} \times \mathcal{A} \mapsto \mathbb{R} \\ r(s,a) &:= \mathbb{E}[R_t \;|\; S_{t-1}=s, A_{t-1}=a] \\ &= \sum_r r \sum_{s'} p(s', r \;|\; s, a). \end{align} \]

The discounted return is the sum of all future rewards, with a multiplicative factor to give more weights to more immediate rewards: \[ G_t := \sum_{k=t+1}^T \gamma^{k-t-1} R_k, \] where \(T\) can be infinite or \(\gamma\) can be 1, but not both.

Deciding what to do: policies

Defining our policy and its value

A policy is a way for the agent to choose the next action to perform.

A policy is a function \(\pi\) defined as

\[\begin{align} \pi &: \mathcal{A} \times \mathcal{S} \mapsto [0,1] \\ \pi(a \;|\; s) &:= \mathbb{P}(A_t=a \;|\; S_t=s). \end{align} \]

In order to compare policies, we need to associate values to them.

The state-value function of a policy \(\pi\) is

\[\begin{align} v_{\pi} &: \mathcal{S} \mapsto \mathbb{R} \\ v_{\pi}(s) &:= \text{expected return when starting in $s$ and following $\pi$} \\ v_{\pi}(s) &:= \mathbb{E}_{\pi}\left[ G_t \;|\; S_t=s\right] \\ v_{\pi}(s) &= \mathbb{E}_{\pi}\left[ \sum_{k=0}^{\infty} \gamma^k R_{t+k+1} \;|\; S_t=s\right] \end{align} \]

We can also compute the value starting from a state \(s\) by also taking into account the action taken \(a\).

The action-value function of a policy \(\pi\) is

\[\begin{align} q_{\pi} &: \mathcal{S} \times \mathcal{A} \mapsto \mathbb{R} \\ q_{\pi}(s,a) &:= \text{expected return when starting from $s$, taking action $a$, and following $\pi$} \\ q_{\pi}(s,a) &:= \mathbb{E}_{\pi}\left[ G_t \;|\; S_t=s, A_t=a \right] \\ q_{\pi}(s,a) &= \mathbb{E}_{\pi}\left[ \sum_{k=0}^{\infty} \gamma^k R_{t+k+1} \;|\; S_t=s, A_t=a\right] \end{align} \]

The quest for the optimal policy


  1. R. S. Sutton and A. G. Barto, Reinforcement learning: an introduction, Second edition. Cambridge, MA: The MIT Press, 2018.