# A Brief Review of Reinforcement Learning

11 Nov 2017Reinforcement Learning is a mathematical framework for experience-driven autonomous learning. An **RL agent** interacts with its **environment** and, upon observing the consequences of its **actions**, can learn to alter its own behaviour in response to the **rewards** received. The goal of the agent is to learn a **policy** that maximizes the expected **return** (cumulative, discounted reward).

## Background

At each time step , the agent receives a state from the state space and selects an action from the action space , following a policy . Policy is the agent’s behavior. It is a mapping from a state to an action . Upon taking the **action **, the agent receives a scalar **reward **, and transitions to the **next state **.

Transition to the next state as well as the reward are governed by the environment dynamics (or a model). **State transition probability ** is used to describe the environment dynamics.

The **return ** is the discounted, accumulated reward with the discount factor . The agent aims to maximize the expectation of such long term return from each state.

A value function is a prediction of the expected, accumulative, discounted, future reward. In other words, a value function is the expected **return**. A value function measures how good each state, or state-action pair, is. The state value is the expected return for following policy from state .

decomposes into the Bellman equation:

We are assuming a stochastic policy in the above equation. Therefore, we are summing over all actions. is the state transition probabily for the given environment. And is the return for taking action from state .

An optimal state value is the maximum state value achievable by any policy for state .

The action value is the expected return for selecting action in state and then following policy .

The optimal action value function is the maximum action value achievable by any policy for state and action .

## Value Function Algorithms

The goal of these algorithms is to come up with an Action Value Function or a State Value Function . An action value function can be converted into an **-greedy policy**.

A state value function can be converted into the policy by greedily selecting the actions that result in better state value functions for .

### TD Learning

TD Learning learns the value function from experiences using TD errors. TD learning is prediction problem. The update rule for TD learning is <- . is the **learning rate** and is the **TD error**. More accurately, this is called TD(0) learning, where 0 indicates one-step return.

**Input**: The policy to be evaluated

**Output**: value function

```
Initialize V arbitrarily
for each episode do
initialize state s
for each step of episode, state s is not terminal do
a <- action given by policy pi for s
take action a, observe r, s'
V(s) <- V(s) + alpha * [r + gamma * V(s') - V(s)]
```

### SARSA

SARSA (state, action, reward, state, action) is an **on-policy** algorithm to find the optimal policy. The same policy is being used for both state space exploration and the value function update.

**Output**: action value function

```
Initialize Q arbitrarily
for each episode do
initialize state s
for each step of episode, state s is not terminal do
a <- action for s derived by Q, e.g. ϵ-greedy
take action a, observe r, s'
a' <- action for s' derived by Q, e.g. ϵ-greedy
Q(s,a) <- Q(s,a) + alpha * [r + gamma * Q(s',a') - Q(s,a)]
```

### Q-Learning

Q-Learning is an off-policy algorithm to find the optimal policy. It’s an off-policy method because the update rule greedily uses the action that maximizes the Q value, which differs from the policy used for evaluation (ϵ-greedy).

**Output**: action value function

```
Initialize Q arbitrarily
for each episode do
initialize state s
for each step of episode, state s is not terminal do
a <- action for s derived by Q, e.g. ϵ-greedy
take action a, observe r, s'
a' <- argmax Q(s',b)
Q(s,a) <- Q(s,a) + alpha * [r + gamma * Q(s',a') - Q(s,a)]
```

## Policy Gradient Algorithms

Policy gradient algorithms directly optimize the function (policy) we care about. Value function based algorithms, on the other hand, come to the final solution indirectly.

Policy gradient algorithms are more compatible as online (on-policy) methods, while the value algorithms are better suited as off-policy. Off-policy refers different policies for exploration (i.e. how you collect your data) and update (i.e. value function update) - off-policy methods can potentially do a better job of exploring the state space.

While value function based algorithms are **more sample-efficient** since they can use the data more efficiently through better state exploration, policy gradient algorithms are **more stable**.

This is an excellent video where Pieter Abbeel explains the theoretical background behind Policy Gradient Methods.

The goal is to maximize the utility function . Utility function is the return function of a trajectory weighted by the probability of its occurance. The final gradient turns out to be:

### REINFORCE

Vanilla REINFORCE algorithm is the simplest policy gradient algorithm out there. However, REINFORCE with baseline is a more popular version. Adding a baseline reduces the variance during training. The following psuedocode uses the value function as a baseline.

**Input**: policy

**Parameters**: step sizes,

**Output**: policy

```
for true do
generate an episode s[0], a[0], r[1],...,s[T-1], a[T-1], r[T], following π(.|.,θ)
for each step t of episode 0,...,T-1 do
Gt <- return from step t
delta <- Gt - v(s,w)
w <- w + beta * delta * gradient
θ <- θ + alpha * gamma * delta * gradient
```

The following GIF shows the REINFORCE policy learned on the CartPole-v0 env of OpenAI gym.