A Brief Review of Reinforcement Learning

Reinforcement Learning is a mathematical framework for experience-driven autonomous learning. An RL agent interacts with its environment and, upon observing the consequences of its actions, can learn to alter its own behaviour in response to the rewards received. The goal of the agent is to learn a policy that maximizes the expected return (cumulative, discounted reward).

Background

At each time step , the agent receives a state from the state space and selects an action from the action space , following a policy . Policy is the agent’s behavior. It is a mapping from a state to an action . Upon taking the action , the agent receives a scalar reward , and transitions to the next state .

Transition to the next state as well as the reward are governed by the environment dynamics (or a model). State transition probability is used to describe the environment dynamics.

The return is the discounted, accumulated reward with the discount factor . The agent aims to maximize the expectation of such long term return from each state.

A value function is a prediction of the expected, accumulative, discounted, future reward. In other words, a value function is the expected return. A value function measures how good each state, or state-action pair, is. The state value is the expected return for following policy from state .

decomposes into the Bellman equation:

We are assuming a stochastic policy in the above equation. Therefore, we are summing over all actions. is the state transition probabily for the given environment. And is the return for taking action from state .

An optimal state value is the maximum state value achievable by any policy for state .

The action value is the expected return for selecting action in state and then following policy .

The optimal action value function is the maximum action value achievable by any policy for state and action .


Value Function Algorithms

The goal of these algorithms is to come up with an Action Value Function or a State Value Function . An action value function can be converted into an -greedy policy.

A state value function can be converted into the policy by greedily selecting the actions that result in better state value functions for .

TD Learning

TD Learning learns the value function from experiences using TD errors. TD learning is prediction problem. The update rule for TD learning is <- . is the learning rate and is the TD error. More accurately, this is called TD(0) learning, where 0 indicates one-step return.

Input: The policy to be evaluated
Output: value function

Initialize V arbitrarily
for each episode do
    initialize state s
    for each step of episode, state s is not terminal do
        a <- action given by policy pi for s
        take action a, observe r, s'
        V(s) <- V(s) + alpha * [r + gamma * V(s') - V(s)]

SARSA

SARSA (state, action, reward, state, action) is an on-policy algorithm to find the optimal policy. The same policy is being used for both state space exploration and the value function update.

Output: action value function

Initialize Q arbitrarily
for each episode do
    initialize state s
    for each step of episode, state s is not terminal do
        a <- action for s derived by Q, e.g. ϵ-greedy
        take action a, observe r, s'
        a' <- action for s' derived by Q, e.g. ϵ-greedy
        Q(s,a) <- Q(s,a) + alpha * [r + gamma * Q(s',a') - Q(s,a)]

Q-Learning

Q-Learning is an off-policy algorithm to find the optimal policy. It’s an off-policy method because the update rule greedily uses the action that maximizes the Q value, which differs from the policy used for evaluation (ϵ-greedy).

Output: action value function

Initialize Q arbitrarily
for each episode do
    initialize state s
    for each step of episode, state s is not terminal do
        a <- action for s derived by Q, e.g. ϵ-greedy
        take action a, observe r, s'
        a' <- argmax Q(s',b)
        Q(s,a) <- Q(s,a) + alpha * [r + gamma * Q(s',a') - Q(s,a)]

Policy Gradient Algorithms

Policy gradient algorithms directly optimize the function (policy) we care about. Value function based algorithms, on the other hand, come to the final solution indirectly.

Policy gradient algorithms are more compatible as online (on-policy) methods, while the value algorithms are better suited as off-policy. Off-policy refers different policies for exploration (i.e. how you collect your data) and update (i.e. value function update) - off-policy methods can potentially do a better job of exploring the state space.

While value function based algorithms are more sample-efficient since they can use the data more efficiently through better state exploration, policy gradient algorithms are more stable.

This is an excellent video where Pieter Abbeel explains the theoretical background behind Policy Gradient Methods.

The goal is to maximize the utility function . Utility function is the return function of a trajectory weighted by the probability of its occurance. The final gradient turns out to be:


REINFORCE

Vanilla REINFORCE algorithm is the simplest policy gradient algorithm out there. However, REINFORCE with baseline is a more popular version. Adding a baseline reduces the variance during training. The following psuedocode uses the value function as a baseline.

Input: policy
Parameters: step sizes,
Output: policy

for true do
    generate an episode s[0], a[0], r[1],...,s[T-1], a[T-1], r[T], following π(.|.,θ)
    for each step t of episode 0,...,T-1 do
        Gt <- return from step t
        delta <- Gt - v(s,w)
        w <- w + beta * delta * gradient
        θ <- θ + alpha * gamma * delta * gradient

The following GIF shows the REINFORCE policy learned on the CartPole-v0 env of OpenAI gym.