A Brief Review of Reinforcement Learning11 Nov 2017
Reinforcement Learning is a mathematical framework for experience-driven autonomous learning. An RL agent interacts with its environment and, upon observing the consequences of its actions, can learn to alter its own behaviour in response to the rewards received. The goal of the agent is to learn a policy that maximizes the expected return (cumulative, discounted reward).
- Value Function Algorithms
- Policy Gradient Algorithms
At each time step , the agent receives a state from the state space and selects an action from the action space , following a policy . Policy is the agent’s behavior. It is a mapping from a state to an action . Upon taking the action , the agent receives a scalar reward , and transitions to the next state .
Transition to the next state as well as the reward are governed by the environment dynamics (or a model). State transition probability is used to describe the environment dynamics.
The return is the discounted, accumulated reward with the discount factor . The agent aims to maximize the expectation of such long term return from each state.
A value function is a prediction of the expected, accumulative, discounted, future reward. In other words, a value function is the expected return. A value function measures how good each state, or state-action pair, is. The state value is the expected return for following policy from state .
decomposes into the Bellman equation:
We are assuming a stochastic policy in the above equation. Therefore, we are summing over all actions. is the state transition probabily for the given environment. And is the return for taking action from state .
An optimal state value is the maximum state value achievable by any policy for state .
The action value is the expected return for selecting action in state and then following policy .
The optimal action value function is the maximum action value achievable by any policy for state and action .
Value Function Algorithms
The goal of these algorithms is to come up with an Action Value Function or a State Value Function . An action value function can be converted into an -greedy policy.
A state value function can be converted into the policy by greedily selecting the actions that result in better state value functions for .
TD Learning learns the value function from experiences using TD errors. TD learning is prediction problem. The update rule for TD learning is <- . is the learning rate and is the TD error. More accurately, this is called TD(0) learning, where 0 indicates one-step return.
Input: The policy to be evaluated
Output: value function
Initialize V arbitrarily for each episode do initialize state s for each step of episode, state s is not terminal do a <- action given by policy pi for s take action a, observe r, s' V(s) <- V(s) + alpha * [r + gamma * V(s') - V(s)]
SARSA (state, action, reward, state, action) is an on-policy algorithm to find the optimal policy. The same policy is being used for both state space exploration and the value function update.
Output: action value function
Initialize Q arbitrarily for each episode do initialize state s for each step of episode, state s is not terminal do a <- action for s derived by Q, e.g. ϵ-greedy take action a, observe r, s' a' <- action for s' derived by Q, e.g. ϵ-greedy Q(s,a) <- Q(s,a) + alpha * [r + gamma * Q(s',a') - Q(s,a)]
Q-Learning is an off-policy algorithm to find the optimal policy. It’s an off-policy method because the update rule greedily uses the action that maximizes the Q value, which differs from the policy used for evaluation (ϵ-greedy).
Output: action value function
Initialize Q arbitrarily for each episode do initialize state s for each step of episode, state s is not terminal do a <- action for s derived by Q, e.g. ϵ-greedy take action a, observe r, s' a' <- argmax Q(s',b) Q(s,a) <- Q(s,a) + alpha * [r + gamma * Q(s',a') - Q(s,a)]
Policy Gradient Algorithms
Policy gradient algorithms directly optimize the function (policy) we care about. Value function based algorithms, on the other hand, come to the final solution indirectly.
Policy gradient algorithms are more compatible as online (on-policy) methods, while the value algorithms are better suited as off-policy. Off-policy refers different policies for exploration (i.e. how you collect your data) and update (i.e. value function update) - off-policy methods can potentially do a better job of exploring the state space.
While value function based algorithms are more sample-efficient since they can use the data more efficiently through better state exploration, policy gradient algorithms are more stable.
The goal is to maximize the utility function . Utility function is the return function of a trajectory weighted by the probability of its occurance. The final gradient turns out to be:
Vanilla REINFORCE algorithm is the simplest policy gradient algorithm out there. However, REINFORCE with baseline is a more popular version. Adding a baseline reduces the variance during training. The following psuedocode uses the value function as a baseline.
Parameters: step sizes,
for true do generate an episode s, a, r,...,s[T-1], a[T-1], r[T], following π(.|.,θ) for each step t of episode 0,...,T-1 do Gt <- return from step t delta <- Gt - v(s,w) w <- w + beta * delta * gradient θ <- θ + alpha * gamma * delta * gradient
The following GIF shows the REINFORCE policy learned on the CartPole-v0 env of OpenAI gym.