Understanding Expected SARSA in Reinforcement Learning Techniques
Written on
Introduction to Expected SARSA
This section offers a brief overview of Expected SARSA, along with a comparative analysis of SARSA, Expected SARSA, and Q-Learning.
Reinforcement Learning Essentials
Reinforcement Learning (RL) focuses on enabling an agent to determine the best control strategy for a sequential decision-making challenge within an environment. The goal is to maximize its long-term rewards through continuous interactions with that environment. Each interaction modifies the environment's state, leading to a numerical reward provided by the environment. This concept is elaborated in "Reinforcement Learning: An Introduction" by Richard S. Sutton and Andrew G. Barto.
Both SARSA and Expected SARSA, along with Q-Learning, fall under the category of Temporal Difference (TD) algorithms. TD methods learn directly from experiences gained by interacting with the environment without requiring a model of its dynamics. Whenever an agent takes an action, the feedback it receives is utilized to update estimates of its state-action value function, which predicts the long-term discounted rewards associated with specific actions in various states. TD learning is fully incremental, allowing updates before knowing the final results.
Understanding Expected SARSA
Expected SARSA is an adaptation of the traditional SARSA algorithm, which is a well-known on-policy temporal-difference method used in model-free reinforcement learning. It can function both on-policy and off-policy.
- On-Policy vs. Off-Policy: In an off-policy context, similar to Q-Learning, there exist two distinct policies: the behavior policy, which explores the environment to gather diverse samples, and the target policy, which is optimized based on that exploration. Conversely, in the on-policy scenario, as seen with traditional SARSA, the behavior and target policies are the same, with the target policy being refined iteratively as it governs the agent's behavior.
Expected SARSA utilizes the Bellman equation, similar to SARSA, while Q-Learning employs the Bellman optimality equation. The stochastic nature of the behavior and target policies in SARSA introduces significant variance in updates due to non-deterministic action selection.
In contrast, Expected SARSA employs the expected state-action value for the subsequent state, averaging over all potential actions, weighted according to the policy distribution. This approach mitigates the increase in update variance caused by policy stochasticity by focusing on the expected value rather than the specific state-action pair.
For example, SARSA randomly selects an action in a given state, whereas Expected SARSA calculates the expected return based on the policy in that state, thus moving in the same direction as SARSA, which is reflected in its name.
While Expected SARSA is more computationally intensive than SARSA, the use of expected returns helps to lower update variance.
Comparative Analysis of SARSA, Expected SARSA, and Q-Learning
- Sampling Methods: SARSA samples subsequent state-action pairs in a stochastic manner, whereas Q-Learning maximizes over next state-action pairs. Expected SARSA, on the other hand, computes the expected value by considering the likelihood of actions under the current policy.
- Optimal Policy Convergence: SARSA is capable of converging to an optimal policy as long as all state-action pairs are visited infinitely and the policy approaches the ε-greedy policy. Q-Learning's greedy target policy ensures that Q values converge to an optimal Q*. Expected SARSA presents a lower variance alternative to SARSA and functions as an on-policy variant of Q-Learning.
Performance Metrics
When evaluating performance, SARSA tends to outperform Q-Learning in online settings. Expected SARSA consistently exceeds SARSA across all tested step size parameters, particularly in tasks with high policy exploration. The performance gap between Expected SARSA and SARSA narrows in highly stochastic environments.
An interesting insight is that Expected SARSA surpasses Q-Learning for scenarios where an ε-soft behavior policy derived from Q*(s, a) does not align with the optimal ε-soft policy.
A Helpful Analogy
To differentiate between SARSA, Q-Learning, and Expected SARSA, consider the analogy of traveling from one location to another, where various routes are available, and you need to estimate travel time:
- SARSA represents choosing a route randomly.
- Q-Learning signifies estimating the time for the route with the longest travel duration.
- Expected SARSA entails taking a weighted average of all potential routes.
Conclusion
Expected SARSA serves as a refined version of the SARSA algorithm, aimed at reducing update variance when compared to both SARSA and Q-Learning. It utilizes the expected state-action value, considering all possible actions weighted by their probability under the current policy distribution. Expected SARSA can function as either an on-policy or off-policy algorithm and applies the Bellman equation effectively.
References
Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction.
This video provides an overview of Temporal Difference Learning, including Q-Learning, as part of the reinforcement learning series.
This video discusses Q-learning and variations of SARSA, offering insights into their applications in reinforcement learning.