SARSA (State-Action-Reward-State-Action) is an on-policy reinforcement learning algorithm that, like Q-learning, aims to learn the optimal action-value function Q(s, a). However, the key difference is that SARSA updates Q-values based on the action actually taken in the next state, rather than the best possible action.
The update rule for SARSA is: Q(s, a) ← Q(s, a) + α * [R(s, a) + γ * Q(s', a') - Q(s, a)] Where:
- s' is the next state
- a' is the next action taken by the agent (not necessarily the one that maximizes the Q-value) This makes SARSA an on-policy method because it updates the Q-values based on the policy that the agent is actually following, including the actions it chooses.
For example, if an agent chooses a non-optimal action in a given state, SARSA will adjust the Q-value based on this action, rather than the best possible one.