In reinforcement learning, on-policy and off-policy methods differ in how they handle the policy used for learning and decision-making.
On-policy methods learn the value of the policy that the agent is currently following. In these methods, the agent updates its policy using data generated by the policy it is exploring. An example of this is SARSA, where the agent’s current policy directly influences its learning.
Off-policy methods, on the other hand, learn the value of an optimal policy independently of the agent’s current behavior. This allows the agent to learn from data generated by a different policy, enabling it to explore various strategies. Q-learning is an example of off-policy learning, where the agent learns from past experiences or from another policy while still aiming for the best possible policy.