TD(0) and TD(λ) are both methods used in reinforcement learning for estimating value functions. The primary distinction between the two lies in how they update the value estimates based on the rewards received from the environment. TD(0) is a simpler approach where updates occur only at the most recent state-action pair, while TD(λ) introduces a more sophisticated mechanism that considers the entire sequence of previous state-action pairs in the update process.
In TD(0), also known as one-step Temporal Difference learning, the value of a state is updated immediately after taking an action and observing a reward. The update rule is straightforward: the value of the current state is adjusted toward the observed reward plus the estimated value of the next state. For example, if you are training an agent to play chess, after every move, if the agent ends up in a new position and receives a score based on that position, the TD(0) algorithm will adjust the value of the previous position based solely on that immediate outcome. This makes TD(0) efficient but limited, as it does not take into account the broader context of earlier states, which can result in slower learning.
On the other hand, TD(λ) uses a method called eligibility traces. This approach incorporates a fading memory of past states, allowing for updates not just to the most recent state but also to previous states based on how recently they were visited. In this method, the parameter λ (lambda) dictates how much influence past states have on the current update. For instance, if λ is set to 0.9, earlier states will still significantly contribute to the update process, but with diminishing weight. As a result, TD(λ) can facilitate faster learning by spreading the credit (or blame) of rewards more evenly across prior states compared to TD(0). So, in scenarios where actions have long-term consequences, such as in games or decision-making tasks, TD(λ) often performs better than TD(0) because it captures the intricacies of the learning environment more effectively.