Temporal Difference (TD) learning in reinforcement learning (RL) is a method for estimating the value of a state or action without needing a model of the environment. TD learning combines ideas from dynamic programming and Monte Carlo methods by learning directly from raw experiences without waiting for a final outcome or terminal state. The agent updates its value estimates based on the difference between consecutive predictions (hence the term "temporal difference").
In TD learning, the agent updates its value estimates after each step, even if the final outcome is not yet known. This is done by comparing the predicted value for a state to the actual reward received plus the estimated value of the next state. The difference between these two values is used to adjust the estimates.
TD learning is effective because it allows the agent to learn from partial sequences of interactions, making it more efficient for tasks with delayed rewards. A common algorithm using TD learning is Q-learning, where the Q-values are updated iteratively based on the temporal difference between predictions.