Policy gradients and Q-learning are two different approaches in reinforcement learning, with distinct methods for learning optimal policies.
Q-learning is a value-based method that estimates the value of state-action pairs through a Q-function. It selects the action with the highest Q-value in each state, and the Q-values are updated based on the rewards received. Q-learning is typically used in discrete action spaces and can converge to an optimal policy using off-policy learning.
Policy gradient methods, on the other hand, are policy-based. Instead of learning the value of state-action pairs, they directly learn the policy by optimizing a performance objective (like maximizing expected return). Policy gradients work well for continuous or high-dimensional action spaces. Unlike Q-learning, which involves selecting the best action based on Q-values, policy gradients involve sampling actions according to the learned policy distribution and updating it based on the observed rewards.