Policy-based methods in reinforcement learning focus on directly learning the policy, which is a mapping from states to actions. Rather than estimating the value of state-action pairs, the agent learns a policy that maximizes expected cumulative rewards over time.
In policy-based methods, the agent typically uses a parameterized function (such as a neural network) to represent the policy. The policy is updated based on feedback from the environment. Policy gradient methods, such as REINFORCE and Proximal Policy Optimization (PPO), adjust the policy parameters by computing the gradient of expected rewards with respect to the policy, and then updating the parameters to increase the likelihood of taking better actions.
These methods are particularly useful for continuous action spaces, where value-based methods like Q-learning are less effective. However, policy-based methods may suffer from high variance in their updates and can require more careful tuning and optimization.