Yongjin Lee -

Proximal Policy Optimization (PPO) stands as one of the most successful deep reinforcement learning methods, primarily owing to its utilization of a clipped loss for an actor. While the clipped loss for an actor has been extensively studied, its counterpart for a critic has not received equal attention. This study provides a comprehensive analysis of the behavior of the clipped critic loss, revealing a misalignment with the trust region principle. Drawing on our analysis, we propose a refined variant that aligns closely with the trust region principle.