Renxing Li -

In this paper, a novel reinforcement learning (RL) approach, continuous dynamic policy programming (CDPP) is proposed to tackle the issues of both learning stability and sample efficiency in the current RL methods with continuous actions. The proposed method naturally extends the relative entropy regularization from the value function-based framework to the actor-critic (AC) framework of deep deterministic policy gradient (DDPG) to stabilize the learning process in continuous action space. It tackles the intractable softmax operation over continuous actions in the critic by Monte Carlo estimation and explores the practical advantages of the Mellowmax operator. A Boltzmann sampling policy is proposed to guide the exploration of actor following the relative entropy regularized critic. Evaluated by several benchmark tasks, the proposed method clearly illustrates the positive impact of the relative entropy regularization including efficient exploration behavior and stable policy update in RL with continuous action space and successfully outperforms the related baseline approach in both sample efficiency and learning stability.