AUTHOREA
Log in
Sign Up
Browse Preprints
LOG IN
SIGN UP
Essential Site Maintenance
: Authorea-powered sites will be updated circa 15:00-17:00 Eastern on Tuesday 5 November.
There should be no interruption to normal services, but please contact us at help@authorea.com in case you face any issues.
Zhiwei Shang
Public Documents
2
Relative Entropy Regularized Sample Efficient Reinforcement Learning with Continuous...
Zhiwei Shang
and 4 more
June 29, 2022
In this paper, a novel reinforcement learning (RL) approach, continuous dynamic policy programming (CDPP) is proposed to tackle the issues of both learning stability and sample efficiency in the current RL methods with continuous actions. The proposed method naturally extends the relative entropy regularization from the value function-based framework to the actor-critic (AC) framework of deep deterministic policy gradient (DDPG) to stabilize the learning process in continuous action space. It tackles the intractable softmax operation over continuous actions in the critic by Monte Carlo estimation and explores the practical advantages of the Mellowmax operator. A Boltzmann sampling policy is proposed to guide the exploration of actor following the relative entropy regularized critic. Evaluated by several benchmark tasks, the proposed method clearly illustrates the positive impact of the relative entropy regularization including efficient exploration behavior and stable policy update in RL with continuous action space and successfully outperforms the related baseline approach in both sample efficiency and learning stability.
Efficient Distributional Reinforcement Learning with Kullback-Leibler Divergence Regu...
Renxing Li
and 5 more
May 02, 2022
In this article, we address the issues of stability and data-efficiency in reinforcement learning (RL). A novel RL approach, Kullback–Leibler divergence-regularized distributional RL (KLC51) is proposed to integrate the advantages of both stability in the distributional RL and data-efficiency in the Kullback-Leibler (KL) divergence-regularized RL in one framework. KLC51 derived the Bellman equation and the TD errors regularized by KL divergence in a distributional perspective and explored the approximated strategies of properly mapping the corresponding Boltzmann softmax term into distributions. Evaluated by several benchmark tasks with different complexity, the proposed method clearly illustrates the positive effect of the KL divergence regularization to the distributional RL including exclusive exploration behaviors and smooth value function update, and successfully demonstrates its significant superiority in both learning stability and data-efficiency compared with the related baseline approaches.