基于样本效率优化的深度强化学习方法研究综述
作者:
作者单位:

作者简介:

通讯作者:

吕帅,lus@jlu.edu.cn

基金项目:

国家重点研发计划(2017YFB1003103);国家自然科学基金(61300049);吉林省自然科学基金(20180101053JC)


Survey on Deep Reinforcement Learning Methods Based on Sample Efficiency Optimization
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
    摘要:

    深度强化学习将深度学习的表示能力和强化学习的决策能力结合,因在复杂控制任务中效果显著而掀起研究热潮.本文以是否用Bellman方程为基准将无模型深度强化学习方法分为Q值函数方法和策略梯度方法,并从模型构建方式、优化历程和方法评估等方面对两类方法分别进行了介绍.针对深度强化学习方法中样本效率低的问题进行讨论,根据两类方法的模型特性,说明了Q值函数方法过高估计问题和策略梯度方法采样无偏性约束分别是两类方法样本效率受限的主要原因.本文从增强探索效率和提高样本利用率两个角度,根据近年来的研究热点和趋势归纳出各类可行的优化方法,分析相关方法的优势和仍存在的问题,并对比其适用范围和优化效果.最后提出增强样本效率优化方法的通用性、探究两类方法间优化机制的迁移和提高理论完备性作为未来的研究方向.

    Abstract:

    Deep reinforcement learning combines the representation ability of deep learning with the decision-making ability of reinforcement learning, which has aroused great research interest due to its remarkable effect in complex control tasks. This paper classifies the model-free deep reinforcement learning methods into Q-value function methods and policy gradient methods by considering whether the Bellman equation is used, and introduce the two kinds of methods from the aspects of model structure, optimization process and evaluation respectively. Toward the low sample efficiency problem in deep reinforcement learning, this paper illustrates that the overestimation problem in Q-value function methods and the unbiased sampling constraint in policy gradient methods are the main factors that affect the sample efficiency according to model structure. Then, from the perspectives of enhancing the exploration efficiency and improving the sample exploitation rate, this paper summarizes various feasible optimization methods according to the recent research hotspots and trends, analyzes advantages together with existing problems of related methods, and compares them according to the scope of application and optimization effect. Finally, this paper proposes to enhance the generality of optimization methods, explore migration of optimization mechanisms between the two kinds of methods and improve theoretical completeness as future research directions.

    参考文献
    相似文献
    引证文献
引用本文

张峻伟,吕帅,张正昊,于佳玉,龚晓宇.基于样本效率优化的深度强化学习方法研究综述.软件学报,,():0

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
历史
  • 收稿日期:2020-11-11
  • 最后修改日期:2021-01-18
  • 录用日期:
  • 在线发布日期: 2021-08-02
您是第位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号