###
Journal of Software:2013.24(11):2667-2675

一种基于自生成样本学习的奖赏塑形方法
钱煜,俞扬,周志华
(计算机软件新技术国家重点实验室南京大学, 江苏 南京 210023)
Shaping Reward Learning Approach from Passive Samples
QIAN Yu,YU Yang,ZHOU Zhi-Hua
(National Key Laboratory for Novel Software Technology Nanjing University, Nanjing 210023, China)
Abstract
Chart / table
Reference
Similar Articles
Article :Browse 3883   Download 3696
Received:April 06, 2013    Revised:July 17, 2013
> 中文摘要: 强化学习通过从以往的决策反馈中学习,使Agent 做出正确的短期决策,以最大化其获得的累积奖赏值.以往研究发现,奖赏塑形方法通过提供简单、易学的奖赏替代函数(即奖赏塑性函数)来替换真实的环境奖赏,能够有效地提高强化学习性能.然而奖赏塑形函数通常是在领域知识或者最优策略示例的基础上建立的,均需要专家参与,代价高昂.研究是否可以在强化学习过程中自动地学习有效的奖赏塑形函数.通常,强化学习算法在学习过程中会采集大量样本.这些样本虽然有很多是失败的尝试,但对构造奖赏塑形函数可能提供有用信息.提出了针对奖赏塑形的新型最优策略不变条件,并在此基础上提出了RFPotential 方法,从自生成样本中学习奖赏塑形.在多个强化学习算法和问题上进行了实验,其结果表明,该方法可以加速强化学习过程.
Abstract:Reinforcement learning (RL) deals with long-term reward maximization problems via learning correct short-term decisions from on previous experience. It has been revealed that reward shaping, which provides simpler and easier reward functions to replace the actual environmental reward, is an effective way to guide and accelerate reinforcement learning. However, building a shaping reward requires either domain knowledge or demonstrations from an optimal policy, both involve participation of human experts that is costly. This work investigates whether it is possible to automatically learn a better shaping reward along with an RL process. RL algorithms commonly sample a lot of trajectories throughout the learning process. Those passive samples, though containing many failed attempts, may provide useful information for building a shaping reward function. A policy-invariance condition for reward shaping is introduced as a more effective way to handle noisy examples, followed by the RFPotential approach to learn a shaping reward from massive examples efficiently. Empirical studies on various RL algorithms and domains show that RFPotential can accelerate the RL process.
文章编号:     中图分类号:    文献标志码:
基金项目:江苏省自然科学基金(BK2012303);百度开放课题(181315P00651) 江苏省自然科学基金(BK2012303);百度开放课题(181315P00651)
Foundation items:
Reference text:

钱煜,俞扬,周志华.一种基于自生成样本学习的奖赏塑形方法.软件学报,2013,24(11):2667-2675

QIAN Yu,YU Yang,ZHOU Zhi-Hua.Shaping Reward Learning Approach from Passive Samples.Journal of Software,2013,24(11):2667-2675