模糊映射熵驱动的强化学习系统安全监控方法
CSTR:
作者:
作者单位:

作者简介:

通讯作者:

中图分类号:

TP311

基金项目:

国家自然科学基金(92582104)


Fuzzy-mapping-entropy-driven Safety Monitoring Method for Reinforcement Learning Systems
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    深度强化学习虽已在多种复杂任务中取得卓越成果, 但其策略在动态高维环境下仍缺乏实时安全保障, 因而亟需在部署阶段引入能够实时评估并纠正智能体决策的安全监控机制. 现有数据驱动的黑盒监控方法侧重离散或二元决策, 难以直接迁移到连续动作空间. 针对上述问题, 提出了模糊映射熵驱动的安全监控框架, 仅依赖状态、动作和成本数据即可构建, 无需任何环境模型. 该方法首先利用高斯混合模型(Gaussian mixture model, GMM)对离线收集的安全轨迹进行状态簇硬划分和动作簇软隶属, 并提出模糊映射熵在兼顾均衡性与模型复杂度的前提下自适应确定最优动作簇数. 随后在?Mamdani框架下构建模糊逻辑规则, 并通过残差网络与对抗判别器联合微调簇中心, 使生成动作更贴近真实的安全分布. 在线阶段, 监控器基于GMM后验概率计算每条待执行状态-动作对的簇一致性度量. 一旦该度量低于阈值, 即通过模糊推理生成平滑的安全替换动作, 从而在风险发生之前完成修正. 在?Safety-Gymnasium的3个导航任务上, 对?PPO-Lag、TRPO-Lag与?CPPO-PID策略进行了监控评估. 结果显示, 该框架在几乎不降低乃至略微提升任务回报的前提下, 显著降低累计安全成本, 并保持较高的预警覆盖率, 验证了该监控框架在连续动作场景中的有效性和实用性.

    Abstract:

    Deep reinforcement learning has achieved excellent results in many complex tasks, but its policies still lack real-time safety guarantees in dynamic, high-dimensional environments. It is therefore urgent to introduce a safety monitoring mechanism during deployment that can evaluate and correct agent decisions in real time. Existing data-driven black-box monitoring methods focus on discrete or binary decisions and are hard to transfer directly to continuous action spaces. To address the above issue, this study proposes a fuzzy-mapping-entropy-driven safety monitoring framework, which can be constructed solely from state, action, and cost data without requiring any environment model. Specifically, the method first uses a Gaussian mixture model (GMM) to perform hard partitioning of states and soft membership of actions on the offline collected safe trajectories, and proposes fuzzy mapping entropy to adaptively determine the optimal number of action clusters under the premise of balancing uniformity and model complexity. Next, fuzzy logic rules are built in the Mamdani framework, and cluster centers are jointly fine-tuned with a residual network and an adversarial discriminator to make the generated actions closer to the real safe distribution. In the online phase, the monitor computes a cluster consistency measure for each pending state-action pair based on GMM posterior probabilities. If this measure falls below a threshold, fuzzy inference is used to generate a smooth safe replacement action, thus correcting the action before a risk occurs. PPO-Lag, TRPO-Lag, and CPPO-PID policies are evaluated under the proposed monitoring framework on three navigation tasks in Safety-Gymnasium. The results show that the framework significantly reduces cumulative safety costs while keeping almost the same or slightly higher task returns and maintains a high warning coverage rate, which confirms its effectiveness and practicality in continuous action scenarios.

    参考文献
    相似文献
    引证文献
引用本文

杨敏,周子渊,李晓锋,刘关俊.模糊映射熵驱动的强化学习系统安全监控方法.软件学报,2026,37(9):1-20

复制
相关视频

分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2025-09-08
  • 最后修改日期:2025-10-28
  • 录用日期:
  • 在线发布日期: 2025-12-24
  • 出版日期: 2026-09-06
文章二维码
您是第位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号