基于图表示学习的在线服务系统告警聚类方法
CSTR:
作者:
作者单位:

作者简介:

通讯作者:

中图分类号:

TP311

基金项目:


Alert Clustering Method Based on Graph Representation Learning for Online Service Systems
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    在大型在线服务系统中, 由于各组件间错综复杂的依赖关系, 故障的发生常会引发大量相关告警, 形成告警风暴. 告警风暴不仅增加了值班工程师的工作负担, 也使得故障诊断和根因分析变得更加困难. 为了解决这一问题, 提出了Alert-CM, 一种基于图表示学习的告警聚类方法, 将由同一故障引发的告警有效地聚类在一起, 从而减轻工程师的工作负担. 在告警管理过程中, 一条告警往往由故障发生时的多种底层系统数据共同汇总得到, 如异常相关应用、指标、日志、预警规则和应急场景等. Alert-CM认为由同一个故障引发的多个告警通常在底层系统数据层面存在紧密关联, 并且当中的底层核心系统数据将更能代表当前告警对应的系统异常状态. 根据这一核心思想, Alert-CM基于告警相关配置数据构建出细粒度的系统数据依赖图, 并将告警与图中各节点的依赖关系进行抽象和映射, 进一步扩展告警的特征空间. 基于系统数据依赖图, Alert-CM搭建了图神经网络模型进行图表示学习, 挖掘核心底层系统数据对于告警的贡献强弱, 从而输出准确的告警向量表示. 最终, Alert-CM使用DBSCAN算法实现告警聚类. 在真实工业数据集上对Alert-CM进行评估, 重点考察聚类的有效性和实时效率. 实验结果表明, Alert-CM在告警聚类任务中的表现显著优于传统的告警聚合方法. 在评估中, Alert-CM的NMI和ARI分别达到了 0.901 和 0.645, 相较于现有方法的平均值分别提升 31.7% 和 153.9%, 同时Alert-CM在在线实时聚类任务上也表现出良好的性能.

    Abstract:

    In large-scale online service systems, intricate dependencies among components often cause a single fault to trigger a massive number of correlated alerts, resulting in alert storms. Alert storms not only increase the workload of on-call engineers but also make fault diagnosis and root cause analysis more challenging. To address this issue, this study proposes Alert-CM, a graph representation learning-based alert clustering method that effectively groups alerts caused by the same fault, thereby alleviating engineers’ workloads. In alert management, an alert is typically generated by aggregating various types of underlying system data at the time of a fault, including related applications, metrics, logs, alert rules, and emergency scenarios. Alert-CM assumes that multiple alerts triggered by the same fault usually exhibit tight correlations at the underlying system data level, and that the core underlying system data better represents the abnormal system state associated with the alerts. Based on this assumption, Alert-CM constructs a fine-grained system data dependency graph using alert-related configuration data, abstracting and mapping the dependency relationships between alerts and graph nodes to further expand the alert feature space. On top of the system data dependency graph, a graph neural network model is built to perform graph representation learning, in which the contributions of core underlying system data to alerts are automatically learned, producing accurate alert vector representations. Finally, the DBSCAN algorithm is applied to cluster alerts based on the learned representations. Alert-CM is evaluated on a real-world industrial dataset, with a focus on clustering effectiveness and real-time efficiency. The experimental results demonstrate that Alert-CM significantly outperforms traditional alert aggregation methods in alert clustering tasks. Specifically, Alert-CM achieves an NMI of 0.901 and an ARI of 0.645, corresponding to average improvements of 31.7% and 153.9% over existing methods, respectively. In addition, Alert-CM exhibits strong performance in online real-time clustering tasks.

    参考文献
    相似文献
    引证文献
引用本文

陈淼,张弼铖,张晨曦,彭鑫,杨定裕,李伟,钱泽林,吴哲顺,欧嘉煜,钟坚锐.基于图表示学习的在线服务系统告警聚类方法.软件学报,,():1-21

复制
相关视频

分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2024-12-10
  • 最后修改日期:2025-06-05
  • 录用日期:
  • 在线发布日期: 2026-04-01
  • 出版日期:
文章二维码
您是第位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号