主页期刊介绍编委会编辑部服务介绍道德声明在线审稿编委办公编辑办公English
     
在线出版
各期目录
纸质出版
分辑系列
论文检索
论文排行
综述文章
专刊文章
美文分享
各期封面
E-mail Alerts
RSS
旧版入口
中国科学院软件研究所
  
投稿指南 问题解答 下载区 收费标准 在线投稿
郭虎升,张爱娟,王文剑.基于在线性能测试的概念漂移检测方法.软件学报,2020,31(4):0
基于在线性能测试的概念漂移检测方法
Concept Drift Detection Method Based on Online Performance Test
投稿时间:2019-03-08  修订日期:2019-07-11
DOI:10.13328/j.cnki.jos.005917
中文关键词:  流数据  概念漂移  交叉检验  有效波动位点  一致波动位点  概念漂移位点
英文关键词:streaming data  concept drift  cross checking  effective fluctuation point  consistent fluctuation point  concept drift point
基金项目:国家自然科学基金(61503229,61673249,U1805263);山西省回国留学人员科研基金项目(2016-004)
作者单位E-mail
郭虎升 山西大学 计算机与信息技术学院, 山西 太原 030006
计算智能与中文信息处理教育部重点实验室(山西大学), 山西 太原 030006 
 
张爱娟 山西大学 计算机与信息技术学院, 山西 太原 030006  
王文剑 计算智能与中文信息处理教育部重点实验室(山西大学), 山西 太原 030006 wjwang@sxu.edu.cn 
摘要点击次数: 93
全文下载次数: 35
中文摘要:
      概念漂移是动态流数据挖掘中一类常见的问题,但混杂噪声或训练样本规模过小而产生的伪概念漂移会引起与真实概念漂移相似的结果,即模型在线测试性能的不稳定波动,导致二者容易混淆,发生概念漂移的误报.针对流数据中真伪概念漂移的混淆问题,提出一种基于在线性能测试的概念漂移检测方法(Concept drift detection method based on online performance test,CDPT).该方法将最新获得的数据集进行均匀分组,在每组子数据集上分别进行在线学习,同时记录每组子数据集训练测试得到的分类精度向量,并计算相邻学习时间单元之间的精度落差,依据测试精度下降阈值得到有效波动位点.然后采用交叉检验的方式整合不同分组中的有效波动位点,以消除流数据在线学习过程中由于训练样本过小导致模型不稳定造成的检测干扰,根据精度波动一致性得到一致波动位点.最后,通过跟踪在线学习分类准确率,得到一致波动位点邻域参照点的测试精度变化,比较一致波动位点邻域参照点对应的模型测试精度下降幅度及收敛情况,以有效检测一致波动位点当中真实的概念漂移位点.实验结果表明,该方法能够有效辨识流数据在线学习过程中发生的真实概念漂移,有效避免训练样本过小或者流数据中噪声对检测结果的负面影响,同时提高模型的泛化性能.
英文摘要:
      Concept drift is a common problem in dynamic streaming data mining, but the false concept drift generated by the mixed noise data or too small scale size training data will cause similar results to the concept drift, that is, the instability fluctuation of model online testing performance, which leads to confusion between them, and the false alarm of concept drift. To address the problem which is easy to confuse the authenticity of concept drift, concept drift detection method based on online performance test, namely CDPT, is presented. With CDPT, the latest acquired data are evenly divided into groups, and online learning is performed on each group sub sets. At the same time, the classification accuracy vectors obtained by training and testing of each group sub sets are recorded, and the accuracy difference between adjacent learning time units is calculated. The effective fluctuation points are obtained according to the testing accuracy decline threshold. Then, the effective fluctuation points in different groups are integrated by cross checking to eliminate the detection interference caused by the instability of the model due to the small training samples in the online learning process of streaming data, and the consistent fluctuation points are obtained according to the consistency of accuracy fluctuation. Finally, by tracking the classification accuracy of online learning, we can get the change of testing accuracy of neighborhood reference points of consistent fluctuation points, and compare the decline and convergence of model testing accuracy of neighborhood reference points of consistent fluctuation points, so as to effectively detect the true concept drift points of the consistent fluctuation points. The experimental results demonstrate that the proposed CDPT method can effectively identify the real concept drift occurring in the online learning process of streaming data, effectively avoid the negative impact of too small training samples or noise on the detection results, and improve the generalization performance of the model.
HTML  下载PDF全文  查看/发表评论  下载PDF阅读器
 

京公网安备 11040202500064号

主办单位:中国科学院软件研究所 中国计算机学会 京ICP备05046678号-4
编辑部电话:+86-10-62562563 E-mail: jos@iscas.ac.cn
Copyright 中国科学院软件研究所《软件学报》版权所有 All Rights Reserved
本刊全文数据库版权所有,未经许可,不得转载,本刊保留追究法律责任的权利