###
DOI:
Journal of Software:2009.20(7):1756-1767

基于相关分析的多数据流聚类
屠莉,陈崚,邹凌君
(南京航空航天大学 信息科学与技术学院,江苏 南京 210093;扬州大学 计算机科学与工程系,江苏 扬州 225009;南京大学 计算机软件新技术国家重点实验室,江苏 南京 210093)
Clustering Multiple Data Streams Based on Correlation Analysis
()
Abstract
Chart / table
Reference
Similar Articles
Article :Browse 3517   Download 3919
Received:August 11, 2007    Revised:July 02, 2008
> 中文摘要: 提出基于相关分析的多数据流聚类算法.该算法将多数据流的原始数据快速压缩成一个统计概要.根据这些统计概要,可以增量式地计算相关系数来衡量数据间的相似度.提出了一种改进的k-平均算法来生成聚类结果.改进的k-平均算法可以动态、实时地调整聚类数目,并及时检测数据流的发展变化.还将算法应用到按照用户要求的聚类问题(COD),使得用户可以在任意的时间区间上查询聚类结果.提出了一种合理的时间片断划分机制,使得用户指定的任意时间区间都可以由这些时间片断组合而成.在模拟和真实数据上的实验结果都表明,该算法比其他方法具有更好的聚类质量、速度和稳定性,能够实时地反映数据流的变化.
中文关键词: 聚类  数据流  相关分析
Abstract:This paper proposes a compression scheme which quickly compresses the raw data from multiple streams into a compressed synopsis. The synopsis allows to incrementally reconstruct the correlation coefficients without accessing the raw data. A modified k-means algorithm is developed to generate clustering results and dynamically adjust the number of clusters in real time so as to detect the evolving changes in the data streams.Finally, the framework is extended to support clustering on demand (COD), where a user can query for clustering results over an arbitrary time horizon. A theoretically sound time-segment partitioning scheme is developed so that any demand time horizon can be fulfilled by a combination of those time-segments. Experimental results on synthetic and real data sets show that the algorithm has higher clustering quality, speed and stability than other methods and can detect the evolving changes of the data streams in real time.
文章编号:     中图分类号:    文献标志码:
基金项目:Supported by the National Natural Science Foundation of China under Grant Nos.60673060, 60773103 (国家自然科学基金); the Natural Science Foundation of Jiangsu Province of China under Grant No.BK2008206 (江苏省自然科学基金) Supported by the National Natural Science Foundation of China under Grant Nos.60673060, 60773103 (国家自然科学基金); the Natural Science Foundation of Jiangsu Province of China under Grant No.BK2008206 (江苏省自然科学基金)
Foundation items:
Reference text:

屠 莉,陈 崚,邹凌君.基于相关分析的多数据流聚类.软件学报,2009,20(7):1756-1767

.Clustering Multiple Data Streams Based on Correlation Analysis.Journal of Software,2009,20(7):1756-1767