###
Journal of Software:2012.23(6):1578-1587

基于增量型聚类的自动话题检测研究
张小明,李舟军,巢文涵
(北京航空航天大学 计算机科学与工程系,北京 100191)
Research of Automatic Topic Detection Based on Incremental Clustering
ZHANG Xiao-Ming,LI Zhou-Jun,CHAO Wen-Han
(Department of Computer Science and Engineering, BeiHang University, Beijing 100191, China)
Abstract
Chart / table
Reference
Similar Articles
Article :Browse 4351   Download 5673
Received:August 07, 2009    Revised:September 01, 2011
> 中文摘要: 随着网络信息飞速的发展,收集并组织相关信息变得越来越困难.话题检测与跟踪(topic detection andtracking,简称TDT)就是为解决该问题而提出来的研究方向.话题检测是TDT 中重要的研究任务之一,其主要研究内容是把讨论相同话题的故事聚类到一起.虽然话题检测已经有了多年的研究,但面对日益变化的网络信息,它具有了更大的挑战性.提出了一种基于增量型聚类的和自动话题检测方法,该方法旨在提高话题检测的效率,并且能够自动检测出文本库中话题的数量.采用改进的权重算法计算特征的权重,通过自适应地提炼具有较强的主题辨别能力的文本特征来提高文档聚类的准确率,并且在聚类过程中利用BIC 来判断话题类别的数目,同时利用话题的延续性特征来预聚类文档,并以此提高话题检测的速度.基于TDT-4 语料库的实验结果表明,该方法能够大幅度提高话题检测的效率和准确率.
Abstract:With the exponential growth of information on the Internet, it has become increasingly difficult to find and organize relevant material. Topic detection and tracking (TDT) is a research area addressing this problem. As one of the basic tasks of TDT, topic detection is the problem of grouping all stories, based on the topics they discuss. This paper proposes a new topic detection method (TPIC) based on an incremental clustering algorithm. The proposed topic detection strives to achieve a high accuracy and the capability of estimating the true number of topics in the document corpus. Term reweighing algorithm is used to accurately and efficiently cluster the given document corpus, and a self-refinement process of discriminative feature identification is proposed to improve the performance of clustering. Furthermore, topics' “aging” nature is used to precluster stories, and Bayesian information criterion (BIC) is used to estimate the true number of topics. Experimental results on linguistic data consortium (LDC) datasets TDT-4 show that the proposed model can improve both efficiency and accuracy, compared to other models.
文章编号:     中图分类号:    文献标志码:
基金项目:国家自然科学基金(61170189, 61003111); 国家教育部博士点基金(20101102120016); 国家重点实验室基金(SKLSDE-2011ZX-03) 国家自然科学基金(61170189, 61003111); 国家教育部博士点基金(20101102120016); 国家重点实验室基金(SKLSDE-2011ZX-03)
Foundation items:
Reference text:

张小明,李舟军,巢文涵.基于增量型聚类的自动话题检测研究.软件学报,2012,23(6):1578-1587

ZHANG Xiao-Ming,LI Zhou-Jun,CHAO Wen-Han.Research of Automatic Topic Detection Based on Incremental Clustering.Journal of Software,2012,23(6):1578-1587