###
Journal of Software:2013.24(11):2571-2583

一种基于聚类的PU主动文本分类方法
刘露,彭涛,左万利,戴耀康
(吉林大学 计算机科学与技术学院, 吉林 长春 130012;Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, USA;吉林大学 计算机科学与技术学院, 吉林 长春 130012;Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, USA;符号计算与知识工程教育部重点实验室吉林大学, 吉林 长春 130012;吉林大学 计算机科学与技术学院, 吉林 长春 130012;符号计算与知识工程教育部重点实验室吉林大学, 吉林 长春 130012)
Clustering-Based PU Active Text Classification Method
LIU Lu,PENG Tao,ZUO Wan-Li,DAI Yao-Kang
(College of Computer Science and Technology, Jilin University, Changchun 130012, China;Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, USA;College of Computer Science and Technology, Jilin University, Changchun 130012, China;Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, USA;Key Laboratory of Symbol Computation and Knowledge Engineering Jilin University, Ministry of Education, Changchun 130012, China;College of Computer Science and Technology, Jilin University, Changchun 130012, China;Key Laboratory of Symbol Computation and Knowledge Engineering Jilin University, Ministry of Education, Changchun 130012, China)
Abstract
Chart / table
Reference
Similar Articles
Article :Browse 4348   Download 3963
Received:February 28, 2013    Revised:July 16, 2013
> 中文摘要: 文本分类是信息检索的关键问题之一.提取更多的可信反例和构造准确高效的分类器是PU(positive andunlabeled)文本分类的两个重要问题.然而,在现有的可信反例提取方法中,很多方法提取的可信反例数量较少,构建的分类器质量有待提高.分别针对这两个重要步骤提供了一种基于聚类的半监督主动分类方法.与传统的反例提取方法不同,利用聚类技术和正例文档应与反例文档共享尽可能少的特征项这一特点,从未标识数据集中尽可能多地移除正例,从而可以获得更多的可信反例.结合SVM 主动学习和改进的Rocchio 构建分类器,并采用改进的TFIDF(term frequency inverse document frequency)进行特征提取,可以显著提高分类的准确度.分别在3 个不同的数据集中测试了分类结果(RCV1,Reuters-21578,20 Newsgoups).实验结果表明,基于聚类寻找可信反例可以在保持较低错误率的情况下获取更多的可信反例,而且主动学习方法的引入也显著提升了分类精度.
Abstract:Text classification is a key technology in information retrieval. Collecting more reliable negative examples, and building effective and efficient classifiers are two important problems for automatic text classification. However, the existing methods mostly collect a small number of reliable negative examples, keeping the classifiers from reaching high accuracy. In this paper, a clustering-based method for automatic PU (positive and unlabeled) text classification enhanced by SVM active learning is proposed. In contrast to traditional methods, this approach is based on the clustering technique which employs the characteristic that positive and negative examples should share as few words as possible. It finds more reliable negative examples by removing as many probable positive examples from unlabeled set as possible. In the process of building classifier, a term weighting scheme TFIPNDF (term frequency inverse positive-negative document frequency, improved TFIDF) is adopted. An additional improved Rocchio, in conjunction with SVMs active learning, significantly improves the performance of classifying. Experimental results on three different datasets (RCV1, Reuters-21578, 20 Newsgroups) show that the proposed clustering- based method extracts more reliable negative examples than the baseline algorithms with very low error rates and implementing SVM active learning also improves the accuracy of classification significantly.
文章编号:     中图分类号:    文献标志码:
基金项目:国家自然科学基金(60903098,60973040) 国家自然科学基金(60903098,60973040)
Foundation items:
Reference text:

刘露,彭涛,左万利,戴耀康.一种基于聚类的PU主动文本分类方法.软件学报,2013,24(11):2571-2583

LIU Lu,PENG Tao,ZUO Wan-Li,DAI Yao-Kang.Clustering-Based PU Active Text Classification Method.Journal of Software,2013,24(11):2571-2583