| P.O.Box 8718, Beijing 100080, China | Journal of Software, Sept. 2004,15(9):1393-1404 |
| E-mail: jos@iscas.ac.cn | ISSN 1000-9825, CODEN RUXUEW, CN 11-2560/TP |
| http://www.jos.org.cn | Copyright © 2004 by The Editorial Department of Journal of Software |
基于多粒度树模型的Web站点描述及挖掘算法
田永鸿, 黄铁军, 高 文
田永鸿1, 黄铁军1,2, 高 文1,2,3
1(中国科学院 计算技术研究所,北京 100080)
2(中国科学院 研究生院,北京 100039)
3(哈尔滨工业大学 计算机科学与工程系,黑龙江 哈尔滨 150001)
作者简介: 田永鸿(1975-),男,四川德阳人,博士生,主要研究领域为数据挖掘,多媒体分析与检索;黄铁军(1970-),男,博士,研究员,主要研究领域为数字媒体,数字图书馆,智能交互,模式识别,图像处理;高文(1956-),男,博士,教授,博士生导师,主要研究领域为多模式用户接口,多媒体技术,人工智能.
联系人: 田永鸿 Phn: +86-10-82649529, Fax: +86-10-82649298, E-mail: yhtian@jdl.ac.cn, http://www.jdl.ac.cn
Received
2003-06-02; Accepted
2003-07-08
Abstract
With the exponential growth of both the amount and the diversity of the web information, web site mining is highly desirable for automatically discovering and classifying topic-specific web resources from the World Wide Web. Nevertheless, existing web site mining methods have not yet handled adequately how to make use of all the correlative contextual semantic clues and how to denoise the content of web sites effectually so as to obtain a better classification accuracy. This paper circumstantiates three issues to be solved for designing an effective and efficient web site mining algorithm, i.e., the sampling size, the analysis granularity, and the representation structure of web sites. On the basis, this paper proposes a novel multiscale tree representation model of web sites, and presents a multiscale web site mining approach that contains an HMT-based two-phase classification algorithm, a context-based interscale fusion algorithm, a two-stage text-based denoising procedure, and an entropy-base pruning strategy. The proposed model and algorithms may be used as a starting-point for further investigating some related issues of web sites, such as query optimization of multiple sites and web usage mining. Experiments also show that the approach achieves in average 16% improvement in classification accuracy and 34.5% reduction in processing time over the baseline system.
Tian YH, Huang TJ, Gao W. A Web site representation and mining algorithm using the multiscale tree model.
Journal of Software, 2004,15(9):1393~1404.
http://www.jos.org.cn/1000-9825/15/1393.htm
摘要
随着Web所拥有的信息量和信息种类的急剧增长,Web站点挖掘对于自动实现特定主题的Web资源发现和分类具有重要的意义.然而现有的Web站点分类或挖掘算法在利用上下文语义信息、去除噪声信息以进一步提高分类准确率等方面还缺乏深入研究.从站点的采样尺寸、分析粒度和描述结构3个方面分析了设计高效的Web站点挖掘算法所需要解决的问题.在此基础上,提出了一种新的Web站点多粒度树描述模型,并描述了包括基于隐Markov树的两阶段分类算法、粒度间上下文融合算法、两阶段去噪程序以及基于熵的动态剪枝策略在内的多粒度Web站点挖掘算法.站点的多粒度描述方法及挖掘算法为多站点查询优化、Web效用挖掘等的深入研究奠定了基础.实验表明,该算法相对于基线系统平均可以提高16%的分类准确率,并减少了34.5%的处理时间.
基金项目:Supported by the "Knowledge Innovation Initiative" of the Chinese Academy of Sciences under Grant No.Kgcxz-103 (中国科学院知识创新工程)
References:
[1] Ester M, Kriegel HP, Schubert M. Web site mining: A new way to spot competitors, customers and suppliers in the world wide web. In: Hand D, ed. Proc. of the SIGKDD 2002. Edmonton: ACM Press, 2002. 249~258.
[2] Chakrabarti S, Joshi M, Tawde V. Enhanced topic distillation using text, markup tags, and hyperlinks. In: Kraft DH, ed. Proc. of the 24th ACM-SIGIR Conf. on Research and Development in Information Retrieval. New Orleans: ACM Press, 2001. 208~216.
[3] Chakrabarti S. Integrating the document object model with hyperlinks for enhanced topic distillation and information extraction. In: Shen VY, ed. Proc. of the WWW 2001. Hong Kong: ACM Press, 2001. 211~220.
[4] Pierre JM. On the automated classification of web sites. Computer and Information Science, 2001,6(001).
[5] Terveen L, Hill W, Amento B. Constructing, organizing, and visualizing collections of topically related web resources. ACM Trans. on Computer-Human Interaction, 1999,6(1):67~94.
[6] Crouse MS, Nowak RD, Baraniuk RG. Wavelet-Based statistical signal processing using hidden Markov models. IEEE Trans. on Signal Processing, 1998,46(4):886~902.
[7] Li J, Gray RM. Context-Based multiscale classification of document images using wavelet coefficient distributions. IEEE Trans. on Image Processing, 2000,9(9):1604~1616.
[8] Chakrabarti S, Berg M, van den Dom B. Focused crawling: A new approach to topic-specific web resource discovery. Computer Networks, 1999,31(11-16):1623~1640.
[9] Minh ND. Fast approximation of Kullback-Leibler distance for dependence trees and hidden Markov model. IEEE Signal Processing Letters, 2003,10(4):115~118.
[10] Diligenti M, Gori M, Maggini M, Scarselli F. Classification of HTML documents by hidden tree-Markov models. In: Tombre K, et al, eds. Proc. of the Int'l Conf. on Document Analysis and Recognition (ICDAR 2001). Los Vaqueros: IEEE Computer Society Press, 2001. 849~853.
[11] Han JW, Chang KC. Data mining for web intelligence. IEEE Computer, 2002,35(11):64~70.
[12] Attardi G, Gulli A, Sebastiani F. Automatic Web page categorization by link and context analysis. In: Hutchison C, Lanzarone G, eds. Proc. of the THAI'99, European Symp. on Telematics, Hypermedia and Artificial Intelligence. Varese, 1999. 105~119. http://faure.isti.cnr.it/~fabrizio/Publications/THAI99.pdf
[13] Cheng H, Bouman CA. Multiscale Bayesian segmentation using a trainable context model. IEEE Trans. on Image Processing, 2001,10(4):511~525.
[13] [14] Chakrabarti S, Dom B, Indyk P. Enhanced hypertext categorization using hyperlinks. In: Tiwary A, Franklin M, eds. Proc. of the SIGMOD'98. Seattle: ACMP Press, 1998. 307~318.
[15] Ye Z, Lu CC. Unsupervised multiscale classification using wavelet-domain hidden Markov tree model. In: Proc. of the 2002 IEEE Int'l Conf. on Acoustics, Speech, and Signal Processing (ICASSP 2002). 2002. 4188~4191. http://www.cs.kent.edu/~zye/research.htm
[16] Choi H, Baraniuk R. Multiscale image segmentation using wavelet-domain hidden Markov models. IEEE Trans. on Image Processing. 2001,10(9):1309~1321.
[17] Romberg J, Choi H, Baraniuk R. Bayesian tree-structured image modeling using wavelet domain hidden Markov models. IEEE Trans. on Image Processing, 2001,10(7):1056~1068.
[17] [17] [18] Gotoh Y. Hochberg M, Silverman H. Efficient training algorithms for HMMs using incremental estimation. IEEE Trans. on Speech and Audio Processing, 1998,6(6):539~548.
[19] Fan G, Xia XG. Multiscale texture segmentation using hybrid contextual labeling tree. In: Proc. of the IEEE Int'l Conf. on Image Processing (ICIP 2000). Vancouver: IEEE Computer Society, 2000. 576~579. http://www.vcipl.okstate.edu/Publications/icip2000b.pdf
[20] Bertino E, Guerrini G, Mesiti M. Measuring the structural similarity among XML documents and DTDs. Technical Report, DISI-TR-02-02, Department of Computer and Information Sciences, University of Genova, 2001. http://www.disi.unige.it/person/MesitiM