P.O.Box 8718, Beijing 100080, China Journal of Software,  February  2008,19(2):267-274
E-mail: jos@iscas.ac.cn ISSN 1000-9825,  CODEN RUXUEW,  CN 11-2560/TP
http://www.jos.org.cn  Copyright © 2008 by Journal of Software

基于网页上下文的Deep Web数据库分类

马 军, 宋 玲, 韩晓晖, 闫 泼

 Full-Text PDF    Submission   Back


马 军, 宋 玲, 韩晓晖, 闫 泼
(山东大学 计算机科学与技术学院,山东 济南 250101)
作者简介: 马军(1956-),男,山东汶上人,博士,教授,博士生导师,CCF高级会员,主要研究领域为信息检索,并行计算,算法分析与设计.宋玲(1969-),女,博士生,副教授,主要研究领域为信息检索.韩晓晖(1983-),男,博士生,主要研究领域为信息检索.闫泼(1985-),女,硕士生,主要研究领域为信息检索.
联系人:
马 军  Phn: +86-531-88391528, Fax: +86-531-88392498, E-mail: majun@sdu.edu.cn, http://ir.sdu.edu.cn
Received 2007-08-31; Accepted 2007-11-19

Abstract
New techniques are discussed for enhancing the classification precision of deep Web databases, which include utilizing the content texts of the HTML pages containing the database entry forms as the context and a unification processing for the database attribute labels. An algorithm to find out the content texts in HTML pages is developed based on multiple statistic characteristics of the text blocks in HTML pages. The unification processing for database attributes is to let the attribute labels that are closed semantically be replaced with delegates. The domain and language knowledge found in learning samples is represented in hierarchical fuzzy sets and an algorithm for the unification processing is proposed based on the presentation. Based on the pre-computing a k-NN (k nearest neighbors) algorithm is given for deep Web database classification, where the semantic distance between two databases is calculated based on both the distance between the content texts of the HTML pages and the distance between database forms embedded in the pages. Various classification experiments are carried out to compare the classification results done by the algorithm with pre-computing and the one without the pre-computing in terms of classification precision, recall and F1 values.

Ma J, Song L, Han XH, Yan P. Classification of deep Web databases based on the context of Web pages. Journal of Software, 2008,19(2):267?274.
DOI: 10.3724/SP.J.1001.2008.00267
http://www.jos.org.cn/1000-9825/19/267.htm


摘要
讨论了提高Deep Web数据库分类准确性的若干新技术,其中包括利用HTML网页的内容文本作为理解数据库内容的上下文和把数据库表的属性标记词归一的过程.其中对网页中的内容文本的发现算法是基于对网页文本块的多种统计特征.而对数据库属性标记词的归一过程是把同义标记词用代表词进行替代的过程.给出了采用分层模糊集合对给定学习实例所发现的领域和语言知识进行表示和基于这些知识对标记词归一化算法.基于上述预处理,给出了计算Deep Web数据库的K-NN(k nearest neighbors)分类算法,其中对数据库之间语义距离计算综合了数据库表之间和含有数据库表的网页的内容文本之间的语义距离.分类实验给出算法对未预处理的网页和经过预处理后的网页在数据库分类精度、查全率和综合F1等测度上的分类结果比较.

基金项目:Supported by the Specialized Research Fund for the Doctoral Program of Higher Education of China under Grant No.20070422107 (高等学校博士学科点专项科研基金); the Key Science-Technology Project of Shandong Province of China under Grant No.2007GG10001002 (山东省科技攻关项目)

References: 

[1] Brightpanet's investigation. 2001. http://www.brightplanet.com/news/prs/deep-Web-500-times-larger.html

[2] Chang KCC, He B, Zhang Z. Toward large-scale, integration: building a MetaQuerier over databases on the Web. In: Weikum G, ed. Proc. of the Conf. on Innovative Data Systems Research. Asilomar: IEEE Computer Society, 2005. 44-55.

[3] He H, Meng W, Yu CT, Wu Z. Automatic integration of Web search interfaces with WISE-integrator. VLDB Journal, 2004,13(3): 256-273.

[4] He H, Meng W, Yu C, Wu Z. Wise-Integrator: An automatic integrator of Web search interfaces for e-commerce. In: Lockemann P, ed. Proc. of the Int'l Conf. on very Large Data Bases. Berlin: IEEE Computer Society, 2003. 357-368.

[5] Gravano L, Garcia-Molina H, Tomasic A. Gloss: Textsource discovery over the Internet. ACM Trans. on Database Systems, 1999, 24(2):229-246.

[6] Yi L, Liu B. Web page cleaning for Web mining through feature weighting. In: Cohn AG, ed. Proc. of the 18th Int'l Joint Conf. on Artificial Intelligence (IJCAI 2003). Acapulco: Kluwier Academic Publisher, 2003. 64-75.

[7] Bergholz A, Chidlovskii B. Crawling for domain-specific hidden Web resources. In: Spaccapietra S, ed. Proc. of the 4th Int'l Conf. on Web Information Systems Engineering. Rome: IEEE Computer Society, 2003. 125-133.

[8] Barbosa L, Freire J, Silva A. Organizing hidden-Web databases by clustering visible Web documents. In: Doqac A, ed. Proc. of IEEE the 23rd Int'l Conf. on Data Engineering. Istanbul: IEEE Computer Society, 2007. 326-335.

[9] Gravano L, Ipeirotis PG, Sahami M. QProber: A system for automatic classification of hidden-Web databases. ACM TOIS, 2003, 21(1):1-41.

[10] He B, Tao T, Chang KCC. Organizing structured Web sources by query schemas: A clustering approach. In: Gravano L, ed. Proc. of ACM the 13th Conf. on Information and Knowlege Management. Washington: ACM Press, 2004. 22-31.

[11] Baeza-Yates R, Ribeiro-Neto B. Modern Information Retrieval. Boston: Addison Wesley, 1999. 27-30.

[12] The UIUC Web integration repository. 2007. http://metaquerier.cs.uiuc.edu/repository/datasets/tel-8/index.html

[13] Thomopolos S, Buche P, Haemmerle O. Fuzzy sets defined on a hierarchical domain. IEEE Trans. on Knowledge and Data Engineering, 2006,16(10):1395-1409.

[14] Wang J, Loehovsky F. Data-Rich section extraction from HTML pages. In: Cham TS, ed. Proc. of the 3rd Int'l Conf. on Web Information Systems Engineering. Singapore: IEEE Computer Society Press, 2002. 1-10.

[15] Cai D, Yu SP, Wen JR, Ma WY. VIPS: A vision-based page segmentation algorithm. Technical Report, MSR-TR-2003-79, Redmond: Microsoft Research Corporation, 2003. 1-79.

[16] Song RH, Liu HF, Wen JR, Ma WY. Learning important models for Web page blocks based on layout and content analysis. SIGKDD Explorations, 2004,6(2):14-23.

[17] Feng HM, Liu B, Liu YM. Framework of Web page analysis and content extraction with coordinate trees. Journal of Tsinghua University, 2005,45(S1):1767-1771 (in Chinese with English abstract).

[18] CWT200G. 2007. http://www.cwirf.org/SharedRes/DataSet/cwt.html

附中文参考文献:
[17] 封化民,刘飚,刘艳敏.含有位置坐标树的Web页面分析和内容提取框架.清华大学学报,2005,45(S1):1767-1771.