P.O.Box 8718, Beijing 100080, China Journal of Software,  February  2008,19(2):224-236
E-mail: jos@iscas.ac.cn ISSN 1000-9825,  CODEN RUXUEW,  CN 11-2560/TP
http://www.jos.org.cn  Copyright © 2008 by Journal of Software

基于属性相关度的Web数据库大小估算方法

凌妍妍, 孟小峰, 刘 伟

 Full-Text PDF    Submission   Back


凌妍妍, 孟小峰, 刘 伟
(中国人民大学 信息学院,北京 100872)
作者简介: 凌妍妍(1985-),女,安徽黄山人,硕士生,主要研究领域为Deep Web数据集成.孟小峰(1964-),男,博士,教授,博士生导师,CCF高级会员,主要研究领域为Web数据集成,XML数据管理,移动数据管理.刘伟(1976-),男,博士生,主要研究领域为Deep Web数据集成,Web数据抽取.
联系人:
孟小峰  Phn: +86-10-62519453, E-mail: xfmeng@ruc.edu.cn, http://idke.ruc.edu.cn/xfmeng/
Received 2007-09-03; Accepted 2007-10-19

Abstract
An approach based on the word frequency is proposed in this paper to estimate the size of Web database. It obtains a random sample on a certain attribute by analyzing the attribute correlations among all the textual attributes in the query interface. The size of a Web database can be estimated by submitting probing queries which are generated by top-k frequent words to the query interface of a Web database. The experiments on several real-world databases have proved that this approach is effective and can achieve high accuracy in estimating the size of Web databases.

Ling YY, Meng XF, Liu W. An attributes correlation based approach for estimating size of Web databases. Journal of Software, 2008,19(2):224?236.
DOI: 10.3724/SP.J.1001.2008.00224
http://www.jos.org.cn/1000-9825/19/224.htm


摘要
提出了一种基于词频统计的方法以估算Web数据库的规模.通过分析Web数据库查询接口中属性之间的相关度来获取某个属性上的一组随机样本;并对该属性分别提交由前k位高频词形成的试探查询以估算Web数据库中记录的总数.通过在几个真实的Web数据库上进行实验验证,说明该方法可以准确地估算出Web数据库的 大小.

基金项目:Supported by the National Natural Science Foundation of China under Grant No.60573091 (国家自然科学基金); the National High-Tech Research and Development Plan of China under Grant No.2007AA01Z155 (国家高技术研究发展计划(863)); the Program for New Century Excellent Talents in University of China (新世纪优秀人才支持计划); the Beijing Natural Science Foundation of China under Grant No.4073035 (北京市自然科学基金)

References: 

[1] Chang KCC, Cho J. Accessing the Web: From search to integration. In: Proc. of 2006 ACM SIGMOD Int'l Conf. on Management of Data (SIGMOD 2006). Chicago: ACM Press, 2006. 804-805.

[2] Cope J, Craswell N, Hawking D. Automated discovery of search interfaces on the Web. In: Proc. of the 14th Australasian Database Conf. (ADC 2003). Adelaide: Australian Computer Society Press, 2003. 181-189.

[3] Kabra G, Li C, Chang KCC. Query routing: Finding ways in the maze of the deep Web. In: Proc. of the Int'l Workshop on Challenges in Web Information Retrieval and Integration (WIRI 2005). Tokyo: IEEE Computer Society Press, 2005. 64-73.

[4] He H, Meng W, Yu CT, Wu Z. WISE-Integrator: An automatic integrator of Web search interfaces for e-commerce. In: Proc. of the 29th Int'l Conf. on Very Large Data Bases (VLDB 2003). Berlin: ACM Press, 2003. 357-368.

[5] Wu W, Doan A, Yu CT. WebIQ: Learning from the Web to match deep-Web query interfaces. In: Proc. of the 22nd Int'l Conf. on Data Engineering (ICDE 2006). Atlanta: IEEE Computer Society Press, 2006. 44.

[6] Zhai Y, Liu B. Web data extraction based on partial tree alignment. In: Proc. of the 14th Int'l World Wide Web Conf. (WWW 2005). Chiba: ACM Press, 2005. 76(85.

[7] Zhao H, Meng W, Wu Z, Raghavan V, Yu CT. Fully automatic wrapper generation for search engines. In: Proc. of the 14th Int'l World Wide Web Conf. (WWW 2005). Chiba: ACM Press, 2005. 66-75.

[8] Raghavan S, Garcia-Molina H. Crawling the hidden Web. In: Proc. of the 27th Int'l Conf. on Very Large Data Bases (VLDB 2001). Rome: ACM Press, 2001. 129-138.

[9] Wu P, Wen JR, Liu H, Ma WY. Query selection techniques for efficient crawling of structured Web sources. In: Proc. of the 22nd Int'l Conf. on Data Engineering (ICDE 2006). Atlanta: IEEE Computer Society Press, 2006. 47-58.

[10] BrightPlanet.com. The deep Web: Surfacing hidden value. 2000. http://brightplanet.com

[11] Liu KL, Yu CT, Meng W. Discovering the representative of a search engine. In: Proc. of the 11th Int'l Conf. on Information and Knowledge Management (CIKM 2002). McLean: ACM Press, 2002. 652-654.

[12] Si L, Callan JP. Relevant document distribution estimation method for resource selection. In: Proc. of the 26th ACM Int'l Conf. on Research and Development in Information Retrieval (SIGIR2003). Toronto: ACM Press, 2003. 298-305.

[13] Karnatapu S, Ramachandran K, Wu Z. Estimating size of search engines in an uncooperative environment. In: Proc. of the 2nd Int'l Workshop on Web-Based Support Systems 2004 (WSS 2004). Beijing: IEEE Computer Society Press, 2004. 81-87.

[14] Shokouhi M, Zobel J, Scholer F, Tahaghoghi SMM. Capturing collection size for distributed non-cooperative retrieval. In: Proc. of the 29th ACM Int'l Conf. on Research and Development in Information Retrieval (SIGIR 2006). Seattle: ACM Press, 2006. 316-323.