| P.O.Box 8718, Beijing 100080, China | Journal of Software, February 2008,19(2):179-193 |
| E-mail: jos@iscas.ac.cn | ISSN 1000-9825, CODEN RUXUEW, CN 11-2560/TP |
| http://www.jos.org.cn | Copyright © 2008 by Journal of Software |
一种基于图模型的Web数据库采样方法
刘 伟, 孟小峰, 凌妍妍
Abstract
A flood of information is hidden behind the Web-based query interfaces with specific query capabilities, which makes it difficult to capture the characteristics of the Web database, such as the topic and the frequency of updates. This poses a great challenge for Deep Web data integration. To address this problem, a graph-based approach WDB-Sampler for Web database sampling is proposed in this paper, which can incrementally obtain sample records from a Web database through its query interface. That is, a number of samples are obtained for the current query, and one of them is transformed into the next query. The important characteristic of this approach is it can adapt to different kinds of attributes on the query interfaces. The extensive experiments on the local simulation Web databases and the real Web databases prove that the approach can achieve high-quality samples from a Web database at a lower cost.
Liu W, Meng XF, Ling YY. A graph-based approach for Web database sampling.
Journal of Software, 2008, 19(2):179-193.
DOI:
10.3724/SP.J.1001.2008.00179
http://www.jos.org.cn/1000-9825/19/179.htm
摘要
Web数据库中,海量的信息隐藏在具有特定查询能力的查询接口后面,使人无法了解一个Web数据库内容的特征,比如主题的分布、更新的频率等,这就为Deep Web数据集成带来了巨大的挑战.为了解决这个问题,提出了一种基于图模型的Web数据库采样方法,可以通过查询接口从Web数据库中以增量的方式获取近似随机的样本,即每次查询获取一定数量的样本记录,并且利用已经保存在本地的样本记录生成下一次的查询.该方法的一个重要特点是不受查询接口中属性表现形式的局限,因此是一种一般的Web数据库采样方法.在本地的模拟实验和真实Web数据库上的大量实验表明,该方法可以在较小代价下获得高质量的样本.
基金项目:Supported by the National Natural Science Foundation of China under Grant No.60573091 (国家自然科学基金); the National High-Tech Research and Development Plan of China under Grant No.2007AA01Z155 (国家高技术研究发展计划(863)); the Beijing Natural Science Foundation of China under Grant No.4073035 (北京市自然科学基金); the Program for New Century Excellent Talents in University of China (新世纪优秀人才支持计划)
References:
[1] Chang KCC, He B, Li CK, Patel M, Zhang Z. Structured databases on the Web: Observations and implications. SIGMOD Record, 2004,33(3):61-70.
[2] BrightPlanet.com. The deep Web: Surfacing hidden value. 2000. http://brightplanet.com
[3] He H, Meng WY, Yu C, Wu ZH. WISE-Integrator: An automatic integrator of Web search interfaces for e-commerce. In: Proc. of the 29th Int'l Conf. on Very Large Data Bases. San Fransisco: Morgan Kaufmann Publishers, 2003. 357-368.
[4] Wu WS, Yu C, Doan AH, Meng WY. An interactive clustering-based approach to integrating source query interfaces on the deep Web. In: Proc. of the 24th ACM SIGMOD Int'l Conf. on Management of Data. Paris: ACM Press, 2004. 95-106.
[5] Peng Q, Meng WY, He H, Yu C. WISE-Cluster: Clustering e-commerce search engines automatically. In: Proc. of the 6th ACM Int'l Workshop on Web Information and Data Management. Washington: ACM Press, 2004. 104-111.
[6] He B, Tao T, Chang KCC. Clustering structured Web sources: A schema-based, model-differentiation approach. In: Proc. of the 9th Int'l Conf. on Extending Database Technology. Heraklion: Springer-Verlag, 2004. 536-546.
[7] Zhao HK, Meng WY, Wu ZH, Raghavan V, Yu C. Fully automatic wrapper generation for search engines. In: Proc. of the 14th Int'l World Wide Web Conf. Chiba: ACM Press, 2005. 66-75.
[8] Zhai YH, Liu B. Web data extraction based on partial tree alignment. In: Proc. of the 14th Int'l World Wide Web Conf. Chiba: ACM Press, 2005. 76-85.
[9] Chang KCC, He B, Zhang Z. Toward large scale integration: Building a MetaQuerier over databases on the Web. In: Proc. of the 2nd Int'l Conf. on Innovative Data Systems Research. Asilomar, 2005. 44-55.
[10] Chaudhuri S, Das G, Srivastava U. Effective use of block-level sampling in statistics estimation. In: Proc. of the 24th ACM SIGMOD Int'l Conf. on Management of Data. Paris: ACM Press, 2004. 287-298.
[11] Haas PJ, Koenig CA. Bi-Level bernoulli scheme for database sampling. In: Proc. of the 24th ACM SIGMOD Int'l Conf. on Management of Data. Paris: ACM Press, 2004. 275-286.
[12] Olken F. Random sampling from databases [Ph.D. Thesis]. Berkeley: University of California, 1993.
[13] Piatetsky-Shapiro G, Connell C. Accurate estimation of the number of tuples satisfying a condition. In: Proc. of the 4th ACM SIGMOD Int'l Conf. on Management of Data. Boston: ACM Press, 1984. 256-276.
[14] Dasgupta A, Das G, Mannila H. A random walk approach to sampling hidden databases. In: Proc. of the 27th ACM SIGMOD Int'l Conf. on Management of Data. Beijing: ACM Press, 2007. 629-640.
[15] Wu P, Wen JR, Liu H, Ma WY. Query selection techniques for efficient crawling of structured Web sources. In: Proc. of the 22nd Int'l Conf. on Data Engineering. Atlanta, 2006. 47-56.
[16] Ziv B, Gurevich M. Random sampling from a search engine's index. In: Proc. of the 15th Int'l Conf. on World Wide Web. ACM Press, 2006. 367-376.
[17] Bradlow E, Schmittlein D. The little engines that could: Modeling the performance of World Wide Web search engines. Marketing Science, 2000,19(1):43-62.
[18] Lawrence S, Giles C. Searching the World Wide Web. Science, 1998,5360(280):98.
[19] Bhalotia G, Hulgeri A, Nakhe C, Chakrabarti S, Sudarshan S. Keyword searching and browsing in databases using BANKS. In: Proc. of the 18th Int'l Conf. on Data Engineering. San Jose: IEEE Computer Society, 2002. 431-440.