P.O.Box 8718, Beijing 100080, China Journal of Software   February 2008,19(2):246-256
E-mail: jos@iscas.ac.cn ISSN 1000-9825,  CODEN RUXUEW,  CN 11-2560/TP
http://www.jos.org.cn  Copyright © 2008 by Journal of Software

Using Classifiers to Find Domain-Specific Online Databases Automatically

WANG Hui, LIU Yan-Wei, ZUO Wan-Li 

 Full-Text PDF    Submission   Back


WANG Hui, LIU Yan-Wei, ZUO Wan-Li, 
(College of Computer Science and Technology, Jilin University, Changchun 130012, China)
Authors information: WANG Hui was born in 1972. He received his Ph.D. degree from Jilin University. His research area is Web information mining. LIU Yan-Wei was born in 1983. He is a graduate student at the Jilin University. His research area is Web information mining. ZOU Wan-Li was born in 1957. He is a professor and doctoral supervisor at the Jilin University and a CCF senior member. His research areas are database, data mining and Web search engine.
Corresponding author:
WANG Hui, Phn: +86-431-85166492, E-mail: whui05@yahoo.com.cn
Received 2007-08-02; Accepted 2007-11-06

Abstract
In hidden Web domain, general-purpose search engines (i.e., Google and Yahoo) have their shortcomings. They cover less than one-third of the data stored in document databases. Unlike the surface Web, if combined, they cover roughly the same data. Hidden Web is a highly important information source since the content provided by many hidden Web sites is often of very high quality. This paper proposes a three-step framework to automatically identify domain-specific hidden Web entries. With those obtained query interfaces, they can be integrated to obtain a unified interface which is given to users to query. Eight large-scale experiments demonstrate that the technique can find domain-specific hidden Web entries accurately and efficiently.

Wang H, Liu YW, Zuo WL. Using classifiers to find domain-specific online databases automatically. Journal of Software, 2008,19(2):246-256.
DOI: 10.3724/SP.J.1001.2008.00246
http://www.jos.org.cn/1000-9825/19/246.htm


摘要
在深度网研究领域,通用搜索引擎(比如Google和Yahoo)具有许多不足之处:它们各自所能覆盖的数据量与整个深度网数据总量的比值小于1/3;与表层网中的情况不同,几个搜索引擎相结合所能覆盖的数据量基本没有发生变化.许多深度网站点能够提供大量高质量的信息,并且,深度网正在逐渐成为一个最重要的信息资源.提出了一个三分类器的框架,用于自动识别特定领域的深度网入口.查询接口得到以后,可以将它们进行集成,然后将一个统一的接口提交给用户以方便他们查询信息.通过8组大规模的实验,验证了所提出的方法可以准确高效地发现特定领域的深度网入口.

基金项目:Supported by the National Natural Science Foundation of China under Grant No.60373099 (国家自然科学基金); the Science and Technology Development Program of Jilin Province of China under Grant No.20070533 (吉林省科技发展计划)


References: 

[1] Rocco D, Caverlee J, Liu L, Critchlow T. Exploiting the deep Web with DynaBot: Matching, probing, and ranking. In: Ellis A, Hagino T, eds. Proc. of the World Wide Web Special Interest Tracks And Posters (WWW). Chiba: ACM, 2005. 1174-1175.

[2] BrightPlanet.com. The deep Web: Surfacing hidden value. http://brightplanet.com

[3] Bergman MK. The deep Web: Surfacing hidden value. Journal of Electronic Publishing, 2001,7(1):1174-1175. http://www.press.umich.edu/jep/07-01/bergman.html

[4] He B, Zhang Z, Chang KCC. Knocking the door to the deep Web: Integrating Web query interfaces. In: Weikum G, ed. Proc. of the SIGMOD Conf. Paris: ACM, 2004. 913-914.

[5] Chang KCC, He B, Zhang Z. MetaQuerier over the deep Web: Shallow integration across holistic sources. In: Nascimento MA, -zsu MT, Kossmann D, Miller RJ, Blakeley JA, Schiefer KB, eds. Proc. of the Int'l Conf. on Very Large Data Bases (VLDB). Morgan Kaufmann Publishers, 2004. 15-21.

[6] Wu W, Doan A, Yu CT. Merging interface schemas on the deep Web via clustering aggregation. In: Proc. of the Int'l Conf. on Data Mining (ICDM). IEEE Computer Society, 2005. 801-804.

[7] He H, Meng WY, Yu CT, Wu ZH. WISE-Integrator: A system for extracting and integrating complex Web search interfaces of the deep Web. In: B-hm K, Jensen CS, Haas LM, Kersten ML, Larson PA, Ooi BC, eds. Proc. of the Int'l Conf. on Very Large Data Bases (VLDB). ACM, 2005. 1314-1317.

[8] Chang KCC, Garcia-Molina H. Mind your vocabulary: Query mapping across heterogeneous information sources. In: Delis A, Faloutsos C, Ghandeharizadeh S, eds. Proc. of the SIGMOD Conf. Philadelphia: ACM Press, 1999. 335-346.

[9] He B, Zhang Z, Chang KCC. MetaQuerier: Querying structured Web sources on-the-fly. In: -zcan F, ed. Proc. of the SIGMOD Conf. ACM, 2005. 927-929.

[10] Nakatoh T, Yamada Y, Hirokawa S. Automatic generation of deep Web wrappers based on discovery of repetition. In: Proc. of the Asia Information Retrieval Symp. (AIRS). Beijing: Springer-Verlag, 2004. 269-272.

[11] Hedley YL, Younas M, James A, Sanderson M. A two-phase sampling technique for information extraction from hidden Web databases. In: Laender AHF, Lee D, Ronthaler M, eds. Proc. of the Int'l Workshop on Web Information and Data Management (WIDM). Washington: ACM, 2004. 1-8.

[12] Mundluru D, Katukuri JR, Celebi S. Automatically mining result records from search engine response pages. In: Proc. of the Int'l Conf. on Data Mining (ICDM). IEEE Computer Society, 2005. 749-752.

[13] Liu B, Grossman R, Zhai YH. Mining data records in Web pages. In: Getoor L, Senator TE, Domingos P, Faloutsos C, eds. Proc. of the Knowledge Discovery and Data Mining (KDD). Washington: ACM, 2003. 601-606.

[14] Hsieh W, Madhavan J, Pike R. Data management projects at Google. In: Chaudhuri S, Hristidis V, Polyzotis N, eds. Proc. of the SIGMOD Conf. Chicago: ACM, 2006. 725-726.

[15] Wu P, Wen JR, Liu H, Ma WY. Query selection techniques for efficient crawling of structured Web sources. In: Liu L, Reuter A, Whang KY, Zhang J, eds. Proc. of the Int'l Conf. on Data Mining (ICDE). IEEE Computer Society, 2006. 47.

[16] Raghavan S, Garcia-Molina H. Crawling the hidden Web. In: Apers PMG, Atzeni P, Ceri S, Paraboschi S, Ramamohanarao K, Snodgrass RT, eds. Proc. of the Int'l Conf. on Very Large Data Bases (VLDB). Rome: Morgan Kaufmann Publishers, 2001. 129-138.

[17] Cope J, Craswell N, Hawking D. Automated discovery of search interfaces on the Web. In: Schewe KD, Zhou X, eds. Proc. of the Australasian Database Conf. (ADC). Australian Computer Society, 2003. 181-189.

[18] Bergholz A, Chidlovskii B. Crawling for domain-specific hidden Web resources. In: Proc. of the Int'l Conf. on Web Information Systems Engineering (WISE). Roma: IEEE Computer Society, 2003. 125-133.

[19] Barbosa L, Freire J. Combining classifiers to identify online databases. In: Williamson CL, Zurko ME, Patel-Schneider PF, Shenoy PJ, eds. Proc. of the World Wide Web Conf. (WWW). ACM, 2007. 431-440.

[20] Barbosa L, Freire J. An adaptive crawler for locating hidden-Web entry points. In: Williamson CL, Zurko ME, Patel-Schneider PF, Shenoy PJ, eds. Proc. of the World Wide Web Conf. (WWW). ACM, 2007. 441-450.

[21] Barbosa L, Freire J. Searching for hidden-Web databases. In: Doan AH, Neven F, McCann R, Bex GJ, eds. Proc. of the 8th Int'l Workshop on the Web and Databases (WebDB). Baltimore: ACM Press, 2005. 1-6.

[22] Chang CC, Lin CJ. Libsvm—A library for support vector machines. http://www.csie.ntu.edu.tw/~cjlin/libsvm/

[23] CPAN. http://search.cpan.org/

[24] Torgo L, Gama J. Regression by classification. In: Borges D, Kaestner C, eds. Proc. of the Brasilian Artificial Intelligence Symp. Curitiba: Springer-Verlag, 1996. 51-60.

[25] The uiuc Web integration repository. http://metaquerier.cs.uiuc.edu/repository/

[26] Weka. http://www.cs.waikato.ac.nz/ml/weka/