| P.O.Box 8718, Beijing 100080, China | Journal of Software February 2008,19(2):209-223 |
| E-mail: jos@iscas.ac.cn | ISSN 1000-9825, CODEN RUXUEW, CN 11-2560/TP |
| http://www.jos.org.cn | Copyright © 2008 by Journal of Software |
Automatic Data Extraction from Template-Generated Web Pages
YANG Shao-Hua, LIN Hai-Lüe, HAN Yan-Bo
YANG Shao-Hua1,2, LIN Hai-Lüe1,2, HAN Yan-Bo1,
1(Research Center for Grid and Service Computing, Institute of Computing Technology, The Chinese Academy of Sciences, Beijing 100080, China)
2(Graduate University, The Chinese Academy of Sciences, Beijing 100049, China)
Authors information: YANG Shao-Hua was born in 1981. He is a Ph.D. candidate of the Institute of Computing Technology, the Chinese Academy of Sciences. His current research areas are Web mining and service-oriented computing.
LIN Hai-Lue was born in 1982. He is a Ph.D. candidate of the Institute of Computing Technology, the Chinese Academy of Sciences. His current research areas are Web information retrieval and service-oriented computing.
HAN Yan-Bo was born in 1962. He is a professor and doctoral supervisor at the Institute of Computing Technology, the Chinese Academy of Sciences, and a CCF senior member. His research areas are software integration and service grid.
Corresponding author: YANG Shao-Hua, Phn: +86-10-62600955, Fax: +86-10-62600900, E-mail:
yangshaohua@software.ict.ac.cn
Received 2007-09-07; Accepted 2007-11-29
Abstract
A substantial fraction of the Web consists of pages that are dynamically generated using a common template populated with data from databases, such as product description pages on e-commerce sites. The objective of the proposed research is to automatically detect the template behind these pages and extract embedded data (e.g., product name, price...). The template detection problem is formalized and an analysis of the underlying structure of template-generated pages is made. A template detection approach is presented and the detected templates are used to extract data from instance pages. Comparing with many other existing work, the approach is applicable for both "list pages" and "detail pages". Experimental results on two large third-party test beds show that the approach can achieve high extraction accuracy.
Yang SH, Lin HL, Han YB. Automatic data extraction from template-generated Web pages.
Journal of Software, 2008,19(2):209-223.
DOI: 10.3724/SP.J.1001.2008.00209
http://www.jos.org.cn/1000-9825/19/209.htm
摘要
当前,Web上的很多网页是动态生成的,网站根据请求从后台数据库中选取数据并嵌入到通用的模板中,例如电子商务网站的商品描述网页.研究如何从这类由模板生成的网页中检测出其背后的模板,并将嵌入的数据(例如商品名称、价格等等)自动地抽取出来.给出了模板检测问题的形式化描述,并深入分析模板产生网页的结构特征.提出了一种新颖的模板检测方法,并利用检测出的模板自动地从实例网页中抽取数据.与其他已有方法相比,该方法能够适用于"列表页面"和"详细页面"两种类型的网页.在两个第三方的测试集上进行了实验,结果表明,该方法具有很高的抽取准确率.
基金项目:Supported by the National Basic Research Program of China under Grant No.2007CB310804 (国家重点基础研究发展计划(973)); the National Natural Science Foundation of China under Grant No.60573117 (国家自然科学基金重大研究计划); the National High-Tech Research and Development Plan of China under Grant No.2006AA01A106 (国家高技术研究发展计划(863))
References:
[1] Chang CH, Kayed M, Girgis MR, Shaalan K. A survey of Web information extraction systems. IEEE Trans. on Knowledge and Data Engineering, 2006,18(10):1411-1428.
[2] Gold ME. Language identification in the limit. Information and Control, 1967,10(5):447-474.
[3] Laender AHF, Ribeiro-Neto BA, da Silva AD, Teixeira JS. A brief survey of Web data extraction tools. SIGMOD Record, 2002,31(2):84-93.
[4] Arasu A, Hector GM. Extracting structured data from Web pages. In: Proc. of the ACM SIGMOD Int'l Conf. on Management of Data. San Diego: ACM Press, 2003. 337-348.
[5] EXALG datasets. http://infolab.stanford.edu/~arvind/extract/
[6] TBDW v1.02. http://daisen.cc.kyushu-u.ac.jp/TBDW/testbed/
[7] Zhao HK, Meng WY, Wu ZH, Raghavan V, Yu C. Fully automatic wrapper generation for search engines. In: Proc. of the 14th Int'l Conf. on World Wide Web (WWW 2005). Chiba: ACM Press, 2005. 66-75.
[8] Simon K, Lausen G. ViPER: Augmenting automatic information extraction with visual perceptions. In: Proc. of the ACM CIKM Int'l Conf. on Information and Knowledge Management. Bremen: ACM Press, 2005. 381-388.
[9] Crescenzi V, Mecca G, Meraldo P. RoadRunner: Towards automatic data extraction from large Web sites. In: Proc. of the 27th Int'l Conf. on Very Large Data Bases (VLDB 2001). Roma: Morgan Kaufmann Publishers, 2001. 109-118.
[10] Wang JY, Lochovsky FH. Data extraction and label assignment for Web databases. In: Proc. of the 12th Int'l World Wide Web Conf. (WWW 2003). Budapest: ACM Press, 2003. 187-196.
[11] Liu W, Meng XF, Meng WY. Vision-Based Web data records extraction. In: Proc. of the 9th SIGMOD Int'l Workshop on Web and Databases (WebDB 2006). Chicago: ACM Press, 2006.
[12] Zhai YH, Liu B. Structured data extraction from the Web based on partial tree alignment. IEEE Trans. on Knowledge and Data Engineering, 2006,18(12):1614-1628.