| P.O.Box 8718, Beijing 100080, China | Journal of Software, February 2008,19(2):275-290 |
| E-mail: jos@iscas.ac.cn | ISSN 1000-9825, CODEN RUXUEW, CN 11-2560/TP |
| http://www.jos.org.cn | Copyright © 2008 by Journal of Software |
基于页面Block的Web档案采集和存储
宋 杰, 王大玲, 鲍玉斌, 申德荣
Abstract
In this paper, the page block based Web archive collecting and storing approach is proposed. The algorithms of layout-based page partition, extracting topic from block, version comparison and incremental storage implementation are introduced in detail. The prototype system is implemented and tested to verify the proposed approach. Theoretics and experiments show that, the proposed approach adapts the Web archive management well, and provides a valuable data resource to the Web archive based query, search, data mining and knowledge discovering applications.
Song J, Wang DL, Bao YB, Shen DR. Collecting and storing Web archive based on page block.
Journal of Software, 2008,19(2):275-290.
DOI:
10.3724/SP.J.1001.2008.00275
http://www.jos.org.cn/1000-9825/19/275.htm
摘要
提出了基于页面Block对Web页面的采集和存储方式,并详细表述了该方法如何完成基于布局页面分区、Block主题的抽取、版本和差异的比较以及增量存储的方式.实现了一个Web归档原型系统,并对所提出的算法进行了详细的测试.理论和实验表明,所提出的基于页面Block的Web档案(Web archive)采集和存储方法能够很好地适应Web档案的管理方式,并对基于Web档案的查询、搜索、知识发现和数据挖掘等应用提供有利的数据 资源.
基金项目:Supported by the National Natural Science Foundation of China under Grant Nos.60573090, 60673139 (国家自然科学基金)
References:
[1] Ntoulas A, Cho J, Olston C. What's new on the Web- The evolution of the Web from a search engine perspective. In: Chen YR, Kovács L, Lawrence S, eds. Proc. of the 13th Int'l Conf. on World Wide Web. New York: ACM Press, 2004. 1-12.
[2] National Liberaray of Australia. Padi-Web archiving. 2006. http://www.nla.gov.au/padi/topics/92.html
[3] Web InfoMall. 2006 (in Chinese). http://www.infomall.cn/
[4] Internet archive WayBack machine. http://www.archive.org/Web/Web.php
[5] Gupta S, Kaiser G, Stolfo S. Extracting context to improve accuracy for HTML content extraction. In: Ellis A, Tatsuya H, eds. Proc. of the 14th Int'l Conf. on World Wide Web—Special Interest Tracks and Posters. New York: ACM Press, 2005. 1114-1115.
[6] Lin SH, Ho JM. Discovering informative content blocks from Web documents. In: Hand D, Keim D, eds. Proc. of the 8th ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining. New York: ACM Press, 2002. 588-593.
[7] Wong WC, Fu AW. Finding structure and characteristics of Web documents for classification. In: Gunopulos D, Rastogi R, eds. Proc. of the ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery. New York: ACM Press, 2000. 96-105.
[8] Yang YD, Zhang HJ. HTML page analysis based on visual cues. In: Antonacopoulos A, Gatos B, eds. Proc. of the 6th Int'l Conf. on Document Analysis and Recognition. Washington: IEEE Computer Society, 2001. 859-864.
[9] http://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_utils/stop_words
[10] http://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_utils/porter.c
[11] Kantrowitz M, Mohit B, Mittal V. Stemming and its effects on TFIDF ranking. In: Nicholas J, Peter I, Mun-Kew L, eds. Proc. of the 23rd Annual Int'l ACM SIGIR Conf. on Research and Development in Information Retrieval. New York: ACM Press, 2000. 357-359.
[12] MacDonald J. Versioned file archiving, compression, and distribution. UC Berkeley, 1999. http://www.cs.berkeley.edu/~jmacd/
[13] Berliner B. CVS II: Parallelizing software development. In: Proc. of the USENIX Winter 1990 Technical Conf. Berkeley: USENIX Association, 1990. 341-352.
[14] Gomes D, Campos JP, Silva MJ. Versus: A Web repository. 2003. http://xldb.fc.ul.pt/referencias
[15] Gomes D, LSantos A, Silva MJ. Managing duplicates in a Web archive. In: Liebrock LM, ed. Proc. of the 21st Annual ACM Symp. on Applied Computing. New York: ACM Press, 2006. 818-825.
[16] Cho J, Garcia-Molina H. Estimating frequency of change. ACM Trans. on Internet Technology (TOIT), 2003,3(3):256-290.
[17] Phillips M. PANDORA, Australia's Web archive, and the digital archiving system that supports it. DigiCULT.info, 2003,(6):24-30. http://www.nla.gov.au/nla/staffpaper/2003/mphillips1.html
[18] Halse JE, Mohr G, Sigurdsson K, Stack M, Jack P. Heritrix developer documentation. 2005. http://crawler.archive.org/articles/developer_manual/index.html
[19] Gomes D, Freitas S, Silva MJ. Design and selection criteria for a national Web archive. In: Thanos C, Gonzalo J, eds. Proc. of the 10th European Conf. of Research and Advanced Technology for Digital Libraries (ECDL). Berlin, Heidelberg: Springer-Verlag, 2006. 196-207.
[20] Silva MJ. Searching and archiving the Web with tumba!. In: Proc. of the 4th Conf. on Association Portugal of System and Information (CAPSI). 2003. http://xldb.fc.ul.pt/data/Publications_attach/tumba-search+archive-capsi-final.pdf
[21] Hallgrimsson D, Bang S. Nordic Web archive. In: Michael D, ed. Proc. of the 3rd ECDL Workshop on Web Archives. 2003. http://bibnum.bnf.fr/ECDL/2003/proceedings.php-f=ecdl2003
[22] National Diet Library (Japan). Web archiving project. 2007. http://warp.ndl.go.jp
[23] UK Web archiving consortium. 2006. http://info.Webarchive.org.uk
[24] The Library of Congress. Minerva Web archiving project. 2006. http://lcWeb2.loc.gov/cocoon/minerva/html/minerva-home.html
[25] McCown F. Dynamic Web file format transformations with grace. In: Proc. of the 5th Int'l Web Archiving Workshop and Digital Preservation (IWAW 2005). 2005. 22-23. http://www.iwaw.net/05/papers/iwaw05-mccown2.pdf
[26] Lampos C, Eirinaki M, Jevtuchova D, Vazirgiannis M. Archiving the greek Web. In: Proc. of the 4th Int'l Web Archiving Workshop (IWAW 2004). 2004. http://www.iwaw.net/04/Lampos.pdf
[27] Callan J. Passage-Level evidence in document retrieval. In: Croft BW, Rijsbergen V, eds. Proc. of the 7th Annual Int'l ACM SIGIR Conf. on Research and Development in Information Retrieval. New York: ACM Press, 1994. 302-310.
[28] Kaszkiel M, Zobel J. Effective ranking with arbitrary passages. Journal of the American Society for Information Science, 2001, 52(4):344-364.
[29] Diao YL, Lu HJ, Chen ST, Tian ZP. Toward learning based Web query processing. In: Abbadi AE, Brodie ML, Chakravarthy S, Dayal U, Kamel N, Schlageter G, Whang KY, eds. Proc. of the 26th Int'l Conf. on Very Large Data Bases. San Fransisco: Morgan Kaufmann Publishers, 2000. 317-328.
[30] Li SH, Ho JM. Discovering informative content blocks from Web documents. In: Hand D, Keim D, Ng R, eds. Proc. of the 8th ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data mining. New York: ACM Press, 2002. 588-593.
[31] Kaasinen E, Aaltonen M, Kolari J, Melakoski S, Laakko T. Two approaches to bringing Internet services to WAP devices. Computer Networks: The Int'l Journal of Computer and Telecommunications Networking, 2000,33(1-6):231-246.
[32] Buyukkokten O, Garcia H, Paepche A. Accordion summarization for end-game browsing on PDAs and cellular phones. In: Rosson MB, Gilmore DJ, eds. Proc. of the SIG-CHI on Human Factors in Computing Systems. New York: ACM Press, 2001.
[33] Rahman A, Alam H, Hartono R. Content extraction from HTML documents. In: Hu JY, ed. Proc. of the 1st Int'l Workshop on Web Document Analysis (WDA 2001). New York: ACM Press, 2001. 3-10.
[34] Cai D, Yu S, Wen JR, Ma WY. Extracting content structure for Web pages based on visual representation. In: Zhou XF, Zhang YC, Orlowska ME, eds. Proc. of the 5th Asia Pacific Web Conf. Berlin, Heidelberg: Springer-Verlag, 2003. 406-417.
[35] Burner M, Kahle B. WWW archive file format specification. Alexa Internet Inc., 1996. http://pages.alexa.com/company/arcformat.html
[36] Gomes D, Santos AL, Silva MJ. Webstore: A manager for incremental storage of contents. Technical Report, DI/FCUL TR 04-15, Lisbon: University of Lisbon, 2004.
[37] Sekiguchi Y, Kawashima H, Okuda H, Oku M. Topic detection from Blog documents using users' interests. In: Aberer K, Hara T, eds. Proc. of the 7th Int'l Conf. on Mobile Data Management (MDM 2006). Washington: IEEE Computer Society, 2006. 108-111.
[38] Wang XY, Xiong FY, Ling B, Zhou A. A similarity-based algorithm for topic exploration and distillation. Journal of Software, 2003,14(9):1578-1585 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/14/1578.htm
附中文参考文献:
[3] 中国Web信息博物馆.2006. http://www.infomall.cn/
[38] 王晓宇,熊方,凌波,周傲英.一种基于相似度分析的主题提取和发现算法.软件学报,2003,14(9):1578-1585. http://www.jos.org.cn/1000-9825/14/1578.htm