###
DOI:
Journal of Software:2003.14(5):976-983

一种通过内容和结构查询文档数据库的方法
王晓玲,文继荣,栾金锋,马维英,董逸生
(东南大学计算机科学与工程系,江苏,南京,210096;微软亚洲研究院,北京,100080)
A Method to Query Document Database by Content and Structure
WANG Xiao-Ling,WEN Ji-Rong,LUAN Jin-Feng,MA Wei-Ying,DONG Yi-Sheng
()
Abstract
Chart / table
Reference
Similar Articles
Article :Browse 3173   Download 3083
Received:April 04, 2002    Revised:October 17, 2002
> 中文摘要: 文档是有一定逻辑结构的,标题、章节、段落等这些概念是文档的内在逻辑.不同的用户对文档的检索,有不同的需求,检索系统如何提供有意义的信息,一直是研究的中心任务.结合文档的结构和内容,对结构化文件的检索,提出了一种新的计算相似度的方法.这种方法可以提供多粒度的文档内容的检索,包括从单词、短语到段落或者章节.基于这种方法实现了一个问题回答系统,测试集是微软的百科全书Encarta,通过与传统方法实验比较,证明通过这种方法检索的文章片断更合理、更有效.
Abstract:Structured documents are made up of a few logical components, such as title, sections, subsections andparagraphs. The components in each structured document can be represented by an ordered tree model, which canalso be viewed as a hierarchical concept relationship. To meet the user's requirements for more precise andconcentrated search results, the retrieval techniques should allow the user to retrieve document components withvarying granularity. This paper presents a method to query document database by content and structure. The keyidea is to construct a more comprehensive similarity function by taking advantage of the inherent hierarchicalstructure in documents. This work combines Information Retrieval techniques, semi-structured data query andproximate search for document documents. The proposed method is evaluated on the Encarta encyclopediadocument set and the experimental results show that it can provide more accurate and focused answers thantraditional document retrieval methods.
文章编号:     中图分类号:    文献标志码:
基金项目:
Foundation items:
Reference text:

王晓玲,文继荣,栾金锋,马维英,董逸生.一种通过内容和结构查询文档数据库的方法.软件学报,2003,14(5):976-983

WANG Xiao-Ling,WEN Ji-Rong,LUAN Jin-Feng,MA Wei-Ying,DONG Yi-Sheng.A Method to Query Document Database by Content and Structure.Journal of Software,2003,14(5):976-983