###
Journal of Software:2017.28(12):3241-3256

基于主题与概率模型的非合作深网数据源选择
邓松,万常选
(江西财经大学 软件与通信工程学院, 江西 南昌 330013;数据与知识工程江西省高校重点实验室(江西财经大学), 江西 南昌 330013;江西财经大学 信息管理学院, 江西 南昌 330013;数据与知识工程江西省高校重点实验室(江西财经大学), 江西 南昌 330013)
Non-Cooperative Deep Web Data Source Selection Based on Subject and Probability Model
DENG Song,WAN Chang-Xuan
(School of Software & Communication Engineering, Jiangxi University of Finance and Economics, Nanchang 330013, China;Jiangxi Key Laboratory of Data and Knowledge Engineering(Jiangxi University of Finance and Economics), Nanchang 330013, China;School of Information and Technology, Jiangxi University of Finance and Economics, Nanchang 330013, China;Jiangxi Key Laboratory of Data and Knowledge Engineering(Jiangxi University of Finance and Economics), Nanchang 330013, China)
Abstract
Chart / table
Reference
Similar Articles
Article :Browse 1257   Download 1221
Received:October 12, 2016    Revised:March 21, 2017
> 中文摘要: 在深网数据集成过程中,用户希望仅检索少量数据源便能获取高质量的检索结果,因而数据源选择成为其核心技术.为满足基于相关性和多样性的集成检索需求,提出一种适合小规模抽样文档摘要的深网数据源选择方法.该方法在数据源选择过程中首先度量数据源与用户查询的相关性,然后进一步考虑候选数据源提供数据的多样性.为提升数据源相关性判别的准确性,构建了基于层次主题的数据源摘要,并在其中引入了主题内容相关性偏差概率模型,且给出了基于人工反馈的偏差概率模型构建方法以及基于概率分析的数据源相关性度量方法.为提升数据源选择结果的多样性程度,在基于层次主题的数据源摘要中建立了多样性链接有向边,并给出了数据源多样性的评价方法.最后,将基于相关性和多样性的数据源选择问题转化为一个组合优化问题,提出了基于优化函数的数据源选择策略.实验结果表明:在基于少量抽样文档进行数据源选择时,该方法具有较高的选择准确率.
中文关键词: 深网  数据源选择  主题  概率模型  TextRank
Abstract:It is desirable for a user to get high-quality query results from only a few data sources in deep Web data integration systems. Therefore, data source selection becomes one of the core technologies in the integration systems. In this paper, a method based on correlations and diversities is proposed for selecting deep Web data sources suitable for small-scale sampling document summaries. Firstly, considering the correlations between the query and the data sources, a hierarchical subject summary with a probability model of correlation deviation of the data sources is constructed to discriminate the data sources. Furthermore, a method is described for constructing a deviation probability model based on artificial feedbacks and correlation measurement of the data sources. Meanwhile, the diversity-oriented directed edges are built in the hierarchical subject summary of data source in consideration of the diversities of data sources, and an evaluation metric is proposed to measure data source diversities. Taking the data source selection based on correlation and diversity as a combinatorial optimization problem, an optimal result of data source selection is achieved by solving an optimization function. Experimental results show that the proposed method achieves better selection accuracy in selecting data sources with small sampling documents.
文章编号:     中图分类号:    文献标志码:
基金项目:国家自然科学基金(61462037,61562032,61173146,61363039,61363010);江西省自然科学基金(20152ACB20003);江西省高等学校科技落地计划(KJLD12022,KJLD14035) 国家自然科学基金(61462037,61562032,61173146,61363039,61363010);江西省自然科学基金(20152ACB20003);江西省高等学校科技落地计划(KJLD12022,KJLD14035)
Foundation items:National Natural Science Foundation of China (61462037, 61562032, 61173146, 61363039, 61363010); Natural Science Foundation of Jiangxi Province of China (20152ACB20003); Science and Technology Landing Plan of Colleges in Jiangxi Province of China (KJLD12022, KJLD14035)
Reference text:

邓松,万常选.基于主题与概率模型的非合作深网数据源选择.软件学报,2017,28(12):3241-3256

DENG Song,WAN Chang-Xuan.Non-Cooperative Deep Web Data Source Selection Based on Subject and Probability Model.Journal of Software,2017,28(12):3241-3256