Optimizing Top-k Similarity Join Algorithm
Author:
Affiliation:

Clc Number:

Fund Project:

National Natural Science Foundation of China (61370205); Shanghai Municipal Natural Science Foundation (13ZR1400800); The Fundamental Research Funds for the Central Universities

  • Article
  • |
  • Figures
  • |
  • Metrics
  • |
  • Reference
  • |
  • Related
  • |
  • Cited by
  • |
  • Materials
  • |
  • Comments
    Abstract:

    Similarity join is widely used in data cleaning, data integration and the detection of near duplicate Web pages. Existing similarity join algorithms fall into two categories:Threshold-based similarity join and Top-k similarity join. Top-k similarity join is suitable for applications in which the threshold is unknown in advance. The most efficient Top-k similarity join algorithm is Top-k-join, which is proposed by Xiao et al. In order to resolve the performance problemsof Topk-join, a novel Top-k similarity join algorithm Opt-join is proposed in this paper. By integrating the token batch processing technique into the existing event-driven framework, Opt-join reduces the cost of processing the prefix events. In addition, Opt-joinreduces the cost in hash lookup by switching the positions of the hash lookup and filtering operations. The correctness of the new algorithm is proved. Experimental results show that 1.28x-3.09xspeed-up is achieved by Opt-join compared with Topk-join. More importantly, with the increase of the record length or the k value, Opt-join surpasses Topk-join by a larger margin.

    Reference
    Related
    Cited by
Get Citation

王洪亚,杨利宏,刘晓强. Top-k相似连接算法性能优化.软件学报,2016,27(12):3051-3066

Copy
Share
Article Metrics
  • Abstract:
  • PDF:
  • HTML:
  • Cited by:
History
  • Received:June 11,2015
  • Revised:September 08,2015
  • Adopted:
  • Online: January 18,2016
  • Published:
You are the firstVisitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address:4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code:100190
Phone:010-62562563 Fax:010-62562533 Email:jos@iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063