Parallel Top-k Similarity Join Algorithm on Probabilistic Data Based on Earth Mover’s Distance
DOI:
Author:
Affiliation:

Clc Number:

Fund Project:

  • Article
  • |
  • Figures
  • |
  • Metrics
  • |
  • Reference
  • |
  • Related
  • |
  • Cited by
  • |
  • Materials
  • |
  • Comments
    Abstract:

    Many new data applications, such as wireless sensor networks, and traditional data applications that process images produce massive probabilistic data. In probabilistic data management, the Top-k similarity join operation returns the most similar k pairs of probabilistic data and thus has important value. Histogram is one of the most frequently-used data models for representing probabilistic data. Earth Mover's Distance (EMD) is more robust in quantifying the similarities between histogram-represented probabilistic data. However, EMD computation has a cubic time complexity, which brings great challenges to the EMD-based Top-k similarity joins. Based on the MapReduce framework, this paper utilizes the good properties of EMD's dual problem and proposes two EMD-based Top-k similarity join algorithms on massive probabilistic dataset. A baseline solution named Top-k BNLJ algorithm is first developed on the idea of block-nested-loop join, and the novel Top-k DLPJ algorithm is then built on the improved idea of data locality preserving data partition strategy which significantly reduces the amount of data transferred during MapReduce jobs. Extensive experiments on large real-world datasets, with millions of probabilistic data, have verified the efficiency, effectiveness and scalability of Top-k DLPJ algorithm.

    Reference
    Related
    Cited by
Get Citation

雷斌,许嘉,谷峪,于戈.概率数据上基于EMD距离的并行Top-k相似性连接算法.软件学报,2013,24(S2):188-199

Copy
Share
Article Metrics
  • Abstract:
  • PDF:
  • HTML:
  • Cited by:
History
  • Received:March 15,2013
  • Revised:July 11,2013
  • Adopted:
  • Online: January 02,2014
  • Published:
You are the firstVisitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address:4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code:100190
Phone:010-62562563 Fax:010-62562533 Email:jos@iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063