Inconsistency Detection in Distributed Big Data

doi:10.13328/j.cnki.jos.005052

微信服务号

微信订阅号

Home > Archive>Volume 27, Issue 8, 2016 >2068-2085. DOI:10.13328/j.cnki.jos.005052

PDF HTML XML Export Cite reminder

Inconsistency Detection in Distributed Big Data
DOI:
                        10.13328/j.cnki.jos.005052
                    
Author:
                        
                        
                    
Affiliation:
Clc Number:
Fund Project:National Program on Key Basic Research Project of China (973) (2012CB316203); National Natural Science Foundation of China (61472321, 61332006, 61502390); National High-Tech R&D Program of China (863) (2015AA015307); Basic Research Fund of Northwestern Polytechnical University of China (3102014JSJ0005, 3102014JSJ0013)

Article

Figures

Metrics

Reference

Cited by

Materials

Comments

Abstract:

Data inconsistency may exist in relational database. One major problem of data quality in relational database is functional dependency violation. To find out inconsistent data in a relational database, people need to detect the functional dependency violations. It is easy to detect dependency violations in centralized databases via SQL-based techniques, although the detection efficiency is not high. However, it is far more challenging to check inconsistencies in distributed databases, not only data shipment needs to be considered, but also the distribution of detecting tasks is a conundrum. These problems are more prominent with big data. This paper proposes a novel single functional dependency inconsistency detection approach in distributed big data, and provides a cost model of inconsistency detection. To reduce data shipment and response time, distributed data are pretreated based on equivalence class. Considering that the inconsistency detection problem is NP-hard, that is impossible to find an optimal solution in polynomial time, this work provides a 3/2-approximate optimal solution. A multiple functional dependencies detection approach is developed for distributed big data based on the minimal set cover theory. This approach allows detecting multiple functional dependencies violations in parallel after one-time data traversal，and it also incorporates load balancing in the detecting process. Experiments on real-world and generated datasets demonstrate that compared with previous detection methods and Naïve method based on Hadoop framework, the presented approach is more efficient and is scalable on big data.

Reference

Cited by

Get Citation

李卫榜,李战怀,陈群,杨婧颖,姜涛.分布式大数据不一致性检测.软件学报,2016,27(8):2068-2085

Copy

Article Metrics

Abstract:
PDF:
HTML:
Cited by:

History

Received:August 07,2015
Revised:February 23,2016
Adopted:
Online: March 16,2016
Published:

You are the firstVisitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address：4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code：100190
Phone：010-62562563 Fax：010-62562533 Email：jos@iscas.ac.cn
Technical Support：Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063

微信服务号

微信订阅号

Get Citation

Share

Article Metrics

History