›› 2016,Vol. 31 ›› Issue (4): 741-758.doi: 10.1007/s11390-016-1660-4

所属专题: Data Management and Data Mining Computer Networks and Distributed Computing

• Special Section on Selected Paper from NPC 2011 • 上一篇    下一篇

一种分布式环境下基于内容相关数据的一致性修复方法

Yue-Feng Du, Student Member, CCF, Member, ACM, De-Rong Shen, Senior Member, CCF, Member, ACM, IEEE Tie-Zheng Nie, Yue Kou, Member, CCF, ACM, and Ge Yu, Fellow, CCF, Member, ACM, IEEE   

  1. College of Information Science and Engineering, Northeastern University, Shenyang 110819, China
  • 收稿日期:2015-12-07 修回日期:2016-04-22 出版日期:2016-07-05 发布日期:2016-07-05
  • 作者简介:Yue-Feng Du is a currently Ph.D. candidate in the College of Information Science & Engineering, Northeastern University, Shenyang, where he received his M.S. degree in computer software and theory in 2012. His research interests include data quality and data integration.
  • 基金资助:

    This research was supported by the National Basic Research 973 Program of China under Grant No. 2012CB316201, the National Natural Science Foundation of China under Grant Nos. 61033007 and 61472070, and the Fundamental Research Funds for the Central Universities of China under Grant No. N150408001-3.

Content-Related Repairing of Inconsistencies in Distributed Data

Yue-Feng Du, Student Member, CCF, Member, ACM, De-Rong Shen, Senior Member, CCF, Member, ACM, IEEE Tie-Zheng Nie, Yue Kou, Member, CCF, ACM, and Ge Yu, Fellow, CCF, Member, ACM, IEEE   

  1. College of Information Science and Engineering, Northeastern University, Shenyang 110819, China
  • Received:2015-12-07 Revised:2016-04-22 Online:2016-07-05 Published:2016-07-05
  • About author:Yue-Feng Du is a currently Ph.D. candidate in the College of Information Science & Engineering, Northeastern University, Shenyang, where he received his M.S. degree in computer software and theory in 2012. His research interests include data quality and data integration.
  • Supported by:

    This research was supported by the National Basic Research 973 Program of China under Grant No. 2012CB316201, the National Natural Science Foundation of China under Grant Nos. 61033007 and 61472070, and the Fundamental Research Funds for the Central Universities of China under Grant No. N150408001-3.

条件函数依赖(CFDs)是一种关键的数据一致性分析技术。由于CFDs没有考虑数据之间的内容关联性,所以在检测的过程中可能会遗漏部分潜在的一致性错误。内容相关的条件函数依赖(CCFDs)是一种特殊的CFDs,CCFDs通过合并CFDs并利用关联数据找出数据中的潜在错误。对于数据一致性清洗而言,数据检测和数据修复是相互作用的。一方面,检测捕捉冲突数据;另一方面,修复更正其中的错误。但是,修复的过程中也可能会产生新的不一致数据。另外,在现实生活中的数据通常是松散的并且分布存储在不同的数据源上。这样会增加清洗过程的传输代价。基于此,本文的工作就是解决分布式环境下内容关联数据之间的一致性修复问题,进而提出了一种修复框架。框架中包含检测和修复两个方法,并且迭代地执行这两个过程。检测方法会返回发生冲突的CCFDs规则并从中确定需要优先处理的冲突规则。本文提出了一种修复代价模型,并且证明了最小代价修复问题是一个NP完全问题。所以本文采用一种启发式方法对数据进行修复。为了提高修复效率和准确率,本文提出了两种优化结构:a)差别值,一种代替真实值进行数据传输的数据结构;b)规则序列,一种决定修复顺序的数据结构。最后,通过在两组真实数据上进行实验,证明了本方法比CFDs方法更为有效。

Abstract: Conditional functional dependencies (CFDs) are a critical technique for detecting inconsistencies while they may ignore some potential inconsistencies without considering the content relationship of data. Content-related conditional functional dependencies (CCFDs) are a type of special CFDs, which combine content-related CFDs and detect potential inconsistencies by putting content-related data together. In the process of cleaning inconsistencies, detection and repairing are interactive:1) detection catches inconsistencies, 2) repairing corrects caught inconsistencies while may bring new inconsistencies. Besides, data are often fragmented and distributed into multiple sites. It consequently costs expensive shipment for inconsistencies cleaning. In this paper, our aim is to repair inconsistencies in distributed content-related data. We propose a framework consisting of an inconsistencies detection method and an inconsistencies repairing method, which work iteratively. The detection method marks the violated CCFDs for computing the inconsistencies which should be repaired preferentially. Based on the repairing-cost model presented in this paper, we prove that the minimum-cost repairing using CCFDs is NP-complete. Therefore, the repairing method heuristically repairs the inconsistencies with minimum cost. To improve the efficiency and accuracy of repairing, we propose distinct values and rules sequences. Distinct values make less data shipments than real data for communication. Rules sequences determine appropriate repairing sequences to avoid some incorrect repairs. Our solution is proved to be more effective than CFDs by empirical evaluation on two real-life datasets.

[1] Fan W, Geerts F. Foundations of Data Quality Management. Morgan & Claypool, 2012.

[2] Fan W, Geerts F, Jia X, Kementsietsidis A. Conditional functional dependencies for capturing data inconsistencies. ACM Transactions on Database Systems (TODS), 2008, 33(2):Article No. 6

[3] Papenbrock T, Ehrlich J, Marten J, Neubert T, Rudolph J P, Schönberg M, Zwiener J, Naumann F. Functional dependency discovery:An experimental evaluation of seven algorithms. Proceedings of the VLDB Endowment, 2015, 8(10):1082-1093.

[4] Cong G, Fan W, Geerts F, Jia X, Ma S. Improving data quality:Consistency and accuracy. In Proc. the 33rd International Conference on Very Large Data Bases, Sept. 2007, pp.315-326.

[5] Alwan A A, Ibrahim H, Udzir N I. Improved integrity constraints checking in distributed databases by exploiting local checking. Journal of Computer Science and Technology, 2009, 24(4):665-674.

[6] Du Y, Shen D, Nie T, Kou Y, Yu G. Discovering condition-combined functional dependency rules. In Proc. the 16th APWeb, Sept. 2014, pp.247-257.

[7] Fan W, Li J, Tang N, Yu W. Incremental detection of inconsistencies in distributed data. IEEE Transactions on Knowledge and Data Engineering, 2014, 26(6):1367-1383.

[8] Fan W, Ma S, Tang N, Yu W. Interaction between record matching and data repairing. Journal of Data and Information Quality (JDIQ), 2014, 4(4):Article No. 16.

[9] Christen P. A survey of indexing techniques for scalable record linkage and deduplication. IEEE Transactions on Knowledge and Data Engineering, 2012, 24(9):1537-1555.

[10] Li X, Dong X L, Lyons K, Meng W, Srivastava D. Truth finding on the deep web:Is the problem solved? Proceedings of the VLDB Endowment, 2012, 6(2):97-108.

[11] Bohannon P, Fan W, Flaster M, Rastogi R. A cost-based model and effective heuristic for repairing constraints by value modification. In Proc. the 31st ACM SIGMOD International Conference on Management of Data, June 2005, pp.143-154.

[12] Wu A H, Tan Z J, Wang W. Annotation based query answer over inconsistent database. Journal of Computer Science and Technology, 2012, 25(3):469-481.

[13] Chiang F, Miller R J. A unified model for data and constraint repair. In Proc. the 27th International Conference on Data Engineering, April 2011, pp.446-457.

[14] Maher M J. Constrained dependencies. Theoretical Computer Science, 1997, 173(1):113-149.

[15] Bravo L, Fan W, Ma S. Extending dependencies with conditions. In Proc. the 33rd International Conference on Very Large Data Bases, Sept. 2007, pp.243-254.

[16] Chu X, Ilyas I F, Papotti P. Discovering denial constraints. Proceedings of the VLDB Endowment, 2013, 6(13):1498-1509.

[17] Wang J, Tang N. Towards dependable data repairing with fixing rules. In Proc. the 40th ACM SIGMOD International Conference on Management of Data, June 2014, pp.457-468.

[18] Wu L, Yuan L, You J. Survey of large-scale data management systems for big data applications. Journal of Computer Science and Technology, 2015, 30(1):163-183.

[19] Chen Q, Tan Z, He C, Sha C, Wang W. Repairing functional dependency violations in distributed data. In Proc. the 20th Database Systems for Advanced Applications, April 2015, pp.441-457.

[20] Chiang F, Miller R J. Discovering data quality rules. Proceedings of the VLDB Endowment, 2008, 1(1):1166-1177.

[21] Fan W, Geerts F, Ma S, Müller H. Detecting inconsistencies in distributed data. In Proc. the 26th International Conference on Data Engineering (ICDE), March 2010, pp.64-75.

[22] Dallachiesa M, Ebaid A, Eldawy A, Elmagarmid A, Ilyas I F, Ouzzani M, Tang N. NADEEF:A commodity data cleaning system. In Proc. the 39th ACM SIGMOD International Conference on Management of Data, June 2013, pp.541-552.

[23] Khayyat Z, Ilyas I F, Jindal A, Madden S, Ouzzani M, Papotti P, Quiané-Ruiz J A, Tang N, Yin S. BigDansing:A system for big data cleansing. In Proc. the 2015 ACM SIGMOD International Conference on Management of Data, May 2015, pp.1215-1230.

[24] Chomicki J, Marcinkowski J. Minimal-change integrity maintenance using tuple deletions. Information and Computation, 2005, 197(1/2):90-121.

[25] Cormen T H, Leiserson C E, Rivest R L, Stein C. Introduction to Algorithms. MIT Press and McGraw-Hill, 2001.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 沈一栋;. A Fixpoint Semantics for Stratified Databases[J]. , 1993, 8(2): 12 -21 .
[2] 马军; 马绍汉;. An O(k~2n~2) Algorithm to Find a k-Partition in a k-Connected Graph[J]. , 1994, 9(1): 86 -91 .
[3] 唐志敏; 夏培肃;. A Maximum Time Difference Pipelined Arithmetic Unit Based on CMOS Gate Array[J]. , 1995, 10(2): 97 -103 .
[4] 陈意云;. Head Boundedness of Nonterminating Rewritings[J]. , 1995, 10(3): 281 -284 .
[5] 高虹;. Transformation List for SGML Application[J]. , 1995, 10(5): 455 -462 .
[6] . WWAN和WLAN间的一种高效的垂直切换触发算法[J]. , 2007, 22(1): 114 -120 .
[7] . 软件工程中的开发项目工期模型[J]. , 2007, 22(3): 348 -357 .
[8] . 有效优化多个子空间Skyline查询[J]. , 2008, 23(1): 103 -111 .
[9] . 暂缺[J]. , 2008, 23(4 ): 481 -496 .
[10] 黄永勤, 李宏亮, 谢向辉, 钱磊, 郝子宇, 过锋, 张昆. 面向高性能计算机体系结构设计的系统级并行模拟平台ArchSim[J]. , 2009, 24(5): 901 -912 .
版权所有 © 《计算机科学技术学报》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn
总访问量: