|
›› 2016,Vol. 31 ›› Issue (4): 741-758.doi: 10.1007/s11390-016-1660-4
所属专题: Data Management and Data Mining; Computer Networks and Distributed Computing
• Special Section on Selected Paper from NPC 2011 • 上一篇 下一篇
Yue-Feng Du, Student Member, CCF, Member, ACM, De-Rong Shen, Senior Member, CCF, Member, ACM, IEEE Tie-Zheng Nie, Yue Kou, Member, CCF, ACM, and Ge Yu, Fellow, CCF, Member, ACM, IEEE
Yue-Feng Du, Student Member, CCF, Member, ACM, De-Rong Shen, Senior Member, CCF, Member, ACM, IEEE Tie-Zheng Nie, Yue Kou, Member, CCF, ACM, and Ge Yu, Fellow, CCF, Member, ACM, IEEE
条件函数依赖(CFDs)是一种关键的数据一致性分析技术。由于CFDs没有考虑数据之间的内容关联性,所以在检测的过程中可能会遗漏部分潜在的一致性错误。内容相关的条件函数依赖(CCFDs)是一种特殊的CFDs,CCFDs通过合并CFDs并利用关联数据找出数据中的潜在错误。对于数据一致性清洗而言,数据检测和数据修复是相互作用的。一方面,检测捕捉冲突数据;另一方面,修复更正其中的错误。但是,修复的过程中也可能会产生新的不一致数据。另外,在现实生活中的数据通常是松散的并且分布存储在不同的数据源上。这样会增加清洗过程的传输代价。基于此,本文的工作就是解决分布式环境下内容关联数据之间的一致性修复问题,进而提出了一种修复框架。框架中包含检测和修复两个方法,并且迭代地执行这两个过程。检测方法会返回发生冲突的CCFDs规则并从中确定需要优先处理的冲突规则。本文提出了一种修复代价模型,并且证明了最小代价修复问题是一个NP完全问题。所以本文采用一种启发式方法对数据进行修复。为了提高修复效率和准确率,本文提出了两种优化结构:a)差别值,一种代替真实值进行数据传输的数据结构;b)规则序列,一种决定修复顺序的数据结构。最后,通过在两组真实数据上进行实验,证明了本方法比CFDs方法更为有效。
[1] Fan W, Geerts F. Foundations of Data Quality Management. Morgan & Claypool, 2012.[2] Fan W, Geerts F, Jia X, Kementsietsidis A. Conditional functional dependencies for capturing data inconsistencies. ACM Transactions on Database Systems (TODS), 2008, 33(2):Article No. 6[3] Papenbrock T, Ehrlich J, Marten J, Neubert T, Rudolph J P, Schönberg M, Zwiener J, Naumann F. Functional dependency discovery:An experimental evaluation of seven algorithms. Proceedings of the VLDB Endowment, 2015, 8(10):1082-1093.[4] Cong G, Fan W, Geerts F, Jia X, Ma S. Improving data quality:Consistency and accuracy. In Proc. the 33rd International Conference on Very Large Data Bases, Sept. 2007, pp.315-326.[5] Alwan A A, Ibrahim H, Udzir N I. Improved integrity constraints checking in distributed databases by exploiting local checking. Journal of Computer Science and Technology, 2009, 24(4):665-674.[6] Du Y, Shen D, Nie T, Kou Y, Yu G. Discovering condition-combined functional dependency rules. In Proc. the 16th APWeb, Sept. 2014, pp.247-257.[7] Fan W, Li J, Tang N, Yu W. Incremental detection of inconsistencies in distributed data. IEEE Transactions on Knowledge and Data Engineering, 2014, 26(6):1367-1383.[8] Fan W, Ma S, Tang N, Yu W. Interaction between record matching and data repairing. Journal of Data and Information Quality (JDIQ), 2014, 4(4):Article No. 16.[9] Christen P. A survey of indexing techniques for scalable record linkage and deduplication. IEEE Transactions on Knowledge and Data Engineering, 2012, 24(9):1537-1555.[10] Li X, Dong X L, Lyons K, Meng W, Srivastava D. Truth finding on the deep web:Is the problem solved? Proceedings of the VLDB Endowment, 2012, 6(2):97-108.[11] Bohannon P, Fan W, Flaster M, Rastogi R. A cost-based model and effective heuristic for repairing constraints by value modification. In Proc. the 31st ACM SIGMOD International Conference on Management of Data, June 2005, pp.143-154.[12] Wu A H, Tan Z J, Wang W. Annotation based query answer over inconsistent database. Journal of Computer Science and Technology, 2012, 25(3):469-481.[13] Chiang F, Miller R J. A unified model for data and constraint repair. In Proc. the 27th International Conference on Data Engineering, April 2011, pp.446-457.[14] Maher M J. Constrained dependencies. Theoretical Computer Science, 1997, 173(1):113-149.[15] Bravo L, Fan W, Ma S. Extending dependencies with conditions. In Proc. the 33rd International Conference on Very Large Data Bases, Sept. 2007, pp.243-254.[16] Chu X, Ilyas I F, Papotti P. Discovering denial constraints. Proceedings of the VLDB Endowment, 2013, 6(13):1498-1509.[17] Wang J, Tang N. Towards dependable data repairing with fixing rules. In Proc. the 40th ACM SIGMOD International Conference on Management of Data, June 2014, pp.457-468.[18] Wu L, Yuan L, You J. Survey of large-scale data management systems for big data applications. Journal of Computer Science and Technology, 2015, 30(1):163-183.[19] Chen Q, Tan Z, He C, Sha C, Wang W. Repairing functional dependency violations in distributed data. In Proc. the 20th Database Systems for Advanced Applications, April 2015, pp.441-457.[20] Chiang F, Miller R J. Discovering data quality rules. Proceedings of the VLDB Endowment, 2008, 1(1):1166-1177.[21] Fan W, Geerts F, Ma S, Müller H. Detecting inconsistencies in distributed data. In Proc. the 26th International Conference on Data Engineering (ICDE), March 2010, pp.64-75.[22] Dallachiesa M, Ebaid A, Eldawy A, Elmagarmid A, Ilyas I F, Ouzzani M, Tang N. NADEEF:A commodity data cleaning system. In Proc. the 39th ACM SIGMOD International Conference on Management of Data, June 2013, pp.541-552.[23] Khayyat Z, Ilyas I F, Jindal A, Madden S, Ouzzani M, Papotti P, Quiané-Ruiz J A, Tang N, Yin S. BigDansing:A system for big data cleansing. In Proc. the 2015 ACM SIGMOD International Conference on Management of Data, May 2015, pp.1215-1230.[24] Chomicki J, Marcinkowski J. Minimal-change integrity maintenance using tuple deletions. Information and Computation, 2005, 197(1/2):90-121.[25] Cormen T H, Leiserson C E, Rivest R L, Stein C. Introduction to Algorithms. MIT Press and McGraw-Hill, 2001. |
No related articles found! |
|
版权所有 © 《计算机科学技术学报》编辑部 本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn 总访问量: |