一种分布式环境下基于内容相关数据的一致性修复方法
Content-Related Repairing of Inconsistencies in Distributed Data
-
摘要: 条件函数依赖(CFDs)是一种关键的数据一致性分析技术。由于CFDs没有考虑数据之间的内容关联性,所以在检测的过程中可能会遗漏部分潜在的一致性错误。内容相关的条件函数依赖(CCFDs)是一种特殊的CFDs,CCFDs通过合并CFDs并利用关联数据找出数据中的潜在错误。对于数据一致性清洗而言,数据检测和数据修复是相互作用的。一方面,检测捕捉冲突数据;另一方面,修复更正其中的错误。但是,修复的过程中也可能会产生新的不一致数据。另外,在现实生活中的数据通常是松散的并且分布存储在不同的数据源上。这样会增加清洗过程的传输代价。基于此,本文的工作就是解决分布式环境下内容关联数据之间的一致性修复问题,进而提出了一种修复框架。框架中包含检测和修复两个方法,并且迭代地执行这两个过程。检测方法会返回发生冲突的CCFDs规则并从中确定需要优先处理的冲突规则。本文提出了一种修复代价模型,并且证明了最小代价修复问题是一个NP完全问题。所以本文采用一种启发式方法对数据进行修复。为了提高修复效率和准确率,本文提出了两种优化结构:a)差别值,一种代替真实值进行数据传输的数据结构;b)规则序列,一种决定修复顺序的数据结构。最后,通过在两组真实数据上进行实验,证明了本方法比CFDs方法更为有效。Abstract: Conditional functional dependencies (CFDs) are a critical technique for detecting inconsistencies while they may ignore some potential inconsistencies without considering the content relationship of data. Content-related conditional functional dependencies (CCFDs) are a type of special CFDs, which combine content-related CFDs and detect potential inconsistencies by putting content-related data together. In the process of cleaning inconsistencies, detection and repairing are interactive:1) detection catches inconsistencies, 2) repairing corrects caught inconsistencies while may bring new inconsistencies. Besides, data are often fragmented and distributed into multiple sites. It consequently costs expensive shipment for inconsistencies cleaning. In this paper, our aim is to repair inconsistencies in distributed content-related data. We propose a framework consisting of an inconsistencies detection method and an inconsistencies repairing method, which work iteratively. The detection method marks the violated CCFDs for computing the inconsistencies which should be repaired preferentially. Based on the repairing-cost model presented in this paper, we prove that the minimum-cost repairing using CCFDs is NP-complete. Therefore, the repairing method heuristically repairs the inconsistencies with minimum cost. To improve the efficiency and accuracy of repairing, we propose distinct values and rules sequences. Distinct values make less data shipments than real data for communication. Rules sequences determine appropriate repairing sequences to avoid some incorrect repairs. Our solution is proved to be more effective than CFDs by empirical evaluation on two real-life datasets.