数据完整性的判定
Determining the Real Data Completeness of a Relational Dataset
-
摘要: 在大数据时代,低质量的数据是一个严重的问题,这些低质量的数据会严重的降低数据的可用性,给查询,分析和挖掘带来误导和偏差,因此会导致巨大的损失。在低质量数据中,不完整数据是常见的,有必要判定一个数据集合的数据完整性,来为之后在这个集合上进行的操作提供参考。目前,在很少有工作关注数据集合的完整性,并且现有的工作将所有的缺失值都当作未知的值。在这篇论文中,我们研究了如何判定一个数据集合的真实的数据完整性。通过利用给定的函数依赖集合,我们希望能够通过其它的元组来确定某些缺失值,从而发现那些真正的缺失值。我们提出了一个数据完整性模型,形式化了这个判定问题,并给出了这个问题的下界。在这篇论文中,我们提出了两个优化的判定给定数据集合的完整性的算法。我们通过在真实数据和合成数据上的实验结果,表明了本文所提出的算法的有效性和高效性。Abstract: Low quality of data is a serious problem in the new era of big data, which can severely reduce the usability of data, mislead or bias the querying, analyzing and mining, and leads to huge loss. Incomplete data is common in low quality data, and it is necessary to determine the data completeness of a dataset to provide hints for follow-up operations on it. Little existing work focuses on the completeness of a dataset, and such work views all missing values as unknown values. In this paper, we study how to determine real data completeness of a relational dataset. By taking advantage of given functional dependencies, we aim to determine some missing attribute values by other tuples and capture the really missing attribute cells. We propose a data completeness model, formalize the problem of determining the real data completeness of a relational dataset, and give a lower bound of the time complexity of this problem. Two optimal algorithms to determine the data completeness of a dataset for different cases are proposed. We empirically show the effectiveness and the scalability of our algorithms on both real-world data and synthetic data.