We use cookies to improve your experience with our site.

Indexed in:

SCIE, EI, Scopus, INSPEC, DBLP, CSCD, etc.

Submission System
(Author / Reviewer / Editor)
Yong-Nan Liu, Jian-Zhong Li, Zhao-Nian Zou. Determining the Real Data Completeness of a Relational Dataset[J]. Journal of Computer Science and Technology, 2016, 31(4): 720-740. DOI: 10.1007/s11390-016-1659-x
Citation: Yong-Nan Liu, Jian-Zhong Li, Zhao-Nian Zou. Determining the Real Data Completeness of a Relational Dataset[J]. Journal of Computer Science and Technology, 2016, 31(4): 720-740. DOI: 10.1007/s11390-016-1659-x

Determining the Real Data Completeness of a Relational Dataset

Funds: The work was supported by the National Basic Research 973 Program of China under Grant No. 2011CB036202 and the National Natural Science Foundation of China under Grant No. 61532015.
More Information
  • Author Bio:

    Yong-Nan Liu received his B.S. and M.S. degrees in computer science and technology from Harbin Institute of Technology, Harbin, in 2011 and 2013, respectively. Currently he is a Ph.D. candidate of Harbin Institute of Technology, Harbin. His research interests include data quality and data mining.

  • Received Date: February 24, 2016
  • Revised Date: May 16, 2016
  • Published Date: July 04, 2016
  • Low quality of data is a serious problem in the new era of big data, which can severely reduce the usability of data, mislead or bias the querying, analyzing and mining, and leads to huge loss. Incomplete data is common in low quality data, and it is necessary to determine the data completeness of a dataset to provide hints for follow-up operations on it. Little existing work focuses on the completeness of a dataset, and such work views all missing values as unknown values. In this paper, we study how to determine real data completeness of a relational dataset. By taking advantage of given functional dependencies, we aim to determine some missing attribute values by other tuples and capture the really missing attribute cells. We propose a data completeness model, formalize the problem of determining the real data completeness of a relational dataset, and give a lower bound of the time complexity of this problem. Two optimal algorithms to determine the data completeness of a dataset for different cases are proposed. We empirically show the effectiveness and the scalability of our algorithms on both real-world data and synthetic data.
  • [1]
    Rahm E, Do H H. Data cleaning:Problems and current approaches. IEEE Data Eng. Bull., 2000, 23(4):3-13.
    [2]
    Eckerson W W. Data warehousing special report:Data quality and the bottom line. Application Development Trends, 2002, (5):1-9.
    [3]
    Poleto F Z, Singer J M, Paulino C D. Missing data mechanisms and their implications on the analysis of categorical data. Statistics and Computing, 2011, 21(1):31-43.
    [4]
    Chen K, Chen H, Conway N, Hellerstein J M, Parikh T S. Usher:Improving data quality with dynamic forms. IEEE Transactions on Knowledge and Data Engineering, 2011, 23(8):1138-1153.
    [5]
    Arocena P C, Glavic B, Miller R J. Value invention in data exchange. In Proc. the 2013 International Conference on Management of Data, June 2013, pp.157-168.
    [6]
    Dong X L, Gabrilovich E, Murphy K, Dang V, Horn W, Lugaresi C, Sun S, Zhang W. Knowledge-based trust:Estimating the trustworthiness of web sources. Proceedings of the VLDB Endowment, 2015, 8(9):938-949.
    [7]
    Wolf G, Khatri H, Chokshi B, Fan J, Chen Y, Kambhampati S. Query processing over incomplete autonomous databases. In Proc. the 33rd International Conference on Very Large Data Bases, Sept. 2007, pp.651-662.
    [8]
    Motro A. Integrity=Validity + Completeness. ACM Transactions on Database Systems, 1989, 14(4):480-502.
    [9]
    Yang K, Li J, Wang C. Missing values estimation in microarray data with partial least squares regression. In Proc. the 6th International Conference on Computational Science, May 2006, pp.662-669.
    [10]
    Beskales G, Ilyas I F, Golab L. Sampling the repairs of functional dependency violations under hard constraints. Proceedings of the VLDB Endowment, 2010, 3(1/2):197-207.
    [11]
    Li P, Dong X, Maurino A, Srivastava D. Linking temporal records. Proceedings of the VLDB Endowment, 2011, 4(11):956-967.
    [12]
    Motro A, Rakov I. Not all answers are equally good:Estimating the quality of database answers. In Proc. the 1997 Flexible Query Answering Systems, June 1997, pp.1-21.
    [13]
    Naumann F, Freytag J C, Leser U. Completeness of integrated information sources. Information Systems, 2004, 29(7):583-615.
    [14]
    Biswas J, Naumann F, Qiu Q. Assessing the completeness of sensor data. In Proc. the 11th International Conference on Database Systems for Advanced Applications, April 2006, pp.717-732.
    [15]
    Levy A Y. Obtaining complete answers from incomplete databases. In Proc. the 22nd International Conference on Very Large Data Bases, Sept. 1996, pp.402-412.
    [16]
    Razniewski S, Nutt W. Completeness of queries over incomplete databases. Proceedings of the VLDB Endowment, 2011, 4(11):749-760.
    [17]
    Fan W, Geerts F. Capturing missing tuples and missing values. In Proc. the 29th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, June 2010, pp.169-178.
    [18]
    Fan W, Geerts F. Relative information completeness. ACM Transactions on Database Systems, 2010, 35(4):Article No. 27.
    [19]
    Prokoshyna N, Szlichta J, Chiang F, Miller R J, Srivastava D. Combining quantitative and logical data cleaning. Proceedings of the VLDB Endowment, 2015, 9(4):300-311.
    [20]
    Abiteboul S, Hull R, Vianu V. Foundations of Databases:The Logical Level (1st edition). Addison-Wesley Longman Publishing Co., Inc., 1995.
    [21]
    Silberschatz A, Korth H, Sudarshan S. Database System Concepts (4th edition). McGraw-Hill Education, 2001.
    [22]
    Cheng S, Li J. Sampling based (epsilon, delta)-approximate aggregation algorithm in sensor networks. In Proc. the 29th International Conference on Distributed Computing Systems, June 2009, pp.273-280.
    [23]
    Khalefa M E, Mokbel M F, Levandoski J J. Skyline query processing for incomplete data. In Proc. the 24th International Conference on Data Engineering, April 2008, pp.556-565.
    [24]
    Salloum M, Dong X L, Srivastava D, Tsotras V J. Online ordering of overlapping data sources. Proceedings of the VLDB Endowment, 2013, 7(3):133-144.
    [25]
    Zhao B, Rubinstein B I, Gemmell J, Han J. A Bayesian approach to discovering truth from conflicting sources for data integration. Proceedings of the VLDB Endowment, 2012, 5(6):550-561.
    [26]
    Dong X L, Saha B, Srivastava D. Less is more:Selecting sources wisely for integration. In Proc. the 39th International Conference on Very Large Data Bases, August 2013, pp.37-48.
  • Related Articles

    [1]Amin Nikanjam, Adel Rahmani. Exploiting Bivariate Dependencies to Speedup Structure Learning in Bayesian Optimization Algorithm[J]. Journal of Computer Science and Technology, 2012, 27(5): 1077-1090. DOI: 10.1007/s11390-012-1285-1
    [2]Ken-Li Li, Ren-Fa Li, Qing-Hua. Optimal Parallel Algorithm for the Knapsack Problem Without Memory Conflicts[J]. Journal of Computer Science and Technology, 2004, 19(6).
    [3]SONG Jianping, HOU Zifeng, SHI Yuntao. An Optimal Multicast Algorithm for Cube-Connected Cycles[J]. Journal of Computer Science and Technology, 2000, 15(6): 572-583.
    [4]WU Jigang, JI Yongchang, CHEN Guoliang. An Optimal Online Algorithm for Half Plane Intersection[J]. Journal of Computer Science and Technology, 2000, 15(3): 295-299.
    [5]Zhi Lihong. Optimal Algorithm for Algebraic Factoring[J]. Journal of Computer Science and Technology, 1997, 12(1): 1-9.
    [6]Gao Qingshi. A Unified O(log N) and Optimal Sorting Vector Algorithm[J]. Journal of Computer Science and Technology, 1995, 10(5): 470-475.
    [7]Gao Qingshi, Liu Zhiyong. K-Dimensional Optimal Parallel Algorithm for the Solution of a General Class of Recurrence Equations[J]. Journal of Computer Science and Technology, 1995, 10(5): 417-424.
    [8]Yan Yong. An Optimal Algorithm for Solving Collision Distance Between Convex Polygons in Plane[J]. Journal of Computer Science and Technology, 1993, 8(4): 81-87.
    [9]Huang Kaiyuan. Three-Valued Diagnosable Systems: Diagnosability, Optimal Design and Fault Identification Algorithm[J]. Journal of Computer Science and Technology, 1987, 2(2): 99-114.
    [10]Li Layuan. A Routing Algorithm for Distributed Optimal Double Loop Computer Networks[J]. Journal of Computer Science and Technology, 1987, 2(2): 92-98.

Catalog

    Article views (35) PDF downloads (990) Cited by()
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return