›› 2016, Vol. 31 ›› Issue (4): 720-740.doi: 10.1007/s11390-016-1659-x

Special Issue: Data Management and Data Mining

• Special Section on Data Management and Data Mining 2016 • Previous Articles     Next Articles

Determining the Real Data Completeness of a Relational Dataset

Yong-Nan Liu(刘永楠), Jian-Zhong Li(李建中), Fellow, CCF, Member, ACM, and Zhao-Nian Zou(邹兆年), Member, CCF, ACM, IEEE   

  1. School of Computer Science and Engineering, Harbin Institute of Technology, Harbin 150001, China
  • Received:2016-02-25 Revised:2016-05-17 Online:2016-07-05 Published:2016-07-05
  • About author:Yong-Nan Liu received his B.S. and M.S. degrees in computer science and technology from Harbin Institute of Technology, Harbin, in 2011 and 2013, respectively. Currently he is a Ph.D. candidate of Harbin Institute of Technology, Harbin. His research interests include data quality and data mining.
  • Supported by:

    The work was supported by the National Basic Research 973 Program of China under Grant No. 2011CB036202 and the National Natural Science Foundation of China under Grant No. 61532015.

Low quality of data is a serious problem in the new era of big data, which can severely reduce the usability of data, mislead or bias the querying, analyzing and mining, and leads to huge loss. Incomplete data is common in low quality data, and it is necessary to determine the data completeness of a dataset to provide hints for follow-up operations on it. Little existing work focuses on the completeness of a dataset, and such work views all missing values as unknown values. In this paper, we study how to determine real data completeness of a relational dataset. By taking advantage of given functional dependencies, we aim to determine some missing attribute values by other tuples and capture the really missing attribute cells. We propose a data completeness model, formalize the problem of determining the real data completeness of a relational dataset, and give a lower bound of the time complexity of this problem. Two optimal algorithms to determine the data completeness of a dataset for different cases are proposed. We empirically show the effectiveness and the scalability of our algorithms on both real-world data and synthetic data.

[1] Rahm E, Do H H. Data cleaning:Problems and current approaches. IEEE Data Eng. Bull., 2000, 23(4):3-13.

[2] Eckerson W W. Data warehousing special report:Data quality and the bottom line. Application Development Trends, 2002, (5):1-9.

[3] Poleto F Z, Singer J M, Paulino C D. Missing data mechanisms and their implications on the analysis of categorical data. Statistics and Computing, 2011, 21(1):31-43.

[4] Chen K, Chen H, Conway N, Hellerstein J M, Parikh T S. Usher:Improving data quality with dynamic forms. IEEE Transactions on Knowledge and Data Engineering, 2011, 23(8):1138-1153.

[5] Arocena P C, Glavic B, Miller R J. Value invention in data exchange. In Proc. the 2013 International Conference on Management of Data, June 2013, pp.157-168.

[6] Dong X L, Gabrilovich E, Murphy K, Dang V, Horn W, Lugaresi C, Sun S, Zhang W. Knowledge-based trust:Estimating the trustworthiness of web sources. Proceedings of the VLDB Endowment, 2015, 8(9):938-949.

[7] Wolf G, Khatri H, Chokshi B, Fan J, Chen Y, Kambhampati S. Query processing over incomplete autonomous databases. In Proc. the 33rd International Conference on Very Large Data Bases, Sept. 2007, pp.651-662.

[8] Motro A. Integrity=Validity + Completeness. ACM Transactions on Database Systems, 1989, 14(4):480-502.

[9] Yang K, Li J, Wang C. Missing values estimation in microarray data with partial least squares regression. In Proc. the 6th International Conference on Computational Science, May 2006, pp.662-669.

[10] Beskales G, Ilyas I F, Golab L. Sampling the repairs of functional dependency violations under hard constraints. Proceedings of the VLDB Endowment, 2010, 3(1/2):197-207.

[11] Li P, Dong X, Maurino A, Srivastava D. Linking temporal records. Proceedings of the VLDB Endowment, 2011, 4(11):956-967.

[12] Motro A, Rakov I. Not all answers are equally good:Estimating the quality of database answers. In Proc. the 1997 Flexible Query Answering Systems, June 1997, pp.1-21.

[13] Naumann F, Freytag J C, Leser U. Completeness of integrated information sources. Information Systems, 2004, 29(7):583-615.

[14] Biswas J, Naumann F, Qiu Q. Assessing the completeness of sensor data. In Proc. the 11th International Conference on Database Systems for Advanced Applications, April 2006, pp.717-732.

[15] Levy A Y. Obtaining complete answers from incomplete databases. In Proc. the 22nd International Conference on Very Large Data Bases, Sept. 1996, pp.402-412.

[16] Razniewski S, Nutt W. Completeness of queries over incomplete databases. Proceedings of the VLDB Endowment, 2011, 4(11):749-760.

[17] Fan W, Geerts F. Capturing missing tuples and missing values. In Proc. the 29th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, June 2010, pp.169-178.

[18] Fan W, Geerts F. Relative information completeness. ACM Transactions on Database Systems, 2010, 35(4):Article No. 27.

[19] Prokoshyna N, Szlichta J, Chiang F, Miller R J, Srivastava D. Combining quantitative and logical data cleaning. Proceedings of the VLDB Endowment, 2015, 9(4):300-311.

[20] Abiteboul S, Hull R, Vianu V. Foundations of Databases:The Logical Level (1st edition). Addison-Wesley Longman Publishing Co., Inc., 1995.

[21] Silberschatz A, Korth H, Sudarshan S. Database System Concepts (4th edition). McGraw-Hill Education, 2001.

[22] Cheng S, Li J. Sampling based (epsilon, delta)-approximate aggregation algorithm in sensor networks. In Proc. the 29th International Conference on Distributed Computing Systems, June 2009, pp.273-280.

[23] Khalefa M E, Mokbel M F, Levandoski J J. Skyline query processing for incomplete data. In Proc. the 24th International Conference on Data Engineering, April 2008, pp.556-565.

[24] Salloum M, Dong X L, Srivastava D, Tsotras V J. Online ordering of overlapping data sources. Proceedings of the VLDB Endowment, 2013, 7(3):133-144.

[25] Zhao B, Rubinstein B I, Gemmell J, Han J. A Bayesian approach to discovering truth from conflicting sources for data integration. Proceedings of the VLDB Endowment, 2012, 5(6):550-561.

[26] Dong X L, Saha B, Srivastava D. Less is more:Selecting sources wisely for integration. In Proc. the 39th International Conference on Very Large Data Bases, August 2013, pp.37-48.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!

ISSN 1000-9000(Print)

         1860-4749(Online)
CN 11-2296/TP

Home
Editorial Board
Author Guidelines
Subscription
Journal of Computer Science and Technology
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China
Tel.:86-10-62610746
E-mail: jcst@ict.ac.cn
 
  Copyright ©2015 JCST, All Rights Reserved