We use cookies to improve your experience with our site.

Indexed in:

SCIE, EI, Scopus, INSPEC, DBLP, CSCD, etc.

Submission System
(Author / Reviewer / Editor)
An-Zhen Zhang, Jian-Zhong Li, Hong Gao. Interval Estimation for Aggregate Queries on Incomplete Data[J]. Journal of Computer Science and Technology, 2019, 34(6): 1203-1216. DOI: 10.1007/s11390-019-1970-4
Citation: An-Zhen Zhang, Jian-Zhong Li, Hong Gao. Interval Estimation for Aggregate Queries on Incomplete Data[J]. Journal of Computer Science and Technology, 2019, 34(6): 1203-1216. DOI: 10.1007/s11390-019-1970-4

Interval Estimation for Aggregate Queries on Incomplete Data

Funds: This work was supported by the Young Scientists Fund of the National Natural Science Foundation of China under Grant No. 61702344.
More Information
  • Author Bio:

    An-Zhen Zhang is a lecturer at Shenyang Aerospace University, Shenyang. She received her Ph.D. degree in computer software and theory from Harbin Institute of Technology, Harbin, in 2019. Her research interests include data quality and cloud computing.

  • Received Date: December 25, 2018
  • Revised Date: September 11, 2019
  • Published Date: November 15, 2019
  • Incomplete data has been a longstanding issue in the database community, and the subject is yet poorly handled by both theories and practices. One common way to cope with missing values is to complete their imputation (filling in) as a preprocessing step before analyses. Unfortunately, not a single imputation method could impute all missing values correctly in all cases. Users could hardly trust the query result on such complete data without any confidence guarantee. In this paper, we propose to directly estimate the aggregate query result on incomplete data, rather than to impute the missing values. An interval estimation, composed of the upper and the lower bound of aggregate query results among all possible interpretations of missing values, is presented to the end users. The ground-truth aggregate result is guaranteed to be among the interval. We believe that decision support applications could benefit significantly from the estimation, since they can tolerate inexact answers, as long as there are clearly defined semantics and guarantees associated with the results. Our main techniques are parameter-free and do not assume prior knowledge about the distribution and missingness mechanisms. Experimental results are consistent with the theoretical results and suggest that the estimation is invaluable to better assess the results of aggregate queries on incomplete data.
  • [1]
    Osborne J W. Best Practices in Data Cleaning:A Complete Guide to Everything You Need to Do Before and After Collecting Your Data (1st edition). SAGE Publications, Inc., 2012.
    [2]
    Rahm E, Do H H. Data cleaning:Problems and current approaches. IEEE Data Eng. Bull., 2000, 23(4):3-13.
    [3]
    Little R J, Rubin D B. Statistical Analysis with Missing Data (2nd Edition, Kindle Edition). Wiley-Interscience, 2014.
    [4]
    Zhang A, Wang J, Li J, Gao H. Aggregate query processing on incomplete data. In Proc. the 2nd International Joint Conference on Web and Big Data, July 2018, pp.286-294.
    [5]
    Jr W L. On semantic issues connected with incomplete information databases. ACM Trans. Database Syst., 1979, 4(3):262-296.
    [6]
    Reiter R. On closed world data bases. In Proc. the 1977 Symposium on Logic and Data Bases, November 1977, pp.55-76.
    [7]
    Codd E. Extending the database relational model to capture more meaning. ACM Trans. Database Syst., 1979, 4(4):397-434.
    [8]
    Lakshminarayan K, Harp S A, Samad T. Imputation of missing data in industrial databases. Appl. Intell., 1999, 11(3):259-275.
    [9]
    Mayfield C, Neville J, Prabhakar S. ERACER:A database approach for statistical inference and data cleaning. In Proc. the ACM SIGMOD International Conference on Management of Data, June 2010, pp.75-86.
    [10]
    Abiteboul S, Hull R, Vianu V. Foundations of Databases. Addison-Wesley, 1995.
    [11]
    Grahne G. The Problem of Incomplete Information in Relational Databases. Springer, 1991.
    [12]
    Imielinski T, Jr W L. Incomplete information in relational databases. J. ACM, 1984, 31(4):761-791.
    [13]
    van der Meyden R. Logical approaches to incomplete information:A survey. In Logics for Databases and Information Systems, Chomicki J, Saake G (eds.), Springer, 1998, pp.307-356.
    [14]
    Codd E F. Understanding relations (Installment #6). FDT-Bulletin of ACM SIGMOD, 1975, 7(1):1-4.
    [15]
    Date C J. Database in Depth Relational Theory for Practitioners. O'Reilly, 2005.
    [16]
    Date C. A critique of Claude Rubinson's paper nulls, threevalued logic, and ambiguity in SQL:Critiquing date's critique. SIGMOD Record, 2008, 37(3):20-22.
    [17]
    Date C J, Darwen H. A Guide to SQL Standard (4th edition). Addison-Wesley, 1997.
    [18]
    Grant J. Null values in a relational data base. Inf. Process. Lett., 1977, 6(5):156-157.
    [19]
    Abiteboul S, Kanellakis P C, Grahne G. On the representation and querying of sets of possible worlds. Theor. Comput. Sci., 1991, 78(1):158-187.
    [20]
    Sarle W S. Prediction with missing inputs. In Proc. the 4th Joint Conference on Information Sciences, October 1998, pp.399-402.
    [21]
    Feelders A. Handling missing data in trees:Surrogate splits or statistical imputation? In Proc. the 3rd European Conference on Principles of Data Mining and Knowledge Discovery, September 1999, pp.329-334.
    [22]
    Sande I G. Hot-deck imputation procedures. Incomplete Data in Sample Surveys, 1983, 3:339-349.
    [23]
    Buck S F. A method of estimation of missing values in multivariate data suitable for use with an electronic computer. Journal of the Royal Statistical Society. Series B (Methodological), 1960, 22(2):302-306.
    [24]
    Duda R O, Hart P E. Pattern Classification and Scene Analysis (1st edition). Wiley, 1973.
    [25]
    Ghahramani Z, Jordan M I. Mixture models for learning from incomplete data. Computational Learning Theory and Natural Learning Systems, 1997, 4:67-85.
    [26]
    van Buuren S, Mulligen E V, Brand J P L. Routine multiple imputation in statistical databases. In Proc. the 7th International Working Conference on Scientific and Statistical Database Management, September 1994, pp.74-78.
    [27]
    Rubin D B. Multiple imputation after 18+ years. Journal of the American statistical Association, 1996, 91(434):473-489.
    [28]
    Li K H. Imputation using Markov chains. Journal of Statistical Computation and Simulation, 1988, 30(1):57-79.
    [29]
    Rubin D B. Multiple Imputation for Nonresponse in Surveys. John Wiley & Sons, 2004.
    [30]
    Schafer J L. Analysis of Incomplete Multivariate Data (1st edition). Chapman and Hall/CRC, 1997.
  • Related Articles

    [1]Zhi-Xin Qi, Hong-Zhi Wang, An-Jie Wang. Impacts of Dirty Data on Classification and Clustering Models: An Experimental Evaluation[J]. Journal of Computer Science and Technology, 2021, 36(4): 806-821. DOI: 10.1007/s11390-021-1344-6
    [2]Yong-Nan Liu, Jian-Zhong Li, Zhao-Nian Zou. Determining the Real Data Completeness of a Relational Dataset[J]. Journal of Computer Science and Technology, 2016, 31(4): 720-740. DOI: 10.1007/s11390-016-1659-x
    [3]Hui Li, Di Wu, Gao-Xiang Li, Yi-Hao Ke, Wen-Jie Liu, Yuan-Huan Zheng, Xiao-La Lin. Enhancing Telco Service Quality with Big Data Enabled Churn Analysis: Infrastructure, Model, and Deployment[J]. Journal of Computer Science and Technology, 2015, 30(6): 1201-1214. DOI: 10.1007/s11390-015-1594-2
    [4]Wenfei Fan, Jin-Peng Huai. Querying Big Data: Bridging Theory and Practice[J]. Journal of Computer Science and Technology, 2014, 29(5): 849-869. DOI: 10.1007/s11390-014-1473-2
    [5]Ai-Hua Wu, Zi-Jing Tan, Wei Wang. Annotation Based Query Answer over Inconsistent Database[J]. Journal of Computer Science and Technology, 2010, 25(3): 469-481.
    [6]Taghi M. Khoshgoftaar, Pierre Rebours. Improving Software Quality Prediction by Noise Filtering Techniques[J]. Journal of Computer Science and Technology, 2007, 22(3): 387-396.
    [7]ZHOU Aoying, JIN Wen, ZHOU Shuigeng, QIAN Weining, TIAN Zenping. Incremental Mining of the Schema of Semistructured Data[J]. Journal of Computer Science and Technology, 2000, 15(3): 241-248.
    [8]Xu Meihe, Tang Zesheng. Surface Reconstruction for Cross Sectional Data[J]. Journal of Computer Science and Technology, 1996, 11(5): 471-479.
    [9]Zheng Weidong, Lin Zongkai, Guo Yuchai. A Data Manager for Engineering Applications[J]. Journal of Computer Science and Technology, 1993, 8(4): 21-30.
    [10]Zhang Yan, He Jichao. Data Dependencies in Database with Incomplete Information[J]. Journal of Computer Science and Technology, 1988, 3(2): 131-138.
  • Others

  • Cited by

    Periodical cited type(2)

    1. Rischan Mafrur, Mohamed A. Sharaf, Guido Zuccon. Database and Expert Systems Applications. Lecture Notes in Computer Science, DOI:10.1007/978-3-030-59003-1_8
    2. Mohamed A. Sharaf, Nishi Kochunni, Heba Helal, et al. CoDAQ: Congressional-Based Data Imputation for Aggregate Queries. 2025 IEEE International Conference on Big Data and Smart Computing (BigComp), DOI:10.1109/BigComp64353.2025.00060

    Other cited types(0)

Catalog

    Article views (45) PDF downloads (10) Cited by(2)
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return