Interval Estimation for Aggregate Queries on Incomplete Data
-
Abstract
Incomplete data has been a longstanding issue in the database community, and the subject is yet poorly handled by both theories and practices. One common way to cope with missing values is to complete their imputation (filling in) as a preprocessing step before analyses. Unfortunately, not a single imputation method could impute all missing values correctly in all cases. Users could hardly trust the query result on such complete data without any confidence guarantee. In this paper, we propose to directly estimate the aggregate query result on incomplete data, rather than to impute the missing values. An interval estimation, composed of the upper and the lower bound of aggregate query results among all possible interpretations of missing values, is presented to the end users. The ground-truth aggregate result is guaranteed to be among the interval. We believe that decision support applications could benefit significantly from the estimation, since they can tolerate inexact answers, as long as there are clearly defined semantics and guarantees associated with the results. Our main techniques are parameter-free and do not assume prior knowledge about the distribution and missingness mechanisms. Experimental results are consistent with the theoretical results and suggest that the estimation is invaluable to better assess the results of aggregate queries on incomplete data.
-
-