CrowdOLA:重复数据上基于众包的在线聚集查询系统

doi:10.1007/s11390-018-1824-5

CrowdOLA:重复数据上基于众包的在线聚集查询系统

CrowdOLA: Online Aggregation on Duplicate Data Powered by Crowdsourcing

摘要

摘要: 近年来，大数据交互式分析的需求与日俱增。在线聚集查询处理能够快速描绘数据概貌，避免过长时间的查询等待，从而引起学术界广泛关注。当数据中有重复元组时，直接进行在线聚集查询分析将导致错误的查询结果。传统的在线聚集技术基于均匀采样的前提，然而当数据有重复时，重复的元组被采样的概率增大，从而违背均匀采样。本文提出了一个适用于重复数据上在线聚集查询的系统，CrowdOLA。CrowdOLA从原始数据集上不断以块为单位抽取样本，在得到的样本上利用基于众包的实体识别技术去除重复元组，然后根据清洗后的样本给出总体聚集结果的无偏估计统计量。真实数据集和模拟数据集上的大量实验证实了CrowdOLA的高效性和准确性。

Abstract: Recently there is an increasing need for interactive human-driven analysis on large volumes of data. Online aggregation (OLA), which provides a quick sketch of massive data before a long wait of the final accurate query result, has drawn significant research attention. However, the direct processing of OLA on duplicate data will lead to incorrect query answers, since sampling from duplicate records leads to an over representation of the duplicate data in the sample. This violates the prerequisite of uniform distributions in most statistical theories. In this paper, we propose CrowdOLA, a novel framework for integrating online aggregation processing with deduplication. Instead of cleaning the whole dataset, CrowdOLA retrieves block-level samples continuously from the dataset, and employs a crowd-based entity resolution approach to detect duplicates in the sample in a pay-as-you-go fashion. After cleaning the sample, an unbiased estimator is provided to address the error bias that is introduced by the duplication. We evaluate CrowdOLA on both real-world and synthetic workloads. Experimental results show that CrowdOLA provides a good balance between efficiency and accuracy.

HTML全文

参考文献()

施引文献

资源附件()