? CrowdOLA: Online Aggregation on Duplicate Data Powered by Crowdsourcing
Journal of Computer Science and Technology
Quick Search in JCST
 Advanced Search 
      Home | PrePrint | SiteMap | Contact Us | FAQ
 
Indexed by   SCIE, EI ...
Bimonthly    Since 1986
Journal of Computer Science and Technology 2018, Vol. 33 Issue (2) :366-379    DOI: 10.1007/s11390-018-1824-5
Data Management and Data Mining Current Issue | Archive | Adv Search << Previous Articles | Next Articles >>
CrowdOLA: Online Aggregation on Duplicate Data Powered by Crowdsourcing
An-Zhen Zhang, Jian-Zhong Li, Fellow, CCF, Member, ACM, Hong Gao, Senior Member, CCF, Member, ACM, Yu-Biao Chen, Heng-Zhao Ma, Mohamed Jaward Bah
School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China

Abstract
Reference
Related Articles
Download: [PDF 1368KB]     Export: BibTeX or EndNote (RIS)  
Abstract Recently there is an increasing need for interactive human-driven analysis on large volumes of data. Online aggregation (OLA), which provides a quick sketch of massive data before a long wait of the final accurate query result, has drawn significant research attention. However, the direct processing of OLA on duplicate data will lead to incorrect query answers, since sampling from duplicate records leads to an over representation of the duplicate data in the sample. This violates the prerequisite of uniform distributions in most statistical theories. In this paper, we propose CrowdOLA, a novel framework for integrating online aggregation processing with deduplication. Instead of cleaning the whole dataset, CrowdOLA retrieves block-level samples continuously from the dataset, and employs a crowd-based entity resolution approach to detect duplicates in the sample in a pay-as-you-go fashion. After cleaning the sample, an unbiased estimator is provided to address the error bias that is introduced by the duplication. We evaluate CrowdOLA on both real-world and synthetic workloads. Experimental results show that CrowdOLA provides a good balance between efficiency and accuracy.
Articles by authors
Keywordsonline aggregation   entity resolution   crowdsourcing   cloud computing     
Received 2017-02-26;
Fund:

This work was supported by the National Natural Science Foundation of China under Grant Nos. 61502121, 61472099, and 61602129.

Corresponding Authors: 10.1007/s11390-018-1824-5   
About author: An-Zhen Zhang received her B.S. degree in computer science and technology from Harbin Institute of Technology, Harbin, in 2013. Currently she is a Ph.D. candidate of Harbin Institute of Technology, Harbin. Her research interests include data quality and cloud computing
Cite this article:   
An-Zhen Zhang, Jian-Zhong Li, Hong Gao, Yu-Biao Chen, Heng-Zhao Ma, Mohamed Jaward Bah.CrowdOLA: Online Aggregation on Duplicate Data Powered by Crowdsourcing[J]  Journal of Computer Science and Technology, 2018,V33(2): 366-379
URL:  
http://jcst.ict.ac.cn:8080/jcst/EN/10.1007/s11390-018-1824-5
Copyright 2010 by Journal of Computer Science and Technology