›› 2018, Vol. 33 ›› Issue (2): 366-379.doi: 10.1007/s11390-018-1824-5

Special Issue: Data Management and Data Mining

• Data Management and Data Mining • Previous Articles     Next Articles

CrowdOLA: Online Aggregation on Duplicate Data Powered by Crowdsourcing

An-Zhen Zhang, Jian-Zhong Li, Fellow, CCF, Member, ACM, Hong Gao, Senior Member, CCF, Member, ACM, Yu-Biao Chen, Heng-Zhao Ma, Mohamed Jaward Bah   

  1. School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
  • Received:2017-02-26 Revised:2018-01-29 Online:2018-03-05 Published:2018-03-05
  • Contact: 10.1007/s11390-018-1824-5
  • About author:An-Zhen Zhang received her B.S. degree in computer science and technology from Harbin Institute of Technology, Harbin, in 2013. Currently she is a Ph.D. candidate of Harbin Institute of Technology, Harbin. Her research interests include data quality and cloud computing
  • Supported by:

    This work was supported by the National Natural Science Foundation of China under Grant Nos. 61502121, 61472099, and 61602129.

Recently there is an increasing need for interactive human-driven analysis on large volumes of data. Online aggregation (OLA), which provides a quick sketch of massive data before a long wait of the final accurate query result, has drawn significant research attention. However, the direct processing of OLA on duplicate data will lead to incorrect query answers, since sampling from duplicate records leads to an over representation of the duplicate data in the sample. This violates the prerequisite of uniform distributions in most statistical theories. In this paper, we propose CrowdOLA, a novel framework for integrating online aggregation processing with deduplication. Instead of cleaning the whole dataset, CrowdOLA retrieves block-level samples continuously from the dataset, and employs a crowd-based entity resolution approach to detect duplicates in the sample in a pay-as-you-go fashion. After cleaning the sample, an unbiased estimator is provided to address the error bias that is introduced by the duplication. We evaluate CrowdOLA on both real-world and synthetic workloads. Experimental results show that CrowdOLA provides a good balance between efficiency and accuracy.

[1] Hellerstein J M, Haas P J, Wang H J. Online aggregation. In Proc. ACM SIGMOD Int. Conf. Management of Data, May 1997, pp.171-182.

[2] Doulkeridis C, Nørvåg K. A survey of large-scale analytical query processing in MapReduce. VLDB J., 2014, 23(3):355-380.

[3] Elmagarmid A K, Ipeirotis P G, Verykios V S. Duplicate record detection:A survey. IEEE Trans. Knowl. Data Eng., 2007, 19(1):1-16.

[4] Charikar M, Chaudhuri S, Motwani R, Narasayya V R. Towards estimation error guarantees for distinct values. In Proc. ACM SIGMOD Int. Conf. Management of Data, May 2000, pp.268-279.

[5] Wang J, Krishnan S, Franklin M J, Goldberg K, Kraska T, Milo T. A sample-and-clean framework for fast and accurate query processing on dirty data. In Proc. ACM SIGMOD Int. Conf. Management of Data, June 2014, pp.469-480.

[6] Haas P J. Large-sample and deterministic confidence intervals for online aggregation. In Proc. the 9th Int. Conf. Scientific and Statistical Database Management, August 1997, pp.51-63.

[7] Haas P J, Hellerstein J M. Ripple joins for online aggregation. In Proc. ACM SIGMOD Int. Conf. Management of Data, June 1999, pp.287-298.

[8] Jermaine C, Dobra A, Arumugam S, Joshi S, Pol A. A disk-based join with probabilistic guarantees. In Proc. ACM SIGMOD Int. Conf. Management of Data, June 2005, pp.563-574.

[9] Luo G, Ellmann C J, Haas P J, Naughton J F. A scalable hash ripple join algorithm. In Proc. ACM SIGMOD Int. Conf. Management of Data, June 2002, pp.252-262.

[10] Condie T, Conway N, Alvaro P, Hellerstein J M, Gerth J, Talbot J, Elmeleegy K, Sears R. Online aggregation and continuous query support in MapReduce. In Proc. ACM SIGMOD Int. Conf. Management of Data, June 2010, pp.1115-1118.

[11] Shi Y, Meng X, Wang F, Gan Y. You can stop early with COLA:Online processing of aggregate queries in the cloud. In Proc. the 21st Int. Conf. Information and Knowledge Management, October 2012, pp.1223-1232.

[12] Pansare N, Borkar V R, Jermaine C, Condie T. Online aggregation for large MapReduce jobs. PVLDB, 2011, 4(11):1135-1145.

[13] Zeng K, Agarwal S, Stoica I. iOLAP:Managing uncertainty for efficient incremental OLAP. In Proc. ACM SIGMOD Int. Conf. Management of Data, July 2016, pp.1347-1361.

[14] Köpcke H, Rahm E. Frameworks for entity matching:A comparison. Data Knowl. Eng., 2010, 69(2):197-210.

[15] Hernández M A, Stolfo S J. The merge/purge problem for large databases. In Proc. ACM SIGMOD Int. Conf. Management of Data, May 1995, pp.127-138.

[16] McCallum A, Nigam K, Ungar L H. Efficient clustering of high-dimensional data sets with application to reference matching. In Proc. ACM SIGMOD Int. Conf. Management of Data, August 2000, pp.169-178.

[17] Ananthakrishna R, Chaudhuri S, Ganti V. Eliminating fuzzy duplicates in data warehouses. In Proc. the 28th Int. Conf. Very Large Data Bases, August 2002, pp.586-597.

[18] Bhattacharya I, Getoor L. Collective entity resolution in relational data. TKDD, 2007, 1(1):5.

[19] Altowim Y, Kalashnikov D V, Mehrotra S. Progressive approach to relational entity resolution. PVLDB, 2014, 7(11):999-1010.

[20] Whang S E, Marmaros D, Garcia-Molina H. Pay-as-yougo entity resolution. IEEE Trans. Knowl. Data Eng., 2013, 25(5):1111-1124.

[21] Gruenheid A, Dong X L, Srivastava D. Incremental record linkage. PVLDB, 2014, 7(9):697-708.

[22] Whang S E, Garcia-Molina H. Incremental entity resolution on rules and data. VLDB J., 2014, 23(1):77-102.

[23] Li G, Wang J, Zheng Y, Franklin M J. Crowdsourced data management:A survey. In Proc. the 33rd IEEE Int. Conf. Data Engineering, April 2017, pp.39-40.

[24] Zheng Y, Cheng R, Maniu S, Mo L. On optimality of jury selection in crowdsourcing. In Proc. the 18th Int. Conf. Extending Database Technology, March 2015, pp.193-204.

[25] Zheng Y, Li G, Li Y, Shan C, Cheng R. Truth inference in crowdsourcing:Is the problem solved? PVLDB, 2017, 10(5):541-552.

[26] Zheng Y, Li G, Cheng R. DOCS:Domain-aware crowdsourcing system. PVLDB, 2016, 10(4):361-372.

[27] Zheng Y, Wang J, Li G, Cheng R, Feng J. QASCA:A quality-aware task assignment system for crowdsourcing applications. In Proc. ACM SIGMOD Int. Conf. Management of Data, May 31-June 4, 2015, pp.1031-1046.

[28] Xiong H, Zhang D, Chen G, Wang L, Gauthier V, Barnes L E. iCrowd:Near-optimal task allocation for piggyback crowdsensing. IEEE Trans. Mob. Comput., 2016, 15(8):2010-2022.

[29] Hu H, Zheng Y, Bao Z, Li G, Feng J, Cheng R. Crowdsourced POI labelling:Location-aware result inference and task assignment. In Proc. the 32nd IEEE Int. Conf. Data Engineering, May 2016, pp.61-72.
No related articles found!
Full text



[1] Lian Lin; Zhang Yili; Tang Changjie;. A Non-Recursive Algorithm Computing Set Expressions[J]. , 1988, 3(4): 310 -316 .
[2] Xu Dianxiang; Zheng Guoliang;. Towards a Declarative Semantics of Inheritance with Exceptions[J]. , 1996, 11(1): 61 -71 .
[3] Chen Yangjun;. Counting and Topological Order[J]. , 1997, 12(6): 497 -509 .
[4] Nadir Farah, Labiba Souici, and Mokhtar Sellami. Arabic Word Recognition by Classifiers and Context[J]. , 2005, 20(3): 402 -410 .
[5] Ian Foster. Globus Toolkit Version 4: Software for Service-Oriented Systems[J]. , 2006, 21(4): 513 -520 .
[6] Xi-Shun Zhao and Yu-Ping Shen. Comparison of Semantics of Disjunctive Logic Programs Based on Model-Equivalent Reduction[J]. , 2007, 22(4): 562 -568 .
[7] Xiao Sun(Sun-Xiao-), De-Gen Huang(Huang-De-Gen-), Senior Member, CCF , Hai-Yu Song(Song-Hai-Yu-) and Fu-Ji Ren(Lin-Fu-Ji-), Member , IEEE . Chinese New Word Identification: A Latent Discriminative Model with Global Features[J]. , 2011, 26(1): 14 -24 .
[8] Min-Yi Guo, Zi-Li Shao, Edwin Hsing-Mean Sha. Preface[J]. , 2011, 26(3): 373 -374 .
[9] Peter Szolovits. Possibilities for Healthcare Computing[J]. , 2011, 26(4): 625 -631 .
[10] Feng Wang (王锋) Member, CCF, ACM, Can-Qun Yang (杨灿群), Yun-Fei Du (杜云飞), Juan Chen (陈娟), Hui-Zhan Yi (易会战), and Wei-Xia Xu (徐炜遐). Optimizing Linpack Benchmark on GPU-Accelerated Petascale Supercomputer[J]. , 2011, 26(5): 854 -865 .

ISSN 1000-9000(Print)

CN 11-2296/TP

Editorial Board
Author Guidelines
Journal of Computer Science and Technology
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China
E-mail: jcst@ict.ac.cn
  Copyright ©2015 JCST, All Rights Reserved