›› 2018,Vol. 33 ›› Issue (2): 366-379.doi: 10.1007/s11390-018-1824-5

所属专题: Data Management and Data Mining

• Data Management and Data Mining • 上一篇    下一篇

CrowdOLA:重复数据上基于众包的在线聚集查询系统

An-Zhen Zhang, Jian-Zhong Li, Fellow, CCF, Member, ACM, Hong Gao, Senior Member, CCF, Member, ACM, Yu-Biao Chen, Heng-Zhao Ma, Mohamed Jaward Bah   

  1. School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
  • 收稿日期:2017-02-26 修回日期:2018-01-29 出版日期:2018-03-05 发布日期:2018-03-05
  • 作者简介:An-Zhen Zhang received her B.S. degree in computer science and technology from Harbin Institute of Technology, Harbin, in 2013. Currently she is a Ph.D. candidate of Harbin Institute of Technology, Harbin. Her research interests include data quality and cloud computing
  • 基金资助:

    This work was supported by the National Natural Science Foundation of China under Grant Nos. 61502121, 61472099, and 61602129.

CrowdOLA: Online Aggregation on Duplicate Data Powered by Crowdsourcing

An-Zhen Zhang, Jian-Zhong Li, Fellow, CCF, Member, ACM, Hong Gao, Senior Member, CCF, Member, ACM, Yu-Biao Chen, Heng-Zhao Ma, Mohamed Jaward Bah   

  1. School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
  • Received:2017-02-26 Revised:2018-01-29 Online:2018-03-05 Published:2018-03-05
  • Contact: 10.1007/s11390-018-1824-5
  • About author:An-Zhen Zhang received her B.S. degree in computer science and technology from Harbin Institute of Technology, Harbin, in 2013. Currently she is a Ph.D. candidate of Harbin Institute of Technology, Harbin. Her research interests include data quality and cloud computing
  • Supported by:

    This work was supported by the National Natural Science Foundation of China under Grant Nos. 61502121, 61472099, and 61602129.

近年来,大数据交互式分析的需求与日俱增。在线聚集查询处理能够快速描绘数据概貌,避免过长时间的查询等待,从而引起学术界广泛关注。当数据中有重复元组时,直接进行在线聚集查询分析将导致错误的查询结果。传统的在线聚集技术基于均匀采样的前提,然而当数据有重复时,重复的元组被采样的概率增大,从而违背均匀采样。本文提出了一个适用于重复数据上在线聚集查询的系统,CrowdOLA。CrowdOLA从原始数据集上不断以块为单位抽取样本,在得到的样本上利用基于众包的实体识别技术去除重复元组,然后根据清洗后的样本给出总体聚集结果的无偏估计统计量。真实数据集和模拟数据集上的大量实验证实了CrowdOLA的高效性和准确性。

Abstract: Recently there is an increasing need for interactive human-driven analysis on large volumes of data. Online aggregation (OLA), which provides a quick sketch of massive data before a long wait of the final accurate query result, has drawn significant research attention. However, the direct processing of OLA on duplicate data will lead to incorrect query answers, since sampling from duplicate records leads to an over representation of the duplicate data in the sample. This violates the prerequisite of uniform distributions in most statistical theories. In this paper, we propose CrowdOLA, a novel framework for integrating online aggregation processing with deduplication. Instead of cleaning the whole dataset, CrowdOLA retrieves block-level samples continuously from the dataset, and employs a crowd-based entity resolution approach to detect duplicates in the sample in a pay-as-you-go fashion. After cleaning the sample, an unbiased estimator is provided to address the error bias that is introduced by the duplication. We evaluate CrowdOLA on both real-world and synthetic workloads. Experimental results show that CrowdOLA provides a good balance between efficiency and accuracy.

[1] Hellerstein J M, Haas P J, Wang H J. Online aggregation. In Proc. ACM SIGMOD Int. Conf. Management of Data, May 1997, pp.171-182.

[2] Doulkeridis C, Nørvåg K. A survey of large-scale analytical query processing in MapReduce. VLDB J., 2014, 23(3):355-380.

[3] Elmagarmid A K, Ipeirotis P G, Verykios V S. Duplicate record detection:A survey. IEEE Trans. Knowl. Data Eng., 2007, 19(1):1-16.

[4] Charikar M, Chaudhuri S, Motwani R, Narasayya V R. Towards estimation error guarantees for distinct values. In Proc. ACM SIGMOD Int. Conf. Management of Data, May 2000, pp.268-279.

[5] Wang J, Krishnan S, Franklin M J, Goldberg K, Kraska T, Milo T. A sample-and-clean framework for fast and accurate query processing on dirty data. In Proc. ACM SIGMOD Int. Conf. Management of Data, June 2014, pp.469-480.

[6] Haas P J. Large-sample and deterministic confidence intervals for online aggregation. In Proc. the 9th Int. Conf. Scientific and Statistical Database Management, August 1997, pp.51-63.

[7] Haas P J, Hellerstein J M. Ripple joins for online aggregation. In Proc. ACM SIGMOD Int. Conf. Management of Data, June 1999, pp.287-298.

[8] Jermaine C, Dobra A, Arumugam S, Joshi S, Pol A. A disk-based join with probabilistic guarantees. In Proc. ACM SIGMOD Int. Conf. Management of Data, June 2005, pp.563-574.

[9] Luo G, Ellmann C J, Haas P J, Naughton J F. A scalable hash ripple join algorithm. In Proc. ACM SIGMOD Int. Conf. Management of Data, June 2002, pp.252-262.

[10] Condie T, Conway N, Alvaro P, Hellerstein J M, Gerth J, Talbot J, Elmeleegy K, Sears R. Online aggregation and continuous query support in MapReduce. In Proc. ACM SIGMOD Int. Conf. Management of Data, June 2010, pp.1115-1118.

[11] Shi Y, Meng X, Wang F, Gan Y. You can stop early with COLA:Online processing of aggregate queries in the cloud. In Proc. the 21st Int. Conf. Information and Knowledge Management, October 2012, pp.1223-1232.

[12] Pansare N, Borkar V R, Jermaine C, Condie T. Online aggregation for large MapReduce jobs. PVLDB, 2011, 4(11):1135-1145.

[13] Zeng K, Agarwal S, Stoica I. iOLAP:Managing uncertainty for efficient incremental OLAP. In Proc. ACM SIGMOD Int. Conf. Management of Data, July 2016, pp.1347-1361.

[14] Köpcke H, Rahm E. Frameworks for entity matching:A comparison. Data Knowl. Eng., 2010, 69(2):197-210.

[15] Hernández M A, Stolfo S J. The merge/purge problem for large databases. In Proc. ACM SIGMOD Int. Conf. Management of Data, May 1995, pp.127-138.

[16] McCallum A, Nigam K, Ungar L H. Efficient clustering of high-dimensional data sets with application to reference matching. In Proc. ACM SIGMOD Int. Conf. Management of Data, August 2000, pp.169-178.

[17] Ananthakrishna R, Chaudhuri S, Ganti V. Eliminating fuzzy duplicates in data warehouses. In Proc. the 28th Int. Conf. Very Large Data Bases, August 2002, pp.586-597.

[18] Bhattacharya I, Getoor L. Collective entity resolution in relational data. TKDD, 2007, 1(1):5.

[19] Altowim Y, Kalashnikov D V, Mehrotra S. Progressive approach to relational entity resolution. PVLDB, 2014, 7(11):999-1010.

[20] Whang S E, Marmaros D, Garcia-Molina H. Pay-as-yougo entity resolution. IEEE Trans. Knowl. Data Eng., 2013, 25(5):1111-1124.

[21] Gruenheid A, Dong X L, Srivastava D. Incremental record linkage. PVLDB, 2014, 7(9):697-708.

[22] Whang S E, Garcia-Molina H. Incremental entity resolution on rules and data. VLDB J., 2014, 23(1):77-102.

[23] Li G, Wang J, Zheng Y, Franklin M J. Crowdsourced data management:A survey. In Proc. the 33rd IEEE Int. Conf. Data Engineering, April 2017, pp.39-40.

[24] Zheng Y, Cheng R, Maniu S, Mo L. On optimality of jury selection in crowdsourcing. In Proc. the 18th Int. Conf. Extending Database Technology, March 2015, pp.193-204.

[25] Zheng Y, Li G, Li Y, Shan C, Cheng R. Truth inference in crowdsourcing:Is the problem solved? PVLDB, 2017, 10(5):541-552.

[26] Zheng Y, Li G, Cheng R. DOCS:Domain-aware crowdsourcing system. PVLDB, 2016, 10(4):361-372.

[27] Zheng Y, Wang J, Li G, Cheng R, Feng J. QASCA:A quality-aware task assignment system for crowdsourcing applications. In Proc. ACM SIGMOD Int. Conf. Management of Data, May 31-June 4, 2015, pp.1031-1046.

[28] Xiong H, Zhang D, Chen G, Wang L, Gauthier V, Barnes L E. iCrowd:Near-optimal task allocation for piggyback crowdsensing. IEEE Trans. Mob. Comput., 2016, 15(8):2010-2022.

[29] Hu H, Zheng Y, Bao Z, Li G, Feng J, Cheng R. Crowdsourced POI labelling:Location-aware result inference and task assignment. In Proc. the 32nd IEEE Int. Conf. Data Engineering, May 2016, pp.61-72.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 练林; 张一立; 唐常杰;. A Non-Recursive Algorithm Computing Set Expressions[J]. , 1988, 3(4): 310 -316 .
[2] 徐殿祥; 郑国梁;. Towards a Declarative Semantics of Inheritance with Exceptions[J]. , 1996, 11(1): 61 -71 .
[3] 陈阳军;. Counting and Topological Order[J]. , 1997, 12(6): 497 -509 .
[4] . 使用多分类器和上下文限制的阿拉伯文字识别[J]. , 2005, 20(3): 402 -410 .
[5] . Globus Toolkit 4: 用于面向服务系统的软件[J]. , 2006, 21(4): 513 -520 .
[6] . 基于模型等价归约的析取逻辑程序语义比较[J]. , 2007, 22(4): 562 -568 .
[7] Xiao Sun(孙晓), De-Gen Huang(黄德根), Senior Member, CCF, Hai-Yu Song(宋海玉) and Fu-Ji Ren(任福继), Member, IEEE. 基于一种隐藏变量区别模型与全局特征的中文新词识别[J]. , 2011, 26(1): 14 -24 .
[8] Min-Yi Guo, Zi-Li Shao, Edwin Hsing-Mean Sha. [J]. , 2011, 26(3): 373 -374 .
[9] Peter Szolovits. [J]. , 2011, 26(4): 625 -631 .
[10] Feng Wang (王锋) Member, CCF, ACM, Can-Qun Yang (杨灿群), Yun-Fei Du (杜云飞), Ju. 使用GPU加速的千万亿次超级计算机的Linpack程序优化[J]. , 2011, 26(5): 854 -865 .
版权所有 © 《计算机科学技术学报》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn
总访问量: