? CrowdOLA:重复数据上基于众包的在线聚集查询系统
Journal of Computer Science and Technology
Quick Search in JCST
 Advanced Search 
      Home | PrePrint | SiteMap | Contact Us | Help
 
Indexed by   SCIE, EI ...
Bimonthly    Since 1986
Journal of Computer Science and Technology 2018, Vol. 33 Issue (2) :366-379    DOI: 10.1007/s11390-018-1824-5
Data Management and Data Mining << Previous Articles | Next Articles >>
CrowdOLA:重复数据上基于众包的在线聚集查询系统
An-Zhen Zhang, Jian-Zhong Li, Fellow, CCF, Member, ACM, Hong Gao, Senior Member, CCF, Member, ACM, Yu-Biao Chen, Heng-Zhao Ma, Mohamed Jaward Bah
School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
CrowdOLA: Online Aggregation on Duplicate Data Powered by Crowdsourcing
An-Zhen Zhang, Jian-Zhong Li, Fellow, CCF, Member, ACM, Hong Gao, Senior Member, CCF, Member, ACM, Yu-Biao Chen, Heng-Zhao Ma, Mohamed Jaward Bah
School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China

摘要
参考文献
相关文章
Download: [PDF 1368KB]  
摘要 近年来,大数据交互式分析的需求与日俱增。在线聚集查询处理能够快速描绘数据概貌,避免过长时间的查询等待,从而引起学术界广泛关注。当数据中有重复元组时,直接进行在线聚集查询分析将导致错误的查询结果。传统的在线聚集技术基于均匀采样的前提,然而当数据有重复时,重复的元组被采样的概率增大,从而违背均匀采样。本文提出了一个适用于重复数据上在线聚集查询的系统,CrowdOLA。CrowdOLA从原始数据集上不断以块为单位抽取样本,在得到的样本上利用基于众包的实体识别技术去除重复元组,然后根据清洗后的样本给出总体聚集结果的无偏估计统计量。真实数据集和模拟数据集上的大量实验证实了CrowdOLA的高效性和准确性。
关键词在线聚集   实体识别   众包   云计算     
Abstract: Recently there is an increasing need for interactive human-driven analysis on large volumes of data. Online aggregation (OLA), which provides a quick sketch of massive data before a long wait of the final accurate query result, has drawn significant research attention. However, the direct processing of OLA on duplicate data will lead to incorrect query answers, since sampling from duplicate records leads to an over representation of the duplicate data in the sample. This violates the prerequisite of uniform distributions in most statistical theories. In this paper, we propose CrowdOLA, a novel framework for integrating online aggregation processing with deduplication. Instead of cleaning the whole dataset, CrowdOLA retrieves block-level samples continuously from the dataset, and employs a crowd-based entity resolution approach to detect duplicates in the sample in a pay-as-you-go fashion. After cleaning the sample, an unbiased estimator is provided to address the error bias that is introduced by the duplication. We evaluate CrowdOLA on both real-world and synthetic workloads. Experimental results show that CrowdOLA provides a good balance between efficiency and accuracy.
Keywordsonline aggregation   entity resolution   crowdsourcing   cloud computing     
Received 2017-02-26;
本文基金:

This work was supported by the National Natural Science Foundation of China under Grant Nos. 61502121, 61472099, and 61602129.

About author: An-Zhen Zhang received her B.S. degree in computer science and technology from Harbin Institute of Technology, Harbin, in 2013. Currently she is a Ph.D. candidate of Harbin Institute of Technology, Harbin. Her research interests include data quality and cloud computing
引用本文:   
An-Zhen Zhang, Jian-Zhong Li, Hong Gao, Yu-Biao Chen, Heng-Zhao Ma, Mohamed Jawa.CrowdOLA:重复数据上基于众包的在线聚集查询系统[J]  Journal of Computer Science and Technology , 2018,V33(2): 366-379
An-Zhen Zhang, Jian-Zhong Li, Hong Gao, Yu-Biao Chen, Heng-Zhao Ma, Mohamed Jaward Bah.CrowdOLA: Online Aggregation on Duplicate Data Powered by Crowdsourcing[J]  Journal of Computer Science and Technology, 2018,V33(2): 366-379
链接本文:  
http://jcst.ict.ac.cn:8080/jcst/CN/10.1007/s11390-018-1824-5
Copyright 2010 by Journal of Computer Science and Technology