|
›› 2018,Vol. 33 ›› Issue (2): 366-379.doi: 10.1007/s11390-018-1824-5
所属专题: Data Management and Data Mining
• Data Management and Data Mining • 上一篇 下一篇
An-Zhen Zhang, Jian-Zhong Li, Fellow, CCF, Member, ACM, Hong Gao, Senior Member, CCF, Member, ACM, Yu-Biao Chen, Heng-Zhao Ma, Mohamed Jaward Bah
An-Zhen Zhang, Jian-Zhong Li, Fellow, CCF, Member, ACM, Hong Gao, Senior Member, CCF, Member, ACM, Yu-Biao Chen, Heng-Zhao Ma, Mohamed Jaward Bah
近年来,大数据交互式分析的需求与日俱增。在线聚集查询处理能够快速描绘数据概貌,避免过长时间的查询等待,从而引起学术界广泛关注。当数据中有重复元组时,直接进行在线聚集查询分析将导致错误的查询结果。传统的在线聚集技术基于均匀采样的前提,然而当数据有重复时,重复的元组被采样的概率增大,从而违背均匀采样。本文提出了一个适用于重复数据上在线聚集查询的系统,CrowdOLA。CrowdOLA从原始数据集上不断以块为单位抽取样本,在得到的样本上利用基于众包的实体识别技术去除重复元组,然后根据清洗后的样本给出总体聚集结果的无偏估计统计量。真实数据集和模拟数据集上的大量实验证实了CrowdOLA的高效性和准确性。
[1] Hellerstein J M, Haas P J, Wang H J. Online aggregation. In Proc. ACM SIGMOD Int. Conf. Management of Data, May 1997, pp.171-182.[2] Doulkeridis C, Nørvåg K. A survey of large-scale analytical query processing in MapReduce. VLDB J., 2014, 23(3):355-380.[3] Elmagarmid A K, Ipeirotis P G, Verykios V S. Duplicate record detection:A survey. IEEE Trans. Knowl. Data Eng., 2007, 19(1):1-16.[4] Charikar M, Chaudhuri S, Motwani R, Narasayya V R. Towards estimation error guarantees for distinct values. In Proc. ACM SIGMOD Int. Conf. Management of Data, May 2000, pp.268-279.[5] Wang J, Krishnan S, Franklin M J, Goldberg K, Kraska T, Milo T. A sample-and-clean framework for fast and accurate query processing on dirty data. In Proc. ACM SIGMOD Int. Conf. Management of Data, June 2014, pp.469-480.[6] Haas P J. Large-sample and deterministic confidence intervals for online aggregation. In Proc. the 9th Int. Conf. Scientific and Statistical Database Management, August 1997, pp.51-63.[7] Haas P J, Hellerstein J M. Ripple joins for online aggregation. In Proc. ACM SIGMOD Int. Conf. Management of Data, June 1999, pp.287-298.[8] Jermaine C, Dobra A, Arumugam S, Joshi S, Pol A. A disk-based join with probabilistic guarantees. In Proc. ACM SIGMOD Int. Conf. Management of Data, June 2005, pp.563-574.[9] Luo G, Ellmann C J, Haas P J, Naughton J F. A scalable hash ripple join algorithm. In Proc. ACM SIGMOD Int. Conf. Management of Data, June 2002, pp.252-262.[10] Condie T, Conway N, Alvaro P, Hellerstein J M, Gerth J, Talbot J, Elmeleegy K, Sears R. Online aggregation and continuous query support in MapReduce. In Proc. ACM SIGMOD Int. Conf. Management of Data, June 2010, pp.1115-1118.[11] Shi Y, Meng X, Wang F, Gan Y. You can stop early with COLA:Online processing of aggregate queries in the cloud. In Proc. the 21st Int. Conf. Information and Knowledge Management, October 2012, pp.1223-1232.[12] Pansare N, Borkar V R, Jermaine C, Condie T. Online aggregation for large MapReduce jobs. PVLDB, 2011, 4(11):1135-1145.[13] Zeng K, Agarwal S, Stoica I. iOLAP:Managing uncertainty for efficient incremental OLAP. In Proc. ACM SIGMOD Int. Conf. Management of Data, July 2016, pp.1347-1361.[14] Köpcke H, Rahm E. Frameworks for entity matching:A comparison. Data Knowl. Eng., 2010, 69(2):197-210.[15] Hernández M A, Stolfo S J. The merge/purge problem for large databases. In Proc. ACM SIGMOD Int. Conf. Management of Data, May 1995, pp.127-138.[16] McCallum A, Nigam K, Ungar L H. Efficient clustering of high-dimensional data sets with application to reference matching. In Proc. ACM SIGMOD Int. Conf. Management of Data, August 2000, pp.169-178.[17] Ananthakrishna R, Chaudhuri S, Ganti V. Eliminating fuzzy duplicates in data warehouses. In Proc. the 28th Int. Conf. Very Large Data Bases, August 2002, pp.586-597.[18] Bhattacharya I, Getoor L. Collective entity resolution in relational data. TKDD, 2007, 1(1):5.[19] Altowim Y, Kalashnikov D V, Mehrotra S. Progressive approach to relational entity resolution. PVLDB, 2014, 7(11):999-1010.[20] Whang S E, Marmaros D, Garcia-Molina H. Pay-as-yougo entity resolution. IEEE Trans. Knowl. Data Eng., 2013, 25(5):1111-1124.[21] Gruenheid A, Dong X L, Srivastava D. Incremental record linkage. PVLDB, 2014, 7(9):697-708.[22] Whang S E, Garcia-Molina H. Incremental entity resolution on rules and data. VLDB J., 2014, 23(1):77-102.[23] Li G, Wang J, Zheng Y, Franklin M J. Crowdsourced data management:A survey. In Proc. the 33rd IEEE Int. Conf. Data Engineering, April 2017, pp.39-40.[24] Zheng Y, Cheng R, Maniu S, Mo L. On optimality of jury selection in crowdsourcing. In Proc. the 18th Int. Conf. Extending Database Technology, March 2015, pp.193-204.[25] Zheng Y, Li G, Li Y, Shan C, Cheng R. Truth inference in crowdsourcing:Is the problem solved? PVLDB, 2017, 10(5):541-552.[26] Zheng Y, Li G, Cheng R. DOCS:Domain-aware crowdsourcing system. PVLDB, 2016, 10(4):361-372.[27] Zheng Y, Wang J, Li G, Cheng R, Feng J. QASCA:A quality-aware task assignment system for crowdsourcing applications. In Proc. ACM SIGMOD Int. Conf. Management of Data, May 31-June 4, 2015, pp.1031-1046.[28] Xiong H, Zhang D, Chen G, Wang L, Gauthier V, Barnes L E. iCrowd:Near-optimal task allocation for piggyback crowdsensing. IEEE Trans. Mob. Comput., 2016, 15(8):2010-2022.[29] Hu H, Zheng Y, Bao Z, Li G, Feng J, Cheng R. Crowdsourced POI labelling:Location-aware result inference and task assignment. In Proc. the 32nd IEEE Int. Conf. Data Engineering, May 2016, pp.61-72. |
No related articles found! |
|
版权所有 © 《计算机科学技术学报》编辑部 本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn 总访问量: |