›› 2013,Vol. 28 ›› Issue (6): 989-1011.doi: 10.1007/s11390-013-1393-6

所属专题: Data Management and Data Mining

• Special Section on Selected Paper from NPC 2011 • 上一篇    下一篇

云计算环境下基于划分及共享采样的在线聚集机制研究

Yu-Xiang Wang (王宇翔), Jun-Zhou Luo (罗军舟), Senior Member, CCF, Member, ACM, IEEE Ai-Bo Song* (宋爱波), Member, CCF, ACM, and Fang Dong (东方), Member, CCF, ACM   

  • 收稿日期:2012-12-01 修回日期:2013-05-07 出版日期:2013-11-05 发布日期:2013-11-05
  • 作者简介:Yu-Xiang Wang received the B.S. degree in software engineering from Tianjin University, China, in 2008. He is currently a Ph.D. student in the School of Computer Science and Engineering, Southeast University, Nanjing. His current research interests include cloud computing, big data processing and query optimization.

Partition-Based Online Aggregation with Shared Sampling in the Cloud

Yu-Xiang Wang (王宇翔), Jun-Zhou Luo (罗军舟), Senior Member, CCF, Member, ACM, IEEE Ai-Bo Song* (宋爱波), Member, CCF, ACM, and Fang Dong (东方), Member, CCF, ACM   

  1. School of Computer Science and Engineering, Southeast University, Nanjing 211189, China
  • Received:2012-12-01 Revised:2013-05-07 Online:2013-11-05 Published:2013-11-05
  • About author:Yu-Xiang Wang received the B.S. degree in software engineering from Tianjin University, China, in 2008. He is currently a Ph.D. student in the School of Computer Science and Engineering, Southeast University, Nanjing. His current research interests include cloud computing, big data processing and query optimization.
  • Supported by:

    This work is supported by the National Basic Research 973 Program of China under Grant No. 2010CB328104, the National Natural Science Foundation of China under Grant Nos. 61070161, 61202449, 61320106007, the National High Technology Research and Development 863 Program of China under Grant No. 2013AA013503, the Specialized Research Fund for the Doctoral Program of Higher Education of China under Grant No. 20110092130002, the Jiangsu Provincial Key Laboratory of Network and Information Security under Grant No. BM2003201, the Key Laboratory of Computer Network and Information Integration of Ministry of Education of China under Grant No. 93K-9, and the Shanghai Key Laboratory of Scalable Computing and Systems of China under Grant No. 2010DS680095.

在线聚集(Online Aggregation)是一种基于采样理论的近似查询技术,通过迭代的方式对原始数据集进行多轮采样,并以置信区间形式向用户快速返回近似查询结果。在线聚集已广泛部署在基于MapReduce框架的云计算系统中以支持大数据处理与分析,这种近似查询技术无需扫描完整数据集即可获得符合查询精度的近似查询结果,可以大幅降低云环境中的计算与经济开销。然而,在现有的MapReduce架构下部署在线聚集系统存在着两个重要问题,一定程度上抑制了在线聚集的性能优势:一方面,现有云架构下的在线聚集并没有针对倾斜数据分布设计相应的优化策略,导致部分数据块采样效率较低,影响了样本质量从而导致单位时间内的精度损失,延长了执行时间;另一方面,现有云架构下的在线聚集针对不同查询请求采用独立执行策略,忽略查询间的共享可能,导致大量冗余的I/O开销,影响整体执行性能。鉴于以上问题,本文设计并实现OLACloud系统,以支持倾斜数据分布及大规模并发查询处理,进而提高云环境下在线聚集执行性能。针对第一个问题,本文提出基于内容敏感的重划分机制及公平数据块放置策略,在保证底层计算资源上的存储与计算负载均衡的前提下提高数据块采样效率。针对第二个问题,本文提出共享采样策略以发现并利用多查询任务间的共享机遇,减少冗余的I/O开销。最后,基于Hadoop平台实现OLACloud系统,并基于TPC-H基准测试包生成具有倾斜数据分布的测试数据集,在此基础上设计并实施性能验证试验。实验结果表明OLACloud系统在面对倾斜数据分布及并发查询任务时具有明显的性能优势。

Abstract: Online aggregation is an attractive sampling-based technology to response aggregation queries by an estimate to the final result, with the confidence interval becoming tighter over time. It has been built into a MapReduce-based cloud system for big data analytics, which allows users to monitor the query progress, and save money by killing the computation early once sufficient accuracy has been obtained. However, there are several limitations that restrict the performance of online aggregation generated from the gap between the current mechanism of MapReduce paradigm and the requirements of online aggregation, such as: 1) the low sampling efficiency due to the lack of consideration of skewed data distribution for online aggregation in MapReduce, and 2) the large redundant I/O cost of online aggregation caused by the independent job execution mechanism of MapReduce. In this paper, we present OLACloud, a MapReduce-based cloud system to well support online aggregation for different data distributions and large-scale concurrent query processing. We propose a content-aware repartition method with a fair-allocation block placement strategy to increase the sampling efficiency and guarantee the storage and computation load balancing simultaneously. We also develop a shared sampling method to share the sampling opportunities among multiple queries to reduce redundant I/O cost. We also implement OLACloud in Hadoop, and conduct an extensive experimental study on the TPC-H benchmark for skewed data distribution. Our results demonstrate the efficiency and effectiveness of OLACloud.

[1] Herodotou H, Lim H, Luo G et al. Starfish: A self-tuning system for big data analytics. In Proc. the 15th CIDR, Apr. 2011, pp.261-272.

[2] Wu S, Ooi B C, Tan K L. Continuous sampling for online aggregation over multiple queries. In Proc. the 2010 International Conference on Management of Data (SIGMOD), June 2010, pp.651-662.

[3] Chaudhuri S, Das G, Datar M et al. Overcoming limitations of sampling for aggregation queries. In Proc. the 17th Int. Conf. Data Engineering, Apr. 2001, pp.534-544.

[4] Laptev N, Zeng K, Zaniolo C. Early accurate results for advanced analytics on MapReduce. PVLDB, 2012, 5(10): 10281039.

[5] Hellerstein J M, Haas P J, Wang H J. Online aggregation. ACM SIGMOD Record., 1997, 26(2): 171-182.

[6] Dean J, Ghemawat S. MapReduce: Simplified data processing on large clusters. Communications of the ACM, 2008, 51(1): 107-113.

[7] Borkar V, Carey M, Grover R et al. Hyracks: A flexible and extensible foundation for data-intensive computing. In Proc. the 27th International Conference on Data Engineering, Apr. 2011, pp.1151-1162.

[8] Pansare N, Borkar V R, Jermaine C et al. Online aggregation for large MapReduce jobs. PVLDB, 2011, 4(11): 1135-1145.

[9] Böose J H, Andrzejak A, Höogqvist M. Beyond online aggregation: Parallel and incremental data mining with online mapreduce. In Proc. MDAC, Apr. 2010, Article No.3.

[10] Condie T, Conway N, Alvaro P et al. Online aggregation and continuous query support in MapReduce. In Proc. the 2010 International Conference on Management of Data, June 2010, pp.1115-1118.

[11] Shi Y, Meng X,Wang F et al. You can stop early with COLA: Online processing of aggregate queries in the cloud. In Proc. the 21st ACM International Conference on Information and Knowledgy Management, Oct. 29-Nov. 2, 2012, pp.1223-1232.

[12] Grover R, Carey M J. Extending MapReduce for efficient predicate-based sampling. In Proc. the 28th International Conference on Data Engineering, Apr. 2012, pp.486-497.

[13] Wang Y, Luo J, Song A, Jin J H, Dong F. Improving online aggregation performance for skewed data distribution. In Proc. Database Systems for Advanced Applications, Apr. 2012, pp.18-32.

[14] Chaudhuri S, Das G, Srivastava U. Effective use of blocklevel sampling in statistics estimation. In Proc. the 2004 International Conference on Management of Data, June 2004, pp.287-298.

[15] Jacobs A. The pathologies of big data. Communications of the ACM, 2009, 52(8): 36-44.

[16] Soroush E, Balazinska M, Wang D. Arraystore: A storage manager for complex parallel array processing. In Proc. the 2011 International Conference on Management of Data, June 2011, pp.253-264.

[17] Eltabakh M Y, Tian Y, Ozcan F et al. CoHadoop: Flexible data placement and its exploitation in Hadoop. PVLDB, 2011, 4(9): 575-585.

[18] Nykiel T, Potamias M, Mishra C et al. MRShare: Sharing across multiple queries in MapReduce. PVLDB, 2010, 3(1/2): 494-505.

[19] Wu S, Jiang S, Ooi B C et al. Distributed online aggregations. PVLDB, 2009, 2(1): 443-454.

[20] Zaharia M, Borthakur D, Sen Sarma J et al. Delay scheduling: A simple technique for achieving locality and fairness in cluster scheduling. In Proc. the 5th European Conference on Computer System, Apr. 2010, pp.265-278.

[21] Chaudhuri S, Narasyrra V. Program for tpc-d data generation with skew. Technical Report, ftp://ftp.research.microsoft.com/pub/user./viveknar/tpcdskew, Dec. 2012.

[22] Haas P J. Large-sample and deterministic confidence intervals for online aggregation. In Proc. the 9th International Conference on Scientific and Statistical Database Management, Aug. 1997, pp.51-62.

[23] Haas P J, Hellerstein J M. Ripple joins for online aggregation. ACM SIGMOD Record, 1999, 28(2): 287-298.

[24] Luo G, Ellmann C J, Haas P J et al. A scalable hash ripple join algorithm. In Proc. the 2002 International Conference on Management of Data, June 2002, pp.252-262.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 姚殊; 张钹;. Situated Learning of a Behavior-Based Mobile Robot Path Planner[J]. , 1995, 10(4): 375 -379 .
[2] 胡事民;. A Subdivision Scheme for Rational Triangular Bézier Surfaces[J]. , 1996, 11(1): 9 -16 .
[3] 伊波; 陶先平; G.Cioni; A.Colagrossi;. Intuitive Minimal Abduction in Sequent Calculi[J]. , 1998, 13(3): 209 -219 .
[4] . 完成可确定性XML查询优化[J]. , 2005, 20(3): 357 -366 .
[5] 范宝峡, 杨梁, 王江嵋, 王茹, 肖斌, 徐英, 刘动, 赵继业. 1GHz四核龙芯三号微处理器的物理实现[J]. , 2010, 25(2): 192 -199 .
[6] Hua Huang, Xiang-Wang Ma. 基于非负矩阵分解的正面和半正面人脸漫画合成[J]. , 2010, 25(6): 1282 -1292 .
[7] Wei Jiang (姜伟), Tian Wu (吴甜), Song-Lin Hu (虎嵩林), Senior Member, CCF, and . 质量敏感的自动服务组合:基于图论的视角[J]. , 2011, 26(5): 837 -853 .
[8] Wen-Yong Zhao, Shao-Lin Chen, Yuan Zheng, and Si-Long Peng. 基于冗余球面调和框架的凸Lambert物体的光照估计[J]. , 2013, 28(3): 454 -467 .
[9] Cheng-Lin Fan, Jun Luo, Wen-Cheng Wang, Fa-Rong Zhong, Binhai Zhu. 关于有色点集的一些距离问题[J]. , 2014, 29(5): 879 -886 .
[10] Chong Cao, Hai-Zhou Ai. 用户参与的人脸相似度学习[J]. , 2015, 30(3): 499 -510 .
版权所有 © 《计算机科学技术学报》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn
总访问量: