云计算环境下基于划分及共享采样的在线聚集机制研究

王宇翔; 罗军舟; 宋爱波; 东方

doi:10.1007/s11390-013-1393-6

云计算环境下基于划分及共享采样的在线聚集机制研究

Partition-Based Online Aggregation with Shared Sampling in the Cloud

摘要

摘要: 在线聚集(Online Aggregation)是一种基于采样理论的近似查询技术,通过迭代的方式对原始数据集进行多轮采样,并以置信区间形式向用户快速返回近似查询结果。在线聚集已广泛部署在基于MapReduce框架的云计算系统中以支持大数据处理与分析,这种近似查询技术无需扫描完整数据集即可获得符合查询精度的近似查询结果,可以大幅降低云环境中的计算与经济开销。然而,在现有的MapReduce架构下部署在线聚集系统存在着两个重要问题,一定程度上抑制了在线聚集的性能优势:一方面,现有云架构下的在线聚集并没有针对倾斜数据分布设计相应的优化策略,导致部分数据块采样效率较低,影响了样本质量从而导致单位时间内的精度损失,延长了执行时间;另一方面,现有云架构下的在线聚集针对不同查询请求采用独立执行策略,忽略查询间的共享可能,导致大量冗余的I/O开销,影响整体执行性能。鉴于以上问题,本文设计并实现OLACloud系统,以支持倾斜数据分布及大规模并发查询处理,进而提高云环境下在线聚集执行性能。针对第一个问题,本文提出基于内容敏感的重划分机制及公平数据块放置策略,在保证底层计算资源上的存储与计算负载均衡的前提下提高数据块采样效率。针对第二个问题,本文提出共享采样策略以发现并利用多查询任务间的共享机遇,减少冗余的I/O开销。最后,基于Hadoop平台实现OLACloud系统,并基于TPC-H基准测试包生成具有倾斜数据分布的测试数据集,在此基础上设计并实施性能验证试验。实验结果表明OLACloud系统在面对倾斜数据分布及并发查询任务时具有明显的性能优势。

Abstract: Online aggregation is an attractive sampling-based technology to response aggregation queries by an estimate to the final result, with the confidence interval becoming tighter over time. It has been built into a MapReduce-based cloud system for big data analytics, which allows users to monitor the query progress, and save money by killing the computation early once sufficient accuracy has been obtained. However, there are several limitations that restrict the performance of online aggregation generated from the gap between the current mechanism of MapReduce paradigm and the requirements of online aggregation, such as: 1) the low sampling efficiency due to the lack of consideration of skewed data distribution for online aggregation in MapReduce, and 2) the large redundant I/O cost of online aggregation caused by the independent job execution mechanism of MapReduce. In this paper, we present OLACloud, a MapReduce-based cloud system to well support online aggregation for different data distributions and large-scale concurrent query processing. We propose a content-aware repartition method with a fair-allocation block placement strategy to increase the sampling efficiency and guarantee the storage and computation load balancing simultaneously. We also develop a shared sampling method to share the sampling opportunities among multiple queries to reduce redundant I/O cost. We also implement OLACloud in Hadoop, and conduct an extensive experimental study on the TPC-H benchmark for skewed data distribution. Our results demonstrate the efficiency and effectiveness of OLACloud.

HTML全文

参考文献()

施引文献

资源附件()