|
›› 2013,Vol. 28 ›› Issue (6): 989-1011.doi: 10.1007/s11390-013-1393-6
所属专题: Data Management and Data Mining
• Special Section on Selected Paper from NPC 2011 • 上一篇 下一篇
Yu-Xiang Wang (王宇翔), Jun-Zhou Luo (罗军舟), Senior Member, CCF, Member, ACM, IEEE Ai-Bo Song* (宋爱波), Member, CCF, ACM, and Fang Dong (东方), Member, CCF, ACM
Yu-Xiang Wang (王宇翔), Jun-Zhou Luo (罗军舟), Senior Member, CCF, Member, ACM, IEEE Ai-Bo Song* (宋爱波), Member, CCF, ACM, and Fang Dong (东方), Member, CCF, ACM
在线聚集(Online Aggregation)是一种基于采样理论的近似查询技术,通过迭代的方式对原始数据集进行多轮采样,并以置信区间形式向用户快速返回近似查询结果。在线聚集已广泛部署在基于MapReduce框架的云计算系统中以支持大数据处理与分析,这种近似查询技术无需扫描完整数据集即可获得符合查询精度的近似查询结果,可以大幅降低云环境中的计算与经济开销。然而,在现有的MapReduce架构下部署在线聚集系统存在着两个重要问题,一定程度上抑制了在线聚集的性能优势:一方面,现有云架构下的在线聚集并没有针对倾斜数据分布设计相应的优化策略,导致部分数据块采样效率较低,影响了样本质量从而导致单位时间内的精度损失,延长了执行时间;另一方面,现有云架构下的在线聚集针对不同查询请求采用独立执行策略,忽略查询间的共享可能,导致大量冗余的I/O开销,影响整体执行性能。鉴于以上问题,本文设计并实现OLACloud系统,以支持倾斜数据分布及大规模并发查询处理,进而提高云环境下在线聚集执行性能。针对第一个问题,本文提出基于内容敏感的重划分机制及公平数据块放置策略,在保证底层计算资源上的存储与计算负载均衡的前提下提高数据块采样效率。针对第二个问题,本文提出共享采样策略以发现并利用多查询任务间的共享机遇,减少冗余的I/O开销。最后,基于Hadoop平台实现OLACloud系统,并基于TPC-H基准测试包生成具有倾斜数据分布的测试数据集,在此基础上设计并实施性能验证试验。实验结果表明OLACloud系统在面对倾斜数据分布及并发查询任务时具有明显的性能优势。
[1] Herodotou H, Lim H, Luo G et al. Starfish: A self-tuning system for big data analytics. In Proc. the 15th CIDR, Apr. 2011, pp.261-272.[2] Wu S, Ooi B C, Tan K L. Continuous sampling for online aggregation over multiple queries. In Proc. the 2010 International Conference on Management of Data (SIGMOD), June 2010, pp.651-662.[3] Chaudhuri S, Das G, Datar M et al. Overcoming limitations of sampling for aggregation queries. In Proc. the 17th Int. Conf. Data Engineering, Apr. 2001, pp.534-544.[4] Laptev N, Zeng K, Zaniolo C. Early accurate results for advanced analytics on MapReduce. PVLDB, 2012, 5(10): 10281039.[5] Hellerstein J M, Haas P J, Wang H J. Online aggregation. ACM SIGMOD Record., 1997, 26(2): 171-182.[6] Dean J, Ghemawat S. MapReduce: Simplified data processing on large clusters. Communications of the ACM, 2008, 51(1): 107-113.[7] Borkar V, Carey M, Grover R et al. Hyracks: A flexible and extensible foundation for data-intensive computing. In Proc. the 27th International Conference on Data Engineering, Apr. 2011, pp.1151-1162.[8] Pansare N, Borkar V R, Jermaine C et al. Online aggregation for large MapReduce jobs. PVLDB, 2011, 4(11): 1135-1145.[9] Böose J H, Andrzejak A, Höogqvist M. Beyond online aggregation: Parallel and incremental data mining with online mapreduce. In Proc. MDAC, Apr. 2010, Article No.3.[10] Condie T, Conway N, Alvaro P et al. Online aggregation and continuous query support in MapReduce. In Proc. the 2010 International Conference on Management of Data, June 2010, pp.1115-1118.[11] Shi Y, Meng X,Wang F et al. You can stop early with COLA: Online processing of aggregate queries in the cloud. In Proc. the 21st ACM International Conference on Information and Knowledgy Management, Oct. 29-Nov. 2, 2012, pp.1223-1232.[12] Grover R, Carey M J. Extending MapReduce for efficient predicate-based sampling. In Proc. the 28th International Conference on Data Engineering, Apr. 2012, pp.486-497.[13] Wang Y, Luo J, Song A, Jin J H, Dong F. Improving online aggregation performance for skewed data distribution. In Proc. Database Systems for Advanced Applications, Apr. 2012, pp.18-32.[14] Chaudhuri S, Das G, Srivastava U. Effective use of blocklevel sampling in statistics estimation. In Proc. the 2004 International Conference on Management of Data, June 2004, pp.287-298.[15] Jacobs A. The pathologies of big data. Communications of the ACM, 2009, 52(8): 36-44.[16] Soroush E, Balazinska M, Wang D. Arraystore: A storage manager for complex parallel array processing. In Proc. the 2011 International Conference on Management of Data, June 2011, pp.253-264.[17] Eltabakh M Y, Tian Y, Ozcan F et al. CoHadoop: Flexible data placement and its exploitation in Hadoop. PVLDB, 2011, 4(9): 575-585.[18] Nykiel T, Potamias M, Mishra C et al. MRShare: Sharing across multiple queries in MapReduce. PVLDB, 2010, 3(1/2): 494-505.[19] Wu S, Jiang S, Ooi B C et al. Distributed online aggregations. PVLDB, 2009, 2(1): 443-454.[20] Zaharia M, Borthakur D, Sen Sarma J et al. Delay scheduling: A simple technique for achieving locality and fairness in cluster scheduling. In Proc. the 5th European Conference on Computer System, Apr. 2010, pp.265-278.[21] Chaudhuri S, Narasyrra V. Program for tpc-d data generation with skew. Technical Report, ftp://ftp.research.microsoft.com/pub/user./viveknar/tpcdskew, Dec. 2012.[22] Haas P J. Large-sample and deterministic confidence intervals for online aggregation. In Proc. the 9th International Conference on Scientific and Statistical Database Management, Aug. 1997, pp.51-62.[23] Haas P J, Hellerstein J M. Ripple joins for online aggregation. ACM SIGMOD Record, 1999, 28(2): 287-298.[24] Luo G, Ellmann C J, Haas P J et al. A scalable hash ripple join algorithm. In Proc. the 2002 International Conference on Management of Data, June 2002, pp.252-262. |
No related articles found! |
|
版权所有 © 《计算机科学技术学报》编辑部 本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn 总访问量: |