|
›› 2013,Vol. 28 ›› Issue (6): 973-988.doi: 10.1007/s11390-013-1392-7
所属专题: Data Management and Data Mining
• Special Section on Selected Paper from NPC 2011 • 上一篇 下一篇
Ying-Jie Shi1 (史英杰), Xiao-Feng Meng1 (孟小峰), Senior Member, CCF, Member, ACM, IEEE Fusheng Wang2, 3, and Yan-Tao Gan1 (干艳桃)
Ying-Jie Shi1 (史英杰), Xiao-Feng Meng1 (孟小峰), Senior Member, CCF, Member, ACM, IEEE Fusheng Wang2, 3, and Yan-Tao Gan1 (干艳桃)
随着云数据管理系统的不断发展,很多开源的云数据管理系统如HBase、Hive、Pig和Cassandra等引起了工业界和学术界的广泛关注。与传统具有完善优化技术的关系数据库相比,云数据管理系统发展时间较短,在性能方面有很大的改进空间。在传统的关系数据库中,对于数据分布的总结性信息和统计信息对查询优化起着重要的作用,而直方图正是最有效和最常用的数据分布和统计信息总结方法。在云数据管理系统中,直方图在提高数据存取性能等方面有着重要的作用。首先,直方图为选择最有效的查询执行计划提供了重要参考信息。云计算中大部分查询通过MapReduce框架实现,对于一个给定的查询,往往存在多种基于MapReduce的实现策略。如何根据数据的分布情况选择合适的执行计划是查询优化中的一个重要问题,而直方图则能提供参考信息。其次,直方图提供的基本统计信息可用于负载均衡。负载均衡是影响云计算中查询性能的一个重要因素,因为云计算中的查询往往并行执行。在MapReduce处理框架中,map任务的输出结果根据其键值被哈希分区到多个reduce任务中。如果在该过程中数据倾斜严重,就会导致reduce任务负载不均衡,从而降低查询性能。基于分区键值构建直方图可以有效的辅助设计哈希函数来保证负载均衡。第三,在进行连接的过程中,体现数据分布的直方图有助于减少数据的传输代价,而传输代价是云计算中的一项重要资源。基于连接键值构建的直方图能够帮助避免将不满足连接谓词的数据传输到同一个节点上。除此以外,直方图信息还有助于优化云计算系统中的运行参数设置,在提高用户反馈质量、优化任务调度和资源分配等方面也起着重要作用。基于大规模的海量数据构建完全精确的直方图代价很大,因此对直方图进行估计更具有实际意义。然而由于云计算环境中数据组织和处理模式的特殊性,使得该问题的解决面临很大的挑战。文章提出了对云计算中数据的直方图进行估计的技术HEDC++(An Extended Histogram Estimator for Data in the Cloud),对等宽直方图和等深直方图两种最常用的直方图类型分别提出了相应的估计方法。HEDC++基于改进的MapReduce处理框架提出了直方图估计的数据处理流程,并设计了相应的采样机制保证采样的效率和估计结果的准确性。在Hadoop平台上对HEDC++进行了广泛的实验,实验结果表明HEDC++能够有效的对不同分布的云数据提供直方图估计,其性能优于已有工作。
[1] Thusoo A, Sarma J, Jain N et al. Hive: A warehousing solution over a Map-Reduce framework. In Proc. the 35th Conference of Very Large Databases (VLDB2009), August 2009, pp.1626-1629.[2] Olston C, Reed B, Srivastava U et al. Pig latin: A notso-foreign language for data processing. In Proc. the ACM Int. Conf. Management of Data (SIGMOD2008), June 2008, pp.1099-1110.[3] Abadi D J. Data management in the cloud: Limitations and opportunities. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 2009, 32(1): 3-12.[4] Dean J, Ghemawat S. MapReduce: Simplified data processing on large clusters. In Proc. the 6th Symposium on Opearting Systems Design and Implementation (OSDI2004), December 2004, pp.137-150.[5] Blanas S, Patel J, Ercegovac V et al. A comparison of join algorithms for log processing in MapReduce. In Proc. the ACM Int. Conf. Management of Data (SIGMOD2010), June 2010, pp.975-986.[6] Okcan A, Riedewald M. Processing theta-joins using MapReduce. In Proc. the ACM International Conference on Management of Data (SIGMOD2011), June 2011, pp.949-960.[7] Shi Y J, Meng X F, Wang F S et al. HEDC: A histogram estimator for data in the cloud. In Proc. the 4th Int. Workshop on Cloud Data Management (CloudDB2012), Oct. 29-Nov. 2, 2012, pp.51-58.[8] Poosala V, Ioannidis Y E, Haas P J, Shekita E J. Improved histograms for selectivity estimation of range predicates. In Proc. the ACM International Conference on Management of Data (SIGMOD1996), June 1996, pp.294-305.[9] Ioannidis Y E. The history of histograms (abridged). In Proc. the 29th Conference of Very Large Databases (VLDB2003), September 2003, pp.19-30.[10] Piatetsky-Shapiro G, Connell C. Accurate estimation of the number of tuples satisfying a condition. In Proc. the ACM International Conference on Management of Data (SIGMOD1984), June 1984, pp.256-276.[11] Gibbons P B, Matias Y, Poosala V. Fast incremental maintenance of approximate histograms. ACM Transactions on Database Systems, 2002, 27(3): 261-298.[12] Chaudhuri S, Motwani R, Narasayya V. Random sampling for histogram construction: How much is enough? In Proc. ACM International Conference on Management of Data (SIGMOD1998), June 1998, pp.436-447.[13] Chaudhuri S, Motwani R, Narasayya V. Using random sampling for histogram construction. Technical Report, Microsoft, http://citeseerx.ist.psu.edu/showciting?cid=467221, 1997.[14] Chaudhuri S, Das G, Srivastava U. Effective use of block-level sampling in statistics estimation. In Proc. ACM International Conference on Management of Data (SIGMOD2004), June 2004, pp.287-298.[15] Jestes J, Yi K, Li F F. Building wavelet histograms on large data in MapReduce. In Proc. the 37th International Conference of Very Large Databases (VLDB2011), August 29September 3, 2011, pp.109-120.[16] Mousavi H, Zaniolo C. Fast and accurate computation of equi-depth histograms over data streams. In Proc. the 14th International Conference on Extending Database Technology (EDBT2011), March 2011, pp.69-80.[17] Cochran W G. Sampling Techniques. John Wiley and Sons, 1977.[18] Francisco C A, Fuller W A. Quantile estimation with a complex survey design. The Annals of Statistics, 1991, 19(1): 454-469.[19] Woodruff R S. Confidence intervals for medians and other position measures. Journal of the American Statistical Association, 1952, 47(260): 635-646. |
No related articles found! |
|
版权所有 © 《计算机科学技术学报》编辑部 本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn 总访问量: |