›› 2013,Vol. 28 ›› Issue (6): 973-988.doi: 10.1007/s11390-013-1392-7

所属专题: Data Management and Data Mining

• Special Section on Selected Paper from NPC 2011 • 上一篇    下一篇


Ying-Jie Shi1 (史英杰), Xiao-Feng Meng1 (孟小峰), Senior Member, CCF, Member, ACM, IEEE Fusheng Wang2, 3, and Yan-Tao Gan1 (干艳桃)   

  • 收稿日期:2012-12-02 修回日期:2013-05-03 出版日期:2013-11-05 发布日期:2013-11-05
  • 作者简介:Ying-Jie Shi received the B.S. degree from Shandong University, Jinan, in 2005, and M.S. degree from Huazhong University of Science and Technology, Wuhan, in 2007, both in computer science and technology. She is currently a Ph.D. candidate of Renmin University of China, Beijing. Her research interests include cloud data management and online aggregations of big data.

HEDC++:An Extended Histogram Estimator for Data in the Cloud

Ying-Jie Shi1 (史英杰), Xiao-Feng Meng1 (孟小峰), Senior Member, CCF, Member, ACM, IEEE Fusheng Wang2, 3, and Yan-Tao Gan1 (干艳桃)   

  1. 1 School of Information, Renmin University of China, Beijing 100872, China;
    2 Department of Biomedical Informatics, Emory University, Atlanta 30322, U.S.A.;
    3 Department of Mathematics and Computer Science, Emory University, Atlanta 30322, U.S.A.
  • Received:2012-12-02 Revised:2013-05-03 Online:2013-11-05 Published:2013-11-05
  • Supported by:

    This research was partially supported by the National Natural Science Foundation of China under Grant Nos. 61070055, 91024032, 91124001, the Fundamental Research Funds for the Central Universities of China, the Research Funds of Renmin University of China under Grant No. 11XNL010, and the National High Technology Research and Development 863 Program of China under Grant Nos. 2012AA010701, 2013AA013204.

随着云数据管理系统的不断发展,很多开源的云数据管理系统如HBase、Hive、Pig和Cassandra等引起了工业界和学术界的广泛关注。与传统具有完善优化技术的关系数据库相比,云数据管理系统发展时间较短,在性能方面有很大的改进空间。在传统的关系数据库中,对于数据分布的总结性信息和统计信息对查询优化起着重要的作用,而直方图正是最有效和最常用的数据分布和统计信息总结方法。在云数据管理系统中,直方图在提高数据存取性能等方面有着重要的作用。首先,直方图为选择最有效的查询执行计划提供了重要参考信息。云计算中大部分查询通过MapReduce框架实现,对于一个给定的查询,往往存在多种基于MapReduce的实现策略。如何根据数据的分布情况选择合适的执行计划是查询优化中的一个重要问题,而直方图则能提供参考信息。其次,直方图提供的基本统计信息可用于负载均衡。负载均衡是影响云计算中查询性能的一个重要因素,因为云计算中的查询往往并行执行。在MapReduce处理框架中,map任务的输出结果根据其键值被哈希分区到多个reduce任务中。如果在该过程中数据倾斜严重,就会导致reduce任务负载不均衡,从而降低查询性能。基于分区键值构建直方图可以有效的辅助设计哈希函数来保证负载均衡。第三,在进行连接的过程中,体现数据分布的直方图有助于减少数据的传输代价,而传输代价是云计算中的一项重要资源。基于连接键值构建的直方图能够帮助避免将不满足连接谓词的数据传输到同一个节点上。除此以外,直方图信息还有助于优化云计算系统中的运行参数设置,在提高用户反馈质量、优化任务调度和资源分配等方面也起着重要作用。基于大规模的海量数据构建完全精确的直方图代价很大,因此对直方图进行估计更具有实际意义。然而由于云计算环境中数据组织和处理模式的特殊性,使得该问题的解决面临很大的挑战。文章提出了对云计算中数据的直方图进行估计的技术HEDC++(An Extended Histogram Estimator for Data in the Cloud),对等宽直方图和等深直方图两种最常用的直方图类型分别提出了相应的估计方法。HEDC++基于改进的MapReduce处理框架提出了直方图估计的数据处理流程,并设计了相应的采样机制保证采样的效率和估计结果的准确性。在Hadoop平台上对HEDC++进行了广泛的实验,实验结果表明HEDC++能够有效的对不同分布的云数据提供直方图估计,其性能优于已有工作。

Abstract: With increasing popularity of cloud-based data management, improving the performance of queries in the cloud is an urgent issue to solve. Summary of data distribution and statistical information has been commonly used in traditional databases to support query optimization, and histograms are of particular interest. Naturally, histograms could be used to support query optimization and efficient utilization of computing resources in the cloud. Histograms could provide helpful reference information for generating optimal query plans, and generate basic statistics useful for guaranteeing the load balance of query processing in the cloud. Since it is too expensive to construct an exact histogram on massive data, building an approximate histogram is a more feasible solution. This problem, however, is challenging to solve in the cloud environment because of the special data organization and processing mode in the cloud. In this paper, we present HEDC++, an extended histogram estimator for data in the cloud, which provides efficient approximation approaches for both equi-width and equi-depth histograms. We design the histogram estimate workflow based on an extended MapReduce framework, and propose novel sampling mechanisms to leverage the sampling efficiency and estimate accuracy. We experimentally validate our techniques on Hadoop and the results demonstrate that HEDC++ can provide promising histogram estimate for massive data in the cloud.

[1] Thusoo A, Sarma J, Jain N et al. Hive: A warehousing solution over a Map-Reduce framework. In Proc. the 35th Conference of Very Large Databases (VLDB2009), August 2009, pp.1626-1629.

[2] Olston C, Reed B, Srivastava U et al. Pig latin: A notso-foreign language for data processing. In Proc. the ACM Int. Conf. Management of Data (SIGMOD2008), June 2008, pp.1099-1110.

[3] Abadi D J. Data management in the cloud: Limitations and opportunities. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 2009, 32(1): 3-12.

[4] Dean J, Ghemawat S. MapReduce: Simplified data processing on large clusters. In Proc. the 6th Symposium on Opearting Systems Design and Implementation (OSDI2004), December 2004, pp.137-150.

[5] Blanas S, Patel J, Ercegovac V et al. A comparison of join algorithms for log processing in MapReduce. In Proc. the ACM Int. Conf. Management of Data (SIGMOD2010), June 2010, pp.975-986.

[6] Okcan A, Riedewald M. Processing theta-joins using MapReduce. In Proc. the ACM International Conference on Management of Data (SIGMOD2011), June 2011, pp.949-960.

[7] Shi Y J, Meng X F, Wang F S et al. HEDC: A histogram estimator for data in the cloud. In Proc. the 4th Int. Workshop on Cloud Data Management (CloudDB2012), Oct. 29-Nov. 2, 2012, pp.51-58.

[8] Poosala V, Ioannidis Y E, Haas P J, Shekita E J. Improved histograms for selectivity estimation of range predicates. In Proc. the ACM International Conference on Management of Data (SIGMOD1996), June 1996, pp.294-305.

[9] Ioannidis Y E. The history of histograms (abridged). In Proc. the 29th Conference of Very Large Databases (VLDB2003), September 2003, pp.19-30.

[10] Piatetsky-Shapiro G, Connell C. Accurate estimation of the number of tuples satisfying a condition. In Proc. the ACM International Conference on Management of Data (SIGMOD1984), June 1984, pp.256-276.

[11] Gibbons P B, Matias Y, Poosala V. Fast incremental maintenance of approximate histograms. ACM Transactions on Database Systems, 2002, 27(3): 261-298.

[12] Chaudhuri S, Motwani R, Narasayya V. Random sampling for histogram construction: How much is enough? In Proc. ACM International Conference on Management of Data (SIGMOD1998), June 1998, pp.436-447.

[13] Chaudhuri S, Motwani R, Narasayya V. Using random sampling for histogram construction. Technical Report, Microsoft, http://citeseerx.ist.psu.edu/showciting?cid=467221, 1997.

[14] Chaudhuri S, Das G, Srivastava U. Effective use of block-level sampling in statistics estimation. In Proc. ACM International Conference on Management of Data (SIGMOD2004), June 2004, pp.287-298.

[15] Jestes J, Yi K, Li F F. Building wavelet histograms on large data in MapReduce. In Proc. the 37th International Conference of Very Large Databases (VLDB2011), August 29September 3, 2011, pp.109-120.

[16] Mousavi H, Zaniolo C. Fast and accurate computation of equi-depth histograms over data streams. In Proc. the 14th International Conference on Extending Database Technology (EDBT2011), March 2011, pp.69-80.

[17] Cochran W G. Sampling Techniques. John Wiley and Sons, 1977.

[18] Francisco C A, Fuller W A. Quantile estimation with a complex survey design. The Annals of Statistics, 1991, 19(1): 454-469.

[19] Woodruff R S. Confidence intervals for medians and other position measures. Journal of the American Statistical Association, 1952, 47(260): 635-646.
No related articles found!
Full text



[1] 刘明业; 洪恩宇;. Some Covering Problems and Their Solutions in Automatic Logic Synthesis Systems[J]. , 1986, 1(2): 83 -92 .
[2] 陈世华;. On the Structure of (Weak) Inverses of an (Weakly) Invertible Finite Automaton[J]. , 1986, 1(3): 92 -100 .
[3] 高庆狮; 张祥; 杨树范; 陈树清;. Vector Computer 757[J]. , 1986, 1(3): 1 -14 .
[4] 陈肇雄; 高庆狮;. A Substitution Based Model for the Implementation of PROLOG——The Design and Implementation of LPROLOG[J]. , 1986, 1(4): 17 -26 .
[5] 黄河燕;. A Parallel Implementation Model of HPARLOG[J]. , 1986, 1(4): 27 -38 .
[6] 闵应骅; 韩智德;. A Built-in Test Pattern Generator[J]. , 1986, 1(4): 62 -74 .
[7] 唐同诰; 招兆铿;. Stack Method in Program Semantics[J]. , 1987, 2(1): 51 -63 .
[8] 闵应骅;. Easy Test Generation PLAs[J]. , 1987, 2(1): 72 -80 .
[9] 张钹; 张铃;. Statistical Heuristic Search[J]. , 1987, 2(1): 1 -11 .
[10] 朱鸿;. Some Mathematical Properties of the Functional Programming Language FP[J]. , 1987, 2(3): 202 -216 .
版权所有 © 《计算机科学技术学报》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn