|
计算机科学技术学报 ›› 2020,Vol. 35 ›› Issue (2): 453-467.doi: 10.1007/s11390-020-9702-3
Yun-Cong Zhang1, Xiao-Yang Wang1,2,*, Cong Wang1, Yao Xu1, Jian-Wei Zhang1, Xiao-Dong Lin1, Guang-Yu Sun2, Member, CCF, IEEE, Gong-Lin Zheng1, Shan-Hui Yin1, Xian-Jin Ye1, Li Li1, Zhan Song1, Dong-Dong Miao1
Yun-Cong Zhang1, Xiao-Yang Wang1,2,*, Cong Wang1, Yao Xu1, Jian-Wei Zhang1, Xiao-Dong Lin1, Guang-Yu Sun2, Member, CCF, IEEE, Gong-Lin Zheng1, Shan-Hui Yin1, Xian-Jin Ye1, Li Li1, Zhan Song1, Dong-Dong Miao1
随着数据量的大幅增加,分布式计算已成为数据中心中处理海量数据集的通用方法。虽然已经有了很多相关的计算模型和数据集抽象,但绝大多数都采用了单层的数据分区模型,这导致开发人员经常需要手动实现一些多次分区的操作,这样既增加了开发难度,又可能导致程序难以维护。为了解决这一问题,我们提出了NDD编程模型,它可以很大程度上减轻程序员在数据分区方面的的开发困难。
NDD模型允许用户对某一数据集进行多次分区,最后形成一个多层的分区结构,如先按颜色分大类,再按形状分小类。除了对数据集进一步细分,NDD也支持对底层小类进行重合并,回归上一层分区结构。
基于NDD模型,我们实现了一个开源的编程框架Bigflow。它可以看作是现有分布式系统的一个通用优化层,能够提供许多自动优化策略。我们将Bigflow对接在Spark和Hadoop上,发现对多层分区的基准测试,性能分别提升了30%和50%,同时与Hadoop相比,代码量也有很大的降低。目前,Bigflow已被应用于百度的数据中心,日均处理数据为3PB量级。根据用户的反馈,Bigflow可以极大减轻他们的开发负担,且为应用提供了非常好的性能。
NDD在Spark、Hadoop等批处理分布式计算引擎上工作练好,我们目前在探索如何在Flink等流式数据引擎上使用相同的思路,在不降低性能的前提下为用户提供更加方便的多层分区模型。
[1] Dean J, Ghemawat S. MapReduce:Simplified data processing on large clusters. Communications of the ACM, 2008, 51(1):107-113. [2] Zaharia M, Chowdhury M, Franklin M J, Shenker S, Stoica I. Spark:Cluster computing with working sets. In Proc. the 2nd USENIX Workshop on Hot Topics in Cloud Computing, June 2010, Article No. 5. [3] Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauley M, Franklin M J, Shenker S, Stoica I. Resilient distributed datasets:A fault-tolerant abstraction for in-memory cluster computing. In Proc. the 9th USENIX Conference on Networked Systems Design and Implementation, April 2012, pp.15-28. [4] Chambers C, Raniwala A, Perry F, Adams S, Henry R R, Bradshaw R, Weizenbaum N. FlumeJava:Easy, efficient data-parallel pipelines. In Proc. the 2010 ACM SIGPLAN Conference on Programming Language Design and Implementation, June 2010, pp.363-375. [5] Meng X, Bradley J, Yavuz B et al. MLlib:Machine learning in Apache Spark. The Journal of Machine Learning Research, 2016, 17:Article No. 34. [6] Parsian M. Data Algorithms:Recipes for Scaling up with Hadoop and Spark. O'Reilly Media Inc., 2015. [7] Karau H, Warren R. High Performance Spark:Best Practices for Scaling and Optimizing Apache Spark (1st edition). O'Reilly Media Inc., 2017. [8] Akidau T, Bradshaw R, Chambers C et al. The dataflow model:A practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing. Proceedings of the VLDB Endowment, 2015, 8(12):1792-1803. [9] Isard M, Budiu M, Yu Y, Birrell A, Fetterly D. Dryad:Distributed data-parallel programs from sequential building blocks. ACM SIGOPS Operating Systems Review, 2007, 44(3):59-72. [10] Saha B, Shah H, Seth S, Vijayaraghavan G, Murthy A, Curino C. Apache tez:A unifying framework for modeling and building data processing applications. In Proc. the 2015 ACM SIGMOD International Conference on Management of Data, May 2015, pp.1357-1369. [11] Gates A F, Natkovich O, Chopra S, Kamath P, Narayanamurthy S M, Olston C, Reed B, Srinivasan S, Srivastava U. Building a high-level dataflow system on top of MapReduce:The pig experience. Proceedings of the VLDB Endowment, 2009, 2(2):1414-1425. [12] Thusoo A, Sarma J S, Jain N, Shao Z, Chakka P, Anthony S, Liu H, Wyckoff P, Murthy R. Hive:A warehousing solution over a Map-Reduce framework. Proceedings of the VLDB Endowment, 2009, 2(2):1626-1629. [13] Alexandrov A, Bergmann R, Ewen S et al. The stratosphere platform for big data analytics. The VLDB Journal, 2014, 23(6):939-964. [14] Brown K J, Lee H, Rompf T, Sujeeth A K, de Sa C, Aberger C, Olukotun K. Have abstraction and eat performance, too:Optimized heterogeneous computing with parallel patterns. In Proc. the 2016 IEEE/ACM International Symposium on Code Generation and Optimization, March 2016, pp.194-205. [15] Power R, Li J. Piccolo:Building fast, distributed programs with partitioned tables. In Proc. the 9th USENIX Symposium on Operating Systems Design and Implementation, October 2010, pp.293-306. [16] Gunarathne T, Zhang B, Wu T L, Qiu J. Scalable parallel computing on clouds using Twister4Azure iterative MapReduce. Future Generation Computer Systems, 2013, 29(4):1035-1048. [17] Caneill M, de Palma N. Lambda-blocks:Data processing with topologies of blocks. In Proc. the 2018 IEEE International Congress on Big Data, July 2018, pp.9-16. |
[1] | Jun-Hua Fang, Peng-Peng Zhao, An Liu, Zhi-Xu Li, Lei Zhao. 分布式数据流中轨迹大数据的自适应连接方法[J]. 计算机科学技术学报, 2019, 34(4): 747-761. |
[2] | Xin Bi, Xiang-Guo Zhao, Guo-Ren Wang. 基于节点分发的分布式Twig查询处理技术[J]. , 2017, 32(1): 78-92. |
[3] | Li Shen, Fan Xu, Zhi-Ying Wang. 软件线程级猜测系统中面向循环特征的优化策略[J]. , 2016, 31(1): 60-76. |
[4] | Dong-Hong Han, Xin Zhang, Guo-Ren Wang. 利用分布式极限学习机分类不确定演化数据流[J]. , 2015, 30(4): 874-887. |
[5] | Xiang-Ke Liao, Can-Qun Yang, Tao Tang Hui-Zhan Yi, Feng Wang, Qiang Wu, Jingling. OpenMC:简化天河超级计算机的编程[J]. , 2014, 29(3): 532-546. |
版权所有 © 《计算机科学技术学报》编辑部 本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn 总访问量: |