计算机科学技术学报 ›› 2019,Vol. 34 ›› Issue (5): 1096-1108.doi: 10.1007/s11390-019-1962-4

所属专题: Computer Networks and Distributed Computing

• • 上一篇    下一篇

异构系统上基于加权出度的细粒度并行程序任务调度

Han Lin, Ming-Fan Li, Cheng-Fan Jia, Jun-Nan Liu, Hong An, Senior Member, CCF, ACM, IEEE   

  1. School of Computer Science and Technology, University of Science and Technology of China, Hefei 230026, China
  • 收稿日期:2018-11-08 修回日期:2019-07-04 出版日期:2019-08-31 发布日期:2019-08-31
  • 作者简介:Han Lin received his B.S. degree in computer science and technology from University of Science and Technology of China, Hefei, in 2013. He is now a Ph.D. candidate in the School of Computer Science and Technology, University of Science and Technology of China, Hefei. His research interests include dataflow runtime, artificial intelligence computing system, and computer architecture.
  • 基金资助:
    The work is supported by the National Key Research and Development Program of China under Grant No. 2016YFB1000403.

Degree-of-Node Task Scheduling of Fine-Grained Parallel Programs on Heterogeneous Systems

Han Lin, Ming-Fan Li, Cheng-Fan Jia, Jun-Nan Liu, Hong An, Senior Member, CCF, ACM, IEEE   

  1. School of Computer Science and Technology, University of Science and Technology of China, Hefei 230026, China
  • Received:2018-11-08 Revised:2019-07-04 Online:2019-08-31 Published:2019-08-31
  • About author:Han Lin received his B.S. degree in computer science and technology from University of Science and Technology of China, Hefei, in 2013. He is now a Ph.D. candidate in the School of Computer Science and Technology, University of Science and Technology of China, Hefei. His research interests include dataflow runtime, artificial intelligence computing system, and computer architecture.
  • Supported by:
    The work is supported by the National Key Research and Development Program of China under Grant No. 2016YFB1000403.

处理器专有化已经成为现代处理器工业的重要发展趋势,在半导体时代的未来十余年,这将很可能仍是主流趋势。随着异构系统中多样性的增加,如何在具备多种异构处理器的系统上高效地组织计算成为一个颇具挑战的问题,也将成为未来异构计算的常态。在本文中,我们分析了一系列在异构系统上目前最先进的任务调度算法,并提出了在异构系统上调度细粒度并行程序的新观点。本文的主要创新在于:1.简化了基于有向无环图的细粒度并行程序调度问题中任务优先级的计算方法,这不仅减小了任务选择阶段的时间复杂度,也使得本文的调度算法可以处理动态有向无环图的调度问题;2.在处理器选择阶段,充分考虑了通信链路对性能的影响,构建了一种新颖的通信模型,使得任务调度更加高效。上述所列两点的实现有赖于对以数据流程序执行模型为代表的细粒度并行问题所进行的探索,并且通过一组选取的基准测试程序进行了实验验证。在人工合成以及真实应用的有向无环图上所进行的实验展现出了非常优秀的性能。在调度长度比和并行效率两个指标上,本文提出的基于节点加权出度的任务调度算法比所有文中评估的目前最优的启发式算法都明显表现更好。

关键词: 任务调度, 异构系统, 并行计算

Abstract: Processor specialization has become the development trend of modern processor industry. It is quite possible that this will still be the main-stream in the next decades of semiconductor era. As the diversity of heterogeneous systems grows, organizing computation efficiently on systems with multiple kinds of heterogeneous processors is a challenging problem and will be a normality. In this paper, we analyze some state-of-the-art task scheduling algorithms of heterogeneous computing systems and propose a Degree of Node First (DONF) algorithm for task scheduling of fine-grained parallel programs on heterogeneous systems. The major innovations of DONF include:1) simplifying task priority calculation for directed acyclic graph (DAG) based fine-grained parallel programs which not only reduces the complexity of task selection but also enables the algorithm to solve the scheduling problem for dynamic DAGs; 2) building a novel communication model in the processor selection phase that makes the task scheduling much more efficient. They are achieved by exploring finegrained parallelism via a dataflow program execution model, and validated through experimental results with a selected set of benchmarks. The results on synthesized and real-world application DAGs show a very good performance. The proposed DONF algorithm significantly outperforms all the evaluated state-of-the-art heuristic algorithms in terms of scheduling length ratio (SLR) and efficiency.

Key words: task scheduling, heterogeneous system, performance, parallel program

[1] Suetterlein J. DARTS:A runtime based on the Codelet execution model[Master Thesis]. University of Delaware, 2014.
[2] Qu P, Yan J, Zhang Y H, Gao G R. Parallel turing machine, a proposal. Journal of Computer Science and Technology, 2017, 32(2):269-285.
[3] Valiant L G. A bridging model for parallel computation. Communications of the ACM, 1990, 33(8):103-111.
[4] Zuckerman S, Suetterlein J, Knauerhase R, Gao G R. Using a "codelet" program execution model for exascale machines:Position paper. In Proc. the 1st International Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era, June 2011, pp.64-69.
[5] Suettlerlein J, Zuckerman S, Gao G R. An implementation of the Codelet model. In Proc. the 19th Int. Conf. Parallel Processing, August 2013, pp.633-644.
[6] Ullman J D. NP-complete scheduling problems. Journal of Computer and System Sciences, 1975, 10(3):384-393.
[7] Lenstra J K, Kan A H G R. Computational complexity of discrete optimization problems. Annals of Discrete Mathematics, 1979, 4:121-140.
[8] Garey M R, Johnson D S. Computers and Intractability:A Guide to the Theory of NP-Completeness (1st edition). W.H. Freeman, 1979.
[9] Arabnejad H, Barbosa J G. List scheduling algorithm for heterogeneous systems by an optimistic cost table. IEEE Trans. Parallel and Distributed Systems, 2014, 25(3):682-694.
[10] Dennis J B, Fosseen J B, Linderman J P. Data flow schemas. In Proc. Int. Symp. Theoretical Programming, Aug. 1972, pp.187-216.
[11] Dennis J B. First version of a data flow procedure language. In Proc. Programming Symposium, April 1974, pp.362-376.
[12] Dennis J B. Data flow supercomputers. IEEE Computer, 1980, 13(11):48-56.
[13] Arvind A, Gostelow K P, Plouffe W. Indeterminacy, monitors, and dataflow. ACM SIGOPS Operating Systems Review, 1977, 11(5):159-169.
[14] Arvind A, Kathail V. A multiple processor data flow machine that supports generalized procedures. In Proc. the 8th Symp. Computer Architecture, May 1981, pp.291-302.
[15] Watson I, Gurd J. A practical data flow computer. IEEE Computer, 1982, 15(2):51-57.
[16] Hum H H J, Maquelin O, Theobald K B et al. A design study of the EARTH multiprocessor. In Proc. the IFIP WG10.3 Working Conf. Parallel Architectures and Compilation Techniques, June 1995, pp.59-68.
[17] Farquhar W G, Evripidou P. DART:A data-driven processor architecture for real-time computing. In Proc. the IFIP WG10. 3. Working Conference on Architectures and Compilation Techniques for Fine and Medium Grain Parallelism, January 1993, pp.141-152.
[18] Evripidou P. Thread synchronization unit (TSU):A building block for high performance computers. In Proc. the 1997 Int. Symp. High Performance Computing, Nov. 1997, pp.107-118.
[19] Topcuouglu H, Hariri S, Wu M Y. Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE Trans. Parallel Distrib. Syst., 2002, 13(3):260-274.
[20] Bittencourt L F, Sakellariou R, Madeira E R M. DAG scheduling using a lookahead variant of the heterogeneous earliest finish time algorithm. In Proc. the 18th Euromicro Int. Conf. Parallel, Distributed and Network-Based Processing, February 2010, pp.27-34.
[21] Munir E U, Mohsin S, Hussain A, Nisar M W, Ali S. SDBATS:A novel algorithm for task scheduling in heterogeneous computing systems. In Proc. the 27th Int. Symp. Parallel and Distributed Processing Symposium Workshops & PhD Forum, May 2013, pp.43-53.
[22] Wang G, Wang Y X, Liu H, Guo H. HSIP:A novel task scheduling algorithm for heterogeneous computing. Sci. Program., 2016, 2016:Article No. 3676149.
[23] Yelick K, Bonachea D, Chen W Y et al. Productivity and performance using partitioned global address space languages. In Proc. the 2007 International Workshop on Parallel Symbolic Computation, July 2007, pp.24-32.
[24] Murray D G, McSherry F, Isaacs R et al. Naiad:A timely dataflow system. In Proc. the 24th ACM SIGOPS Symp. Operating Systems Principles, Nov. 2013, pp.439-455.
[25] Daoud M I, Kharma N. A high performance algorithm for static task scheduling in heterogeneous distributed computing systems. Journal of Parallel and Distributed Computing, 2008, 68(4):399-409.
[26] Blumofe R D, Joerg C F, Kuszmaul B C et al. Cilk:An efficient multithreaded runtime system. Journal of Parallel and Distributed Computing, 1996, 37(1):55-69.
[27] Blumofe R D, Frigo M, Joerg C F et al. DAG-consistent distributed shared memory. In Proc. the 10th International Parallel Processing Symposium, April 1996, pp.132-141.
[28] Augonnet C, Thibault S, Namyst R, Wacrenier P A. StarPU:A unified platform for task scheduling on heterogeneous multicore architectures. Concurrency and Computation:Practice and Experience, 2011, 23(2):187-198.
[29] Dongarra J, Heroux M A, Luszczek P. HPCG benchmark:A new metric for ranking high performance computing systems. Technical Report, Electrical Engineering and Computer Science Department, Knoxville, Tennessee, 2015. https://www.hpcg-benchmark.org/pubs/index.html, May 2019.
[30] Su Z C, Chen J S, Lin H et al. A dataflow-based runtime support on a 100P actual system. In Proc. the 2017 IEEE Int. Symp. Parallel and Distributed Processing with Applications and the 2017 IEEE Int. Conf. Ubiquitous Computing and Communications, December 2017, pp.599-606.
[31] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. In Proc. the 3rd Int. Conf. Learning Representations, May 2015, Article No. 4.
[32] Krizhevsky A, Sutskever I, Hinton G E. ImageNet classification with deep convolutional neural networks. In Proc. the 26th Annual Conference on Neural Information Processing Systems, December 2012, pp.1106-1114.
[33] Sun Y, Liang D, Wang X G et al. DeepID3:Face recognition with very deep neural networks. arXiv:1502.00873, 2015. https://arxiv.org/abs/1502.00873, May 2019.
[34] Dai J F, He K M, Sun J. Convolutional feature masking for joint object and stuff segmentation. In Proc. the 2015 IEEE Conference on Computer Vision and Pattern Recognition, June 2015, pp.3992-4000.
[35] Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation. In Proc. the 2015 IEEE Conference on Computer Vision and Pattern Recognition, June 2015, pp.3431-3440.
[36] Ma C, Huang J B, Yang X K et al. Hierarchical convolutional features for visual tracking. In Proc. the 2015 IEEE Int. Conf. Computer Vision, December 2015, pp.3074-3082.
[1] Ying-Lin Zhao, Jian-Lei Yang, Wei-Sheng Zhao, Aida Todri-Sanial, Yuan-Qing Cheng. 考虑供电噪声和散热约束的三维同构片上多核系统任务调度方案研究[J]. 计算机科学技术学报, 2018, 33(5): 966-983.
[2] Dong-Fang Zhou, Jian-Xi Fan, Cheng-Kuan Lin, Bao-Lei Cheng, Jing-Ya Zhou, Zhao Liu. 交换交叉立方体中最优路径的嵌入[J]. , 2017, 32(3): 618-629.
[3] Peng Qu, Jin Yan, You-Hui Zhang, Guang R. Gao. 一个并行图灵机的提议[J]. , 2017, 32(2): 269-285.
[4] Xiao-Hu Yan, Fa-Zhi He, Yi-Lin Chen. 基于野草扰动粒子群算法的新型软硬件划分方法[J]. , 2017, 32(2): 340-355.
[5] Kai Zhang, Feng Chen, Xiaoning Ding, Yin Huai, Rubao Lee, Tian Luo, Kaibo Wang. Hetero-DB: 基于异构计算和存储资源的下一代数据库系统[J]. , 2015, 30(4): 657-678.
[6] Rong Chen, Jia-Xin Shi, Hai-Bo Chen, Bin-Yu Zang. 大数据学习场景中面向二分图的分布式图划分算法[J]. , 2015, 30(1): 20-29.
[7] Juan Fang, Zhen-Yu Leng, Si-Tong Liu, Zhi-Cheng Yao, Xiu-Feng Sui. 面向GPU-CPU异构体系结构的异构片上网络设计空间探索[J]. , 2015, 30(1): 74-83.
[8] Zheng Cao, Xiao-Li Liu, ACM Qiang Li, Xiao-Bing Liu, ACM Zhan Wang, Xue-Jun An. 面向异构计算的服务器内互连架构[J]. , 2014, 29(6): 976-988.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 张钹; 张铃;. Statistical Heuristic Search[J]. , 1987, 2(1): 1 -11 .
[2] 孟力明; 徐晓飞; 常会友; 陈光熙; 胡铭曾; 李生;. A Tree-Structured Database Machine for Large Relational Database Systems[J]. , 1987, 2(4): 265 -275 .
[3] 林琦; 夏培肃;. The Design and Implementation of a Very Fast Experimental Pipelining Computer[J]. , 1988, 3(1): 1 -6 .
[4] 孙成政; 慈云桂;. A New Method for Describing the AND-OR-Parallel Execution of Logic Programs[J]. , 1988, 3(2): 102 -112 .
[5] 张钹; 张恬; 张建伟; 张铃;. Motion Planning for Robots with Topological Dimension Reduction Method[J]. , 1990, 5(1): 1 -16 .
[6] 黄维康; F.Lombardi;. Repairing VLSI/WSI Redundant Memories with Minimum Cost[J]. , 1990, 5(2): 187 -196 .
[7] 王鼎兴; 郑纬民; 杜晓黎; 郭毅可;. On the Execution Mechanisms of Parallel Graph Reduction[J]. , 1990, 5(4): 333 -346 .
[8] 蔡士杰; 张福炎;. A Fast Algorithm for Polygon Operations[J]. , 1991, 6(1): 91 -96 .
[9] 周权; 魏道政;. A Complete Critical Path Algorithm for Test Generation of Combinational Circuits[J]. , 1991, 6(1): 74 -82 .
[10] 赵靓海; 刘慎权;. An Environment for Rapid Prototyping of Interactive Systems[J]. , 1991, 6(2): 135 -144 .
版权所有 © 《计算机科学技术学报》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn
总访问量: