|
计算机科学技术学报 ›› 2021,Vol. 36 ›› Issue (5): 1051-1070.doi: 10.1007/s11390-021-1242-y
所属专题: Computer Architecture and Systems; Computer Networks and Distributed Computing
Yu-Wei Wu, Student Member, CCF, Qing-Gang Wang, Student Member, CCF Long Zheng*, Member, CCF, ACM, IEEE, Xiao-Fei Liao, Senior Member, CCF, Member, IEEE Hai Jin, Fellow, CCF, IEEE, Member, ACM, Wen-Bin Jiang, Member, CCF, ACM, IEEE Ran Zheng, Member, CCF, ACM, IEEE, and Kan Hu, Member, CCF, ACM, IEEE
Yu-Wei Wu, Student Member, CCF, Qing-Gang Wang, Student Member, CCF Long Zheng*, Member, CCF, ACM, IEEE, Xiao-Fei Liao, Senior Member, CCF, Member, IEEE Hai Jin, Fellow, CCF, IEEE, Member, ACM, Wen-Bin Jiang, Member, CCF, ACM, IEEE Ran Zheng, Member, CCF, ACM, IEEE, and Kan Hu, Member, CCF, ACM, IEEE
1、研究背景(context):随着现实世界图的快速增长,图规模很容易超过加速器的片上存储容量,在单个图加速器上处理大型图就变得很困难。在FPGA集群上进行图加速非常必要和重要。现在有许多云服务商(例如,亚马逊,微软和百度)在其数据中心部署FPGA集群,为加速大规模图计算提供了机会。
2、目的(Objective):但是将现有的单FPGA图加速器扩展到数据中心的多FPGA图计算系统的两个主要挑战:首先,由于现有的单FPGA图加速器配备有量身定制的编程模型,运行时系统和通信运行时,因此很难通过重用基础架构来生产新的分布式加速器;其次,当在数据中心中运行的分布式图加速器不能灵活地支持图划分策略,并且未考虑环型互连方案的特殊性时,会有很多不必要的通信开销。
3、方法(Method):本文提出了一个通信库FDGLib,该库可以轻松地将现有的基于FPGA的图加速器扩展到数据中心中形成一个分布式版本的图加速系统,并且无需花费太多的硬件工程工作就可以实现拓展。FDGLib提供了6个只需比较少量的代码修改即可轻松使用并集成到多种图加速器中的API。除此之外,FDGLib还考虑到数据中心中环形的FPGA互连方式,有针对性地设计了相应的图分区和布局方案,进一步提高了通信效率。
4、结果(Result&Findings):本文将FDGLib连接到最先进的图加速器AccuGraph中,并在类似于Microsoft Catapult数据中心的集群上进行了实验。研究结果表明,分布式AccuGraph可以比最新的基于FPGA和CPU的分布式图系统快2.32倍和4.77倍(即ForeGraph和Gemini),并且比起它们具有更好的可扩展性。
5、结论(Conclusions):本文提出了一种通信库FDGLib,它能与现有的单FPGA图加速器集成,并生成分布式图加速系统,用于数据中心的大规模图分析。其关键组件是一个负责在每次迭代中进行顶点值通信的通信控制,以及一个通过考虑环形互连方案的特性来最小化通信开销的通信优化器。随着图规模和数据中心规模的持续增长,如何高效支持图加速还有待继续深入研究。
[1] Quick L, Wilkinson P, Hardcastle D. Using Pregel-like large scale graph processing frameworks for social network analysis. In Proc. the 2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, Aug. 2012, pp.457-463. DOI:10.1109/ASONAM.2012.254. [2] Aridhi S, Montresor A, Velegrakis Y. BLADYG:A novel block-centric framework for the analysis of large dynamic graphs. In Proc. the ACM Workshop on High Performance Graph Processing, May 2016, pp.39-42. DOI:10.1016/j.bdr.2017.05.003. [3] Wang Y, Davidson A, Pan Y, Wu Y, Riffel A, Owens J D. Gunrock:A high-performance graph processing library on the GPU. In Proc. the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Mar. 2016, Article No. 11. DOI:10.1145/2851141.2851145. [4] Warnke-Sommer J, Ali H. Graph mining for next generation sequencing:Leveraging the assembly graph for biological insights. BMC Genomics, 2016, 17(1):Article No. 340. DOI:10.1186/s12864-016-2678-2. [5] Dai G, Chi Y, Wang Y, Yang H. FPGP:Graph processing framework on FPGA-A case study of breadth-first search. In Proc. the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Feb. 2016, pp.105-110. DOI:10.1145/2847263.2847339. [6] Engelhardt N, So H K H. GraVF:A vertex-centric distributed graph processing framework on FPGAs. In Proc. the 26th International Conference on Field Programmable Logic and Applications, Aug. 29-Sept. 2, 2016. DOI:10.1109/FPL.2016.7577360. [7] Nurvitadhi E, Weisz G, Wang Y, Hurkat S, Nguyen M, Hoe J C, Martínez J, Guestrin C. GraphGen:An FPGA framework for vertex-centric graph computation. In Proc. the 22nd IEEE Annual International Symposium on FieldProgrammable Custom Computing Machines, May 2014, pp.25-28. DOI:10.1109/FCCM.2014.15. [8] Oguntebi T, Olukotun K. GraphOps:A dataflow library for graph analytics acceleration. In Proc. the 2016 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, Feb. 2016, pp.111-117. DOI:10.1145/2847263.2847337. [9] Zhou S, Chelmis C, Prasanna V K. High-throughput and energy-efficient graph processing on FPGA. In Proc. the 24th IEEE Annual International Symposium on FieldProgrammable Custom Computing Machines, May 2016, pp.103-110. DOI:10.1109/FCCM.2016.35. [10] Yao P, Zheng L, Liao X, Jin H, He B. An efficient graph accelerator with parallel data conflict management. In Proc. the 27th International Conference on Parallel Architectures and Compilation Techniques, Nov. 2018, Article No. 8. DOI:10.1145/3243176.3243201. [11] Yang C, Zheng L, Gui C, Jin H. Efficient FPGA-based graph processing with hybrid pull-push computational model. Frontiers Comput. Sci., 2020, 14(4):Article No. 144102. DOI:10.1007/s11704-019-9020-5. [12] Lv X, Xiao W, Zhang Y, Liao X, Jin H, Hua Q. An effective framework for asynchronous incremental graph processing. Frontiers Comput. Sci., 2019, 13(3):539-551. DOI:10.1007/s11704-018-7443-z. [13] Jin H, Yao P, Liao X. Towards dataflow based graph processing. Science China Information Sciences, 2017, 60(12):Article No. 126102. DOI:10.1007/s11432-017-9226-8. [14] Li Z, Ding Z. Distributed optimization on unbalanced graphs via continuous-time methods. Science China Information Sciences, 2018, 61(12):Article No. 129204. DOI:10.1007/s11432-018-9502-1. [15] Ahn J, Hong S, Yoo S, Mutlu O, Choi K. A scalable processing-in-memory accelerator for parallel graph processing. In Proc. the 42nd Annual International Symposium on Computer Architecture, Jun. 2015, pp.105-117. DOI:10.1145/2749469.2750386. [16] McSherry F, Isard M, Murray D G. Scalability! But at what COST? In Proc. the 15th USENIX Conference on Hot Topics in Operating Systems, May 2015. [17] Dai G, Huang T, Chi Y, Xu N, Wang Y, Yang H. ForeGraph:Exploring large-scale graph processing on multiFPGA architecture. In Proc. the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Feb. 2017, pp.217-226. DOI:0.1145/3020078.3021739. [18] Dathathri R, Gill G, Hoang L, Dang H, Brooks A, Dryden N, Snir M, Pingali K. Gluon:A communication-optimizing substrate for distributed heterogeneous graph analytics. In Proc. the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation, Jun. 2018, pp.752-768. DOI:10.1145/3192366.3192404. [19] Satish N, Sundaram N, Patwary M M A, Seo J, Park J, Hassaan M A, Sengupta S, Yin Z, Dubey P. Navigating the maze of graph analytics frameworks using massive graph datasets. In Proc. the 2014 ACM SIGMOD International Conference on Management of Data, Jun. 2014, pp.979-990. DOI:10.1145/2588555.2610518. [20] Khorasani F, Gupta R, Bhuyan L N. Scalable SIMDefficient graph processing on GPUs. In Proc. the 2015 International Conference on Parallel Architectures and Compilation Techniques, Oct. 2015, pp.39-50. DOI:10.1109/PACT.2015.15. [21] Fu H, Liao J, Yang J et al. The Sunway TaihuLight supercomputer:System and applications. Science China Information Sciences, 2016, 59(7):Article No. 072001. DOI:10.1007/s11432-016-5588-7. [22] Zhang F, Zheng L, Liao X, Lv X, Jin H, Xiao J. An effective 2-dimension graph partitioning for work stealing assisted graph processing on multi-FPGAs. IEEE Transactions on Big Data, DOI:10.1109/TBDATA.2020.3035090. [23] Engelhardt N, So H K H. GraVF-M:Graph processing system generation for multi-FPGA platforms. ACM Transactions on Reconfigurable Technology and Systems, 2019, 12(4):Article No. 21. DOI:10.1145/3357596. [24] Shao Z, Li R, Hu D, Liao X, Jin H. Improving performance of graph processing on FPGA-DRAM platform by twolevel vertex caching. In Proc. the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Feb. 2019, pp.320-329. DOI:10.1145/3289602.3293900. [25] Zhou S, Kannan R, Prasanna V K, Seetharaman G, Wu Q. HitGraph:High-throughput graph processing framework on FPGA. IEEE Transactions on Parallel and Distributed Systems, 2019, 30(10):2249-2264. DOI:10.1109/TPDS.2019.2910068. [26] Wang Q, Zheng L, Zhao J, Liao X, Jin H, Xue J. A conflictfree scheduler for high-performance graph processing on multi-pipeline FPGAs. ACM Transactions on Architecture and Code Optimization, 2020, 17(2):Article No. 14. DOI:10.1145/3390523. [27] Putnam A, Caulfield A M, Chung E S et al. A reconfigurable fabric for accelerating large-scale datacenter services. In Proc. the 41st ACM/IEEE International Symposium on Computer Architecture, Jun. 2014, pp.13-24. DOI:10.110-9/ISCA.2014.6853195. [28] Caulfield A M, Chung E S, Putnam A et al. Configurable clouds. IEEE Micro, 2017, 37(3):52-61. DOI:10.1109/MM.2017.51. [29] Zhou S, Prasanna V K. Accelerating graph analytics on CPU-FPGA heterogeneous platform. In Proc. the 29th International Symposium on Computer Architecture and High Performance Computing, Oct. 2017, pp.137-144. DOI:10.1109/SBAC-PAD.2017.25. [30] Zhu X, Chen W, Zheng W, Ma X. Gemini:A computationcentric distributed graph processing system. In Proc. the 12th USENIX Symposium on Operating Systems Design and Implementation, Nov. 2016, pp.301-316. [31] Malewicz G, Austern M H, Bik A, Dehnert J C, Horn I, Leiser N, Czajkowski G. Pregel:A system for large-scale graph processing. In Proc. the 2010 ACM SIGMOD International Conference on Management of Data, Jun. 2010, pp.135-146. DOI:10.1145/1807167.1807184. [32] Eskandari N, Tarafdar N, Ly-Ma D, Chow P. A modular heterogeneous stack for deploying FPGAs and CPUs in the data center. In Proc. the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Feb. 2019, pp.262-271. DOI:10.1145/3289602.3293909. [33] Gill G, Dathathri R, Hoang L, Pingali K. A study of partitioning policies for graph analytics on large-scale distributed platforms. Proceedings of the VLDB Endowment, 2018, 12(4):321-334. DOI:10.14778/3297753.3297754. [34] Zilberman N, Bracha G, Schzukin G. Stardust:Divide and conquer in the data center network. In Proc. the 16th USENIX Symposium on Networked Systems Design and Implementation, Feb. 2019, pp.141-160. [35] Snir M, Gropp W, Otto S, Huss-Lederman S, Dongarra J, Walker D. MPI-The Complete Reference:Volume 1, the MPI Core (2nd edition). MIT Publishers, 1998. [36] Gonzalez J E, Low Y, Gu H, Bickson D, Guestrin C. PowerGraph:Distributed graph-parallel computation on natural graphs. In Proc. the 10th USENIX Symposium on Operating Systems Design and Implementation, Oct. 2012, pp.17-30. [37] Boman E G, Devine K D, Rajamanickam S. Scalable matrix computations on large scale-free graphs using 2D graph partitioning. In Proc. the International Conference on High Performance Computing, Networking, Storage and Analysis, Nov. 2013. DOI:10.1145/2503210.2503293. [38] Slota G M, Rajamanickam S, Devine K D, Madduri K. Partitioning trillion-edge graphs in minutes. In Proc. the 2017 IEEE International Parallel and Distributed Processing Symposium, May 29-June 2, 2017, pp.646-655. DOI:10.1109/IPDPS.2017.95. [39] Chen R, Shi J, Chen Y, Zang B, Guan H, Chen H. PowerLyra:Differentiated graph computation and partitioning on skewed graphs. ACM Transactions on Parallel Computing, 2018, 5(3):Article No. 13. DOI:10.1145/3298989. [40] Leskovec J, Krevl A. SNAP datasets:Stanford large network dataset collection, 2014. https://snap.stanford.edu/data, May 2021. [41] Gonzalez J E, Xin R S, Dave A, Crankshaw D, Franklin M J, Stoica I. GraphX:Graph processing in a distributed dataflow framework. In Proc. the 11th USENIX Symposium on Operating Systems Design and Implementation, Oct. 2014, pp.599-613. [42] Chi Y, Dai G, Wang Y, Sun G, Li G, Yang H. NXgraph:An efficient graph processing system on a single machine. In Proc. the 32nd IEEE International Conference on Data Engineering, May 2016, pp.409-420. DOI:10.1109/ICDE.2016.7498258. [43] Zhu X, Han W, Chen W. GridGraph:Large-scale graph processing on a single machine using 2-level hierarchical partitioning. In Proc. the 2015 USENIX Annual Technical Conference, Jul. 2015, pp.375-386. [44] Roy A, Mihailovic I, Zwaenepoel W. X-Stream:Edgecentric graph processing using streaming partitions. In Proc. the 24th ACM Symposium on Operating Systems Principles, Nov. 2013, pp.472-488. DOI:10.1145/2517-349.2522740. [45] Shun J, Blelloch G E. Ligra:A lightweight graph processing framework for shared memory. In Proc. the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Feb. 2013, pp.135-146. DOI:10.1145/2442516.2442530. [46] DeLorimier M, Kapre N, Mehta N, Rizzo D, Eslick I, Rubin R, Uribe T E, Knight T F, DeHon A. GraphStep:A system architecture for sparse-graph algorithms. In Proc. the 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines, Apr. 2006, pp.143-151. DOI:10.1109/FCCM.2006.45. [47] Betkaoui B,Wang Y, Thomas D B, Luk W. Parallel FPGAbased all pairs shortest paths for sparse networks:A human brain connectome case study. In Proc. the 22nd International Conference on Field Programmable Logic and Applications, Aug. 2012, pp.99-104. DOI:10.1109/FPL.2012.6339247. |
[1] | Songjie Niu, Dongyan Zhou. SOOP:支持二阶随机游走的高效的分布式图计算[J]. 计算机科学技术学报, 2021, 36(5): 985-1001. |
[2] | Ding-Huang Hu, De-Zun Dong, Yang Bai, Shan Huang, Ze-Jia Zhou, Zi-Hao Wei, Xiang-Ke Liao. Harmonia:数据中心中使ECN和信用预约流量收敛的拥塞控制[J]. 计算机科学技术学报, 2021, 36(5): 1071-1086. |
[3] | Songjie Niu, Shimin Chen. TransGPerf:利用迁移学习建模分布式图计算性能[J]. 计算机科学技术学报, 2021, 36(4): 778-791. |
[4] | Gui-Juan Wang, Cheng-Kuan Lin, Jian-Xi Fan, Jing-Ya Zhou, Bao-Lei Cheng. 在考虑各种故障元素的情况下BCube网络的容错哈密顿性质和容错哈密顿连通性[J]. 计算机科学技术学报, 2020, 35(5): 1064-1083. |
[5] | Shu-Quan Wang, Lei Wang, Yu Deng, Zhi-Jie Yang, Sha-Sha Guo, Zi-Yang Kang, Yu-Feng Guo, Wei-Xia Xu. SIES:在FPGA上的脉冲卷积神经网络推理引擎的新型实现[J]. 计算机科学技术学报, 2020, 35(2): 475-489. |
[6] | Peng Zhao, Chen Ding, Lei Liu, Jiping Yu, Wentao Han, Xiao-Bing Feng. Cacheap:图计算可移植协同输入/输出优化[J]. 计算机科学技术学报, 2019, 34(3): 690-706. |
[7] | Chuang-Yi Gui, Long Zheng, Bingsheng He, Cheng Liu, Xin-Yu Chen, Xiao-Fei Liao, Hai Jin. 图计算加速器综述[J]. 计算机科学技术学报, 2019, 34(2): 339-371. |
[8] | Song Liu, Yuan-Zhen Cui, Nian-Jun Zou, Wen-Hao Zhu, Dong Zhang, Wei-Guo Wu. 一种面向DOACROSS循环的并行策略研究[J]. 计算机科学技术学报, 2019, 34(2): 456-475. |
[9] | Shuai-Bing Lu, Jie Wu, Huan-Yang Zheng, Zhi-Yi Fang. 基于软管模型下的数据中心最大弹性资源调度算法[J]. 计算机科学技术学报, 2019, 34(1): 185-206. |
[10] | Xiao-Dong Dong, Sheng Chen, Lai-Ping Zhao, Xiao-Bo Zhou, Heng Qi, Ke-Qiu Li. 多请求,少开销:分层计价模式下的数据中心间不确定流量传输[J]. 计算机科学技术学报, 2018, 33(6): 1152-1163. |
[11] | Xi Wang, Jian-Xi Fan, Cheng-Kuan Lin, Jing-Ya Zhou, Zhao Liu. BCDC:一种高性能的以服务器为中心的数据中心网络[J]. , 2018, 33(2): 400-416. |
[12] | Fa-Qiang Sun, Gui-Hai Yan, Xin He, Hua-Wei Li, Yin-He Han. 基于等效资源配置的数据中心效能优化方法研究[J]. , 2018, 33(1): 131-144. |
[13] | Xue-Kai Du, Zhi-Hui Lu, Qiang Duan, Jie Wu, Cheng-Rong Wu. 在多租户数据中心中针对安全服务的负载自适应的流量转向和转发方案[J]. , 2017, 32(6): 1265-1278. |
[14] | Yi-Hong Gao, Hua-Dong Ma, Wu Liu. 视频数据中心中最小化摄像机流调度的最小开销[J]. , 2017, 32(3): 555-570. |
[15] | Tao Jiang, Rui Hou, Jian-Bo Dong, Lin Chai, Sally A. McKee, Bin Tian, Li-Xin Zhang, Ning-Hui Sun. 基于数据中心互连的新型内存架构[J]. , 2015, 30(1): 97-109. |
|
版权所有 © 《计算机科学技术学报》编辑部 本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn 总访问量: |