计算机科学技术学报 ›› 2023,Vol. 38 ›› Issue (1): 115-127.doi: 10.1007/s11390-023-2875-9

所属专题: Computer Architecture and Systems

• • 上一篇    下一篇

图神经网络加速器访存优化方法

  

  • 收稿日期:2022-09-29 修回日期:2022-10-29 接受日期:2023-01-01 出版日期:2023-02-28 发布日期:2023-02-28

GShuttle: Optimizing Memory Access Efficiency for Graph Convolutional Neural Network Accelerators

Jia-Jun Li1 (李家军), Ke Wang2 (王可), Hao Zheng3 (郑皓), and Ahmed Louri4, Fellow, IEEE        

  1. School of Astronautics, Beihang University, Beijing 100191, China
    Department of Electrical and Computer Engineering, University of North Carolina at Charlotte, Charlotte, NC 28223 U.S.A.
    Department of Electrical and Computer Engineering, University of Central Florida, Orlando, FL 32816, U.S.A.
    Department of Electrical and Computer Engineering, George Washington University, Washington, DC 20052, U.S.A.
  • Received:2022-09-29 Revised:2022-10-29 Accepted:2023-01-01 Online:2023-02-28 Published:2023-02-28
  • Contact: Jia-Jun Li E-mail:jiajunli@buaa.edu.cn
  • About author:Jia-Jun Li received his B.E. degree from the Department of Automation, Tsinghua University, Beijing, in 2013. He received his Ph.D. degree from the Institute of Computing Technology, Chinese Academy of Sciences, Beijing, in 2019. From 2019 to 2021, he was a postdoc researcher with the Department of Electrical and Computer Engineering, George Washington University, Washington. He is currently an associate professor with the School of Astronautics, Beihang University, Beijing. His current research interests include machine learning and heterogeneous computer architecture.
  • Supported by:
    This work was supported by the National Science Foundation of USA under Grant Nos. CCF-2131946, CCF-1953980, CCF-1702980. Part of this work was conducted when Prof. Jia-Jun Li was a post-doctoral researcher at the HPCAT Laboratory, George Washington University.

1、研究背景
图卷积神经网络是将深度学习应用到图领域的一种重要方法。但图中往往存在大量节点并且节点之间在连接上存在显著差异性,导致底层计算平台往往无法高效的处理图卷积神经网络。最近,研究人员提出了专用的图卷积神经网络加速器来提升处理图卷积神经网络的性能和能效。然而,他们都没有系统地研究循环优化技术在访存方面对加速器能效的影响。这些工作只采用了有限的循环优化技术,几乎无法有效地利用数据重用,从而导致访存增加。由于访存能耗远远大于计算能耗,因此会大幅降低加速器能效。

本工作的研究目标是通过建立图卷积神经网络加速框架,系统地探索图卷积神经网络的访存优化技术,最大限度地提高访存效率,从而提高图卷积神经网络加速器的能效。
2、研究方法
本工作提出名为GShuttle的图卷积神经网络访存优化框架,GShuttle通过两种算法以在给定约束条件下为加速器确定最佳的设计参数,这两种算法分别为贪心算法(GShuttle-GS)和搜索空间剪枝法(GShuttle-PSSS)。本工作设计了基于C++的加速器模拟器来评估所提方法在减少访存方面的效果,并在五个图数据集上进行了验证。

3、实验和结果
实验结果表明,GShuttle相比于已有工作,片外访存效率提升高达 70%。

4、结论和展望
图卷积神经网络加速器的大部分能耗用于访存中,因此提升访存效率是提高加速器能效的关键。本文提出名为GShuttle的图卷积神经网络加速框架,对图神经网络加速器访存优化问题进行了形式化的定义,并提出两种算法来对这一优化问题进行求解。所提出的两种方法都能快速高效地搜索到图神经网络加速器较佳的设计参数。实验结果表明,GShuttle能够显著提升加速器的访存效率。并且,因为所提访存优化方法与已有的计算优化方法不存在冲突,GShuttle有望用于其他的图神经网络加速器,如HyGCN、AWB-GCN中以提升这些加速器的访存效率。

关键词: 图神经网络, 存储访问, 神经网络加速器

Abstract: Graph convolutional neural networks (GCNs) have emerged as an effective approach to extending deep learning for graph data analytics, but they are computationally challenging given the irregular graphs and the large number of nodes in a graph. GCNs involve chain sparse-dense matrix multiplications with six loops, which results in a large design space for GCN accelerators. Prior work on GCN acceleration either employs limited loop optimization techniques, or determines the design variables based on random sampling, which can hardly exploit data reuse efficiently, thus degrading system efficiency. To overcome this limitation, this paper proposes GShuttle, a GCN acceleration scheme that maximizes memory access efficiency to achieve high performance and energy efficiency. GShuttle systematically explores loop optimization techniques for GCN acceleration, and quantitatively analyzes the design objectives (e.g., required DRAM accesses and SRAM accesses) by analytical calculation based on multiple design variables. GShuttle further employs two approaches, pruned search space sweeping and greedy search, to find the optimal design variables under certain design constraints. We demonstrated the efficacy of GShuttle by evaluation on five widely used graph datasets. The experimental simulations show that GShuttle reduces the number of DRAM accesses by a factor of 1.5 and saves energy by a factor of 1.7 compared with the state-of-the-art approaches.

Key words: graph convolutional neural network, memory access, neural network accelerator

<table class="reference-tab" style="background-color:#FFFFFF;width:914.104px;color:#333333;font-family:Calibri, Arial, 微软雅黑, "font-size:16px;"> <tbody> <tr class="document-box" id="b1"> <td valign="top" class="td1"> [1] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Jiang W W, Luo J Y. Graph neural network for traffic forecasting: A survey. arXiv: 2101.11174, 2021. <a href="https://arxiv.org/abs/2101.11174">https://arxiv.org/abs/2101.11174</a>, Dec. 2022. </div> </td> </tr> <tr class="document-box" id="b2"> <td valign="top" class="td1"> [2] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Shi W J, Rajkumar R. Point-GNN: Graph neural network for 3D object detection in a point cloud. In <i>Proc</i>. <i>the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition</i>, Jun. 2020, pp.1711–1719. DOI: <a href="http://dx.doi.org/10.1109/CVPR42600.2020.00178">10.1109/CVPR42600.2020.00178</a>. </div> </td> </tr> <tr class="document-box" id="b3"> <td valign="top" class="td1"> [3] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Wee C Y, Liu C Q, Lee A, Poh J S, Ji H, Qiu A Q, The Alzheimers Disease Neuroimage Initiative. Cortical graph neural network for AD and MCI diagnosis and transfer learning across populations. <i>NeuroImage</i>: <i>Clinical</i>, 2019, 23: 101929. DOI: <a href="http://dx.doi.org/10.1016/j.nicl.2019.101929">10.1016/j.nicl.2019.101929</a>. </div> </td> </tr> <tr class="document-box" id="b4"> <td valign="top" class="td1"> [4] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Zhang Z W, Cui P, Zhu W W. Deep learning on graphs: A survey. <i>IEEE Trans</i>. <i>Knowledge and Data Engineering</i>, 2022, 34(1): 249–270. DOI: <a href="http://dx.doi.org/10.1109/TKDE.2020.2981333">10.1109/TKDE.2020.2981333</a>. </div> </td> </tr> <tr class="document-box" id="b5"> <td valign="top" class="td1"> [5] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Yang H X. AliGraph: A comprehensive graph neural network platform. In <i>Proc</i>. <i>the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining</i>, Jul. 2019, pp.3165–3166. DOI: <a href="http://dx.doi.org/10.1145/3292500.3340404">10.1145/3292500.3340404</a>. </div> </td> </tr> <tr class="document-box" id="b6"> <td valign="top" class="td1"> [6] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Lerer A, Wu L, Shen J, Lacroix T, Wehrstedt L, Bose A, Peysakhovich A. PyTorch-BigGraph: A large-scale graph embedding system. arXiv: 1903.12287, 2019. <a href="https://arxiv.org/abs/1903.12287">https://arxiv.org/abs/1903.12287</a>, Dec. 2022. </div> </td> </tr> <tr class="document-box" id="b7"> <td valign="top" class="td1"> [7] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Yan M Y, Chen Z D, Deng L, Ye X C, Zhang Z M, Fan D R, Xie Y. Characterizing and understanding GCNs on GPU. <i>IEEE Computer Architecture Letters</i>, 2020, 19(1): 22–25. DOI: <a href="http://dx.doi.org/10.1109/LCA.2020.2970395">10.1109/LCA.2020.2970395</a>. </div> </td> </tr> <tr class="document-box" id="b8"> <td valign="top" class="td1"> [8] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Zhang Z H, Leng J W, Ma L X, Miao Y S, Li C, Guo M Y. Architectural implications of graph neural networks. <i>IEEE Computer Architecture Letters</i>, 2020, 19(1): 59–62. DOI: <a href="http://dx.doi.org/10.1109/LCA.2020.2988991">10.1109/LCA.2020.2988991</a>. </div> </td> </tr> <tr class="document-box" id="b9"> <td valign="top" class="td1"> [9] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Geng T, Li A, Shi R B, Wu C S, Wang T Q, Li Y F, Haghi P, Tumeo A, Che S, Reinhardt S, Herbordt M C. AWB-GCN: A graph convolutional network accelerator with runtime workload rebalancing. In <i>Proc</i>. <i>the 53rd Annual IEEE/ACM International Symposium on Microarchitecture</i> (<i>MICRO</i>), Oct. 2020, pp.922–936. DOI: <a href="http://dx.doi.org/10.1109/MICRO50266.2020.00079">10.1109/MICRO50266.2020.00079</a>. </div> </td> </tr> <tr class="document-box" id="b10"> <td valign="top" class="td1"> [10] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Ma Y F, Cao Y, Vrudhula S, Seo J S. Optimizing loop operation and dataflow in FPGA acceleration of deep convolutional neural networks. In <i>Proc</i>. <i>the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays</i>, Feb. 2017, pp.45–54. DOI: <a href="http://dx.doi.org/10.1145/3020078.3021736">10.1145/3020078.3021736</a>. </div> </td> </tr> <tr class="document-box" id="b11"> <td valign="top" class="td1"> [11] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Yan M Y, Deng L, Hu X, Liang L, Feng, Y J, Ye X C, Zhang Z M, Fan D R, Xie Y. HyGCN: A GCN accelerator with hybrid architecture. In <i>Proc</i>. <i>the</i> <i>2020 IEEE International Symposium on High Performance Computer Architecture</i> (<i>HPCA</i>), Feb. 2020, pp.15–29. DOI: <a href="http://dx.doi.org/10.1109/HPCA47549.2020.00012">10.1109/HPCA47549.2020.00012</a>. </div> </td> </tr> <tr class="document-box" id="b12"> <td valign="top" class="td1"> [12] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Li J J, Louri A, Karanth A, Bunescu R. GCNAX: A flexible and energy-efficient accelerator for graph convolutional neural networks. In <i>Proc</i>. <i>the</i> <i>2021 IEEE International Symposium on High-Performance Computer Architecture</i> (<i>HPCA</i>), Mar. 2021, pp.775–788. DOI: <a href="http://dx.doi.org/10.1109/HPCA51647.2021.00070">10.1109/HPCA51647.2021.00070</a>. </div> </td> </tr> <tr class="document-box" id="b13"> <td valign="top" class="td1"> [13] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Galal S, Horowitz M. Energy-efficient floating-point unit design. <i>IEEE Transactions on Computers</i>, 2011, 60(7): 913-922. </div> </td> </tr> <tr class="document-box" id="b14"> <td valign="top" class="td1"> [14] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Kipf T N, Welling M. Semi-supervised classification with graph convolutional networks. arXiv: 1609.02907, 2016. <a href="https://arxiv.org/abs/1609.02907">https://arxiv.org/abs/1609.02907</a>, Dec. 2022. </div> </td> </tr> <tr class="document-box" id="b15"> <td valign="top" class="td1"> [15] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Hamilton W L, Ying R, Leskovec J. Inductive representation learning on large graphs. In <i>Proc</i>. <i>the 31st International Conference on Neural Information Processing Systems</i>, Dec. 2017, pp.1024–1034. </div> </td> </tr> <tr class="document-box" id="b16"> <td valign="top" class="td1"> [16] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Xu K, Hu W H, Leskovec J, Jegelka S. How powerful are graph neural networks? arXiv: 1810.00826, 2018. <a href="https://arxiv.org/abs/1810.00826">https://arxiv.org/abs/1810.00826</a>, Dec. 2022. </div> </td> </tr> <tr class="document-box" id="b17"> <td valign="top" class="td1"> [17] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Allen J R, Kennedy K. Automatic loop interchange. In <i>Proc</i>. <i>the 1984 SIGPLAN Symposium on Compiler Construction</i>, Jun. 1984, pp.233–246. DOI: <a href="http://dx.doi.org/10.1145/502874.502897">10.1145/502874.502897</a>. </div> </td> </tr> <tr class="document-box" id="b18"> <td valign="top" class="td1"> [18] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Zhang C, Li P, Sun G Y et al. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In <i>Proc</i>. <i>the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays</i>, Feb. 2015, pp.161–170. DOI: <a href="http://dx.doi.org/10.1145/2684746.2689060">10.1145/2684746.2689060</a>. </div> </td> </tr> <tr class="document-box" id="b19"> <td valign="top" class="td1"> [19] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Pugh W. Uniform techniques for loop optimization. In <i>Proc</i>. <i>the 5th International Conference on Supercomputing</i>, Jun. 1991, pp.341–352. DOI: <a href="http://dx.doi.org/10.1145/109025.109108">10.1145/109025.109108</a>. </div> </td> </tr> <tr class="document-box" id="b20"> <td valign="top" class="td1"> [20] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Pal S, Beaumont J, Park D H, Amarnath A, Feng S Y, Chakrabarti C, Kim H S, Blaauw D, Mudge T, Dreslinski R. OuterSPACE: An outer product based sparse matrix multiplication accelerator. In <i>Proc</i>. <i>the 2018 IEEE International Symposium on High Performance Computer Architecture</i> (<i>HPCA</i>), Feb. 2018, pp.724–736. DOI: <a href="http://dx.doi.org/10.1109/HPCA.2018.00067">10.1109/HPCA.2018.00067</a>. </div> </td> </tr> <tr class="document-box" id="b21"> <td valign="top" class="td1"> [21] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Nie J. Memory-driven data-flow optimization for neural processing accelerators [Ph.D. Thesis]. Princeton University, 2020. <a href="https://www.proquest.com/openview/41fe23f43fd65cafaa8c2e051aed4059/1?pq-origsite=gscholar&cbl=18750&diss=y">https://www.proquest.com/openview/41fe23f43fd65cafaa8c2e051aed4059/1?pq-origsite=gscholar&cbl=18750&diss=y</a>, Jan. 2023. </div> </td> </tr> <tr class="document-box" id="b22"> <td valign="top" class="td1"> [22] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Sen P, Namata G, Bilgic M et al. Collective classification in network data. <i>AI Magazine</i>, 2008, 29(3): 93-106. DOI: <a href="https://ojs.aaai.org/aimagazine/index.php/aimagazine/article/view/2157">10.1609/aimag.v29i3.2157</a>. </div> </td> </tr> <tr class="document-box" id="b23"> <td valign="top" class="td1"> [23] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Carlson A, Betteridge J, Kisiel B et al. Toward an architecture for never-ending language learning. In <i>Proc. the 34th AAAI Conference on Artificial Intelligence</i>, July 2010, pp.1306-1313. </div> </td> </tr> <tr class="document-box" id="b24"> <td valign="top" class="td1"> [24] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Auten A, Tomei M, Kumar R. Hardware acceleration of graph neural networks. In <i>Proc</i>. <i>the</i> <i>57th ACM/IEEE Design Automation Conference</i> (<i>DAC</i>), Jul. 2020. DOI: <a href="http://dx.doi.org/10.1109/DAC18072.2020.9218751">10.1109/DAC18072.2020.9218751</a>. </div> </td> </tr> <tr class="document-box" id="b25"> <td valign="top" class="td1"> [25] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Liang S W, Wang Y, Liu C et al. EnGN: A high-throughput and energy-efficient accelerator for large graph neural networks. <i>IEEE Trans</i>. <i>Computers</i>, 2021, 70(9): 1511–1525. DOI: <a href="http://dx.doi.org/10.1109/TC.2020.3014632">10.1109/TC.2020.3014632</a>. </div> </td> </tr> <tr class="document-box" id="b26"> <td valign="top" class="td1"> [26] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Kiningham K, Re C, Levis P. GRIP: A graph neural network accelerator architecture. arXiv: 2007.13828, 2020. <a href="https://arxiv.org/abs/2007.13828v1">https://arxiv.org/abs/2007.13828v1</a>, Dec. 2022. </div> </td> </tr> <tr class="document-box" id="b27"> <td valign="top" class="td1"> [27] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Zeng H Q, Prasanna V. GraphACT: Accelerating GCN training on CPU-FPGA heterogeneous platforms. In <i>Proc</i>. <i>the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays</i>, Feb. 2020, pp.255–265. DOI: <a href="http://dx.doi.org/10.1145/3373087.3375312">10.1145/3373087.3375312</a>. </div> </td> </tr> <tr class="document-box" id="b28"> <td valign="top" class="td1"> [28] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Shi F, Jin A Y, Zhu S C. VersaGNN: A versatile accelerator for graph neural networks. arXiv: 2105.01280, 2021. <a href="https://arxiv.org/abs/2105.01280">https://arxiv.org/abs/2105.01280</a>, Dec. 2022. </div> </td> </tr> </tbody> </table>
[1] . Tetris:面向统一内存多核神经网络加速器的启发式静态内存管理框架 [J]. 计算机科学技术学报, 2022, 37(6): 1255-1270.
[2] . 基于校准协同训练的图上对抗防御方法[J]. 计算机科学技术学报, 2022, 37(5): 1161-1175.
[3] Dan-Hao Zhu, Xin-Yu Dai, Jia-Jun Chen. 预训练和学习:在图神经网络中保留全局信息[J]. 计算机科学技术学报, 2021, 36(6): 1420-1430.
[4] Yu-Hang Liu, Xian-He Sun. 并发访问驱动下的新数据停顿时间[J]. , 2015, 30(2): 227-245.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 周笛;. A Recovery Technique for Distributed Communicating Process Systems[J]. , 1986, 1(2): 34 -43 .
[2] 李未;. A Structural Operational Semantics for an Edison Like Language(2)[J]. , 1986, 1(2): 42 -53 .
[3] 李万学;. Almost Optimal Dynamic 2-3 Trees[J]. , 1986, 1(2): 60 -71 .
[4] 刘明业; 洪恩宇;. Some Covering Problems and Their Solutions in Automatic Logic Synthesis Systems[J]. , 1986, 1(2): 83 -92 .
[5] 王选; 吕之敏; 汤玉海; 向阳;. A High Resolution Chinese Character Generator[J]. , 1986, 1(2): 1 -14 .
[6] C.Y.Chung; 华宣仁;. A Chinese Information Processing System[J]. , 1986, 1(2): 15 -24 .
[7] 孙钟秀; 商陆军;. DMODULA:A Distributed Programming Language[J]. , 1986, 1(2): 25 -31 .
[8] 高庆狮; 张祥; 杨树范; 陈树清;. Vector Computer 757[J]. , 1986, 1(3): 1 -14 .
[9] 金兰; 杨元元;. A Modified Version of Chordal Ring[J]. , 1986, 1(3): 15 -32 .
[10] 潘启敬;. A Routing Algorithm with Candidate Shortest Path[J]. , 1986, 1(3): 33 -52 .
版权所有 © 《计算机科学技术学报》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn
总访问量: