Journal of Computer Science and Technology ›› 2023, Vol. 38 ›› Issue (1): 115-127.doi: 10.1007/s11390-023-2875-9

Special Issue: Computer Architecture and Systems

• Special Issue in Honor of Professor Kai Hwang’s 80th Birthday • Previous Articles     Next Articles

GShuttle: Optimizing Memory Access Efficiency for Graph Convolutional Neural Network Accelerators

Jia-Jun Li1 (李家军), Ke Wang2 (王可), Hao Zheng3 (郑皓), and Ahmed Louri4, Fellow, IEEE        

  1. School of Astronautics, Beihang University, Beijing 100191, China
    Department of Electrical and Computer Engineering, University of North Carolina at Charlotte, Charlotte, NC 28223 U.S.A.
    Department of Electrical and Computer Engineering, University of Central Florida, Orlando, FL 32816, U.S.A.
    Department of Electrical and Computer Engineering, George Washington University, Washington, DC 20052, U.S.A.
  • Received:2022-09-29 Revised:2022-10-29 Accepted:2023-01-01 Online:2023-02-28 Published:2023-02-28
  • Contact: Jia-Jun Li E-mail:jiajunli@buaa.edu.cn
  • About author:Jia-Jun Li received his B.E. degree from the Department of Automation, Tsinghua University, Beijing, in 2013. He received his Ph.D. degree from the Institute of Computing Technology, Chinese Academy of Sciences, Beijing, in 2019. From 2019 to 2021, he was a postdoc researcher with the Department of Electrical and Computer Engineering, George Washington University, Washington. He is currently an associate professor with the School of Astronautics, Beihang University, Beijing. His current research interests include machine learning and heterogeneous computer architecture.
  • Supported by:
    This work was supported by the National Science Foundation of USA under Grant Nos. CCF-2131946, CCF-1953980, CCF-1702980. Part of this work was conducted when Prof. Jia-Jun Li was a post-doctoral researcher at the HPCAT Laboratory, George Washington University.

Graph convolutional neural networks (GCNs) have emerged as an effective approach to extending deep learning for graph data analytics, but they are computationally challenging given the irregular graphs and the large number of nodes in a graph. GCNs involve chain sparse-dense matrix multiplications with six loops, which results in a large design space for GCN accelerators. Prior work on GCN acceleration either employs limited loop optimization techniques, or determines the design variables based on random sampling, which can hardly exploit data reuse efficiently, thus degrading system efficiency. To overcome this limitation, this paper proposes GShuttle, a GCN acceleration scheme that maximizes memory access efficiency to achieve high performance and energy efficiency. GShuttle systematically explores loop optimization techniques for GCN acceleration, and quantitatively analyzes the design objectives (e.g., required DRAM accesses and SRAM accesses) by analytical calculation based on multiple design variables. GShuttle further employs two approaches, pruned search space sweeping and greedy search, to find the optimal design variables under certain design constraints. We demonstrated the efficacy of GShuttle by evaluation on five widely used graph datasets. The experimental simulations show that GShuttle reduces the number of DRAM accesses by a factor of 1.5 and saves energy by a factor of 1.7 compared with the state-of-the-art approaches.

Key words: graph convolutional neural network; memory access; neural network accelerator;

<table class="reference-tab" style="background-color:#FFFFFF;width:914.104px;color:#333333;font-family:Calibri, Arial, 微软雅黑, "font-size:16px;"> <tbody> <tr class="document-box" id="b1"> <td valign="top" class="td1"> [1] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Jiang W W, Luo J Y. Graph neural network for traffic forecasting: A survey. arXiv: 2101.11174, 2021. <a href="https://arxiv.org/abs/2101.11174">https://arxiv.org/abs/2101.11174</a>, Dec. 2022. </div> </td> </tr> <tr class="document-box" id="b2"> <td valign="top" class="td1"> [2] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Shi W J, Rajkumar R. Point-GNN: Graph neural network for 3D object detection in a point cloud. In <i>Proc</i>. <i>the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition</i>, Jun. 2020, pp.1711–1719. DOI: <a href="http://dx.doi.org/10.1109/CVPR42600.2020.00178">10.1109/CVPR42600.2020.00178</a>. </div> </td> </tr> <tr class="document-box" id="b3"> <td valign="top" class="td1"> [3] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Wee C Y, Liu C Q, Lee A, Poh J S, Ji H, Qiu A Q, The Alzheimers Disease Neuroimage Initiative. Cortical graph neural network for AD and MCI diagnosis and transfer learning across populations. <i>NeuroImage</i>: <i>Clinical</i>, 2019, 23: 101929. DOI: <a href="http://dx.doi.org/10.1016/j.nicl.2019.101929">10.1016/j.nicl.2019.101929</a>. </div> </td> </tr> <tr class="document-box" id="b4"> <td valign="top" class="td1"> [4] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Zhang Z W, Cui P, Zhu W W. Deep learning on graphs: A survey. <i>IEEE Trans</i>. <i>Knowledge and Data Engineering</i>, 2022, 34(1): 249–270. DOI: <a href="http://dx.doi.org/10.1109/TKDE.2020.2981333">10.1109/TKDE.2020.2981333</a>. </div> </td> </tr> <tr class="document-box" id="b5"> <td valign="top" class="td1"> [5] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Yang H X. AliGraph: A comprehensive graph neural network platform. In <i>Proc</i>. <i>the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining</i>, Jul. 2019, pp.3165–3166. DOI: <a href="http://dx.doi.org/10.1145/3292500.3340404">10.1145/3292500.3340404</a>. </div> </td> </tr> <tr class="document-box" id="b6"> <td valign="top" class="td1"> [6] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Lerer A, Wu L, Shen J, Lacroix T, Wehrstedt L, Bose A, Peysakhovich A. PyTorch-BigGraph: A large-scale graph embedding system. arXiv: 1903.12287, 2019. <a href="https://arxiv.org/abs/1903.12287">https://arxiv.org/abs/1903.12287</a>, Dec. 2022. </div> </td> </tr> <tr class="document-box" id="b7"> <td valign="top" class="td1"> [7] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Yan M Y, Chen Z D, Deng L, Ye X C, Zhang Z M, Fan D R, Xie Y. Characterizing and understanding GCNs on GPU. <i>IEEE Computer Architecture Letters</i>, 2020, 19(1): 22–25. DOI: <a href="http://dx.doi.org/10.1109/LCA.2020.2970395">10.1109/LCA.2020.2970395</a>. </div> </td> </tr> <tr class="document-box" id="b8"> <td valign="top" class="td1"> [8] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Zhang Z H, Leng J W, Ma L X, Miao Y S, Li C, Guo M Y. Architectural implications of graph neural networks. <i>IEEE Computer Architecture Letters</i>, 2020, 19(1): 59–62. DOI: <a href="http://dx.doi.org/10.1109/LCA.2020.2988991">10.1109/LCA.2020.2988991</a>. </div> </td> </tr> <tr class="document-box" id="b9"> <td valign="top" class="td1"> [9] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Geng T, Li A, Shi R B, Wu C S, Wang T Q, Li Y F, Haghi P, Tumeo A, Che S, Reinhardt S, Herbordt M C. AWB-GCN: A graph convolutional network accelerator with runtime workload rebalancing. In <i>Proc</i>. <i>the 53rd Annual IEEE/ACM International Symposium on Microarchitecture</i> (<i>MICRO</i>), Oct. 2020, pp.922–936. DOI: <a href="http://dx.doi.org/10.1109/MICRO50266.2020.00079">10.1109/MICRO50266.2020.00079</a>. </div> </td> </tr> <tr class="document-box" id="b10"> <td valign="top" class="td1"> [10] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Ma Y F, Cao Y, Vrudhula S, Seo J S. Optimizing loop operation and dataflow in FPGA acceleration of deep convolutional neural networks. In <i>Proc</i>. <i>the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays</i>, Feb. 2017, pp.45–54. DOI: <a href="http://dx.doi.org/10.1145/3020078.3021736">10.1145/3020078.3021736</a>. </div> </td> </tr> <tr class="document-box" id="b11"> <td valign="top" class="td1"> [11] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Yan M Y, Deng L, Hu X, Liang L, Feng, Y J, Ye X C, Zhang Z M, Fan D R, Xie Y. HyGCN: A GCN accelerator with hybrid architecture. In <i>Proc</i>. <i>the</i> <i>2020 IEEE International Symposium on High Performance Computer Architecture</i> (<i>HPCA</i>), Feb. 2020, pp.15–29. DOI: <a href="http://dx.doi.org/10.1109/HPCA47549.2020.00012">10.1109/HPCA47549.2020.00012</a>. </div> </td> </tr> <tr class="document-box" id="b12"> <td valign="top" class="td1"> [12] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Li J J, Louri A, Karanth A, Bunescu R. GCNAX: A flexible and energy-efficient accelerator for graph convolutional neural networks. In <i>Proc</i>. <i>the</i> <i>2021 IEEE International Symposium on High-Performance Computer Architecture</i> (<i>HPCA</i>), Mar. 2021, pp.775–788. DOI: <a href="http://dx.doi.org/10.1109/HPCA51647.2021.00070">10.1109/HPCA51647.2021.00070</a>. </div> </td> </tr> <tr class="document-box" id="b13"> <td valign="top" class="td1"> [13] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Galal S, Horowitz M. Energy-efficient floating-point unit design. <i>IEEE Transactions on Computers</i>, 2011, 60(7): 913-922. </div> </td> </tr> <tr class="document-box" id="b14"> <td valign="top" class="td1"> [14] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Kipf T N, Welling M. Semi-supervised classification with graph convolutional networks. arXiv: 1609.02907, 2016. <a href="https://arxiv.org/abs/1609.02907">https://arxiv.org/abs/1609.02907</a>, Dec. 2022. </div> </td> </tr> <tr class="document-box" id="b15"> <td valign="top" class="td1"> [15] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Hamilton W L, Ying R, Leskovec J. Inductive representation learning on large graphs. In <i>Proc</i>. <i>the 31st International Conference on Neural Information Processing Systems</i>, Dec. 2017, pp.1024–1034. </div> </td> </tr> <tr class="document-box" id="b16"> <td valign="top" class="td1"> [16] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Xu K, Hu W H, Leskovec J, Jegelka S. How powerful are graph neural networks? arXiv: 1810.00826, 2018. <a href="https://arxiv.org/abs/1810.00826">https://arxiv.org/abs/1810.00826</a>, Dec. 2022. </div> </td> </tr> <tr class="document-box" id="b17"> <td valign="top" class="td1"> [17] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Allen J R, Kennedy K. Automatic loop interchange. In <i>Proc</i>. <i>the 1984 SIGPLAN Symposium on Compiler Construction</i>, Jun. 1984, pp.233–246. DOI: <a href="http://dx.doi.org/10.1145/502874.502897">10.1145/502874.502897</a>. </div> </td> </tr> <tr class="document-box" id="b18"> <td valign="top" class="td1"> [18] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Zhang C, Li P, Sun G Y et al. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In <i>Proc</i>. <i>the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays</i>, Feb. 2015, pp.161–170. DOI: <a href="http://dx.doi.org/10.1145/2684746.2689060">10.1145/2684746.2689060</a>. </div> </td> </tr> <tr class="document-box" id="b19"> <td valign="top" class="td1"> [19] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Pugh W. Uniform techniques for loop optimization. In <i>Proc</i>. <i>the 5th International Conference on Supercomputing</i>, Jun. 1991, pp.341–352. DOI: <a href="http://dx.doi.org/10.1145/109025.109108">10.1145/109025.109108</a>. </div> </td> </tr> <tr class="document-box" id="b20"> <td valign="top" class="td1"> [20] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Pal S, Beaumont J, Park D H, Amarnath A, Feng S Y, Chakrabarti C, Kim H S, Blaauw D, Mudge T, Dreslinski R. OuterSPACE: An outer product based sparse matrix multiplication accelerator. In <i>Proc</i>. <i>the 2018 IEEE International Symposium on High Performance Computer Architecture</i> (<i>HPCA</i>), Feb. 2018, pp.724–736. DOI: <a href="http://dx.doi.org/10.1109/HPCA.2018.00067">10.1109/HPCA.2018.00067</a>. </div> </td> </tr> <tr class="document-box" id="b21"> <td valign="top" class="td1"> [21] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Nie J. Memory-driven data-flow optimization for neural processing accelerators [Ph.D. Thesis]. Princeton University, 2020. <a href="https://www.proquest.com/openview/41fe23f43fd65cafaa8c2e051aed4059/1?pq-origsite=gscholar&cbl=18750&diss=y">https://www.proquest.com/openview/41fe23f43fd65cafaa8c2e051aed4059/1?pq-origsite=gscholar&cbl=18750&diss=y</a>, Jan. 2023. </div> </td> </tr> <tr class="document-box" id="b22"> <td valign="top" class="td1"> [22] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Sen P, Namata G, Bilgic M et al. Collective classification in network data. <i>AI Magazine</i>, 2008, 29(3): 93-106. DOI: <a href="https://ojs.aaai.org/aimagazine/index.php/aimagazine/article/view/2157">10.1609/aimag.v29i3.2157</a>. </div> </td> </tr> <tr class="document-box" id="b23"> <td valign="top" class="td1"> [23] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Carlson A, Betteridge J, Kisiel B et al. Toward an architecture for never-ending language learning. In <i>Proc. the 34th AAAI Conference on Artificial Intelligence</i>, July 2010, pp.1306-1313. </div> </td> </tr> <tr class="document-box" id="b24"> <td valign="top" class="td1"> [24] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Auten A, Tomei M, Kumar R. Hardware acceleration of graph neural networks. In <i>Proc</i>. <i>the</i> <i>57th ACM/IEEE Design Automation Conference</i> (<i>DAC</i>), Jul. 2020. DOI: <a href="http://dx.doi.org/10.1109/DAC18072.2020.9218751">10.1109/DAC18072.2020.9218751</a>. </div> </td> </tr> <tr class="document-box" id="b25"> <td valign="top" class="td1"> [25] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Liang S W, Wang Y, Liu C et al. EnGN: A high-throughput and energy-efficient accelerator for large graph neural networks. <i>IEEE Trans</i>. <i>Computers</i>, 2021, 70(9): 1511–1525. DOI: <a href="http://dx.doi.org/10.1109/TC.2020.3014632">10.1109/TC.2020.3014632</a>. </div> </td> </tr> <tr class="document-box" id="b26"> <td valign="top" class="td1"> [26] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Kiningham K, Re C, Levis P. GRIP: A graph neural network accelerator architecture. arXiv: 2007.13828, 2020. <a href="https://arxiv.org/abs/2007.13828v1">https://arxiv.org/abs/2007.13828v1</a>, Dec. 2022. </div> </td> </tr> <tr class="document-box" id="b27"> <td valign="top" class="td1"> [27] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Zeng H Q, Prasanna V. GraphACT: Accelerating GCN training on CPU-FPGA heterogeneous platforms. In <i>Proc</i>. <i>the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays</i>, Feb. 2020, pp.255–265. DOI: <a href="http://dx.doi.org/10.1145/3373087.3375312">10.1145/3373087.3375312</a>. </div> </td> </tr> <tr class="document-box" id="b28"> <td valign="top" class="td1"> [28] </td> <td class="td2"> <div class="reference-en" style="margin:0px;padding:0px;"> Shi F, Jin A Y, Zhu S C. VersaGNN: A versatile accelerator for graph neural networks. arXiv: 2105.01280, 2021. <a href="https://arxiv.org/abs/2105.01280">https://arxiv.org/abs/2105.01280</a>, Dec. 2022. </div> </td> </tr> </tbody> </table>
[1] Adam Weingram, Yuke Li , Hao Qi, Darren Ng, Liuyao Dai, and Xiaoyi Lu. xCCL: A Survey of Industry-Led Collective Communication Libraries for Deep Learning [J]. Journal of Computer Science and Technology, 2023, 38(1): 166-195.
[2] Xiao-Bing Chen, Hao Qi, Shao-Hui Peng, Yi-Min Zhuang, Tian Zhi, and Yun-Ji Chen. Tetris: A Heuristic Static Memory Management Framework for Uniform Memory Multicore Neural Network Accelerators [J]. Journal of Computer Science and Technology, 2022, 37(6): 1255-1270.
[3] Yi-Xiao Gao, Chen Tian, Wei Chen, Duo-Xing Li, Jian Yan, Yuan-Yuan Gong, Bing-Quan Wang, Tao Wu, Lei Han, Fa-Zhi Qi, Shan Zeng, Wan-Chun Dou, and Gui-Hai Chen. Analyzing and Optimizing Packet Corruption in RDMA Network [J]. Journal of Computer Science and Technology, 2022, 37(4): 743-762.
[4] Jason Liu, Pedro Espina, Xian-He Sun. A Study on Modeling and Optimization of Memory Systems [J]. Journal of Computer Science and Technology, 2021, 36(1): 71-89.
[5] Osamu Tatebe, Shukuko Moriwake, Yoshihiro Oyama. Gfarm/BB—Gfarm File System for Node-Local Burst Buffer [J]. Journal of Computer Science and Technology, 2020, 35(1): 61-71.
[6] Yang Hong, Yang Zheng, Fan Yang, Bin-Yu Zang, Hai-Bing Guan, Hai-Bo Chen. Scaling out NUMA-Aware Applications with RDMA-Based Distributed Shared Memory [J]. Journal of Computer Science and Technology, 2019, 34(1): 94-112.
[7] Yu-Hang Liu, Xian-He Sun. Reevaluating Data Stall Time with the Consideration of Data Access Concurrency [J]. , 2015, 30(2): 227-245.
[8] Xiang Gao, Yun-Ji Chen, Huan-Dong Wang, Dan Tang, and Wei-Wu Hu. System Architecture of Godson-3 Multi-Core Processors [J]. , 2010, 25(2): 181-191.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] Zhou Di;. A Recovery Technique for Distributed Communicating Process Systems[J]. , 1986, 1(2): 34 -43 .
[2] Li Wei;. A Structural Operational Semantics for an Edison Like Language(2)[J]. , 1986, 1(2): 42 -53 .
[3] Li Wanxue;. Almost Optimal Dynamic 2-3 Trees[J]. , 1986, 1(2): 60 -71 .
[4] Liu Mingye; Hong Enyu;. Some Covering Problems and Their Solutions in Automatic Logic Synthesis Systems[J]. , 1986, 1(2): 83 -92 .
[5] Wang Xuan; Lü Zhimin; Tang Yuhai; Xiang Yang;. A High Resolution Chinese Character Generator[J]. , 1986, 1(2): 1 -14 .
[6] C.Y.Chung; H.R.Hwa;. A Chinese Information Processing System[J]. , 1986, 1(2): 15 -24 .
[7] Sun Zhongxiu; Shang Lujun;. DMODULA:A Distributed Programming Language[J]. , 1986, 1(2): 25 -31 .
[8] Gao Qingshi; Zhang Xiang; Yang Shufan; Chen Shuqing;. Vector Computer 757[J]. , 1986, 1(3): 1 -14 .
[9] Jin Lan; Yang Yuanyuan;. A Modified Version of Chordal Ring[J]. , 1986, 1(3): 15 -32 .
[10] Pan Qijing;. A Routing Algorithm with Candidate Shortest Path[J]. , 1986, 1(3): 33 -52 .

ISSN 1000-9000(Print)

         1860-4749(Online)
CN 11-2296/TP

Home
Editorial Board
Author Guidelines
Subscription
Journal of Computer Science and Technology
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China
Tel.:86-10-62610746
E-mail: jcst@ict.ac.cn
 
  Copyright ©2015 JCST, All Rights Reserved