Skyway: 面向图计算的多通路体系结构与细粒度数据组织方式

邹沫; 张明喆; 王茹嘉; 孙贤和; 叶笑春; 范东睿; 唐志敏

doi:10.1007/s11390-023-2939-x

Skyway: 面向图计算的多通路体系结构与细粒度数据组织方式

Skyway: Accelerate Graph Applications with a Dual-Path Architecture and Fine-Grained Data Management

摘要

摘要:
研究背景 图计算已经成为许多人工智能和大数据的关键组成部分，然而图计算因为其大量不规则访存请求和多种访存模式并存而访存效率极低，限制了图计算的性能，需要针对其访存特点设计针对性的更高效率的存储结构。
目的本文面向图计算，从片上缓存与片下存储两个角度分别提出针对性的优化方案，能够提高内存带宽利用率，并最终提高图计算的执行速度。
方法本文提出了Skyway，包括片上多路径缓存结构PBuf和片下数据感知型硬件结构DRow。PBuf能够准确区分图计算中的不同访存类型并对局部性较差的访存请求提供快速通路与细粒度的存储单元，提高资源利用率。DRow以缓存的思想感知特定数据并提供细粒度保护与快速返回通路，减少不同类型访存之间的相互干扰。
结果为了验证Skyway的有效性，本文在模拟器Zsim与DRAMsim3上完成代码实现，并在常用的35个图计算测试用例上进行实验，通过实验评估发现相比较于最先进的图计算硬件加速方案，Skyway提高了23%的性能。
结论本文深入分析了图计算的访存特征，针对不同类型的访存请求设计针对性的硬件存储结构，能够为特定请求提供快速通路并降低不同请求在存储结构内的相互干扰。本文提出的Skyway设计能够显著提高内存带宽利用率与图计算的执行性能，但需要进一步研究来动态调整通路的选择，以适应输入图的结构特点。

Abstract: Graph processing is a vital component of many AI and big data applications. However, due to its poor locality and complex data access patterns, graph processing is also a known performance killer of AI and big data applications. In this work, we propose to enhance graph processing applications by leveraging fine-grained memory access patterns with a dual-path architecture on top of existing software-based graph optimizations. We first identify that memory accesses to the offset, edge, and state array have distinct locality and impact on performance. We then introduce the Skyway architecture, which consists of two primary components: 1) a dedicated direct data path between the core and memory to transfer state array elements efficiently, and 2) a data-type aware fine-grained memory-side row buffer hardware for both the newly designed direct data path and the regular memory hierarchy data path. The proposed Skyway architecture is able to improve the overall performance by reducing the memory access interference and improving data access efficiency with a minimal overhead. We evaluate Skyway on a set of diverse algorithms using large real-world graphs. On a simulated four-core system, Skyway improves the performance by 23% on average over the best-performing graph-specialized hardware optimizations.

HTML全文

参考文献()

施引文献

资源附件()