FDGLib:在数据中心中支持高效大规模图计算加速的通信库
FDGLib: A Communication Library for Efficient Large-Scale Graph Processing in FPGA-Accelerated Data Centers
-
摘要: 1、研究背景(context) :随着现实世界图的快速增长,图规模很容易超过加速器的片上存储容量,在单个图加速器上处理大型图就变得很困难。在FPGA集群上进行图加速非常必要和重要。现在有许多云服务商(例如,亚马逊,微软和百度)在其数据中心部署FPGA集群,为加速大规模图计算提供了机会。
2、目的(Objective): 但是将现有的单FPGA图加速器扩展到数据中心的多FPGA图计算系统的两个主要挑战:首先,由于现有的单FPGA图加速器配备有量身定制的编程模型,运行时系统和通信运行时,因此很难通过重用基础架构来生产新的分布式加速器;其次,当在数据中心中运行的分布式图加速器不能灵活地支持图划分策略,并且未考虑环型互连方案的特殊性时,会有很多不必要的通信开销。
3、方法(Method): 本文提出了一个通信库FDGLib,该库可以轻松地将现有的基于FPGA的图加速器扩展到数据中心中形成一个分布式版本的图加速系统,并且无需花费太多的硬件工程工作就可以实现拓展。FDGLib提供了6个只需比较少量的代码修改即可轻松使用并集成到多种图加速器中的API。除此之外,FDGLib还考虑到数据中心中环形的FPGA互连方式,有针对性地设计了相应的图分区和布局方案,进一步提高了通信效率。
4、结果(Result&Findings): 本文将FDGLib连接到最先进的图加速器AccuGraph中,并在类似于Microsoft Catapult数据中心的集群上进行了实验。研究结果表明,分布式AccuGraph可以比最新的基于FPGA和CPU的分布式图系统快2.32倍和4.77倍(即ForeGraph和Gemini),并且比起它们具有更好的可扩展性。
5、结论(Conclusions): 本文提出了一种通信库FDGLib,它能与现有的单FPGA图加速器集成,并生成分布式图加速系统,用于数据中心的大规模图分析。其关键组件是一个负责在每次迭代中进行顶点值通信的通信控制,以及一个通过考虑环形互连方案的特性来最小化通信开销的通信优化器。随着图规模和数据中心规模的持续增长,如何高效支持图加速还有待继续深入研究。Abstract: With the rapid growth of real-world graphs, the size of which can easily exceed the on-chip (board) storage capacity of an accelerator, processing large-scale graphs on a single Field Programmable Gate Array (FPGA) becomes difficult. The multi-FPGA acceleration is of great necessity and importance. Many cloud providers (e.g., Amazon, Microsoft, and Baidu) now expose FPGAs to users in their data centers, providing opportunities to accelerate large-scale graph processing. In this paper, we present a communication library, called FDGLib, which can easily scale out any existing single FPGA-based graph accelerator to a distributed version in a data center, with minimal hardware engineering efforts. FDGLib provides six APIs that can be easily used and integrated into any FPGA-based graph accelerator with only a few lines of code modifications. Considering the torus-based FPGA interconnection in data centers, FDGLib also improves communication efficiency using simple yet effective torus-friendly graph partition and placement schemes. We interface FDGLib into AccuGraph, a state-of-the-art graph accelerator. Our results on a 32-node Microsoft Catapult-like data center show that the distributed AccuGraph can be 2.32x and 4.77x faster than a state-of-the-art distributed FPGA-based graph accelerator ForeGraph and a distributed CPU-based graph system Gemini, with better scalability.