TLP-LDPC：采用高层次综合快速实现LDPC译码的三层并行FPGA架构

张一凡; 孙磊; 曹强

doi:10.1007/s11390-022-1499-9

TLP-LDPC：采用高层次综合快速实现LDPC译码的三层并行FPGA架构

TLP-LDPC: Three-Level Parallel FPGA Architecture for Fast Prototyping of LDPC Decoder Using High-Level Synthesis

摘要

摘要: 低密度奇偶校验码（LDPC）有着接近香农限的纠错性能，广泛应用于通讯和存储领域。一般而言，可以使用RTL语言在FPGA上设计专用硬件加速复杂的LDPC解码过程，从而提高整体性能。但即使是硬件专家仍然需要花费很多时间进行RTL开发。幸运的是，通过使用HLS，编写C/C++代码即可快速设计FPGA原型，从而缩短硬件的开发周期。然而，使用HLS充分利用算法与硬件平台的特点实现高吞吐率设计，仍然存在着大量的灵活性与性能挑战。
目前有许多研究使用FPGA实现LDPC译码器时需要仔细的分配所有参数，考虑LDPC算法的细节并同时考虑算法潜在的并行性以充分利用硬件实现最大性能。然而这样的架构单一且不够灵活，并且吞吐率往往有限。即使使用HLS实现这样的设计，仍然缺少能够自动有效的将LDPC算法映射为大规模硬件设计的语法，需要进行一定的设计以实现高性能LDPC译码器。
本文提出了了一种称为TLP-LDPC的三层并行FPGA架构，能够快速原型化高性能LDPC译码器。TLP-LDPC的关键思想是通过设计一个高效的基本单元来利用特定的算法和硬件，并在上层实现高吞吐率和系统级扩展性。为此，提出了一种具有内部并行性的基本译码单元、基于粗粒度流水线的多单元译码器核心和多核心译码器用于提高整体性能与实现规模扩展。
本文实现最大吞吐率达到9.63Gbps，超过现有基于HLS的FPGA LDPC译码器实现，同时远超基于CPU和GPU的LDPC译码器实现。本文提出了一种三层并行FPGA架构，用于使用HLS快速原型化高性能LDPC译码器，译码器实现了高达9.63Gbps的实测译码吞吐率，超过了基于HLS的FPGA LDPC现有工作的性能。由于本文提出的架构中的每一层都可以相对独立的优化，因此采用更好的算法或者调整上层并行设计可以获得更高的译码吞吐率与硬件效率。

Abstract: Low-Density Parity-heck Codes (LDPC) with excellent error-correction capabilities have been widely used in both data communication and storage fields, to construct reliable cyber-physical systems that are resilient to real-world noises. Fast prototyping field-programmable gate array (FPGA)-based decoder is essential to achieve high decoding performance while accelerating the development process. This paper proposes a three-level parallel architecture, TLP-LDPC, to achieve high throughput by fully exploiting the characteristics of both LDPC and underlying hardware while effectively scaling to large-size FPGA platforms. The three-level parallel architecture contains a low-level decoding unit, a mid-level multi-unit decoding core, and a high-level multi-core decoder. The low-level decoding unit is a basic LDPC computation component that effectively combines the features of the LDPC algorithm and hardware with the specific structure (e.g., Look-Up-Table, LUT) of the FPGA and eliminates potential data conflicts. The mid-level decoding core integrates the input/output and multiple decoding units in a well-balancing pipelined fashion. The top-level multi-core architecture conveniently makes full use of board-level resources to improve the overall throughput. We develop an LDPC C++ code with dedicated pragmas and leverage HLS tools to implement the TLP-LDPC architecture. Experimental results show that TLP-LDPC achieves 9.63 Gbps end-to-end decoding throughput on a Xilinx Alveo U50 platform, 3.9x higher than existing HLS-based FPGA implementations.

HTML全文

参考文献()

施引文献

资源附件()