一种面向数据流架构的流水循环优化方法

doi:10.1007/s11390-017-1748-5

一种面向数据流架构的流水循环优化方法

A Pipelining Loop Optimization Method for Dataflow Architecture

摘要

摘要: 在未来计算场景下，能效将成为构建E级计算系统最大的障碍。数据流体系结构在处理科学应用时具备天然的能效优势，然而目前的数据流处理结构不能充分挖掘循环中的并行性。为了解决这一问题，我们提出了一种流水循环优化方法（PLO），它让不同迭代在处理阵列上同时流动，这种方法包含两种技术：硬件支持的硬迭代技术和指令支持的软迭代技术。在硬迭代执行模型中，片上循环控制器负责产生循环索引，简化了数据流图的复杂性同时为流水执行提供了良好的基础；在软迭代执行模型中，本文设计了循环指令来解决循环之间的依赖问题。通过这两种技术，处理阵列上同一时刻可被执行的指令数大大增加，使得浮点单元保持运转。模拟结果显示本文提出的方法达到的浮点效率比静态和动态执行模型分别高2.45倍和1.1倍，同时本文方法的硬件开销非常有限。

Abstract: With the coming of exascale supercomputing era, power efficiency has become the most important obstacle to build an exascale system. Dataflow architecture has native advantage in achieving high power efficiency for scientific applications. However, the state-of-the-art dataflow architectures fail to exploit high parallelism for loop processing. To address this issue, we propose a pipelining loop optimization method (PLO), which makes iterations in loops flow in the processing element (PE) array of dataflow accelerator. This method consists of two techniques, architecture-assisted hardware iteration and instruction-assisted software iteration. In hardware iteration execution model, an on-chip loop controller is designed to generate loop indexes, reducing the complexity of computing kernel and laying a good foundation for pipelining execution. In software iteration execution model, additional loop instructions are presented to solve the iteration dependency problem. Via these two techniques, the average number of instructions ready to execute per cycle is increased to keep floating-point unit busy. Simulation results show that our proposed method outperforms static and dynamic loop execution model in floating-point efficiency by 2.45x and 1.1x on average, respectively, while the hardware cost of these two techniques is acceptable.

HTML全文

参考文献()

施引文献

资源附件()