通过预测执行的方式加速数据流架构的数据传输效率

冯煜晶; 李德建; 谭旭; 叶笑春; 范东睿; 李文明; 王达; 张浩; 唐志敏

doi:10.1007/s11390-020-0555-6

通过预测执行的方式加速数据流架构的数据传输效率

Accelerating Data Transfer in Dataflow Architectures Through a Look-Ahead Acknowledgment Mechanism

摘要

摘要: 1、研究背景（context）：
相比于控制流执行模型，数据流处理器针对高性能科学计算类的应用可以更好地发挥执行并行度。科学计算应用的特征主要是并行度高、计算量大，并且可以使用数据流图的方式表示。数据流处理器的每个处理单元执行完整数据流图的一部分。在每个处理单元中，一条指令可以被发射到执行单元的前提条件是：所有的源操作数都准备好，并且所有处在下游PE的指令缓存可以接收目标操作数。通常为了提升执行效率，数据流处理器都针对不同的数据块进行连续、并行地处理，不同的数据块被称为不同的操作数空间，每次针对一个数据空间完成的执行称为完成一次上下文迭代。之前提出的一些机制包括：标签匹配表机制、基于Credit的流控机制和握手机制。采用匹配表的机制不仅增大设计复杂度、面积开销和数据路径的延迟，而且还会增加数据流网络当中的冗余消息包的数量。尽管基于Credit的流控机制和握手机制相比较而言面积和功耗的开销较小，但是并不利于数据传输效率和计算资源利用率的提升。可见，高效的指令触发机制对于数据流系统的性能十分重要。
2、目的（Objective）：
设计一种指令触发机制：针对每一次迭代的数据重新触发指令，重新等待被发射执行，并且可以有效地区分属于不同操作数空间的数据，保证不同操作数空间的数据流入数据流处理阵列之后不会发生不同迭代空间的匹配错位。
3、方法（Method）：
我们提出一种提前发送应答消息的指令触发机制，采用带有约束的激进执行方式实现应答消息的提前发送，这样不会引起流水线执行的冲刷从而造成性能损失。即采用预测的方式筛选到出可以提前应答的指令，根据指令在不同节点之间的跳数距离决定提前执行的拍数，并且在硬件上采用预发射队列控制提前发送在安全范围之内。
4、结果（Result & Findings）：
实验结果表明，预测执行机制有效地加速了数据传输、提高了运算部件的利用率，所以预测执行模型的性能比传统模型的性能平均提高23.3%。预测执行模型的平均执行时间比穿哦他那个模型的平均执行时间降低了17.4%。此外，预测执行模型的平均能效比传统模型的提高了22.4%。
5、结论（Conclusions）：
本文所提出的方法是一种简单却行之有效的机制用于降低数据消息的传输延迟，同时也能够提高数据流处理器的执行效率。通过采用这种机制，数据流处理器的能效平均提升22.4%，功能单元的利用率平均提升23.9%。但是面积和功耗的开销只增大0.9%左右。评估结果表明，我们提出的预测先行的指令触发机制适合于数据流执行场景，并且其表现优于传统的数据流处理器当中的指令触发机制。

Abstract: The dataflow architecture, which is characterized by a lack of a redundant unified control logic, has been shown to have an advantage over the control-flow architecture as it improves the computational performance and power efficiency, especially of applications used in high-performance computing (HPC). Importantly, the high computational efficiency of systems using the dataflow architecture is achieved by allowing program kernels to be activated in a simultaneous manner. Therefore, a proper acknowledgment mechanism is required to distinguish the data that logically belongs to different contexts. Possible solutions include the tagged-token matching mechanism in which the data is sent before acknowledgments are received but retried after rejection, or a handshake mechanism in which the data is only sent after acknowledgments are received. However, these mechanisms are characterized by both inefficient data transfer and increased area cost. Good performance of the dataflow architecture depends on the efficiency of data transfer. In order to optimize the efficiency of data transfer in existing dataflow architectures with a minimal increase in area and power cost, we propose a Look-Ahead Acknowledgment (LAA) mechanism. LAA accelerates the execution flow by speculatively acknowledging ahead without penalties. Our simulation analysis based on a handshake mechanism shows that our LAA increases the average utilization of computational units by 23.9%, with a reduction in the average execution time by 17.4% and an increase in the average power efficiency of dataflow processors by 22.4%. Crucially, our novel approach results in a relatively small increase in the area and power consumption of the on-chip logic of less than 0.9%. In conclusion, the evaluation results suggest that Look-Ahead Acknowledgment is an effective improvement for data transfer in existing dataflow architectures.

HTML全文

参考文献()

施引文献

资源附件()