Tetris：面向统一内存多核神经网络加速器的启发式静态内存管理框架

陈小兵; 齐豪; 彭少辉; 庄毅敏; 支天; 陈云霁

doi:10.1007/s11390-021-1213-3

Tetris：面向统一内存多核神经网络加速器的启发式静态内存管理框架

Tetris: A Heuristic Static Memory Management Framework for Uniform Memory Multicore Neural Network Accelerators

摘要

摘要: 1、研究背景（context）
在神经网络模型规模更加庞大、连接关系更加复杂的发展趋势下，计算平台有限的内存空间成为限制神经网络模型部署的关键因素。高效的内存管理方法是深度学习计算系统必不可少的一部分。统一内存多核神经网络加速器作为一类重要的神经网络计算系统，有必要研究专用的内存管理系统。
2、目的（Objective）面向统一内存多核神经网络加速器，提出一种指令级生命周期分析方法和内存共享技术，来减少神经网络模型部署时所需的内存容量。
3、方法（Method）
本文提出了面向统一内存多核神经网络加速器的内存管理框架Tetris。Tetris通过指令分析得到加速器核心内神经元数据的访问顺序以及加速器核心间同步关系，进而分析神经元生命周期。同时，将内存分配问题转化为分配顺序调优问题，并提出了遗传算法对分配顺序空间进行搜索，来优化内存分配策略。
4、结果（Result & Findings）
为了验证内存管理框架Tetris的高效性，本文选取不同核数配置的Cambricon-X神经网络加速器作为硬件平台，选用典型神经网络模型进行实验。通过实验评估发现该工作相比于TensorFlow中使用的内存管理方法，减少35.3%~47.0%内存空间使用。
5、结论（Conclusions）
本文提出了一个面向统一内存多核神经网络加速器的内存管理框架Tetris。Tetris可以有效减少神经网络部署过程中的内存占用，避免内存失效而导致模型部署失败。本文提出的指令级生命周期分析方法相比于网络拓扑级方法能发掘更多内存共享机会。同时，基于遗传算法的空间分配方法可以有效减少内存碎片，但需要进一步研究来缩小解空间，降低编译时间。

Abstract: Uniform memory multicore neural network accelerators (UNNAs) furnish huge computing power to emerging neural network applications. Meanwhile, with neural network architectures going deeper and wider, the limited memory capacity has become a constraint to deploy models on UNNA platforms. Therefore how to efficiently manage memory space and how to reduce workload footprints are urgently significant. In this paper, we propose Tetris: a heuristic static memory management framework for UNNA platforms. Tetris reconstructs execution flows and synchronization relationships among cores to analyze each tensor’s liveness interval. Then the memory management problem is converted to a sequence permutation problem. Tetris uses a genetic algorithm to explore the permutation space to optimize the memory management strategy and reduce memory footprints. We evaluate several typical neural networks and the experimental results demonstrate that Tetris outperforms the state-of-the-art memory allocation methods, and achieves an average memory reduction ratio of 91.9% and 87.9% for a quad-core and a 16-core Cambricon-X platform, respectively.

HTML全文

参考文献()

施引文献

资源附件()