VTensor: 使用虚拟张量构建布局不感知的AI编程框架

俞峰; 赵家程; 崔慧敏; 冯晓兵; 薛京灵

doi:10.1007/s11390-022-1457-6

VTensor: 使用虚拟张量构建布局不感知的AI编程框架

VTensor: Using Virtual Tensors to Build a Layout-Oblivious AI Programming Framework

摘要

摘要: 在人工智能（AI）网络和算法的开发中，张量是很受欢迎的编程接口之一。布局作为张量最为重要的属性之一，即张量数据在物理内存中的摆放顺序。由于布局会通过影响数据局部性等因素而影响性能，导致不同的体系结构会有不同的布局设计，因此高性能库对布局存在约定。应用可以使用任意布局，即指定任意维在内存上连续。由于现有AI系统没有提供编程抽象用于解耦库与应用，因此开发人员需要编写大量布局相关的代码，从而导致AI系统的代码可维护性差。此外，由于应用的输入布局只有在运行时才可知，因此布局转换操作只能交由算子内部完成，或者采用保守的策略约定算子的输入布局，这会导致布局优化的机会被丧失。通过我们的观察，越靠近应用层开发者使用更多的是布局的数学语义，而越靠近库层，开发者使用的更多的是布局的物理语义。基于此，我们借鉴面向对象编程模型中的多态思想，将张量分解成虚拟张量和物理张量，并以API的方式供开发者访问张量，从而解耦了开发者与库，进而降低了开发者的编程负担和提高了代码的可维护性。此外，在运行时将虚拟张量解析为物理张量的过程中，由于我们能获得完整的布局信息，因此我们可以发现更多的布局优化机会，这包括提取公共布局转换操作和对带广播语义的Element-Wise算子的布局进行全局选择。实验结果表明，相比于TensorFlow，VTensor平均编写一个算子的代码行数减少了47.8%，整网性能提升18%。因此，VTensor在新算子开发具有非常大的潜力，例如新兴的算子或者新兴体系结构上的算子开发，用以降低算子开发成本和提升整网性能。此外，由于目前VTensor仅以API的方式提供并只支持稠密张量，因此在未来的工作中我们将考虑如何以中间表示的形式对稀疏张量的布局进行抽象。

Abstract: Tensors are a popular programming interface for developing artificial intelligence (AI) algorithms. Layout refers to the order of placing tensor data in the memory and will affect performance by affecting data locality; therefore the deep neural network library has a convention on the layout. Since AI applications can use arbitrary layouts, and existing AI systems do not provide programming abstractions to shield the layout conventions of libraries, operator developers need to write a lot of layout-related code, which reduces the efficiency of integrating new libraries or developing new operators. Furthermore, the developer assigns the layout conversion operation to the internal operator to deal with the uncertainty of the input layout, thus losing the opportunity for layout optimization. Based on the idea of polymorphism, we propose a layout-agnostic virtual tensor programming interface, namely the VTensor framework, which enables developers to write new operators without caring about the underlying physical layout of tensors. In addition, the VTensor framework performs global layout inference at runtime to transparently resolve the required layout of virtual tensors, and runtime layout-oriented optimizations to globally minimize the number of layout transformation operations. Experimental results demonstrate that with VTensor, developers can avoid writing layout-dependent code. Compared with TensorFlow, for the 16 operations used in 12 popular networks, VTensor can reduce the lines of code (LOC) of writing a new operation by 47.82% on average, and improve the overall performance by 18.65% on average.

HTML全文

参考文献()

施引文献

资源附件()