Joint-Communication Optimal Matrix Multiplication with Asymmetric Memories
Emerging hardware like non-volatile memory (NVM) and network interface cards (NICs) are promising to improve the performance of matrix multiplication. However, a critical challenge in achieving high performance is in accounting for the tradeoff between horizontal communication (data movement between processors) and vertical communication (data movement across memory hierarchies).
In this paper, we provide an analysis in the distributed memory parallel model with additional consideration for communication between main memory and cache. We measure joint-communication as the sum of the horizontal bandwidth cost and vertical bandwidth cost, and study the joint-communication cost of square matrix multiplication in the read-write symmetric setting (such as DRAM) and asymmetric setting (such as NVM). Specifically, we identify that in the symmetric setting, a joint-communication optimal algorithm can be directly obtained by combining the horizontally optimal and vertically optimal algorithms. We also identify that in the asymmetric setting, horizontal and vertical communications cannot be optimal at the same time, which means that there is a tradeoff between the two communications. In this case, we first present a joint-communication lower bound, then we propose JOMMA, a parallel matrix multiplication algorithm whose joint-communication complexity meets the lower bound. The key idea behind JOMMA is to derive optimal matrix dimensions that each processor locally performs, which leads to determining the processor grid and an optimal schedule.