Joint-Communication Optimal Matrix Multiplication with Asymmetric Memories

Lin Zhu; Qiang-Sheng Hua; Hai Jin

doi:10.1007/s11390-023-3489-y

Zhu L, Hua QS, Jin H. Joint-communication optimal matrix multiplication with asymmetric memories. JOURNAL OFCOMPUTER SCIENCE AND TECHNOLOGY, 40(3): 835−854, May 2025. DOI: 10.1007/s11390-023-3489-y

Citation:

Zhu L, Hua QS, Jin H. Joint-communication optimal matrix multiplication with asymmetric memories. JOURNAL OFCOMPUTER SCIENCE AND TECHNOLOGY, 40(3): 835−854, May 2025. DOI: 10.1007/s11390-023-3489-y

Citation:

Zhu L, Hua QS, Jin H. Joint-communication optimal matrix multiplication with asymmetric memories. JOURNAL OFCOMPUTER SCIENCE AND TECHNOLOGY, 40(3): 835−854, May 2025. DOI: 10.1007/s11390-023-3489-y

Joint-Communication Optimal Matrix Multiplication with Asymmetric Memories

Abstract

Abstract

Emerging hardware like non-volatile memory (NVM) and high-speed network interface cards are promising to improve the performance of matrix multiplication. However, a critical challenge in achieving high performance is the tradeoff between horizontal communication (data movement between processors) and vertical communication (data movement across memory hierarchies). In this paper, we provide an analysis in the distributed memory parallel model with additional consideration for communication between main memory and cache. We measure joint communication as the sum of the horizontal bandwidth and vertical bandwidth cost, and study the joint-communication cost of square matrix multiplication in the read-write symmetric setting (such as DRAM) and asymmetric setting (such as NVM). Specifically, we identify that in the symmetric setting, a joint-communication optimal algorithm can be directly obtained by combining the horizontally optimal and vertically optimal algorithms. We also identify that in the asymmetric setting, horizontal and vertical communications cannot be optimal at the same time, which means that there is a tradeoff between the two communications. In this case, we first present a joint-communication lower bound, and then we propose Joint-Communication Optimal Matrix Multiplication Algorithm (JOMMA), a parallel matrix multiplication algorithm whose joint-communication complexity meets the lower bound. The key idea behind JOMMA is to derive optimal matrix dimensions that each processor locally performs, which leads to determining the processor grid and an optimal schedule.

FullText(HTML)

References (34)

Relative Articles

Supplements (3)

Cited By

Joint-Communication Optimal Matrix Multiplication with Asymmetric Memories

Abstract

Catalog

Export File

Citation

Format

Content