HEC: Heterogeneity-Enriched Communication for AI Symphony
-
Abstract
Modern parallel and distributed computing systems are becoming increasingly complex as applications in high-performance computing (HPC) and artificial intelligence (AI) demand ever-greater levels of computation and communication efficiency. To address these demands, recent architectures integrate heterogeneous computing devices, such as CPUs, GPUs, and DPUs (or SmartNICs), within a single compute node, forming what we refer to as multi-rail heterogeneity. This trend enables substantial potential for scalability and performance but also amplifies the challenges of data movement, synchronization, and coordination across heterogeneous components. We propose Heterogeneity-Enriched Communication (HEC) as a new paradigm that embraces multi-rail heterogeneity by accurately analyzing communication primitives, adaptively composing multi-rail strategies, and scalably optimizing end-to-end pipelines. Through three representative case studies, including HCCL (collective communication), TrimEC (multi-rail erasure coding), and DPU-KV (edge data services), we demonstrate that HEC improves efficiency, scalability, and resilience in parallel and distributed systems for AI workloads. We envision HEC as a foundation for the next generation of AI infrastructure, harmonizing heterogeneous computing instruments into a symphony of scalable and efficient systems tailored for the emerging AI era.
-
-