Compression-Aware Gradient Splitting for Collective Communications in Distributed Training
While distributed training is crucial for scaling deep learning models, it incurs significant overhead due to the collective communication of gradients. To alleviate the burden, compression techniques are commonly used to improve network bandwidth utilization. However, compression poses challenges for synchronized AllReduce collective communications, even more so in scalable systems. Non-uniform data sizes resulting from compression can cause bandwidth under-utilization, as faster nodes remain idle while waiting for slower nodes to complete data exchanges, increasing overall communication and consequently, training time. However, the inherent similarity in gradients across consecutive batches presents an opportunity to mitigate these inefficiencies. By leveraging the quantization of gradients and consistent distribution of zeros, the gradients can be partitioned logically to speedup communication. Splitting them into groups with and without zeros can allow different compression approaches for both. The bandwidth under-utilization due to non-uniform data size can also be solved by partitioning the gradients into variable-sized chunks, leading to more balanced compressed data sizes and reduced idle waiting time. We propose two novel strategies in OSCAR, where gradient splitting is designed to improve communication and training. OSCAR-SW is a novel AR compatible software-based technique that splits gradients into probable zeros and non zeros to apply count sketch compression. OSCAR-HW, a novel hardware/software co-designed gradient splitting technique is proposed with ASC(Adaptive Stepwise Coding), an encoding technique for gradient compression in distributed training. OSCAR-HW dynamically splits fixed-point quantized gradients for AllReduce communications and maximizes bandwidth utilization for state-of-the-art hardware compression techniques. ASC is a variant of Adaptive Arithmetic Coding (AAC) that generates a distinct probability table for each timestep of AllReduce to adapt to its unique value ranges and avoids sending the probability table during the communication of gradients. Our experimental results show that OSCAR-SW achieves 1.18× speedup and 3% better accuracy over the SOTA CountSketch algorithm. OSCAR-HW achieves an average AllReduce speedup of 3.77×, and an average end-to-end training speedup of 1.38×. ASC achieves an average AllReduce speedup of 1.05× over Atalanta and 4.66× over no compression.
Tue 3 FebDisplayed time zone: Hobart change
15:50 - 17:10 | |||
15:50 20mTalk | Compression-Aware Gradient Splitting for Collective Communications in Distributed Training Main Conference Pranati Majhi Texas A&M University, Sabuj Laskar Texas A&M University, Abdullah Muzahid Texas A & M University, Eun Jung Kim | ||
16:10 20mTalk | SCALE: Tackling Communication Bottlenecks in Confidential Multi-GPU ML Main Conference Joongun Park Georgia Tech, Yongqin Wang University of Southern California, Huan Xu Georgia Institute of Technology, Hanjiang Wu Georgia Institute of Technology, Mengyuan Li USC, Tushar Krishna Georgia Institute of Technology | ||
16:30 20mTalk | AutoHAAP: Automated Heterogeneity-Aware Asymmetric Partitioning for LLM Training Main Conference Yuanyuan Wang Zhejiang Lab, Nana Tang Zhejiang Lab, Yuyang Wang Zhejiang Lab, Shu Pan Zhejiang Lab, Dingding Yu Zhejiang Lab, Zeyue Wang Zhejiang Lab, Mou Sun Zhejiang Lab, Kejie Fu Zhejiang Lab, Fangyu Wang Zhejiang Lab, Yunchuan Chen Zhejiang Lab, Ning Sun Zhejiang Lab, Fei Yang Zhejiang Lab | ||
16:50 20mTalk | Towards Compute-Aware In-Switch Computing for LLMs Tensor-Parallelism on Multi-GPU Systems Main Conference Chen Zhang Shanghai Jiao Tong University, Qijun Zhang Shanghai Jiao Tong University, Zhuoshan Zhou Shanghai Jiao Tong University, Yijia Diao Shanghai Jiao Tong University, Haibo Wang Huawei, Zhe Zhou Huawei, Zhipeng Tu Huawei, Zhiyao Li Huawei, Guangyu Sun Peking University, Zhuoran Song Shanghai Jiao Tong University, Zhigang Ji Shanghai Jiao Tong University, Jingwen Leng Shanghai Jiao Tong University, Minyi Guo Shanghai Jiao Tong University | ||