Compression-Aware Gradient Splitting for Collective Communications in Distributed Training (HPCA 2026 - Main Conference)

Who

Pranati Majhi, Sabuj Laskar, Abdullah Muzahid, Eun Jung Kim

Track

HPCA 2026 Main Conference

Time Zone

The program is currently displayed in (GMT+11:00) Hobart.

Use conference time zone: (GMT+11:00) HobartSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Tue 3 Feb 2026 15:50 - 16:10 at Coogee - Distributed and Multi-GPU Training Chair(s): J. Nelson Amaral

Abstract

While distributed training is crucial for scaling deep learning models, it incurs significant overhead due to the collective communication of gradients. To alleviate the burden, compression techniques are commonly used to improve network bandwidth utilization. However, compression poses challenges for synchronized AllReduce collective communications, even more so in scalable systems. Non-uniform data sizes resulting from compression can cause bandwidth under-utilization, as faster nodes remain idle while waiting for slower nodes to complete data exchanges, increasing overall communication and consequently, training time. However, the inherent similarity in gradients across consecutive batches presents an opportunity to mitigate these inefficiencies. By leveraging the quantization of gradients and consistent distribution of zeros, the gradients can be partitioned logically to speedup communication. Splitting them into groups with and without zeros can allow different compression approaches for both. The bandwidth under-utilization due to non-uniform data size can also be solved by partitioning the gradients into variable-sized chunks, leading to more balanced compressed data sizes and reduced idle waiting time. We propose two novel strategies in OSCAR, where gradient splitting is designed to improve communication and training. OSCAR-SW is a novel AR compatible software-based technique that splits gradients into probable zeros and non zeros to apply count sketch compression. OSCAR-HW, a novel hardware/software co-designed gradient splitting technique is proposed with ASC(Adaptive Stepwise Coding), an encoding technique for gradient compression in distributed training. OSCAR-HW dynamically splits fixed-point quantized gradients for AllReduce communications and maximizes bandwidth utilization for state-of-the-art hardware compression techniques. ASC is a variant of Adaptive Arithmetic Coding (AAC) that generates a distinct probability table for each timestep of AllReduce to adapt to its unique value ranges and avoids sending the probability table during the communication of gradients. Our experimental results show that OSCAR-SW achieves 1.18× speedup and 3% better accuracy over the SOTA CountSketch algorithm. OSCAR-HW achieves an average AllReduce speedup of 3.77×, and an average end-to-end training speedup of 1.38×. ASC achieves an average AllReduce speedup of 1.05× over Atalanta and 4.66× over no compression.

Pranati Majhi

Texas A&M University

Sabuj Laskar

Texas A&M University

Abdullah Muzahid

Texas A & M University

United States

Eun Jung Kim