SCALE: Tackling Communication Bottlenecks in Confidential Multi-GPU ML (HPCA 2026 - Main Conference)

Who

Joongun Park, Yongqin Wang, Huan Xu, Hanjiang Wu, Mengyuan Li, Tushar Krishna

Track

HPCA 2026 Main Conference

This program is tentative and subject to change.

Time Zone

The program is currently displayed in (GMT+11:00) Hobart.

Use conference time zone: (GMT+11:00) HobartSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Tue 3 Feb 2026 16:10 - 16:30 at Coogee - Distributed and Multi-GPU Training Chair(s): J. Nelson Amaral

Abstract

Machine Learning (ML) has become a cornerstone in numerous applications, creating the need for secure and efficient distributed ML frameworks. However, maintaining data privacy in these systems poses significant challenges, particularly in distributed environments where user data and model parameters must frequently be transmitted between GPUs. Confidential GPU computing technologies, such as NVIDIA’s Confidential GPU mode, offer hardware-based enterprise solutions designed to protect ML workloads in untrusted environments (e.g., public clouds). These technologies leverage heterogeneous systems that combine Confidential Virtual Machines (CVMs) with GPU-based Trusted Execution Environments (TEEs). Nevertheless, confidential computing introduces considerable performance overhead due to its complex heterogeneous architecture and the high throughput data flows required across TEE security boundaries. For example, encrypted communication occurs both between CVMs and GPU TEEs, and among multiple GPU TEEs, resulting in significant latency compared to native PCIe or high-speed interconnects NVLink. Our extensive evaluation shows that these overheads become particularly severe during collective communication operations, which suffer from encryption-induced delays that negatively impact end-to-end training performance.

To address this, we propose a co-encryption approach that leverages underutilized GPU resources, optimizes encryption and authentication, and introduces a communication algorithm tailored for confidential settings. We evaluate our design using real ML workloads and execution traces collected from four HGX H100/H200 clusters. While CC mode was not available on current NVIDIA software stacks, we incorporate encryption-aware modeling based on hardware specifications, enabling realistic estimation of secure communication overheads. Our results demonstrate a 40–70% reduction in communication related security costs

Joongun Park

Georgia Tech

United States

Yongqin Wang

University of Southern California

Huan Xu

Georgia Institute of Technology

Hanjiang Wu

Georgia Institute of Technology

Mengyuan Li

USC

Tushar Krishna

Georgia Institute of Technology

This program is tentative and subject to change.

Time Zone

The program is currently displayed in (GMT+11:00) Hobart.

Use conference time zone: (GMT+11:00) HobartSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Tue 3 Feb
Displayed time zone: Hobart change

15:50 - 17:10	Distributed and Multi-GPU TrainingMain Conference at Coogee Chair(s): J. Nelson Amaral

15:50 20m Talk		Compression-Aware Gradient Splitting for Collective Communications in Distributed Training Main Conference Pranati Majhi Texas A&M University, Sabuj Laskar Texas A&M University, Abdullah Muzahid Texas A & M University, Eun Jung Kim
16:10 20m Talk		SCALE: Tackling Communication Bottlenecks in Confidential Multi-GPU ML Main Conference Joongun Park Georgia Tech, Yongqin Wang University of Southern California, Huan Xu Georgia Institute of Technology, Hanjiang Wu Georgia Institute of Technology, Mengyuan Li USC, Tushar Krishna Georgia Institute of Technology
16:30 20m Talk		AutoHAAP: Automated Heterogeneity-Aware Asymmetric Partitioning for LLM Training Main Conference Yuanyuan Wang Zhejiang Lab, Nana Tang Zhejiang Lab, Yuyang Wang Zhejiang Lab, Shu Pan Zhejiang Lab, Dingding Yu Zhejiang Lab, Zeyue Wang Zhejiang Lab, Mou Sun Zhejiang Lab, Kejie Fu Zhejiang Lab, Fangyu Wang Zhejiang Lab, Yunchuan Chen Zhejiang Lab, Ning Sun Zhejiang Lab, Fei Yang Zhejiang Lab
16:50 20m Talk		Towards Compute-Aware In-Switch Computing for LLMs Tensor-Parallelism on Multi-GPU Systems Main Conference Chen Zhang Shanghai Jiao Tong University, Qijun Zhang Shanghai Jiao Tong University, Zhuoshan Zhou Shanghai Jiao Tong University, Yijia Diao Shanghai Jiao Tong University, Haibo Wang Huawei, Zhe Zhou Huawei, Zhipeng Tu Huawei, Zhiyao Li Huawei, Guangyu Sun Peking University, Zhuoran Song Shanghai Jiao Tong University, Zhigang Ji Shanghai Jiao Tong University, Jingwen Leng Shanghai Jiao Tong University, Minyi Guo Shanghai Jiao Tong University

Hide past events