eGPU: Production-Scale Elastic Sharing over 10,000 GPUs (HPCA 2026 - Industry Track)

Who

Xiaochuan Tang, Hao Qi, Jianbo Dong, Yinghao Yu, Zhennan Xue, Zhengyu Zhang, Daocheng Ying, Zheng Cao, Xiaoyi Lu

Track

HPCA 2026 Industry Track

Time Zone

The program is currently displayed in (GMT+11:00) Hobart.

Use conference time zone: (GMT+11:00) HobartSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Tue 3 Feb 2026 17:55 - 18:15 at Coogee - Industry Track Chair(s): Pradip Bose

Abstract

As the cost of GPUs continues to rise, GPU-sharing solutions have become increasingly important for improving efficiency and maximizing resource utilization. At the same time, large-scale operational deployments of such solutions remain relatively less explored, especially in heterogeneous production environments where workload dynamics and orchestration complexity introduce new practical considerations. In this paper, we introduce eGPU, an elastic, efficient, and scalable GPU-sharing framework tailored for production-scale concurrent machine learning (ML) training and inference. eGPU enables fine-grained, runtime-adjustable sharing of GPUs across multiple jobs, while preserving high resource utilization and fault isolation. To address communication bottlenecks, eGPU supports native NVLink/NCCL-based communication between shared GPU instances, capabilities that are limited or unavailable in many existing designs. Built with production deployment in mind, eGPU integrates with Kubernetes (K8s) to support large-scale orchestration. It has been deployed and running stably in production clusters with over 10,000 GPUs for five years. Our evaluation results show that eGPU achieves elastic and precise control over instance sizes, improves job efficiency by 21% to 31% than SOTA sharing solutions, saves the number of GPUs required by up to 8×, and improves cluster GPU utilization by more than 3×.

Xiaochuan Tang

Alibaba Group

Hao Qi

Jianbo Dong

Alibaba Group

Yinghao Yu

Alibaba Group

Zhennan Xue

Alibaba Group

Zhengyu Zhang

Alibaba Group

Daocheng Ying

Alibaba Group

Zheng Cao

Alibaba Group

Xiaoyi Lu

UC Merced

United States

Time Zone

The program is currently displayed in (GMT+11:00) Hobart.

Use conference time zone: (GMT+11:00) HobartSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Tue 3 Feb
Displayed time zone: Hobart change

17:15 - 18:15	Industry TrackIndustry Track at Coogee Chair(s): Pradip Bose IBM

17:15 20m Industry talk		Enterprise Class On-Chip Accelerator Integration Industry Track Deanna Berger IBM, Alper Buyuktosunoglu IBM Research, Craig Walters IBM, Robert Sonnelitter IBM, Hailey Nicholson IBM, Ashraf ElSharif IBM, Yamil Rivera IBM, Avery Francois IBM, Cedric Lichtenau IBM, Jason Kohl IBM
17:35 20m Industry talk		Characterizing Cloud-Native LLM Inference at ByteDance and Exposing Optimization Challenges and Opportunities for Future AI Accelerators Industry Track Jingwei Cai ByteDance Seed, Dehao Kong , Huang Hantao ByteDance Seed, Zishan Jiang ByteDance Seed, Zixuan Ma ByteDance Seed, Qingyu Guo ByteDance Seed, Zhenxing Zhang ByteDance Seed, Guiming Shi Tsinghua University, Mingyu Gao Tsinghua University, Kaisheng Ma Tsinghua University, Minghui Yu ByteDance Seed
17:55 20m Industry talk		eGPU: Production-Scale Elastic Sharing over 10,000 GPUs Industry Track Xiaochuan Tang Alibaba Group, Hao Qi , Jianbo Dong Alibaba Group, Yinghao Yu Alibaba Group, Zhennan Xue Alibaba Group, Zhengyu Zhang Alibaba Group, Daocheng Ying Alibaba Group, Zheng Cao Alibaba Group, Xiaoyi Lu UC Merced