μShare: Non-Intrusive Kernel Co-Locating on NVIDIA GPUs (HPCA 2026 - Main Conference)

Who

Wenhao Huang, Zhaolin Duan, Laiping Zhao, Yuhao Zhang, Yanjie Wang, Yiming Li, Yihan Wang, Yichi Chen, Zhihang Tang, Kang Chen, Deze Zeng, Wenxin Li, Keqiu Li

Track

HPCA 2026 Main Conference

Time Zone

The program is currently displayed in (GMT+11:00) Hobart.

Use conference time zone: (GMT+11:00) HobartSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Wed 4 Feb 2026 09:50 - 10:10 at Cronulla - GPU Kernel Optimization and Resource Sharing Chair(s): Hyojin Sung

Abstract

The hardware scheduler on NVIDIA GPUs is highly inefficient in utilizing micro-architectural hardware resources. It places blocks from the same kernel within the same GPU Streaming Multiprocessor (SM) core, resulting in a stacking co-locating problem, where identical blocks are placed within the same SM core, saturating only a subset of intra-SM hardware resources while leaving others underutilized.

The primary challenge in addressing this issue is that the NVIDIA hardware is closed-source, preventing us from directly modifying the hardware scheduler. To bridge the semantic gap between the resource demands of kernels and the scheduler, we introduce \emph{μShare}, which enables intra-SM scattered co-locating of kernels through a non-intrusive \emph{half-plus blocksize shaping} method. It shapes the blocksize of kernels to a half-plus blocksize (i.e., slightly more than half of the SM’s thread capacity), scattering identical blocks of the same kernel across different SMs. It further adopts a \emph{time-shifted launching} method to reduce intra-SM resource contention. Compared to state-of-the-art systems, \emph{μShare} does not require intrusive modifications to hardware or kernel code, yet it can still improve inference throughput by 26.90%-54.09% and increases low-level hardware utilization by 38.53%–61.15%.

Wenhao Huang

Tianjin University

Zhaolin Duan

Tianjin University

Laiping Zhao

Tianjin University

Yuhao Zhang

Tianjin University

Yanjie Wang

Tianjin University

Yiming Li

Tianjin University

Yihan Wang

Tianjin University

Yichi Chen

Tianjin University

Zhihang Tang

Tianjin University

Kang Chen

Tsinghua University

Deze Zeng

China University of Geosciences

Wenxin Li

Tianjin University

Keqiu Li

Tianjin University

China

Time Zone

The program is currently displayed in (GMT+11:00) Hobart.

Use conference time zone: (GMT+11:00) HobartSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Wed 4 Feb
Displayed time zone: Hobart change

09:50 - 11:10	GPU Kernel Optimization and Resource SharingMain Conference at Cronulla Chair(s): Hyojin Sung Seoul National University

09:50 20m Talk		μShare: Non-Intrusive Kernel Co-Locating on NVIDIA GPUs Main Conference Wenhao Huang Tianjin University, Zhaolin Duan Tianjin University, Laiping Zhao Tianjin University, Yuhao Zhang Tianjin University, Yanjie Wang Tianjin University, Yiming Li Tianjin University, Yihan Wang Tianjin University, Yichi Chen Tianjin University, Zhihang Tang Tianjin University, Kang Chen Tsinghua University, Deze Zeng China University of Geosciences, Wenxin Li Tianjin University, Keqiu Li Tianjin University
10:10 20m Talk		FlashFuser: Expanding the Scale of Kernel Fusion for Compute-Intensive operators via Inter-Core Connection Main Conference huang ziyu Shanghai Jiao Tong University, Yangjie Zhou National University of Singapore, Zihan Liu Shanghai Jiao Tong University, Xinhao Luo Shanghai Jiao Tong University, Yijia Diao Shanghai Jiao Tong University, Minyi Guo Shanghai Jiao Tong University, Jidong Zhai Tsinghua University, Yu Feng Shanghai Jiao Tong University, Chen Zhang Shanghai Jiao Tong University, Anbang Wu Shanghai Jiao Tong University, Jingwen Leng Shanghai Jiao Tong University
10:30 20m Talk		Swift: High-Performance Sparse-Dense Matrix Multiplication on GPUs Main Conference Jinyu Hu Hunan University, Huizhang Luo Hunan University, Hong Jiang UT Arlington, Marc Casas Barcelona Supercomputing Center, Kenli Li National Supercomputing Center in Changsha, Hunan University, Chubo Liu Hunan University
10:50 20m Talk		QuCo: Efficient and Flexible Hardware-Driven Automatic Configuration of Tile Transfers in GPUs Main Conference Nicolas Meseguer University of Murcia, daoxuan xu William & Mary, Yifan Sun William&Mary, Michael Pellauer Nvidia, José L. Abellán University of Murcia, Manuel E. Acacio Universidad de Murcia (UMU)