Swift: High-Performance Sparse-Dense Matrix Multiplication on GPUs (HPCA 2026 - Main Conference)

Who

Jinyu Hu, Huizhang Luo, Hong Jiang, Marc Casas, Kenli Li, Chubo Liu

Track

HPCA 2026 Main Conference

Time Zone

The program is currently displayed in (GMT+11:00) Hobart.

Use conference time zone: (GMT+11:00) HobartSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Wed 4 Feb 2026 10:30 - 10:50 at Cronulla - GPU Kernel Optimization and Resource Sharing Chair(s): Hyojin Sung

Abstract

Sparse-Dense Matrix Multiplication (SpMM) on GPUs has gained significant attention because of its importance in modern applications and the increasing computing power of GPUs in the last decade. Previous SpMM studies have focused on the importance of storage format and load balance for the overall performance of SpMM on GPUs. However, very little attention has been paid to the efficacy of coalesced memory access in improving the efficiency of data loading, which incurs a notable overhead that amounts to an average of more than 32% of the overall performance, according to our experimental observation. Existing state-of-the-art (SOTA) solutions fail to adequately support coalesced memory access of both sparse and dense matrices between the global memory and threads on GPUs. In this paper, we propose an efficient algorithm called Swift that speeds up the loading of both sparse and dense matrices of SpMM on modern GPUs. Leveraging coalesced memory access, Swift achieves high loading efficiency by sorting both the columns of the sparse matrix and elements of the dense matrix based on the number of non-zero elements and balancing the load by handling the regular and irregular parts differently and judiciously. Swift takes the Compressed Sparse Column format as an implementation case study to prove the concept and gain insights. We conduct a comprehensive comparison of Swift with four SOTA solutions: ASpT, cuSPARSE, RoDe, and Sputnik, using the full SuiteSparse Matrix Collection as the workload. The experimental results on RTX 4080s and RTX 3090Ti demonstrate that our method outperforms the baselines significantly.

Jinyu Hu

Hunan University

Huizhang Luo

Hunan University

Hong Jiang

UT Arlington

Marc Casas

Barcelona Supercomputing Center

Kenli Li

National Supercomputing Center in Changsha, Hunan University

China

Chubo Liu

Hunan University

Time Zone

The program is currently displayed in (GMT+11:00) Hobart.

Use conference time zone: (GMT+11:00) HobartSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Wed 4 Feb
Displayed time zone: Hobart change

09:50 - 11:10	GPU Kernel Optimization and Resource SharingMain Conference at Cronulla Chair(s): Hyojin Sung Seoul National University

09:50 20m Talk		μShare: Non-Intrusive Kernel Co-Locating on NVIDIA GPUs Main Conference Wenhao Huang Tianjin University, Zhaolin Duan Tianjin University, Laiping Zhao Tianjin University, Yuhao Zhang Tianjin University, Yanjie Wang Tianjin University, Yiming Li Tianjin University, Yihan Wang Tianjin University, Yichi Chen Tianjin University, Zhihang Tang Tianjin University, Kang Chen Tsinghua University, Deze Zeng China University of Geosciences, Wenxin Li Tianjin University, Keqiu Li Tianjin University
10:10 20m Talk		FlashFuser: Expanding the Scale of Kernel Fusion for Compute-Intensive operators via Inter-Core Connection Main Conference huang ziyu Shanghai Jiao Tong University, Yangjie Zhou National University of Singapore, Zihan Liu Shanghai Jiao Tong University, Xinhao Luo Shanghai Jiao Tong University, Yijia Diao Shanghai Jiao Tong University, Minyi Guo Shanghai Jiao Tong University, Jidong Zhai Tsinghua University, Yu Feng Shanghai Jiao Tong University, Chen Zhang Shanghai Jiao Tong University, Anbang Wu Shanghai Jiao Tong University, Jingwen Leng Shanghai Jiao Tong University
10:30 20m Talk		Swift: High-Performance Sparse-Dense Matrix Multiplication on GPUs Main Conference Jinyu Hu Hunan University, Huizhang Luo Hunan University, Hong Jiang UT Arlington, Marc Casas Barcelona Supercomputing Center, Kenli Li National Supercomputing Center in Changsha, Hunan University, Chubo Liu Hunan University
10:50 20m Talk		QuCo: Efficient and Flexible Hardware-Driven Automatic Configuration of Tile Transfers in GPUs Main Conference Nicolas Meseguer University of Murcia, daoxuan xu William & Mary, Yifan Sun William&Mary, Michael Pellauer Nvidia, José L. Abellán University of Murcia, Manuel E. Acacio Universidad de Murcia (UMU)