HPCA 2026
Sat 31 January - Wed 4 February 2026 Sydney, Australia
co-located with HPCA/CGO/PPoPP/CC 2026
Wed 4 Feb 2026 10:30 - 10:50 at Cronulla - GPU Kernel Optimization and Resource Sharing Chair(s): Hyojin Sung

Sparse-Dense Matrix Multiplication (SpMM) on GPUs has gained significant attention because of its importance in modern applications and the increasing computing power of GPUs in the last decade. Previous SpMM studies have focused on the importance of storage format and load balance for the overall performance of SpMM on GPUs. However, very little attention has been paid to the efficacy of coalesced memory access in improving the efficiency of data loading, which incurs a notable overhead that amounts to an average of more than 32% of the overall performance, according to our experimental observation. Existing state-of-the-art (SOTA) solutions fail to adequately support coalesced memory access of both sparse and dense matrices between the global memory and threads on GPUs. In this paper, we propose an efficient algorithm called Swift that speeds up the loading of both sparse and dense matrices of SpMM on modern GPUs. Leveraging coalesced memory access, Swift achieves high loading efficiency by sorting both the columns of the sparse matrix and elements of the dense matrix based on the number of non-zero elements and balancing the load by handling the regular and irregular parts differently and judiciously. Swift takes the Compressed Sparse Column format as an implementation case study to prove the concept and gain insights. We conduct a comprehensive comparison of Swift with four SOTA solutions: ASpT, cuSPARSE, RoDe, and Sputnik, using the full SuiteSparse Matrix Collection as the workload. The experimental results on RTX 4080s and RTX 3090Ti demonstrate that our method outperforms the baselines significantly.

Wed 4 Feb

Displayed time zone: Hobart change

09:50 - 11:10
GPU Kernel Optimization and Resource SharingMain Conference at Cronulla
Chair(s): Hyojin Sung Seoul National University
09:50
20m
Talk
μShare: Non-Intrusive Kernel Co-Locating on NVIDIA GPUs
Main Conference
Wenhao Huang Tianjin University, Zhaolin Duan Tianjin University, Laiping Zhao Tianjin University, Yuhao Zhang Tianjin University, Yanjie Wang Tianjin University, Yiming Li Tianjin University, Yihan Wang Tianjin University, Yichi Chen Tianjin University, Zhihang Tang Tianjin University, Kang Chen Tsinghua University, Deze Zeng China University of Geosciences, Wenxin Li Tianjin University, Keqiu Li Tianjin University
10:10
20m
Talk
FlashFuser: Expanding the Scale of Kernel Fusion for Compute-Intensive operators via Inter-Core Connection
Main Conference
huang ziyu Shanghai Jiao Tong University, Yangjie Zhou National University of Singapore, Zihan Liu Shanghai Jiao Tong University, Xinhao Luo Shanghai Jiao Tong University, Yijia Diao Shanghai Jiao Tong University, Minyi Guo Shanghai Jiao Tong University, Jidong Zhai Tsinghua University, Yu Feng Shanghai Jiao Tong University, Chen Zhang Shanghai Jiao Tong University, Anbang Wu Shanghai Jiao Tong University, Jingwen Leng Shanghai Jiao Tong University
10:30
20m
Talk
Swift: High-Performance Sparse-Dense Matrix Multiplication on GPUs
Main Conference
Jinyu Hu Hunan University, Huizhang Luo Hunan University, Hong Jiang UT Arlington, Marc Casas Barcelona Supercomputing Center, Kenli Li National Supercomputing Center in Changsha, Hunan University, Chubo Liu Hunan University
10:50
20m
Talk
QuCo: Efficient and Flexible Hardware-Driven Automatic Configuration of Tile Transfers in GPUs
Main Conference
Nicolas Meseguer University of Murcia, daoxuan xu William & Mary, Yifan Sun William&Mary, Michael Pellauer Nvidia, José L. Abellán University of Murcia, Manuel E. Acacio Universidad de Murcia (UMU)