Swift: High-Performance Sparse-Dense Matrix Multiplication on GPUs
Sparse-Dense Matrix Multiplication (SpMM) on GPUs has gained significant attention because of its importance in modern applications and the increasing computing power of GPUs in the last decade. Previous SpMM studies have focused on the importance of storage format and load balance for the overall performance of SpMM on GPUs. However, very little attention has been paid to the efficacy of coalesced memory access in improving the efficiency of data loading, which incurs a notable overhead that amounts to an average of more than 32% of the overall performance, according to our experimental observation. Existing state-of-the-art (SOTA) solutions fail to adequately support coalesced memory access of both sparse and dense matrices between the global memory and threads on GPUs. In this paper, we propose an efficient algorithm called Swift that speeds up the loading of both sparse and dense matrices of SpMM on modern GPUs. Leveraging coalesced memory access, Swift achieves high loading efficiency by sorting both the columns of the sparse matrix and elements of the dense matrix based on the number of non-zero elements and balancing the load by handling the regular and irregular parts differently and judiciously. Swift takes the Compressed Sparse Column format as an implementation case study to prove the concept and gain insights. We conduct a comprehensive comparison of Swift with four SOTA solutions: ASpT, cuSPARSE, RoDe, and Sputnik, using the full SuiteSparse Matrix Collection as the workload. The experimental results on RTX 4080s and RTX 3090Ti demonstrate that our method outperforms the baselines significantly.
Wed 4 FebDisplayed time zone: Hobart change
09:50 - 11:10 | GPU Kernel Optimization and Resource SharingMain Conference at Cronulla Chair(s): Hyojin Sung Seoul National University | ||
09:50 20mTalk | μShare: Non-Intrusive Kernel Co-Locating on NVIDIA GPUs Main Conference Wenhao Huang Tianjin University, Zhaolin Duan Tianjin University, Laiping Zhao Tianjin University, Yuhao Zhang Tianjin University, Yanjie Wang Tianjin University, Yiming Li Tianjin University, Yihan Wang Tianjin University, Yichi Chen Tianjin University, Zhihang Tang Tianjin University, Kang Chen Tsinghua University, Deze Zeng China University of Geosciences, Wenxin Li Tianjin University, Keqiu Li Tianjin University | ||
10:10 20mTalk | FlashFuser: Expanding the Scale of Kernel Fusion for Compute-Intensive operators via Inter-Core Connection Main Conference huang ziyu Shanghai Jiao Tong University, Yangjie Zhou National University of Singapore, Zihan Liu Shanghai Jiao Tong University, Xinhao Luo Shanghai Jiao Tong University, Yijia Diao Shanghai Jiao Tong University, Minyi Guo Shanghai Jiao Tong University, Jidong Zhai Tsinghua University, Yu Feng Shanghai Jiao Tong University, Chen Zhang Shanghai Jiao Tong University, Anbang Wu Shanghai Jiao Tong University, Jingwen Leng Shanghai Jiao Tong University | ||
10:30 20mTalk | Swift: High-Performance Sparse-Dense Matrix Multiplication on GPUs Main Conference Jinyu Hu Hunan University, Huizhang Luo Hunan University, Hong Jiang UT Arlington, Marc Casas Barcelona Supercomputing Center, Kenli Li National Supercomputing Center in Changsha, Hunan University, Chubo Liu Hunan University | ||
10:50 20mTalk | QuCo: Efficient and Flexible Hardware-Driven Automatic Configuration of Tile Transfers in GPUs Main Conference Nicolas Meseguer University of Murcia, daoxuan xu William & Mary, Yifan Sun William&Mary, Michael Pellauer Nvidia, José L. Abellán University of Murcia, Manuel E. Acacio Universidad de Murcia (UMU) | ||