PADE: A Predictor-Free Sparse Attention Accelerator via Unified Execution and Stage Fusion (HPCA 2026 - Main Conference)

Who

Huizheng Wang, Hongbin Wang, Zichuan Wang, Zhiheng Yue, Yang Wang, Chao Li, Yang Hu, Shouyi Yin

Track

HPCA 2026 Main Conference

Time Zone

The program is currently displayed in (GMT+11:00) Hobart.

Use conference time zone: (GMT+11:00) HobartSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Mon 2 Feb 2026 15:50 - 16:10 at Coogee - Efficient LLM Inference Techniques Chair(s): Jovan Stojkovic

Abstract

Attention-based models have revolutionized AI, but the quadratic cost of self-attention incurs severe computational and memory overhead. Sparse attention methods alleviate this by skipping low-relevance token pairs. However, current approaches lack practicality due to the heavy expense of added sparsity predictor, which severely drops their hardware efficiency.

This paper advances the state-of-the-art (SOTA) by proposing a bit-serial enable stage-fusion (BSF) mechanism, which eliminates the need for a separate predictor. However, it faces key challenges: 1) Inaccurate bit-sliced sparsity speculation leads to incorrect pruning; 2) Hardware under-utilization due to fine-grained and imbalanced bit-level workloads. 3) Tiling difficulty caused by the row-wise dependency in sparsity pruning criteria.

We propose PADE, a predictor-free algorithm-hardware co-design for sparse attention acceleration. PADE features three key innovations: 1) Bit-wise uncertainty interval-enabled guard filtering (BUI-GF) strategy to accurately identify trivial tokens during each bit round; 2) Bidirectional sparsity-based out-of-order execution (BS-OOE) to improve hardware utilization; 3) Interleaving-based sparsity-tiled attention (ISTA) to reduce both I/O and computational complexity. These techniques, combined with custom accelerator designs, enable practical sparsity acceleration without relying on an added sparsity predictor. Extensive experiments on 22 benchmarks show that PADE achieves $7.43\times$ speed up and $31.1\times$ higher energy efficiency than Nvidia H100 GPU. Compared to SOTA Attention accelerators, PADE achieves $5.1\times$, $4.3\times$ and $3.4\times$ energy saving than Sanger, DOTA and SOFA, respectively.

Huizheng Wang

Tsinghua University

Hongbin Wang

Tsinghua University

Zichuan Wang

Tsinghua University

Zhiheng Yue

Tsinghua University

Yang Wang

Tsinghua University

Chao Li

Shanghai Jiao Tong University

Yang Hu

Tsinghua University

Shouyi Yin

Tsinghua University

Time Zone

The program is currently displayed in (GMT+11:00) Hobart.

Use conference time zone: (GMT+11:00) HobartSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Mon 2 Feb
Displayed time zone: Hobart change

15:50 - 17:10	Efficient LLM Inference TechniquesMain Conference at Coogee Chair(s): Jovan Stojkovic University of Illinois at Urbana-Champaign

15:50 20m Talk		PADE: A Predictor-Free Sparse Attention Accelerator via Unified Execution and Stage Fusion Main Conference Huizheng Wang Tsinghua University, Hongbin Wang Tsinghua University, Zichuan Wang Tsinghua University, Zhiheng Yue Tsinghua University, Yang Wang Tsinghua University, Chao Li Shanghai Jiao Tong University, Yang Hu Tsinghua University, Shouyi Yin Tsinghua University
16:10 20m Talk		AQPIM: Breaking the PIM Capacity Wall for LLMs with In-Memory Activation Quantization Main Conference Kosuke Matsushima Institute of Science Tokyo, Yasuyuki Okoshi Institute of Science Tokyo, Masato Motomura Institute of Science Tokyo, Daichi Fujiki Institute of Science Tokyo
16:30 20m Talk		BitDecoding: Unlocking Tensor Cores for Long-Context LLMs with Low-Bit KV Cache Main Conference Dayou Du University of Edinburgh, Shijie Cao Microsoft Research, Jianyi Cheng University of Edinburgh, UK, Luo Mai University of Edinburgh, Ting Cao Institute for AI Industry Research (AIR), Tsinghua University, Mao Yang Microsoft Research
16:50 20m Talk		GyRot: Leveraging Hidden Synergy between Rotation and Fine-grained Group Quantization for Low-bit LLM Inference Main Conference Sangjin Kim KAIST, Yuseon Choi KAIST, Byeongcheol Kim KAIST, Jungjun Oh KAIST, Hoi-Jun Yoo KAIST