Adaptive Draft Sequence Length: Enhancing Speculative Decoding Throughput on PIM-Enabled Systems (HPCA 2026 - Main Conference)

Who

Runze Wang, Qinggang Wang, Haifeng Liu, Long Zheng, XIAOFEI LIAO, Hai Jin, Jingling Xue

Track

HPCA 2026 Main Conference

Time Zone

The program is currently displayed in (GMT+11:00) Hobart.

Use conference time zone: (GMT+11:00) HobartSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Mon 2 Feb 2026 11:50 - 12:10 at Coogee - Near-Data Processing and Storage Chair(s): Jisung Park

Abstract

Transformer-based \emph{large language models} (LLMs) exhibit remarkable generative capabilities, but their inference throughput is limited by the autoregressive decoding process, which generates only one token per iteration. \emph{Speculative decoding} mitigates this bottleneck by using a lightweight \emph{draft language model} (DLM) to generate multiple draft tokens, which are then verified in parallel by a more accurate \emph{target language model} (TLM). To accommodate the differing computational patterns of the DLM and TLM, prior work has leveraged heterogeneous systems combining xPUs and processing-in-memory (PIM) units to offload compute- and memory-intensive operators, respectively.

However, existing systems often adopt a fixed draft sequence length, leading to excessive rejection of draft tokens during verification—especially under large-batch scenarios—resulting in redundant computation and reduced efficiency. This paper proposes a \emph{runtime adaptive draft length} adjustment technique that dynamically tailors the draft length for each request by monitoring cumulative acceptance probabilities, thereby minimizing the generation and verification of invalid tokens. Yet, integrating adaptive draft lengths into existing PIM-enabled heterogeneous systems introduces two new challenges: (1) sequential execution of the DLM and TLM becomes inefficient due to synchronization bubbles caused by request-wise variability in draft lengths, and (2) static operator mappings become suboptimal as draft length variability alters operator arithmetic intensities dynamically. To address these issues, we introduce \textbf{SADDLE}, a PIM-enabled heterogeneous system designed to exploit adaptive draft lengths effectively. SADDLE incorporates two key mechanisms: (1) an \emph{asynchronous speculative decoding pipeline} that decouples DLM prediction and TLM verification to reduce idle time, and (2) an \emph{arithmetic intensity-aware operator scheduler} that dynamically assigns operators to the most suitable hardware units. Experimental results show that SADDLE achieves average speedups of 2.88$\times$ over a state-of-the-art GPU-only solution and 1.71$\times$ over the best-performing GPU+PIM baseline.

Runze Wang

Huazhong University of Science and Technology

Qinggang Wang

Huazhong University of Science and Technology

Haifeng Liu

Huazhong University of Science and Technology

Long Zheng

Huazhong University of Science and Technology

XIAOFEI LIAO

Huazhong University of Science and Technology

Hai Jin

Huazhong University of Science and Technology

China

Jingling Xue

UNSW Sydney

Australia

Time Zone

The program is currently displayed in (GMT+11:00) Hobart.

Use conference time zone: (GMT+11:00) HobartSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Mon 2 Feb
Displayed time zone: Hobart change

11:30 - 12:50	Near-Data Processing and StorageMain Conference at Coogee Chair(s): Jisung Park POSTECH (Pohang University of Science and Technology)

11:30 20m Talk		PIMphony: Overcoming Bandwidth and Capacity Inefficiency in PIM-based Long-Context LLM Inference System Main Conference hyucksung kwon Hanyang University, Kyungmo Koo Hanyang University, Janghyeon Kim Hanyang University, Woongkyu Lee Hanyang University, Minjae Lee Hanyang University, Gyeonggeun Jung KAIST, Hyungdeok Lee Solution Advanced Technology, SK hynix, Yousub Jung Solution Advanced Technology, SK hynix, Jaehan Park Solution Advanced Technology, SK hynix, Yosub Song Solution Advanced Technology, SK hynix, Byeongsu Yang Solution Advanced Technology, SK hynix, Haerang Choi Solution Advanced Technology, SK hynix, Guhyun Kim Solution Advanced Technology, SK hynix, Jongsoon Won Solution Advanced Technology, SK hynix, Woojae Shin Solution Advanced Technology, SK hynix, Changhyun Kim Solution Advanced Technology, SK hynix, Shin Gyeongcheol Solution Advanced Technology, SK hynix, Yongkee Kwon Tenstorrent, Ilkon Kim Solution Advanced Technology, SK hynix, Euicheol Lim SK hynix, John Kim KAIST, Jungwook Choi Hanyang University
11:50 20m Talk		Adaptive Draft Sequence Length: Enhancing Speculative Decoding Throughput on PIM-Enabled Systems Main Conference Runze Wang Huazhong University of Science and Technology, Qinggang Wang Huazhong University of Science and Technology, Haifeng Liu Huazhong University of Science and Technology, Long Zheng Huazhong University of Science and Technology, XIAOFEI LIAO Huazhong University of Science and Technology, Hai Jin Huazhong University of Science and Technology, Jingling Xue UNSW Sydney
12:10 20m Talk		Conduit: Programmer-Transparent Near-Data Processing Using Multiple Compute-Capable Resources in SSDs Main Conference Rakesh Nadig ETH Zurich, Vamanan Arulchelvan ETH Zurich, Mayank Kabra ETH Zurich, Harshita Gupta ETH Zurich, Rahul Bera ETH Zurich, Nika Mansouri Ghiasi ETH Zurich, Nanditha Rao ETH Zurich, Qingcai Jiang ETH Zurich, Andreas Kosmas Kakolyris ETH Zurich, Yu Liang ETH Zurich, Mohammad Sadrosadati ETH Zürich, Onur Mutlu ETH Zurich
12:30 20m Talk		N-DIPPER: A Distributed Inter-die Peak Power Management Network for NAND Systems Main Conference Jinwoo Park KAIST, John Kim KAIST