HPCA 2026
Sat 31 January - Wed 4 February 2026 Sydney, Australia
co-located with HPCA/CGO/PPoPP/CC 2026
Mon 2 Feb 2026 11:50 - 12:10 at Coogee - Near-Data Processing and Storage Chair(s): Jisung Park

Transformer-based \emph{large language models} (LLMs) exhibit remarkable generative capabilities, but their inference throughput is limited by the autoregressive decoding process, which generates only one token per iteration. \emph{Speculative decoding} mitigates this bottleneck by using a lightweight \emph{draft language model} (DLM) to generate multiple draft tokens, which are then verified in parallel by a more accurate \emph{target language model} (TLM). To accommodate the differing computational patterns of the DLM and TLM, prior work has leveraged heterogeneous systems combining xPUs and processing-in-memory (PIM) units to offload compute- and memory-intensive operators, respectively.

However, existing systems often adopt a fixed draft sequence length, leading to excessive rejection of draft tokens during verification—especially under large-batch scenarios—resulting in redundant computation and reduced efficiency. This paper proposes a \emph{runtime adaptive draft length} adjustment technique that dynamically tailors the draft length for each request by monitoring cumulative acceptance probabilities, thereby minimizing the generation and verification of invalid tokens. Yet, integrating adaptive draft lengths into existing PIM-enabled heterogeneous systems introduces two new challenges: (1) sequential execution of the DLM and TLM becomes inefficient due to synchronization bubbles caused by request-wise variability in draft lengths, and (2) static operator mappings become suboptimal as draft length variability alters operator arithmetic intensities dynamically. To address these issues, we introduce \textbf{SADDLE}, a PIM-enabled heterogeneous system designed to exploit adaptive draft lengths effectively. SADDLE incorporates two key mechanisms: (1) an \emph{asynchronous speculative decoding pipeline} that decouples DLM prediction and TLM verification to reduce idle time, and (2) an \emph{arithmetic intensity-aware operator scheduler} that dynamically assigns operators to the most suitable hardware units. Experimental results show that SADDLE achieves average speedups of 2.88$\times$ over a state-of-the-art GPU-only solution and 1.71$\times$ over the best-performing GPU+PIM baseline.

Mon 2 Feb

Displayed time zone: Hobart change

11:30 - 12:50
Near-Data Processing and StorageMain Conference at Coogee
Chair(s): Jisung Park POSTECH (Pohang University of Science and Technology)
11:30
20m
Talk
PIMphony: Overcoming Bandwidth and Capacity Inefficiency in PIM-based Long-Context LLM Inference System
Main Conference
hyucksung kwon Hanyang University, Kyungmo Koo Hanyang University, Janghyeon Kim Hanyang University, Woongkyu Lee Hanyang University, Minjae Lee Hanyang University, Gyeonggeun Jung KAIST, Hyungdeok Lee Solution Advanced Technology, SK hynix, Yousub Jung Solution Advanced Technology, SK hynix, Jaehan Park Solution Advanced Technology, SK hynix, Yosub Song Solution Advanced Technology, SK hynix, Byeongsu Yang Solution Advanced Technology, SK hynix, Haerang Choi Solution Advanced Technology, SK hynix, Guhyun Kim Solution Advanced Technology, SK hynix, Jongsoon Won Solution Advanced Technology, SK hynix, Woojae Shin Solution Advanced Technology, SK hynix, Changhyun Kim Solution Advanced Technology, SK hynix, Shin Gyeongcheol Solution Advanced Technology, SK hynix, Yongkee Kwon Tenstorrent, Ilkon Kim Solution Advanced Technology, SK hynix, Euicheol Lim SK hynix, John Kim KAIST, Jungwook Choi Hanyang University
11:50
20m
Talk
Adaptive Draft Sequence Length: Enhancing Speculative Decoding Throughput on PIM-Enabled Systems
Main Conference
Runze Wang Huazhong University of Science and Technology, Qinggang Wang Huazhong University of Science and Technology, Haifeng Liu Huazhong University of Science and Technology, Long Zheng Huazhong University of Science and Technology, XIAOFEI LIAO Huazhong University of Science and Technology, Hai Jin Huazhong University of Science and Technology, Jingling Xue UNSW Sydney
12:10
20m
Talk
Conduit: Programmer-Transparent Near-Data Processing Using Multiple Compute-Capable Resources in SSDs
Main Conference
Rakesh Nadig ETH Zurich, Vamanan Arulchelvan ETH Zurich, Mayank Kabra ETH Zurich, Harshita Gupta ETH Zurich, Rahul Bera ETH Zurich, Nika Mansouri Ghiasi ETH Zurich, Nanditha Rao ETH Zurich, Qingcai Jiang ETH Zurich, Andreas Kosmas Kakolyris ETH Zurich, Yu Liang ETH Zurich, Mohammad Sadrosadati ETH Zürich, Onur Mutlu ETH Zurich
12:30
20m
Talk
N-DIPPER: A Distributed Inter-die Peak Power Management Network for NAND Systems
Main Conference
Jinwoo Park KAIST, John Kim KAIST