HPCA 2026
Sat 31 January - Wed 4 February 2026 Sydney, Australia
co-located with HPCA/CGO/PPoPP/CC 2026

This program is tentative and subject to change.

Tue 3 Feb 2026 14:10 - 14:30 at Coogee - LLM Systems and Microarchitecture Tools

The ever-growing sizes of frontier large language models (LLMs) introduce significant infrastructure challenges due to their immense memory capacity demands. While the de facto approach has been to deploy multiple high-end GPUs, each with a limited memory capacity, the prohibitive cost of such systems has become a major barrier to the widespread deployment of frontier LLMs. As a CPU can offer an order of magnitude larger memory capacity at a few times lower cost per bit than a GPU, CPU-based inference using the latest on-chip accelerator, Intel Advanced Matrix Extensions (AMX), has emerged as a cost-effective alternative to GPU-based inference. Nevertheless, even CPU’s large memory capacity has become insufficient to serve LLMs with hundreds of billions of parameters. Under the memory capacity constraint, we may offload parameters to storage devices and fetch them on demand, but doing so significantly degrades inference performance due to the high latency and low bandwidth of storage devices. To address this challenge, we propose LILO, an LLM inference framework leveraging an on-chip lossless compression accelerator in the latest Intel CPUs, to accelerate inference under memory capacity constraints. By storing model parameters in a compressed format and decompressing them on demand, LILO enables significantly reduced storage access during inference under memory capacity constraints while preserving the model accuracy and behavior. However, it is crucial to achieve high-speed decompression to ensure the benefits of reduced storage access are not overshadowed by the decompression overhead. LILO addresses this by orchestrating the concurrent execution of on-chip accelerators, i.e., In-memory Analytics Accelerator (IAA), Advanced Vector Extensions (AVX), and AMX, to facilitate high-throughput decompression alongside inference computation. Furthermore, LILO implements selective compression, a Mixture-of-Expert (MoE)-aware optimization that reduces the decompression overhead by up to 1.9×. We demonstrate that LILO reduces inference latency by up to 4.9× and 4.3× for Llama3-405B and DeepSeek-R1, respectively, under memory capacity constraints compared to the baseline inference relying solely on storage-offloading without compression.

This program is tentative and subject to change.

Tue 3 Feb

Displayed time zone: Hobart change

14:10 - 15:30
LLM Systems and Microarchitecture ToolsMain Conference at Coogee
14:10
20m
Talk
LILo: Harnessing the On-chip Accelerators in Intel CPUs for Compressed LLM Inference Acceleration
Main Conference
Hyungyo Kim UIUC, Qirong Xia UIUC, Jinghan Huang UIUC, Nachuan Wang UIUC, Jung Ho Ahn Seoul National University, Younjoo Lee Seoul National University, Wajdi K Feghali Intel, Ren Wang Intel Labs, Nam Sung Kim UIUC
14:30
20m
Talk
ReThermal: Co-Design of Thermal-Aware Static and Dynamic Scheduling for LLM Training on Liquid-Cooled Wafer-Scale Chips
Main Conference
Chengran Li Tsinghua University, Huizheng Wang Tsinghua University, Jiaxin Liu Tsinghua University, Jingyao Liu Tsinghua University, Zhiheng Yue Tsinghua University, Xia Li Shanghai AI Lab, Shenfei Jiang Shanghai AI Lab, Jinyi Deng Tsinghua University, Yang Hu Tsinghua University, Shouyi Yin Tsinghua University
14:50
20m
Talk
TraceRTL: Agile Performance Evaluation for Microarchitecture Exploration
Main Conference
Zifei Zhang SKLP, Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences, Yinan Xu SKLP, Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences, Sa Wang SKLP, Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences, Dan Tang SKLP, Institute of Computing Technology, Chinese Academy of Sciences; Beijing Institute of Open Source Chip, Yungang Bao State Key Lab of Processors, Institute of Computing Technology, CAS; University of Chinese Academy of Sciences
15:10
20m
Talk
Nugget: Portable Program Snippets
Main Conference
Zhantong Qiu University of California, Davis, Mahyar Samani University of California, Davis, Jason Lowe-Power University of California, Davis & Google