LILo: Harnessing the On-chip Accelerators in Intel CPUs for Compressed LLM Inference Acceleration (HPCA 2026 - Main Conference)

Who

Hyungyo Kim, Qirong Xia, Jinghan Huang, Nachuan Wang, Jung Ho Ahn, Younjoo Lee, Wajdi K Feghali, Ren Wang, Nam Sung Kim

Track

HPCA 2026 Main Conference

This program is tentative and subject to change.

Time Zone

The program is currently displayed in (GMT+11:00) Hobart.

Use conference time zone: (GMT+11:00) HobartSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Tue 3 Feb 2026 14:10 - 14:30 at Coogee - LLM Systems and Microarchitecture Tools

Abstract

The ever-growing sizes of frontier large language models (LLMs) introduce significant infrastructure challenges due to their immense memory capacity demands. While the de facto approach has been to deploy multiple high-end GPUs, each with a limited memory capacity, the prohibitive cost of such systems has become a major barrier to the widespread deployment of frontier LLMs. As a CPU can offer an order of magnitude larger memory capacity at a few times lower cost per bit than a GPU, CPU-based inference using the latest on-chip accelerator, Intel Advanced Matrix Extensions (AMX), has emerged as a cost-effective alternative to GPU-based inference. Nevertheless, even CPU’s large memory capacity has become insufficient to serve LLMs with hundreds of billions of parameters. Under the memory capacity constraint, we may offload parameters to storage devices and fetch them on demand, but doing so significantly degrades inference performance due to the high latency and low bandwidth of storage devices. To address this challenge, we propose LILO, an LLM inference framework leveraging an on-chip lossless compression accelerator in the latest Intel CPUs, to accelerate inference under memory capacity constraints. By storing model parameters in a compressed format and decompressing them on demand, LILO enables significantly reduced storage access during inference under memory capacity constraints while preserving the model accuracy and behavior. However, it is crucial to achieve high-speed decompression to ensure the benefits of reduced storage access are not overshadowed by the decompression overhead. LILO addresses this by orchestrating the concurrent execution of on-chip accelerators, i.e., In-memory Analytics Accelerator (IAA), Advanced Vector Extensions (AVX), and AMX, to facilitate high-throughput decompression alongside inference computation. Furthermore, LILO implements selective compression, a Mixture-of-Expert (MoE)-aware optimization that reduces the decompression overhead by up to 1.9×. We demonstrate that LILO reduces inference latency by up to 4.9× and 4.3× for Llama3-405B and DeepSeek-R1, respectively, under memory capacity constraints compared to the baseline inference relying solely on storage-offloading without compression.

Hyungyo Kim

UIUC

Qirong Xia

UIUC

Jinghan Huang

UIUC

Nachuan Wang

UIUC

Jung Ho Ahn

Seoul National University

Younjoo Lee

Seoul National University

Wajdi K Feghali

Intel

Ren Wang

Intel Labs

Nam Sung Kim

UIUC

This program is tentative and subject to change.

Time Zone

The program is currently displayed in (GMT+11:00) Hobart.

Use conference time zone: (GMT+11:00) HobartSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Tue 3 Feb
Displayed time zone: Hobart change

14:10 - 15:30	LLM Systems and Microarchitecture ToolsMain Conference at Coogee

14:10 20m Talk		LILo: Harnessing the On-chip Accelerators in Intel CPUs for Compressed LLM Inference Acceleration Main Conference Hyungyo Kim UIUC, Qirong Xia UIUC, Jinghan Huang UIUC, Nachuan Wang UIUC, Jung Ho Ahn Seoul National University, Younjoo Lee Seoul National University, Wajdi K Feghali Intel, Ren Wang Intel Labs, Nam Sung Kim UIUC
14:30 20m Talk		ReThermal: Co-Design of Thermal-Aware Static and Dynamic Scheduling for LLM Training on Liquid-Cooled Wafer-Scale Chips Main Conference Chengran Li Tsinghua University, Huizheng Wang Tsinghua University, Jiaxin Liu Tsinghua University, Jingyao Liu Tsinghua University, Zhiheng Yue Tsinghua University, Xia Li Shanghai AI Lab, Shenfei Jiang Shanghai AI Lab, Jinyi Deng Tsinghua University, Yang Hu Tsinghua University, Shouyi Yin Tsinghua University
14:50 20m Talk		TraceRTL: Agile Performance Evaluation for Microarchitecture Exploration Main Conference Zifei Zhang SKLP, Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences, Yinan Xu SKLP, Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences, Sa Wang SKLP, Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences, Dan Tang SKLP, Institute of Computing Technology, Chinese Academy of Sciences; Beijing Institute of Open Source Chip, Yungang Bao State Key Lab of Processors, Institute of Computing Technology, CAS; University of Chinese Academy of Sciences
15:10 20m Talk		Nugget: Portable Program Snippets Main Conference Zhantong Qiu University of California, Davis, Mahyar Samani University of California, Davis, Jason Lowe-Power University of California, Davis & Google