LILo: Harnessing the On-chip Accelerators in Intel CPUs for Compressed LLM Inference Acceleration
This program is tentative and subject to change.
The ever-growing sizes of frontier large language models (LLMs) introduce significant infrastructure challenges due to their immense memory capacity demands. While the de facto approach has been to deploy multiple high-end GPUs, each with a limited memory capacity, the prohibitive cost of such systems has become a major barrier to the widespread deployment of frontier LLMs. As a CPU can offer an order of magnitude larger memory capacity at a few times lower cost per bit than a GPU, CPU-based inference using the latest on-chip accelerator, Intel Advanced Matrix Extensions (AMX), has emerged as a cost-effective alternative to GPU-based inference. Nevertheless, even CPU’s large memory capacity has become insufficient to serve LLMs with hundreds of billions of parameters. Under the memory capacity constraint, we may offload parameters to storage devices and fetch them on demand, but doing so significantly degrades inference performance due to the high latency and low bandwidth of storage devices. To address this challenge, we propose LILO, an LLM inference framework leveraging an on-chip lossless compression accelerator in the latest Intel CPUs, to accelerate inference under memory capacity constraints. By storing model parameters in a compressed format and decompressing them on demand, LILO enables significantly reduced storage access during inference under memory capacity constraints while preserving the model accuracy and behavior. However, it is crucial to achieve high-speed decompression to ensure the benefits of reduced storage access are not overshadowed by the decompression overhead. LILO addresses this by orchestrating the concurrent execution of on-chip accelerators, i.e., In-memory Analytics Accelerator (IAA), Advanced Vector Extensions (AVX), and AMX, to facilitate high-throughput decompression alongside inference computation. Furthermore, LILO implements selective compression, a Mixture-of-Expert (MoE)-aware optimization that reduces the decompression overhead by up to 1.9×. We demonstrate that LILO reduces inference latency by up to 4.9× and 4.3× for Llama3-405B and DeepSeek-R1, respectively, under memory capacity constraints compared to the baseline inference relying solely on storage-offloading without compression.
This program is tentative and subject to change.
Tue 3 FebDisplayed time zone: Hobart change
14:10 - 15:30 | |||
14:10 20mTalk | LILo: Harnessing the On-chip Accelerators in Intel CPUs for Compressed LLM Inference Acceleration Main Conference Hyungyo Kim UIUC, Qirong Xia UIUC, Jinghan Huang UIUC, Nachuan Wang UIUC, Jung Ho Ahn Seoul National University, Younjoo Lee Seoul National University, Wajdi K Feghali Intel, Ren Wang Intel Labs, Nam Sung Kim UIUC | ||
14:30 20mTalk | ReThermal: Co-Design of Thermal-Aware Static and Dynamic Scheduling for LLM Training on Liquid-Cooled Wafer-Scale Chips Main Conference Chengran Li Tsinghua University, Huizheng Wang Tsinghua University, Jiaxin Liu Tsinghua University, Jingyao Liu Tsinghua University, Zhiheng Yue Tsinghua University, Xia Li Shanghai AI Lab, Shenfei Jiang Shanghai AI Lab, Jinyi Deng Tsinghua University, Yang Hu Tsinghua University, Shouyi Yin Tsinghua University | ||
14:50 20mTalk | TraceRTL: Agile Performance Evaluation for Microarchitecture Exploration Main Conference Zifei Zhang SKLP, Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences, Yinan Xu SKLP, Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences, Sa Wang SKLP, Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences, Dan Tang SKLP, Institute of Computing Technology, Chinese Academy of Sciences; Beijing Institute of Open Source Chip, Yungang Bao State Key Lab of Processors, Institute of Computing Technology, CAS; University of Chinese Academy of Sciences | ||
15:10 20mTalk | Nugget: Portable Program Snippets Main Conference Zhantong Qiu University of California, Davis, Mahyar Samani University of California, Davis, Jason Lowe-Power University of California, Davis & Google | ||