Towards Resource-Efficient Serverless LLM Inference with SLINFER (HPCA 2026 - Main Conference)

Who

Chuhao Xu, Zijun Li, Quan Chen, Han Zhao, Xueyan Tang, Minyi Guo

Track

HPCA 2026 Main Conference

Time Zone

The program is currently displayed in (GMT+11:00) Hobart.

Use conference time zone: (GMT+11:00) HobartSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Mon 2 Feb 2026 14:10 - 14:30 at Coogee - LLM Inference Serving Systems Chair(s): Jian Li

Abstract

The rise of LLMs has driven demand for private serverless deployments, characterized by moderate-sized models and infrequent requests. While existing serverless solutions follow exclusive GPU allocation, we take a step back to explore modern platforms and find that: Emerging CPU architectures with built-in accelerators are capable of serving LLMs but remain underutilized, and both CPUs and GPUs can accommodate multiple LLMs simultaneously.

We propose SLINFER, a resource-efficient serverless inference scheme tailored for small- to mid-sized LLMs that enables elastic and on-demand sharing across heterogeneous hardware. SLINFER tackles three fundamental challenges: (1) precise, fine-grained compute resource allocation at token-level to handle fluctuating computational demands; (2) a coordinated and forward-looking memory scaling mechanism to detect out-of-memory hazards and reduce operational overhead; and (3) a dual approach that reduces resource fragmentation through proactive preemption and reactive bin-packing. Experimental results on 4 32-core CPUs and 4 A100 GPUs show that SLINFER improves serving capacity by 47% - 62% through sharing, while further leveraging CPUs boosts this to 86% - 154%.

Chuhao Xu

Shanghai Jiao Tong University

Zijun Li

Shanghai Jiao Tong University

Quan Chen

Shanghai Jiao Tong University, China

Han Zhao

Shanghai Jiao Tong University

Xueyan Tang

Nanyang Technological University

Minyi Guo

Shanghai Jiao Tong University

Time Zone

The program is currently displayed in (GMT+11:00) Hobart.

Use conference time zone: (GMT+11:00) HobartSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Mon 2 Feb
Displayed time zone: Hobart change

14:10 - 15:30	LLM Inference Serving SystemsMain Conference at Coogee Chair(s): Jian Li Chinese Academy of Meteorological Sciences

14:10 20m Talk		Towards Resource-Efficient Serverless LLM Inference with SLINFER Main Conference Chuhao Xu Shanghai Jiao Tong University, Zijun Li Shanghai Jiao Tong University, Quan Chen Shanghai Jiao Tong University, China, Han Zhao Shanghai Jiao Tong University, Xueyan Tang Nanyang Technological University, Minyi Guo Shanghai Jiao Tong University
14:30 20m Talk		ELORA: Efficient LoRA and KV Cache Management for Multi-LoRA LLM Serving Main Conference Jiuchen Shi Shanghai Jiao Tong University & The Hong Kong Polytechnic University, Hang Zhang Shanghai Jiao Tong University, Yixiao Wang Shanghai Jiao Tong University, Quan Chen Shanghai Jiao Tong University, China, Yizhou Shan Huawei Cloud, Kaihua Fu Hong Kong University of Science and Technology, Wei Wang Hong Kong University of Science and Technology, Minyi Guo Shanghai Jiao Tong University
14:50 20m Talk		PASCAL: A Phase-Aware Scheduling Algorithm for Serving Reasoning-based Large Language Models Main Conference Eunyeong Cho KAIST, Jehyeon Bang KAIST, Ranggi Hwang UNIST, Minsoo Rhu KAIST
15:10 20m Talk		The Cost of Dynamic Reasoning: Demystifying AI Agents and Test-Time Scaling from an AI Infrastructure Perspective Main Conference Jiin Kim KAIST, Byeongjun Shin KAIST, Jinha Chung KAIST, Minsoo Rhu KAIST