ELORA: Efficient LoRA and KV Cache Management for Multi-LoRA LLM Serving (HPCA 2026 - Main Conference)

Who

Jiuchen Shi, Hang Zhang, Yixiao Wang, Quan Chen, Yizhou Shan, Kaihua Fu, Wei Wang, Minyi Guo

Track

HPCA 2026 Main Conference

Time Zone

The program is currently displayed in (GMT+11:00) Hobart.

Use conference time zone: (GMT+11:00) HobartSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Mon 2 Feb 2026 14:30 - 14:50 at Coogee - LLM Inference Serving Systems Chair(s): Jian Li

Abstract

Multiple Low-Rank Adapters (Multi-LoRA) are gaining popularity for task-specific Large Language Model (LLM) applications. For Multi-LoRA serving, caching hot LoRAs and KV caches in the GPU memory can improve inference performance. However, existing Multi-LoRA inference systems fail to optimize serving performance like Time-To-First-Toke (TTFT), neglecting usage dependencies when caching LoRAs and KV caches. We therefore propose \textbf{ELORA}, a Multi-LoRA caching system to optimize the serving performance. ELORA comprises a \textit{dependency-aware cache manager} and a \textit{performance-driven cache swapper}. The cache manager maintains the usage dependencies between LoRAs and KV caches during inference with a unified caching pool. The cache swapper determines the swap-in or swap-out of LoRAs and KV caches based on a unified cost model, when the GPU memory is idle or busy, respectively. Experimental results show that ELORA reduces the TTFT by 45.7% on average, compared to state-of-the-art works.

Jiuchen Shi

Shanghai Jiao Tong University & The Hong Kong Polytechnic University

Hang Zhang

Shanghai Jiao Tong University

Yixiao Wang

Shanghai Jiao Tong University

Quan Chen

Shanghai Jiao Tong University, China

Yizhou Shan

Huawei Cloud

Kaihua Fu

Hong Kong University of Science and Technology

Wei Wang

Hong Kong University of Science and Technology

Minyi Guo

Shanghai Jiao Tong University

Time Zone

The program is currently displayed in (GMT+11:00) Hobart.

Use conference time zone: (GMT+11:00) HobartSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Mon 2 Feb
Displayed time zone: Hobart change

14:10 - 15:30	LLM Inference Serving SystemsMain Conference at Coogee Chair(s): Jian Li Chinese Academy of Meteorological Sciences

14:10 20m Talk		Towards Resource-Efficient Serverless LLM Inference with SLINFER Main Conference Chuhao Xu Shanghai Jiao Tong University, Zijun Li Shanghai Jiao Tong University, Quan Chen Shanghai Jiao Tong University, China, Han Zhao Shanghai Jiao Tong University, Xueyan Tang Nanyang Technological University, Minyi Guo Shanghai Jiao Tong University
14:30 20m Talk		ELORA: Efficient LoRA and KV Cache Management for Multi-LoRA LLM Serving Main Conference Jiuchen Shi Shanghai Jiao Tong University & The Hong Kong Polytechnic University, Hang Zhang Shanghai Jiao Tong University, Yixiao Wang Shanghai Jiao Tong University, Quan Chen Shanghai Jiao Tong University, China, Yizhou Shan Huawei Cloud, Kaihua Fu Hong Kong University of Science and Technology, Wei Wang Hong Kong University of Science and Technology, Minyi Guo Shanghai Jiao Tong University
14:50 20m Talk		PASCAL: A Phase-Aware Scheduling Algorithm for Serving Reasoning-based Large Language Models Main Conference Eunyeong Cho KAIST, Jehyeon Bang KAIST, Ranggi Hwang UNIST, Minsoo Rhu KAIST
15:10 20m Talk		The Cost of Dynamic Reasoning: Demystifying AI Agents and Test-Time Scaling from an AI Infrastructure Perspective Main Conference Jiin Kim KAIST, Byeongjun Shin KAIST, Jinha Chung KAIST, Minsoo Rhu KAIST