V-Rex: Real-Time Streaming Video LLM Acceleration via Dynamic KV Cache Retrieval (HPCA 2026 - Main Conference)

Who

Donghyuk Kim, Sejeong Yang, Wonjin Shin, Joo-Young Kim

Track

HPCA 2026 Main Conference

Time Zone

The program is currently displayed in (GMT+11:00) Hobart.

Use conference time zone: (GMT+11:00) HobartSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Tue 3 Feb 2026 11:30 - 11:50 at Coogee - Visual and Multimodal Acceleration Chair(s): Yu Feng

Abstract

Streaming video large language models (LLMs) are increasingly used for real-time multimodal tasks such as video captioning, question answering, conversational agents, and augmented reality. However, these models face fundamental memory and computational challenges because their key-value (KV) caches grow exponentially with continuous streaming video input—a process that requires an iterative prefill stage, which is a unique feature of streaming video LLMs. Prior works reduce excessive cache overhead by utilizing the KV cache retrieval algorithm, which offloads the full KV cache to CPU memory or storage, then selectively fetches the most relevant entries. Nevertheless, due to the iterative prefill stage, they suffer from significant limitations, including extensive computation, substantial data transfer, and degradation in accuracy. Crucially, this issue is exacerbated for edge deployment, where the memory footprint exceeds their memory capacity within minutes of video streams, making low-latency, energy-efficient inference infeasible.

In this work, we propose V-Rex, the first software-hardware co-designed accelerator that comprehensively addresses both algorithmic and hardware bottlenecks in streaming video LLM inference. At its core, V-Rex introduces ReSV, a training-free dynamic KV cache retrieval algorithm. ReSV exploits temporal and spatial similarity-based token clustering to reduce excessive KV cache memory across video frames, and dynamically adjusts token selection per transformer layer and attention head to minimize the number of selected tokens. To fully realize these algorithmic benefits, V-Rex offers a compact, low-latency hardware accelerator with a dynamic KV cache retrieval engine (DRE), featuring bit-level and early-exit based computing units, as well as hierarchical KV cache memory management. Evaluated on COIN benchmarks, V-Rex achieves unprecedented real-time of 3.9-8.3FPS and energy-efficient streaming video LLM inference on edge deployment. While DRE only accounts for 2.2% power and 2.0% area, the system delivers 1.9-19.7× speedup and 3.1-18.5× energy efficiency improvements over AGX Orin GPU. On the server level, V-Rex maintains significant advantages with a 2.6-19.7× speedup and 5.9-70.6× energy efficiency gains over A100 GPU. V-Rex dramatically reduces token retrieval ratio to 32.7% (frame processing) and 2.5% (text generation) with negligible accuracy loss. This work is the first to comprehensively tackle KV cache retrieval across algorithm and hardware, enabling real-time streaming video LLM inference on resource-constrained edge devices, with clear potential for scalable deployment in large-scale server environments.

Donghyuk Kim

KAIST

Sejeong Yang

KAIST

Wonjin Shin

KAIST

Joo-Young Kim

KAIST

Time Zone

The program is currently displayed in (GMT+11:00) Hobart.

Use conference time zone: (GMT+11:00) HobartSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Tue 3 Feb
Displayed time zone: Hobart change

11:30 - 12:50	Visual and Multimodal AccelerationMain Conference at Coogee Chair(s): Yu Feng Shanghai Jiao Tong University

11:30 20m Talk		V-Rex: Real-Time Streaming Video LLM Acceleration via Dynamic KV Cache Retrieval Main Conference Donghyuk Kim KAIST, Sejeong Yang KAIST, Wonjin Shin KAIST, Joo-Young Kim KAIST
11:50 20m Talk		SFD: Towards Segment Fusion Dataflow for Spatial Accelerators Main Conference Fuyu Wang Sun Yat-sen University, Minghua Shen Sun Yat-sen University, Yufei Ding UCSD, Nong Xiao National University of Defense Technology & Sun Yat-sen University, Yutong Lu Sun Yat-sen University
12:10 20m Talk		VAR-Turbo: Unlocking the Potential of Visual Autoregressive Models through Dual Redundancy Main Conference Xujiang Xiang The Hong Kong University of Science and Technology, Fengbin Tu The Hong Kong University of Science and Technology
12:30 20m Talk		Cambricon-GS: An Accelerator for 3D Gaussian Splatting Training with Gaussian-Pixel Hybrid Parallelism Main Conference Rui Wen Institute of Computing Technology, Chinese Academy of Sciences, Zhifei Yue University of Science and Technology of China, Tianbo Liu University of Science and Technology of China, Xinkai Song Institute of Computing Technology, Chinese Academy of Sciences, Jin Li Institute of Computing Technology, Chinese Academy of Sciences, Di Huang Chinese Academy of Sciences, Institute of Computing Technology, Jiaming Guo Institute of Computing Technology, Chinese Academy of Sciences, Xing Hu Institute of Computing Technology, Chinese Academy of Sciences, zidong du Institute of Computing Technology, Chinese Academy of Sciences, Qi Guo Chinese Academy of Sciences, Tianshi Chen Cambricon Technologies