Near-Zero-Overhead Freshness for Recommendation Systems via Inference-Side Model Updates (HPCA 2026 - Main Conference)

Who

Wenjun Yu, Sitian Chen, Amelie Chi Zhou, Cheng Chen

Track

HPCA 2026 Main Conference

Time Zone

The program is currently displayed in (GMT+11:00) Hobart.

Use conference time zone: (GMT+11:00) HobartSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Wed 4 Feb 2026 11:30 - 11:50 at Coogee - Efficient Serving and Resource Management Chair(s): Mohammad A. Islam

Abstract

Deep Learning Recommendation Models (DLRMs) underpin personalized services but face a critical freshness-accuracy tradeoff due to massive parameter synchronization overheads. Production DLRMs deploy decoupled training/inference clusters, where synchronizing petabyte-scale embedding tables (EMTs) causes multi-minute staleness, degrading recommendation quality and revenue. We observe that (1) inference nodes exhibit sustained CPU underutilization (peak $\leq$20%), and (2) EMT gradients possess intrinsic low-rank structure, enabling compact update representation. We present \sysname{}, a system that eliminates inter-cluster synchronization by co-locating Low-Rank Adaptation (LoRA) trainers within inference nodes. \sysname{} addresses two core challenges: (1) \textit{dynamic rank adaptation} via singular value monitoring to constrain memory overhead ($<$2% of EMTs), and (2) \textit{NUMA-aware resource scheduling} with hardware-enforced QoS to eliminate update-inference contention (P99 latency impact $<$10ms). Evaluations show \sysname{} reduces update costs by 2$\times$ versus delta-update baselines while achieving higher accuracy within 1-hour windows. By transforming idle inference resources into freshness engines, \sysname{} delivers online model updates while outperforming state-of-the-art methods by 1.41–2.44% in accuracy.

Wenjun Yu

Hong Kong Baptist University

Sitian Chen

Hong Kong Baptist University

Amelie Chi Zhou

Hong Kong Baptist University

Cheng Chen

ByteDance, China

Time Zone

The program is currently displayed in (GMT+11:00) Hobart.

Use conference time zone: (GMT+11:00) HobartSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Wed 4 Feb
Displayed time zone: Hobart change

11:30 - 12:50	Efficient Serving and Resource ManagementMain Conference at Coogee Chair(s): Mohammad A. Islam University of Texas at Arlington

11:30 20m Talk		Near-Zero-Overhead Freshness for Recommendation Systems via Inference-Side Model Updates Main Conference Wenjun Yu Hong Kong Baptist University, Sitian Chen Hong Kong Baptist University, Amelie Chi Zhou Hong Kong Baptist University, Cheng Chen ByteDance, China
11:50 20m Talk		AccelFlow: Orchestrating an On-Package Ensemble of Fine-Grained Accelerators for Microservices Main Conference Jovan Stojkovic University of Illinois at Urbana-Champaign, Abraham Farrell University of Illinois Urbana-Champaign, Zhangxiaowen Gong Intel, Christopher J. Hughes Intel, Josep Torrellas University of Illinois at Urbana-Champaign
12:10 20m Talk		SpotCC: Facilitating Coded Computation for Prediction Serving Systems on Spot Instances Main Conference Lin Wang , Yuchong Hu Huazhong University of Science and Technology, Ziling Duan Huazhong University of Science and Technology, Mingqi Li Huazhong University of Science and Technology, Chenxuan Yao Huazhong University of Science and Technology, feifanliu Huazhong University of Science and Technology, Xiaolu Li Huazhong University of Science and Technology, Leihua Qin Huazhong University of Science and Technology, Dan Feng Huazhong University of Science and Technology, China
12:30 20m Talk		LowCarb: Carbon-Aware Scheduling of Serverless Functions Main Conference Rohan Basu Roy University of Utah, Devesh Tiwari Northeastern University