HPCA 2026
Sat 31 January - Wed 4 February 2026 Sydney, Australia
co-located with HPCA/CGO/PPoPP/CC 2026
Mon 2 Feb 2026 15:10 - 15:30 at Coogee - LLM Inference Serving Systems Chair(s): Jian Li

Large-language-model (LLM)-based AI agents have recently showcased impressive versatility by employing dynamic reasoning, an adaptive, multi-step process that coordinates with external tools. This shift from static, single-turn inference to agentic, multi-turn workflows broadens task generalization and behavioral flexibility, but it also introduces serious concerns about system-level cost, efficiency, and sustainability. This paper presents the first comprehensive system-level analysis of AI agents, quantifying their resource usage, latency behavior, energy consumption, and datacenter-wide power consumption demands across diverse agent designs and test-time scaling strategies. We further characterize how AI agent design choices, such as few-shot prompting, reflection depth, and parallel reasoning, impact accuracy-cost tradeoffs. Our findings reveal that while agents improve accuracy with increased compute, they suffer from rapidly diminishing returns, widening latency variance, and unsustainable infrastructure costs. Through detailed evaluation of representative agents, we highlight the profound computational demands introduced by AI agent workflows, uncovering a looming sustainability crisis. These results call for a paradigm shift in agent design toward compute-efficient reasoning, balancing performance with deployability under real-world constraints.

Mon 2 Feb

Displayed time zone: Hobart change

14:10 - 15:30
LLM Inference Serving SystemsMain Conference at Coogee
Chair(s): Jian Li Chinese Academy of Meteorological Sciences
14:10
20m
Talk
Towards Resource-Efficient Serverless LLM Inference with SLINFER
Main Conference
Chuhao Xu Shanghai Jiao Tong University, Zijun Li Shanghai Jiao Tong University, Quan Chen Shanghai Jiao Tong University, China, Han Zhao Shanghai Jiao Tong University, Xueyan Tang Nanyang Technological University, Minyi Guo Shanghai Jiao Tong University
14:30
20m
Talk
ELORA: Efficient LoRA and KV Cache Management for Multi-LoRA LLM Serving
Main Conference
Jiuchen Shi Shanghai Jiao Tong University & The Hong Kong Polytechnic University, Hang Zhang Shanghai Jiao Tong University, Yixiao Wang Shanghai Jiao Tong University, Quan Chen Shanghai Jiao Tong University, China, Yizhou Shan Huawei Cloud, Kaihua Fu Hong Kong University of Science and Technology, Wei Wang Hong Kong University of Science and Technology, Minyi Guo Shanghai Jiao Tong University
14:50
20m
Talk
PASCAL: A Phase-Aware Scheduling Algorithm for Serving Reasoning-based Large Language Models
Main Conference
15:10
20m
Talk
The Cost of Dynamic Reasoning: Demystifying AI Agents and Test-Time Scaling from an AI Infrastructure Perspective
Main Conference
Jiin Kim KAIST, Byeongjun Shin KAIST, Jinha Chung KAIST, Minsoo Rhu KAIST