Exploration of LLM Workload Reliability based on di/dt effects and Voltage Droops (HPCA 2026 - Main Conference)

Who

Zhixing Jiang, Justin Garrigus, Allison Seigler, Ethan Syed, Yan-Lun Huang, Mehdi Sadi, Tawfik Rahal-Arabi, Lizy John

Track

HPCA 2026 Main Conference

Time Zone

The program is currently displayed in (GMT+11:00) Hobart.

Use conference time zone: (GMT+11:00) HobartSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Wed 4 Feb 2026 11:30 - 11:50 at Cronulla - GPU Memory Management and Multi-Chiplet Systems Chair(s): EJ Kim

Abstract

Large language model (LLM) inference workloads have emerged as a critical reliability challenge for cloud GPU systems. Unlike traditional workloads, the highly structured execution of LLMs creates large power oscillations. These oscillations become a vulnerability when their frequency aligns with the resonant modes of a GPU’s power delivery network (PDN), leading to excessive voltage droops and unreliable operation. In this work, we present the first comprehensive profiling of LLM-induced power oscillations, revealing that many workloads generate oscillatory patterns in the MHz range—critically aligning with typical GPU PDN resonant frequencies and leading to excessive voltage droops.

To systematically investigate this phenomenon, we developed a novel stressmark framework that generates workloads with controllable, high-frequency power oscillations and voltage droops. Our evaluation shows that operating at a resonant frequency induces voltage droops up to 2× larger than conventional workloads, exceeding critical noise margins. Critically, we find that real LLM workloads operating even near these frequencies generate significant voltage droops greater than 100mV. Based on these findings, we propose a kernel staggering technique that mitigates this threat by shifting power oscillation frequencies away from resonance frequency, successfully reducing voltage droops and reducing reliability concerns. This work provides the first systematic understanding of LLM-PDN resonance and offers a practical solution to improve GPU reliability in AI cloud environments.

Zhixing Jiang

University of Texas at Austin

Justin Garrigus

University of Texas at Austin

Allison Seigler

University of Texas at Austin

Ethan Syed

University of Texas at Austin

Yan-Lun Huang

University of Texas at Austin

Mehdi Sadi

Advanced Micro Devices

Tawfik Rahal-Arabi

Advanced Micro Devices

Lizy John

University of Texas, Austin

Time Zone

The program is currently displayed in (GMT+11:00) Hobart.

Use conference time zone: (GMT+11:00) HobartSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Wed 4 Feb
Displayed time zone: Hobart change

11:30 - 12:50	GPU Memory Management and Multi-Chiplet SystemsMain Conference at Cronulla Chair(s): EJ Kim Texas A&M University

11:30 20m Talk		Exploration of LLM Workload Reliability based on di/dt effects and Voltage Droops Main Conference Zhixing Jiang University of Texas at Austin, Justin Garrigus University of Texas at Austin, Allison Seigler University of Texas at Austin, Ethan Syed University of Texas at Austin, Yan-Lun Huang University of Texas at Austin, Mehdi Sadi Advanced Micro Devices, Tawfik Rahal-Arabi Advanced Micro Devices, Lizy John University of Texas, Austin
11:50 20m Talk		ARIADNE: Adaptive UVM Management for Efficient GPU Memory Oversubscription Main Conference Hyunkyun Shin Yonsei University, Seongtae Bang DGIST, Hyungwon Park DGIST, Daehoon Kim Yonsei University
12:10 20m Talk		LRM-GPU: Alleviating Synchronization Overhead for Multi-Chiplet GPU Architecture Main Conference Baiqing Zhong Sun Yat-Sen University, Zhirong Ye Sun Yat-Sen University, Xiaojie Li Sun Yat-Sen University, Peilin Wang Sun Yat-Sen University, Haiqiu Huang Sun Yat-Sen University, Zhaolin Li Tsinghua University, Zhiyi Yu Sun Yat-sen University, Mingyu Wang Sun Yat-Sen University
12:30 20m Talk		LEGO: Supporting LLM-enhanced Games with One Gaming GPU Main Conference Han Zhao Shanghai Jiao Tong University, Weihao Cui Shanghai Jiao Tong University, Zeshen Zhang Tongji University, Wenhao Zhang Shanghai Jiao Tong University, Jiangtong Li Tongji University, Quan Chen Shanghai Jiao Tong University, China, Youmin Chen Shanghai Jiao Tong University, Pu Pang Shanghai Jiao Tong University, Zijun Li Shanghai Jiao Tong University, Zhenhua Han The University of Hong Kong, Yuqing Yang Microsoft Research, Minyi Guo Shanghai Jiao Tong University