HDPAT: Hierarchical Distributed Page Address Translation for Wafer-Scale GPUs (HPCA 2026 - Main Conference)

Who

daoxuan xu, Ying Li, Yuwei Sun, Jie Ren, Yifan Sun

Track

HPCA 2026 Main Conference

Time Zone

The program is currently displayed in (GMT+11:00) Hobart.

Use conference time zone: (GMT+11:00) HobartSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Tue 3 Feb 2026 14:50 - 15:10 at Collaroy - Memory Systems for Scalable Computing Chair(s): Alexandros Daglis

Abstract

Multi-GPU systems deliver high memory capacity and computing power but suffer from slow inter-GPU communication. Wafer-scale GPUs provide a promising solution to the scalability challenge by connecting numerous GPUs with a high-bandwidth and low-latency interposer-based network. While prior work has prototyped wafer-scale GPUs to demonstrate technical feasibility, limited research focuses on architectural designs required to harness the power of such massive devices. With the large number of chiplets and the large-scale interconnect, new bottlenecks emerge that differ from those previously identified in traditional multi-GPU systems. Specifically, the virtual-to-physical address translation process has become a critical constraint in wafer-scale GPU systems due to 1) massive concurrent translation requests and 2) long multi-hop latency in the network. To address these issues, we propose HDPAT, a solution to alleviate translation pressure and enable GPU chiplets to locate the required physical addresses efficiently. HDPAT leverages the GMMUs in other chiplets to improve the concurrency of translation. Moreover, as the translation requests are eventually resolved in the center chiplet (CPU), we convert the close-to-center chiplets to double as translation caches to surrounding chiplets. Additionally, to reduce compulsory misses, HDPAT develops page table entry prefetching. Experimental results on 13 representative applications indicate that HDPAT improves overall performance by an average of 57.99% with negligible area overhead 4.22% of GPU L2 Cache for cuckoo filter and 0.26% of GPU L2 Cache for Next-Hop Computation Unit.

daoxuan xu

William & Mary

Ying Li

William & Mary

United States

Yuwei Sun

UIUC

Jie Ren

William & Mary

United States

Yifan Sun

William&Mary

Time Zone

The program is currently displayed in (GMT+11:00) Hobart.

Use conference time zone: (GMT+11:00) HobartSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Tue 3 Feb
Displayed time zone: Hobart change

14:10 - 15:30	Memory Systems for Scalable ComputingMain Conference at Collaroy Chair(s): Alexandros Daglis Georgia Tech

14:10 20m Talk		BARD: Reducing Write Latency of DDR5 Memory by Exploiting Bank-Parallelism Main Conference Suhas Vittal Georgia Tech, Moinuddin K. Qureshi Georgia Tech
14:30 20m Talk		RoMe: Row Granularity Access Memory System for Large Language Models Main Conference Hwayong Nam Seoul National University, Seungmin Baek Seoul National University, Jumin Kim Seoul National University, Michael Jaemin Kim Meta, Jung Ho Ahn Seoul National University Pre-print
14:50 20m Talk		HDPAT: Hierarchical Distributed Page Address Translation for Wafer-Scale GPUs Main Conference daoxuan xu William & Mary, Ying Li William & Mary, Yuwei Sun UIUC, Jie Ren William & Mary, Yifan Sun William&Mary
15:10 20m Talk		Pulse: Fine-Grained Hierarchical Hashing Index for Disaggregated Memory Main Conference Guangyang Deng Xiamen University, Zixiang Yu Xiamen University, Zhirong Shen Xiamen University, Qiangsheng Su Xiamen University, Jiwu Shu Xiamen University