HPCA 2026
Sat 31 January - Wed 4 February 2026 Sydney, Australia
co-located with HPCA/CGO/PPoPP/CC 2026
Tue 3 Feb 2026 14:10 - 14:30 at Collaroy - Memory Systems for Scalable Computing Chair(s): Alexandros Daglis

This paper studies the impact of DRAM writes on DDR5-based system. To efficiently perform DRAM writes, modern systems buffer write requests and try to complete multiple write operations whenever the DRAM mode is switched from read to write. When the DRAM system is performing writes, it is not available to service read requests, thus increasing read latency and reducing performance. We observe that, given the presence of on-die ECC in DDR5 devices, the time to perform a write operation varies significantly: from 1x (for writes to banks of different bankgroups) to 6x (for writes to banks within the same bankgroup) to 24x (for conflicting requests to the same bank). If we can orchestrate the write stream to favor write requests that incur lower latency, then we can reduce the stall time from DRAM writes and improve performance. However, for current systems, the write stream is dictated by the cache replacement policy, which makes eviction decisions without being aware of the variable latency of DRAM writes. The key insight of our work is to improve performance by modifying the cache replacement policy to increase bank-parallelism of DRAM writes.

Our paper proposes {\em BARD (Bank-Aware Replacement Decisions)}, which modifies the cache replacement policy to favor dirty lines that belong to banks without pending writes. We analyze two variants of BARD: BARD-E (Eviction-based), which changes the eviction policy to evict low-cost dirty lines, and BARD-C (Cleansing-Based), which proactively cleans low-cost dirty lines without modifying the eviction decisions. Both BARD-E and BARD-C increase bank-parallelism of DRAM writes by 35%. As both versions of BARD have differing impact, extra misses or extra writebacks, there is no single policy that works well across all the workloads. We develop a hybrid policy (BARD-H), which uses a selective combination of both eviction and writeback. Our evaluations across workloads from SPEC2017, LIGRA, STREAM, and Google server traces show that BARD-H improves performance by 2.9% on average and up-to 8.1%. BARD requires only 8 bytes of SRAM.

Tue 3 Feb

Displayed time zone: Hobart change

14:10 - 15:30
Memory Systems for Scalable ComputingMain Conference at Collaroy
Chair(s): Alexandros Daglis Georgia Tech
14:10
20m
Talk
BARD: Reducing Write Latency of DDR5 Memory by Exploiting Bank-Parallelism
Main Conference
Suhas Vittal Georgia Tech, Moinuddin K. Qureshi Georgia Tech
14:30
20m
Talk
RoMe: Row Granularity Access Memory System for Large Language Models
Main Conference
Hwayong Nam Seoul National University, Seungmin Baek Seoul National University, Jumin Kim Seoul National University, Michael Jaemin Kim Meta, Jung Ho Ahn Seoul National University
Pre-print
14:50
20m
Talk
HDPAT: Hierarchical Distributed Page Address Translation for Wafer-Scale GPUs
Main Conference
daoxuan xu William & Mary, Ying Li William & Mary, Yuwei Sun UIUC, Jie Ren William & Mary, Yifan Sun William&Mary
15:10
20m
Talk
Pulse: Fine-Grained Hierarchical Hashing Index for Disaggregated Memory
Main Conference
Guangyang Deng Xiamen University, Zixiang Yu Xiamen University, Zhirong Shen Xiamen University, Qiangsheng Su Xiamen University, Jiwu Shu Xiamen University