QuCo: Efficient and Flexible Hardware-Driven Automatic Configuration of Tile Transfers in GPUs (HPCA 2026 - Main Conference)

Who

Nicolas Meseguer, daoxuan xu, Yifan Sun, Michael Pellauer, José L. Abellán, Manuel E. Acacio

Track

HPCA 2026 Main Conference

Time Zone

The program is currently displayed in (GMT+11:00) Hobart.

Use conference time zone: (GMT+11:00) HobartSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Wed 4 Feb 2026 10:50 - 11:10 at Cronulla - GPU Kernel Optimization and Resource Sharing Chair(s): Hyojin Sung

Abstract

The growing complexity and parallelism demands of modern GPU workloads have driven architectural innovations toward asynchronous tile transfers (ATTs) to overlap computation and data movement. While ATT units such as the NVIDIA’s Tensor Memory Accelerator (TMA) introduce high-throughput memory transfers, programmers must deal with wavefront specialization, select tile sizes, queue slots, and synchronization primitives, all of which are hardware-specific and workload-dependent. Existing GPU libraries fall short—offering limited ATT support and configurability—so developers still resort to manual exploration of this vast parameter space, which is laborious, error-prone, and fundamentally limits performance portability across GPUs.

In this work, we present QuCo (Queue Configurator), a single lightweight hardware unit embedded in the GPU that fully automates the ATT configuration process. Inspired by Blackwell GPU design, QuCo includes a compact RISC-V processor, small memory structures for instructions and data, and a GPU Specification Table (GST) storing key architectural parameters. Using the GST and workload characteristics, along with built-in heuristics, QuCo computes optimal queue configurations at kernel launch. This relieves the programmer of the tedious, time-consuming task of tuning and offline profiling, while simultaneously increasing post-compilation performance portability.

Nicolas Meseguer

University of Murcia

daoxuan xu

William & Mary

Yifan Sun

William&Mary

Michael Pellauer

Nvidia

José L. Abellán

University of Murcia

Spain

Manuel E. Acacio

Universidad de Murcia (UMU)