COMET: Communication and Memory Co-Design for Fine-Grained AI Inference in MCM Accelerators (HPCA 2026 - Main Conference)

Who

Taishu Sheng, Guangyu Sun, Dezun Dong

Track

HPCA 2026 Main Conference

Time Zone

The program is currently displayed in (GMT+11:00) Hobart.

Use conference time zone: (GMT+11:00) HobartSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Tue 3 Feb 2026 16:50 - 17:10 at Collaroy - Accelerator Design and Modeling Chair(s): Leeor Peled

Abstract

Chiplet-based architectures have emerged as a promising approach to overcome the physical and manufacturing constraints faced by monolithic chips, enabling scalable integration of compute resources to meet the growing demands of AI workloads. However, efficient inter-chiplet communication remains some significant bottleneck, especially under the fine-grained and bursty DMA request patterns generated by processing elements in modern AI tasks. Existing communication models and simulators fail to capture these characteristics, which limits the accuracy of performance analysis and the effectiveness of optimization strategies. These limitations hinder the DMA-communication inefficiencies in chiplet-based AI systems and pose challenges for designing HPC architectures.

To address these challenges, we present the first comprehensive chiplet communication model that explicitly incorporates fine-grained DMA traffic observed in realistic AI workloads. Building on this model, we propose COMET, a novel framework that intelligently searches for optimal DMA request aggregation and memory address mapping strategies tailored to chiplet environments. COMET dynamically consolidates small DMA transfers to improve bandwidth utilization and reduce communication latency, while also adapting on-chip memory mapping to align with workload-specific dataflows. This mitigates synchronization overhead across diverse AI tasks. Compared with inference on conventional chiplet communication schemes, COMET achieves $1.7\times$–$2.5\times$ speedup and $1.5\times$–$4.4\times$ higher bandwidth utilization across different DNN and LLM workloads.

Taishu Sheng

College of Computer Science and Technology, National University of Defense Technology

Guangyu Sun

Peking University

Dezun Dong

NUDT