FACE: Fully PD Overlapped Scheduling and Multi-Level Architecture Co-Exploration on Wafer (HPCA 2026 - Main Conference)

Who

Zheng Xu, Dehao Kong, Jiaxin Liu, Dingcheng Jiang, Xu Dai, Jinyi Deng, Yang Hu, Shouyi Yin

Track

HPCA 2026 Main Conference

Time Zone

The program is currently displayed in (GMT+11:00) Hobart.

Use conference time zone: (GMT+11:00) HobartSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Tue 3 Feb 2026 10:10 - 10:30 at Coogee - Wafer-Scale Systems for Large Models Chair(s): Hyesoon Kim, Hyesoon Kim

Abstract

The deployment of large language models (LLMs) imposes significant demands on computing, memory, and communication resources. Wafer-scale chips, leveraging advanced packaging technologies, enable high-density integration of computing and memory resources while offering high die-to-die (D2D) communication bandwidth, presenting a promising architectural solution to meet these requirements arising from LLMs. However, the unprecedented chip area introduces significant architectural design complexities. Wafer-scale chips feature a multi-level architecture, including wafer, die, and core levels, involving critical parameter design considerations and trade-offs. Moreover, this also introduces major challenges for LLM service scheduling, where fully leveraging the advantages of wafer-scale technology while mitigating its limitations is essential to transforming massive hardware resources into actual performance. Unfortunately, methods to address these challenges remain scarce.

To bridge this gap, we propose FACE, a co-exploration framework for multi-level architecture and scheduling. We first define a highly configurable and general hardware template to systematically explore the optimal architecture and micro-architecture parameters. Leveraging the fine-grained control and high interconnect bandwidth of wafer-scale chips, FACE implements an LLM scheduling strategy that achieves fully overlapped prefill-decode execution and efficient KV cache management, minimizing prefill-decode interference while maximizing hardware resource utilization. Our evaluation demonstrates that FACE can achieve an average overall performance improvement of 3.68$\times$ across various LLM models and datasets compared to state-of-the-art (SOTA) LLM serving systems on wafer-scale chips. Moreover, FACE provides valuable insights into wafer-scale multi-level architecture design and LLM workload execution. \textit{The FACE framework will be open-sourced.}

Zheng Xu

Tsinghua University

China

Dehao Kong

Tsinghua University

Jiaxin Liu

Tsinghua University

Dingcheng Jiang

Tsinghua University

Xu Dai

Shanghai Artificial Intelligence Laboratory

Jinyi Deng

Tsinghua University

China

Yang Hu

Tsinghua University

Shouyi Yin

Tsinghua University

Time Zone

The program is currently displayed in (GMT+11:00) Hobart.

Use conference time zone: (GMT+11:00) HobartSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Tue 3 Feb
Displayed time zone: Hobart change

09:50 - 11:10	Wafer-Scale Systems for Large ModelsMain Conference at Coogee Chair(s): Hyesoon Kim Georgia Institute of Technology, Hyesoon Kim Georgia Institute of Technology

09:50 20m Talk		WATOS: Efficient LLM Training Strategies and Architecture Co-exploration for Wafer-scale Chip Main Conference Huizheng Wang Tsinghua University, Zichuan Wang Tsinghua University, Hongbin Wang Tsinghua University, Jingxiang Hou Tsinghua University, Taiquan Wei Tsinghua University, Chao Li Shanghai Jiao Tong University, Yang Hu Tsinghua University, Shouyi Yin Tsinghua University
10:10 20m Talk		FACE: Fully PD Overlapped Scheduling and Multi-Level Architecture Co-Exploration on Wafer Main Conference Zheng Xu Tsinghua University, Dehao Kong Tsinghua University, Jiaxin Liu Tsinghua University, Dingcheng Jiang Tsinghua University, Xu Dai Shanghai Artificial Intelligence Laboratory, Jinyi Deng Tsinghua University, Yang Hu Tsinghua University, Shouyi Yin Tsinghua University
10:30 20m Talk		TEMP: A Memory Efficient Physical-aware Tensor Partition-Mapping Framework on Wafer-scale Chips Main Conference Huizheng Wang Tsinghua University, Taiquan Wei Tsinghua University, Zichuan Wang Tsinghua University, Dingcheng Jiang Tsinghua University, Qize Yang Tsinghua University, Jiaxin Liu Tsinghua University, Jingxiang Hou Tsinghua University, Chao Li Shanghai Jiao Tong University, Jinyi Deng Tsinghua University, Yang Hu Tsinghua University, Shouyi Yin Tsinghua University
10:50 20m Talk		MoEntwine: Unleashing the Potential of Wafer-scale Chips for Large-scale Expert Parallel Inference Main Conference Xinru Tang Tsinghua University, Jingxiang Hou Tsinghua University, Dingcheng Jiang Tsinghua University, Taiquan Wei Tsinghua University, Jiaxin Liu Tsinghua University, Jinyi Deng Tsinghua University, Huizheng Wang Tsinghua University, Qize Yang Tsinghua University, Haoran Shang Tsinghua University, Chao Li Shanghai Jiao Tong University, Yang Hu Tsinghua University, Shouyi Yin Tsinghua University