FACE: Fully PD Overlapped Scheduling and Multi-Level Architecture Co-Exploration on Wafer
The deployment of large language models (LLMs) imposes significant demands on computing, memory, and communication resources. Wafer-scale chips, leveraging advanced packaging technologies, enable high-density integration of computing and memory resources while offering high die-to-die (D2D) communication bandwidth, presenting a promising architectural solution to meet these requirements arising from LLMs. However, the unprecedented chip area introduces significant architectural design complexities. Wafer-scale chips feature a multi-level architecture, including wafer, die, and core levels, involving critical parameter design considerations and trade-offs. Moreover, this also introduces major challenges for LLM service scheduling, where fully leveraging the advantages of wafer-scale technology while mitigating its limitations is essential to transforming massive hardware resources into actual performance. Unfortunately, methods to address these challenges remain scarce.
To bridge this gap, we propose FACE, a co-exploration framework for multi-level architecture and scheduling. We first define a highly configurable and general hardware template to systematically explore the optimal architecture and micro-architecture parameters. Leveraging the fine-grained control and high interconnect bandwidth of wafer-scale chips, FACE implements an LLM scheduling strategy that achieves fully overlapped prefill-decode execution and efficient KV cache management, minimizing prefill-decode interference while maximizing hardware resource utilization. Our evaluation demonstrates that FACE can achieve an average overall performance improvement of 3.68$\times$ across various LLM models and datasets compared to state-of-the-art (SOTA) LLM serving systems on wafer-scale chips. Moreover, FACE provides valuable insights into wafer-scale multi-level architecture design and LLM workload execution. \textit{The FACE framework will be open-sourced.}
Tue 3 FebDisplayed time zone: Hobart change
09:50 - 11:10 | Wafer-Scale Systems for Large ModelsMain Conference at Coogee Chair(s): Hyesoon Kim Georgia Institute of Technology, Hyesoon Kim Georgia Institute of Technology | ||
09:50 20mTalk | WATOS: Efficient LLM Training Strategies and Architecture Co-exploration for Wafer-scale Chip Main Conference Huizheng Wang Tsinghua University, Zichuan Wang Tsinghua University, Hongbin Wang Tsinghua University, Jingxiang Hou Tsinghua University, Taiquan Wei Tsinghua University, Chao Li Shanghai Jiao Tong University, Yang Hu Tsinghua University, Shouyi Yin Tsinghua University | ||
10:10 20mTalk | FACE: Fully PD Overlapped Scheduling and Multi-Level Architecture Co-Exploration on Wafer Main Conference Zheng Xu Tsinghua University, Dehao Kong Tsinghua University, Jiaxin Liu Tsinghua University, Dingcheng Jiang Tsinghua University, Xu Dai Shanghai Artificial Intelligence Laboratory, Jinyi Deng Tsinghua University, Yang Hu Tsinghua University, Shouyi Yin Tsinghua University | ||
10:30 20mTalk | TEMP: A Memory Efficient Physical-aware Tensor Partition-Mapping Framework on Wafer-scale Chips Main Conference Huizheng Wang Tsinghua University, Taiquan Wei Tsinghua University, Zichuan Wang Tsinghua University, Dingcheng Jiang Tsinghua University, Qize Yang Tsinghua University, Jiaxin Liu Tsinghua University, Jingxiang Hou Tsinghua University, Chao Li Shanghai Jiao Tong University, Jinyi Deng Tsinghua University, Yang Hu Tsinghua University, Shouyi Yin Tsinghua University | ||
10:50 20mTalk | MoEntwine: Unleashing the Potential of Wafer-scale Chips for Large-scale Expert Parallel Inference Main Conference Xinru Tang Tsinghua University, Jingxiang Hou Tsinghua University, Dingcheng Jiang Tsinghua University, Taiquan Wei Tsinghua University, Jiaxin Liu Tsinghua University, Jinyi Deng Tsinghua University, Huizheng Wang Tsinghua University, Qize Yang Tsinghua University, Haoran Shang Tsinghua University, Chao Li Shanghai Jiao Tong University, Yang Hu Tsinghua University, Shouyi Yin Tsinghua University | ||