MoEntwine: Unleashing the Potential of Wafer-scale Chips for Large-scale Expert Parallel Inference
This program is tentative and subject to change.
As large language models (LLMs) continue to scale up, mixture-of-experts (MoE) has become a common technology in SOTA models. MoE models rely on expert parallelism (EP) to alleviate memory bottleneck, which introduces all-to-all communication to dispatch and combine tokens across devices. However, in widely-adopted GPU clusters, high-overhead cross-node communication makes all-to-all expensive, hindering the adoption of EP. Recently, wafer-scale chips (WSCs) have emerged as a platform integrating numerous devices on a wafer-sized interposer. WSCs provide a unified high-performance network connecting all devices, presenting a promising potential for hosting MoE models. Yet, their network is restricted to a mesh topology, causing imbalanced communication pressure and performance loss. Moreover, the lack of on-wafer disk leads to high-overhead expert migration on the critical path.
To fully unleash this potential, we first propose Entwined Ring Mapping (ER-Mapping), which co-designs the mapping of attention and MoE layers to balance communication pressure and achieve better performance. We find that under ER-Mapping, the distribution of cold and hot links in the attention and MoE layers is complementary. Therefore, to hide the migration overhead, we propose the Non-invasive Balancer (NI-Balancer), which splits a complete expert migration into multiple steps and alternately utilizes the cold links of both layers. Evaluation shows ER-Mapping achieves communication reduction up to $62%$. NI-Balancer further delivers $54%$ and $22%$ improvements in MoE computation and communication, respectively. Compared with the SOTA NVL72 supernode, the WSC platform delivers an average $39%$ higher per-device MoE performance owing to its scalability to larger EP.
This program is tentative and subject to change.
Tue 3 FebDisplayed time zone: Hobart change
09:50 - 11:10 | |||
09:50 20mTalk | WATOS: Efficient LLM Training Strategies and Architecture Co-exploration for Wafer-scale Chip Main Conference Huizheng Wang Tsinghua University, Zichuan Wang Tsinghua University, Hongbin Wang Tsinghua University, Jingxiang Hou Tsinghua University, Taiquan Wei Tsinghua University, Chao Li Shanghai Jiao Tong University, Yang Hu Tsinghua University, Shouyi Yin Tsinghua University | ||
10:10 20mTalk | FACE: Fully PD Overlapped Scheduling and Multi-Level Architecture Co-Exploration on Wafer Main Conference Zheng Xu Tsinghua University, Dehao Kong Tsinghua University, Jiaxin Liu Tsinghua University, Dingcheng Jiang Tsinghua University, Xu Dai Shanghai Artificial Intelligence Laboratory, Jinyi Deng Tsinghua University, Yang Hu Tsinghua University, Shouyi Yin Tsinghua University | ||
10:30 20mTalk | TEMP: A Memory Efficient Physical-aware Tensor Partition-Mapping Framework on Wafer-scale Chips Main Conference Huizheng Wang Tsinghua University, Taiquan Wei Tsinghua University, Zichuan Wang Tsinghua University, Dingcheng Jiang Tsinghua University, Qize Yang Tsinghua University, Jiaxin Liu Tsinghua University, Jingxiang Hou Tsinghua University, Chao Li Shanghai Jiao Tong University, Jinyi Deng Tsinghua University, Yang Hu Tsinghua University, Shouyi Yin Tsinghua University | ||
10:50 20mTalk | MoEntwine: Unleashing the Potential of Wafer-scale Chips for Large-scale Expert Parallel Inference Main Conference Xinru Tang Tsinghua University, Jingxiang Hou Tsinghua University, Dingcheng Jiang Tsinghua University, Taiquan Wei Tsinghua University, Jiaxin Liu Tsinghua University, Jinyi Deng Tsinghua University, Huizheng Wang Tsinghua University, Qize Yang Tsinghua University, Haoran Shang Tsinghua University, Chao Li Shanghai Jiao Tong University, Yang Hu Tsinghua University, Shouyi Yin Tsinghua University | ||