ReThermal: Co-Design of Thermal-Aware Static and Dynamic Scheduling for LLM Training on Liquid-Cooled Wafer-Scale Chips
With the increasing demand for high computational power in Large Language Models, wafer-scale chips have emerged as a solution, providing the necessary integration and computing capability to meet these needs. However, their ultra-large area and extreme power density introduce critical thermal management challenges under liquid-cooling environments. In addressing this issue, we identify two key opportunities and three major challenges: the behavior–thermal black box, the wafer-scale simulation bottleneck, and the runtime heat–schedule drift.
To tackle these challenges, we propose ReThermal, a holistic scheduling framework that integrates three innovations. First, we introduce behavior-driven thermal modeling to capture workload-induced compute, communication, and heat coupling patterns at the system level. Second, we develop a DNN-accelerated wafer-scale thermal simulator that enables fast and accurate temperature prediction, significantly reducing simulation time. Third, we implement an adaptive thermal-aware scheduling strategy that coordinates compile-time and runtime decisions to dynamically optimize task placement. Evaluations show that ReThermal reduces peak temperature by up to 8.0°C and improves throughput by up to 39.23$%$, providing a scalable and effective thermal control solution for future liquid-cooled wafer-scale systems.
Tue 3 FebDisplayed time zone: Hobart change
14:10 - 15:30 | |||
14:10 20mTalk | LILo: Harnessing the On-chip Accelerators in Intel CPUs for Compressed LLM Inference Acceleration Main Conference Hyungyo Kim UIUC, Qirong Xia UIUC, Jinghan Huang UIUC, Nachuan Wang UIUC, Jung Ho Ahn Seoul National University, Younjoo Lee Seoul National University, Wajdi K Feghali Intel, Ren Wang Intel Labs, Nam Sung Kim UIUC | ||
14:30 20mTalk | ReThermal: Co-Design of Thermal-Aware Static and Dynamic Scheduling for LLM Training on Liquid-Cooled Wafer-Scale Chips Main Conference Chengran Li Tsinghua University, Huizheng Wang Tsinghua University, Jiaxin Liu Tsinghua University, Jingyao Liu Tsinghua University, Zhiheng Yue Tsinghua University, Xia Li Shanghai AI Lab, Shenfei Jiang Shanghai AI Lab, Jinyi Deng Tsinghua University, Yang Hu Tsinghua University, Shouyi Yin Tsinghua University | ||
14:50 20mTalk | TraceRTL: Agile Performance Evaluation for Microarchitecture Exploration Main Conference Zifei Zhang SKLP, Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences, Yinan Xu SKLP, Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences, Sa Wang SKLP, Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences, Dan Tang SKLP, Institute of Computing Technology, Chinese Academy of Sciences; Beijing Institute of Open Source Chip, Yungang Bao State Key Lab of Processors, Institute of Computing Technology, CAS; University of Chinese Academy of Sciences | ||
15:10 20mTalk | Nugget: Portable Program Snippets Main Conference Zhantong Qiu University of California, Davis, Mahyar Samani University of California, Davis, Jason Lowe-Power University of California, Davis & Google | ||