AutoHAAP: Automated Heterogeneity-Aware Asymmetric Partitioning for LLM Training
This program is tentative and subject to change.
Heterogeneous clusters with diverse devices mitigate computational and memory burdens in large language model (LLM) training, yet their inherent resource heterogeneity, characterized by divergent computation, memory, and bandwidth capabilities, renders manual parallelization strategy optimization both challenging and time-intensive. Automatic parallelization is critical for scaling complex workloads across heterogeneous architectures. However, current methodologies face significant inefficiencies. First, insufficient pruning of the parameter initialization space results in impractically large search spaces. Second, the prevailing automatic parallel search strategies exhibit suboptimal performance in load balancing and resource constraint adaptation. Third, dynamic parallel strategy tuning incurs substantial overhead due to redundant latency calculations for operators with unchanged configurations, leading to unnecessary computational costs. Therefore, insufficient search space pruning, suboptimal load/resource adaptation, and redundant latency computation are identified as the major bottlenecks in our research.
To address these challenges, we propose \textbf{AutoHAAP} (\underline{Auto}mated \underline{H}eterogeneity-\underline{A}ware \underline{A}symmetric \underline{P}artitioning), a novel framework incorporating three core innovations: (1) memory-aware initialization to drastically reduce viable search spaces; (2) a heterogeneity-aware load-balancing estimator that guides resource-efficient configuration search; and (3) state caching mechanisms eliminating redundant latency calculations.
Evaluations across \mbox{GPT3} and \mbox{Llama3} models of varying scales on both homogeneous and heterogeneous clusters demonstrate that AutoHAAP achieves 0.68–98$\times$ search efficiency gains, 6.57%–106.9%$\times$ throughput improvements in homogeneous environments, and 10.1%-22.28%$\times$ throughput enhancements in heterogeneous setups. These results validate AutoHAAP’s effectiveness in distributed LLM training on diverse hardware.
This program is tentative and subject to change.
Tue 3 FebDisplayed time zone: Hobart change
15:50 - 17:10 | |||
15:50 20mTalk | Compression-Aware Gradient Splitting for Collective Communications in Distributed Training Main Conference Pranati Majhi Texas A&M University, Sabuj Laskar Texas A&M University, Abdullah Muzahid Texas A & M University, Eun Jung Kim | ||
16:10 20mTalk | SCALE: Tackling Communication Bottlenecks in Confidential Multi-GPU ML Main Conference Joongun Park Georgia Tech, Yongqin Wang University of Southern California, Huan Xu Georgia Institute of Technology, Hanjiang Wu Georgia Institute of Technology, Mengyuan Li USC, Tushar Krishna Georgia Institute of Technology | ||
16:30 20mTalk | AutoHAAP: Automated Heterogeneity-Aware Asymmetric Partitioning for LLM Training Main Conference Yuanyuan Wang Zhejiang Lab, Nana Tang Zhejiang Lab, Yuyang Wang Zhejiang Lab, Shu Pan Zhejiang Lab, Dingding Yu Zhejiang Lab, Zeyue Wang Zhejiang Lab, Mou Sun Zhejiang Lab, Kejie Fu Zhejiang Lab, Fangyu Wang Zhejiang Lab, Yunchuan Chen Zhejiang Lab, Ning Sun Zhejiang Lab, Fei Yang Zhejiang Lab | ||
16:50 20mTalk | Towards Compute-Aware In-Switch Computing for LLMs Tensor-Parallelism on Multi-GPU Systems Main Conference Chen Zhang Shanghai Jiao Tong University, Qijun Zhang Shanghai Jiao Tong University, Zhuoshan Zhou Shanghai Jiao Tong University, Yijia Diao Shanghai Jiao Tong University, Haibo Wang Huawei, Zhe Zhou Huawei, Zhipeng Tu Huawei, Zhiyao Li Huawei, Guangyu Sun Peking University, Zhuoran Song Shanghai Jiao Tong University, Zhigang Ji Shanghai Jiao Tong University, Jingwen Leng Shanghai Jiao Tong University, Minyi Guo Shanghai Jiao Tong University | ||