PASCAL: A Phase-Aware Scheduling Algorithm for Serving Reasoning-based Large Language Models
The emergence of reasoning-based LLMs leveraging Chain-of-Thought (CoT) inference introduces new serving challenges, as their extended reasoning phases delay user-visible output and inflate Time-To-First-Token (TTFT). Existing LLM serving frameworks fail to distinguish between reasoning and answering phases, leading to performance degradation under GPU memory constraints. We present PASCAL, a phase-aware scheduling algorithm that prioritizes reasoning to reduce TTFT while using controlled preemption and token pacing during answering to preserve Quality-of-Experience (QoE). Our hierarchical scheduler combines instance-level placement with intra-instance execution and enables dynamic migration at phase boundaries to balance load and reduce interference. Across benchmarks using DeepSeek-R1-Distill-Qwen-32B, PASCAL reduces tail TTFT by up to 72% while maintaining answering phase SLO attainment, demonstrating the importance of phase-aware scheduling for reasoning-based LLM deployment.
Mon 2 FebDisplayed time zone: Hobart change
14:10 - 15:30 | LLM Inference Serving SystemsMain Conference at Coogee Chair(s): Jian Li Chinese Academy of Meteorological Sciences | ||
14:10 20mTalk | Towards Resource-Efficient Serverless LLM Inference with SLINFER Main Conference | ||
14:30 20mTalk | ELORA: Efficient LoRA and KV Cache Management for Multi-LoRA LLM Serving Main Conference Jiuchen Shi Shanghai Jiao Tong University & The Hong Kong Polytechnic University, Hang Zhang Shanghai Jiao Tong University, Yixiao Wang Shanghai Jiao Tong University, Quan Chen Shanghai Jiao Tong University, China, Yizhou Shan Huawei Cloud, Kaihua Fu Hong Kong University of Science and Technology, Wei Wang Hong Kong University of Science and Technology, Minyi Guo Shanghai Jiao Tong University | ||
14:50 20mTalk | PASCAL: A Phase-Aware Scheduling Algorithm for Serving Reasoning-based Large Language Models Main Conference | ||
15:10 20mTalk | The Cost of Dynamic Reasoning: Demystifying AI Agents and Test-Time Scaling from an AI Infrastructure Perspective Main Conference | ||