Beyond Prediction: Tail-Aware Scheduling for LLM Inference
Summary
This paper introduces a distribution-aware, prediction-free scheduling framework for LLM inference that replaces explicit length prediction with soft priority boosting using statistical signals. The method co-optimizes scheduling and cache-aware preemption to reduce tail latency, achieving up to 35-50% reduction in P99 TTLT compared to SRPT with perfect length knowledge.
View Cached Full Text
Cached at: 06/18/26, 05:42 AM
# Beyond Prediction: Tail-Aware Scheduling for LLM Inference Source: [https://arxiv.org/abs/2606.18431](https://arxiv.org/abs/2606.18431) [View PDF](https://arxiv.org/pdf/2606.18431) > Abstract:LLM serving exhibits extreme length variability, making size\-based scheduling difficult in practice\. Recent LLM schedulers approximate SJF/SRPT using predicted decode lengths or ranks and primarily report mean\-centric metrics such as TTFT and TBT\. We show that these prediction\-driven policies can be fragile under distribution shifts, bursty arrivals, and GPU memory pressure, while offering limited control over the tail latency \(P90\-P99\) that dominates user experience, even with perfect decode\-length knowledge\. We introduce a distribution\-aware, prediction\-free scheduling framework that replaces explicit length prediction with soft priority boosting driven by lightweight statistical signals\. Our design co\-optimizes scheduling and cache\-aware preemption to account for memory\-coupled decode dynamics across workload mixes\. Evaluated on production and open\-source traces, our method reduces P99 TTLT by up to 35\-50% relative to SRPT with perfect length knowledge and reduces TTFT by 34\-47% across workloads, including reasoning\-heavy and chat\-heavy tasks\. These results demonstrate a robust alternative for optimizing tail latency in online LLM serving\. ## Submission history From: Yueying Li \[[view email](https://arxiv.org/show-email/7e1d9899/2606.18431)\] **\[v1\]**Tue, 16 Jun 2026 19:25:37 UTC \(465 KB\)
Similar Articles
Threshold-Based Exclusive Batching for LLM Inference
This paper analyzes the trade-off between mixed batching and exclusive batching for LLM inference, showing that the optimal choice depends on GPU memory bandwidth. It proposes a threshold-based hybrid scheduler that dynamically switches between the two methods, achieving up to 41.9% higher throughput on bandwidth-constrained GPUs.
Prefilling-dLLM: Predictive Prefilling for Long-Context Inference in Diffusion Language Models
This paper proposes Prefilling-dLLM, a training-free framework that partitions the prefix into chunks and caches KV representations, achieving state-of-the-art quality and up to 28x speedup for long-context inference in diffusion language models.
Statistically Reliable LLM-Based Ranking Evaluation via Prediction-Powered Inference
This paper introduces PRECISE, an extension of Prediction-Powered Inference that combines a small set of human labels with a large set of LLM judgments to produce unbiased and variance-reduced estimates of ranking evaluation metrics like Precision@K. The method is validated on the ESCI benchmark and in a production A/B test, where it correctly identified the best system variant using only 100 human labels, confirmed by a +407 bps sales improvement.
Towards Multi-Model LLM Schedulers: Empirical Insights into Offloading and Preemption
This paper presents an empirical study on scheduling multiple LLMs on shared heterogeneous hardware, focusing on performance implications of CPU-GPU offloading and preemption. It finds that offloading causes non-linear decode degradation, especially for smaller models, and preemption overhead is dominated by model state reload, providing design guidance for future multi-model schedulers.
CATS: Cascaded Adaptive Tree Speculation for Memory-Limited LLM Inference Acceleration
This paper introduces CATS, a cascaded adaptive tree speculation framework designed to accelerate LLM inference on memory-constrained edge devices by optimizing memory usage while maintaining high token acceptance rates.