Beyond Prediction: Tail-Aware Scheduling for LLM Inference

arXiv cs.LG Papers

Summary

This paper introduces a distribution-aware, prediction-free scheduling framework for LLM inference that replaces explicit length prediction with soft priority boosting using statistical signals. The method co-optimizes scheduling and cache-aware preemption to reduce tail latency, achieving up to 35-50% reduction in P99 TTLT compared to SRPT with perfect length knowledge.

arXiv:2606.18431v1 Announce Type: new Abstract: LLM serving exhibits extreme length variability, making size-based scheduling difficult in practice. Recent LLM schedulers approximate SJF/SRPT using predicted decode lengths or ranks and primarily report mean-centric metrics such as TTFT and TBT. We show that these prediction-driven policies can be fragile under distribution shifts, bursty arrivals, and GPU memory pressure, while offering limited control over the tail latency (P90-P99) that dominates user experience, even with perfect decode-length knowledge. We introduce a distribution-aware, prediction-free scheduling framework that replaces explicit length prediction with soft priority boosting driven by lightweight statistical signals. Our design co-optimizes scheduling and cache-aware preemption to account for memory-coupled decode dynamics across workload mixes. Evaluated on production and open-source traces, our method reduces P99 TTLT by up to 35-50% relative to SRPT with perfect length knowledge and reduces TTFT by 34-47% across workloads, including reasoning-heavy and chat-heavy tasks. These results demonstrate a robust alternative for optimizing tail latency in online LLM serving.
Original Article
View Cached Full Text

Cached at: 06/18/26, 05:42 AM

# Beyond Prediction: Tail-Aware Scheduling for LLM Inference
Source: [https://arxiv.org/abs/2606.18431](https://arxiv.org/abs/2606.18431)
[View PDF](https://arxiv.org/pdf/2606.18431)

> Abstract:LLM serving exhibits extreme length variability, making size\-based scheduling difficult in practice\. Recent LLM schedulers approximate SJF/SRPT using predicted decode lengths or ranks and primarily report mean\-centric metrics such as TTFT and TBT\. We show that these prediction\-driven policies can be fragile under distribution shifts, bursty arrivals, and GPU memory pressure, while offering limited control over the tail latency \(P90\-P99\) that dominates user experience, even with perfect decode\-length knowledge\. We introduce a distribution\-aware, prediction\-free scheduling framework that replaces explicit length prediction with soft priority boosting driven by lightweight statistical signals\. Our design co\-optimizes scheduling and cache\-aware preemption to account for memory\-coupled decode dynamics across workload mixes\. Evaluated on production and open\-source traces, our method reduces P99 TTLT by up to 35\-50% relative to SRPT with perfect length knowledge and reduces TTFT by 34\-47% across workloads, including reasoning\-heavy and chat\-heavy tasks\. These results demonstrate a robust alternative for optimizing tail latency in online LLM serving\.

## Submission history

From: Yueying Li \[[view email](https://arxiv.org/show-email/7e1d9899/2606.18431)\] **\[v1\]**Tue, 16 Jun 2026 19:25:37 UTC \(465 KB\)

Similar Articles

Threshold-Based Exclusive Batching for LLM Inference

arXiv cs.AI

This paper analyzes the trade-off between mixed batching and exclusive batching for LLM inference, showing that the optimal choice depends on GPU memory bandwidth. It proposes a threshold-based hybrid scheduler that dynamically switches between the two methods, achieving up to 41.9% higher throughput on bandwidth-constrained GPUs.

Statistically Reliable LLM-Based Ranking Evaluation via Prediction-Powered Inference

arXiv cs.LG

This paper introduces PRECISE, an extension of Prediction-Powered Inference that combines a small set of human labels with a large set of LLM judgments to produce unbiased and variance-reduced estimates of ranking evaluation metrics like Precision@K. The method is validated on the ESCI benchmark and in a production A/B test, where it correctly identified the best system variant using only 100 human labels, confirmed by a +407 bps sales improvement.

Towards Multi-Model LLM Schedulers: Empirical Insights into Offloading and Preemption

arXiv cs.AI

This paper presents an empirical study on scheduling multiple LLMs on shared heterogeneous hardware, focusing on performance implications of CPU-GPU offloading and preemption. It finds that offloading causes non-linear decode degradation, especially for smaller models, and preemption overhead is dominated by model state reload, providing design guidance for future multi-model schedulers.