Beyond Prediction: Tail-Aware Scheduling for LLM Inference

arXiv cs.LG 06/18/26, 04:00 AM Papers

scheduling llm-inference tail-latency distribution-aware preemption serving

Summary

This paper introduces a distribution-aware, prediction-free scheduling framework for LLM inference that replaces explicit length prediction with soft priority boosting using statistical signals. The method co-optimizes scheduling and cache-aware preemption to reduce tail latency, achieving up to 35-50% reduction in P99 TTLT compared to SRPT with perfect length knowledge.

arXiv:2606.18431v1 Announce Type: new Abstract: LLM serving exhibits extreme length variability, making size-based scheduling difficult in practice. Recent LLM schedulers approximate SJF/SRPT using predicted decode lengths or ranks and primarily report mean-centric metrics such as TTFT and TBT. We show that these prediction-driven policies can be fragile under distribution shifts, bursty arrivals, and GPU memory pressure, while offering limited control over the tail latency (P90-P99) that dominates user experience, even with perfect decode-length knowledge. We introduce a distribution-aware, prediction-free scheduling framework that replaces explicit length prediction with soft priority boosting driven by lightweight statistical signals. Our design co-optimizes scheduling and cache-aware preemption to account for memory-coupled decode dynamics across workload mixes. Evaluated on production and open-source traces, our method reduces P99 TTLT by up to 35-50% relative to SRPT with perfect length knowledge and reduces TTFT by 34-47% across workloads, including reasoning-heavy and chat-heavy tasks. These results demonstrate a robust alternative for optimizing tail latency in online LLM serving.

Original Article

View Cached Full Text

Cached at: 06/18/26, 05:42 AM

# Beyond Prediction: Tail-Aware Scheduling for LLM Inference
Source: [https://arxiv.org/abs/2606.18431](https://arxiv.org/abs/2606.18431)
[View PDF](https://arxiv.org/pdf/2606.18431)

> Abstract:LLM serving exhibits extreme length variability, making size\-based scheduling difficult in practice\. Recent LLM schedulers approximate SJF/SRPT using predicted decode lengths or ranks and primarily report mean\-centric metrics such as TTFT and TBT\. We show that these prediction\-driven policies can be fragile under distribution shifts, bursty arrivals, and GPU memory pressure, while offering limited control over the tail latency \(P90\-P99\) that dominates user experience, even with perfect decode\-length knowledge\. We introduce a distribution\-aware, prediction\-free scheduling framework that replaces explicit length prediction with soft priority boosting driven by lightweight statistical signals\. Our design co\-optimizes scheduling and cache\-aware preemption to account for memory\-coupled decode dynamics across workload mixes\. Evaluated on production and open\-source traces, our method reduces P99 TTLT by up to 35\-50% relative to SRPT with perfect length knowledge and reduces TTFT by 34\-47% across workloads, including reasoning\-heavy and chat\-heavy tasks\. These results demonstrate a robust alternative for optimizing tail latency in online LLM serving\.

## Submission history

From: Yueying Li \[[view email](https://arxiv.org/show-email/7e1d9899/2606.18431)\] **\[v1\]**Tue, 16 Jun 2026 19:25:37 UTC \(465 KB\)

Beyond Prediction: Tail-Aware Scheduling for LLM Inference

Similar Articles

Threshold-Based Exclusive Batching for LLM Inference

Prefilling-dLLM: Predictive Prefilling for Long-Context Inference in Diffusion Language Models

Statistically Reliable LLM-Based Ranking Evaluation via Prediction-Powered Inference

Towards Multi-Model LLM Schedulers: Empirical Insights into Offloading and Preemption

CATS: Cascaded Adaptive Tree Speculation for Memory-Limited LLM Inference Acceleration

Submit Feedback

Similar Articles

Threshold-Based Exclusive Batching for LLM Inference

Prefilling-dLLM: Predictive Prefilling for Long-Context Inference in Diffusion Language Models

Statistically Reliable LLM-Based Ranking Evaluation via Prediction-Powered Inference

Towards Multi-Model LLM Schedulers: Empirical Insights into Offloading and Preemption

CATS: Cascaded Adaptive Tree Speculation for Memory-Limited LLM Inference Acceleration