Tag
This blog post explains the inspection paradox in system latency and recovery time measurement, showing why customers experience longer average waits than service metrics suggest. It includes an interactive simulation and emphasizes the importance of understanding the tail of the distribution.
This paper introduces a distribution-aware, prediction-free scheduling framework for LLM inference that replaces explicit length prediction with soft priority boosting using statistical signals. The method co-optimizes scheduling and cache-aware preemption to reduce tail latency, achieving up to 35-50% reduction in P99 TTLT compared to SRPT with perfect length knowledge.