@raydistributed: Ray Serve LLM now offers 4.4x higher request throughput on prefill-heavy workloads, and 24.8x higher request throughput…
Summary
Ray Serve LLM achieves 4.4x and 24.8x throughput improvements on prefill- and decode-heavy workloads via direct streaming, a new vLLM V2 executor backend, and HAProxy ingress, now available in Ray 2.56 in partnership with Google Cloud and vLLM.
View Cached Full Text
Cached at: 06/18/26, 10:11 PM
Ray Serve LLM now offers 4.4x higher request throughput on prefill-heavy workloads, and 24.8x higher request throughput on decode-heavy workloads!
Three major optimizations:
- Direct streaming, bypassing an intermediate Ray Serve deployment on the response path with a new, control plane-only endpoint picker
- A new, Ray V2 executor backend in vLLM, enabling optimizations such as async scheduling
- HAProxy ingress, for ingress request routing at the speed of C
All available in Ray 2.56. This is awesome work with @googlecloud and @vllm_project!
Seiji Eicher (@seiji_________): Today we are excited to announce, in partnership with the GKE team at Google Cloud (@googlecloud), a major milestone in Ray Serve LLM’s production serving capability. Ray Serve LLM now matches high performance, rust-based routing frameworks such as vllm-router (@vllm_project) in
Similar Articles
@seiji_________: Today we are excited to announce, in partnership with the GKE team at Google Cloud (@googlecloud), a major milestone in…
Ray Serve LLM achieves up to 4x higher throughput on prefill-heavy workloads and 24x on decode-heavy workloads in Ray 2.56, matching rust-based routing frameworks like vllm-router in production benchmarks, announced in partnership with Google Cloud GKE team.
vllm-project/vllm v0.20.0rc1
vLLM 0.20.0rc1 releases with major throughput, quantization, speculative decoding, and multi-hardware support enhancements for scalable LLM serving.
vllm-project/vllm v0.21.0rc1
vLLM v0.21.0rc1 is a pre-release update for the high-performance LLM inference and serving library, featuring optimizations for throughput, quantization, and hardware support.
@robertnishihara: Some intuition about PD disaggregation from the blog - PD doesn't speed up prefill and can actually hurt TTFT - PD's re…
This blog post from Anyscale explains the intuition behind Prefill-Decode (PD) disaggregation for LLM serving, showing how separating prefill and decode phases onto dedicated GPUs can achieve up to 2.7x better goodput and 67% cost savings when using Ray and vLLM on AMD MI325X, while also discussing when PD disaggregation does not help.
@AndrewYNg: New course on serving LLMs efficiently -- how do you serve models to many concurrent users at low latency and reasonabl…
Andrew Ng and DeepLearning.AI have launched a new short course on efficient LLM inference with vLLM, built in partnership with Red Hat, covering quantization, PagedAttention, continuous batching, and benchmarking for serving LLMs at scale.