@raydistributed: Ray Serve LLM now offers 4.4x higher request throughput on prefill-heavy workloads, and 24.8x higher request throughput…

X AI KOLs Following 06/18/26, 04:22 PM Tools

ray-serve llm-serving throughput-optimization inference google-cloud vllm high-performance

Summary

Ray Serve LLM achieves 4.4x and 24.8x throughput improvements on prefill- and decode-heavy workloads via direct streaming, a new vLLM V2 executor backend, and HAProxy ingress, now available in Ray 2.56 in partnership with Google Cloud and vLLM.

Ray Serve LLM now offers 4.4x higher request throughput on prefill-heavy workloads, and 24.8x higher request throughput on decode-heavy workloads! Three major optimizations: - Direct streaming, bypassing an intermediate Ray Serve deployment on the response path with a new, control plane-only endpoint picker - A new, Ray V2 executor backend in vLLM, enabling optimizations such as async scheduling - HAProxy ingress, for ingress request routing at the speed of C All available in Ray 2.56. This is awesome work with @googlecloud and @vllm_project!

Original Article

View Cached Full Text

Cached at: 06/18/26, 10:11 PM

Ray Serve LLM now offers 4.4x higher request throughput on prefill-heavy workloads, and 24.8x higher request throughput on decode-heavy workloads!

Three major optimizations:

Direct streaming, bypassing an intermediate Ray Serve deployment on the response path with a new, control plane-only endpoint picker
A new, Ray V2 executor backend in vLLM, enabling optimizations such as async scheduling
HAProxy ingress, for ingress request routing at the speed of C

All available in Ray 2.56. This is awesome work with @googlecloud and @vllm_project!

Seiji Eicher (@seiji_________): Today we are excited to announce, in partnership with the GKE team at Google Cloud (@googlecloud), a major milestone in Ray Serve LLM’s production serving capability. Ray Serve LLM now matches high performance, rust-based routing frameworks such as vllm-router (@vllm_project) in

@raydistributed: Ray Serve LLM now offers 4.4x higher request throughput on prefill-heavy workloads, and 24.8x higher request throughput…

Similar Articles

@seiji_________: Today we are excited to announce, in partnership with the GKE team at Google Cloud (@googlecloud), a major milestone in…

vllm-project/vllm v0.20.0rc1

vllm-project/vllm v0.21.0rc1

@robertnishihara: Some intuition about PD disaggregation from the blog - PD doesn't speed up prefill and can actually hurt TTFT - PD's re…

@AndrewYNg: New course on serving LLMs efficiently -- how do you serve models to many concurrent users at low latency and reasonabl…

Submit Feedback

Similar Articles

@seiji_________: Today we are excited to announce, in partnership with the GKE team at Google Cloud (@googlecloud), a major milestone in…

@robertnishihara: Some intuition about PD disaggregation from the blog - PD doesn't speed up prefill and can actually hurt TTFT - PD's re…

@AndrewYNg: New course on serving LLMs efficiently -- how do you serve models to many concurrent users at low latency and reasonabl…