@raydistributed: Ray Serve LLM now offers 4.4x higher request throughput on prefill-heavy workloads, and 24.8x higher request throughput…

X AI KOLs Following Tools

Summary

Ray Serve LLM achieves 4.4x and 24.8x throughput improvements on prefill- and decode-heavy workloads via direct streaming, a new vLLM V2 executor backend, and HAProxy ingress, now available in Ray 2.56 in partnership with Google Cloud and vLLM.

Ray Serve LLM now offers 4.4x higher request throughput on prefill-heavy workloads, and 24.8x higher request throughput on decode-heavy workloads! Three major optimizations: - Direct streaming, bypassing an intermediate Ray Serve deployment on the response path with a new, control plane-only endpoint picker - A new, Ray V2 executor backend in vLLM, enabling optimizations such as async scheduling - HAProxy ingress, for ingress request routing at the speed of C All available in Ray 2.56. This is awesome work with @googlecloud and @vllm_project!
Original Article
View Cached Full Text

Cached at: 06/18/26, 10:11 PM

Ray Serve LLM now offers 4.4x higher request throughput on prefill-heavy workloads, and 24.8x higher request throughput on decode-heavy workloads!

Three major optimizations:

  • Direct streaming, bypassing an intermediate Ray Serve deployment on the response path with a new, control plane-only endpoint picker
  • A new, Ray V2 executor backend in vLLM, enabling optimizations such as async scheduling
  • HAProxy ingress, for ingress request routing at the speed of C

All available in Ray 2.56. This is awesome work with @googlecloud and @vllm_project!

Seiji Eicher (@seiji_________): Today we are excited to announce, in partnership with the GKE team at Google Cloud (@googlecloud), a major milestone in Ray Serve LLM’s production serving capability. Ray Serve LLM now matches high performance, rust-based routing frameworks such as vllm-router (@vllm_project) in

Similar Articles

vllm-project/vllm v0.20.0rc1

GitHub Releases Watchlist

vLLM 0.20.0rc1 releases with major throughput, quantization, speculative decoding, and multi-hardware support enhancements for scalable LLM serving.

vllm-project/vllm v0.21.0rc1

GitHub Releases Watchlist

vLLM v0.21.0rc1 is a pre-release update for the high-performance LLM inference and serving library, featuring optimizations for throughput, quantization, and hardware support.