Tag
Ray Serve LLM achieves 4.4x and 24.8x throughput improvements on prefill- and decode-heavy workloads via direct streaming, a new vLLM V2 executor backend, and HAProxy ingress, now available in Ray 2.56 in partnership with Google Cloud and vLLM.
Anyscale published a technical guide on deploying production-ready AI agents using Ray Serve, MCP, and A2A protocols. The article addresses common infrastructure bottlenecks by proposing a decoupled microservices architecture that enables independent scaling of LLMs, tools, and agents.