Tag
Ray 2.56 is released with stability improvements for Ray Data and a re-architecture of Ray Serve for better LLM serving performance.
Ray 2.56 has been released with improvements to Ray Data, Ray Serve for LLMs, GPU-domain-aware placement groups, and Kubernetes integration.
Robert Nishihara highlights a paper on disaggregating RL workloads, showing that using compute-optimized H800s for prefill and bandwidth-optimized H20s for decode can cut rollout times by 21-51% and 47% respectively, emphasizing that no single hardware type fits all stages.
Ray Serve LLM achieves up to 4x higher throughput on prefill-heavy workloads and 24x on decode-heavy workloads in Ray 2.56, matching rust-based routing frameworks like vllm-router in production benchmarks, announced in partnership with Google Cloud GKE team.
This blog post from Anyscale explains the intuition behind Prefill-Decode (PD) disaggregation for LLM serving, showing how separating prefill and decode phases onto dedicated GPUs can achieve up to 2.7x better goodput and 67% cost savings when using Ray and vLLM on AMD MI325X, while also discussing when PD disaggregation does not help.
Anyscale on Azure is now in public preview. Daniel Arrizza and Paul Yu will host a working session on building and deploying production AI workloads within an Azure tenant, integrating with existing Azure services.
Microsoft AI announces MAI-Thinking-1, a 35B active/1T total MoE reasoning model competitive on STEM and coding tasks, developed using Ray for distributed training and orchestration.
Snowflake now supports job-based batch inference powered by Ray, enabling distributed GPU execution for scaling model inference over millions of unstructured datapoints with a single API call.
Anyscale is hosting a hands-on virtual lab session teaching developers how to build and scale data pipelines with Ray, covering video data curation, distributed GPU inference, and CPU/GPU streaming pipelines.
Anyscale releases Agent Skills to help coding agents correctly deploy Ray workloads with proper GPU memory handling and up-to-date APIs.