Tag
OpenAI has released MRC (Multipath Reliable Connection), a novel networking protocol developed with industry partners to improve performance and resilience in large-scale AI training clusters. The specification was published via the Open Compute Project to standardize infrastructure for efficient supercomputer operations.
Emerald AI demonstrated how power-flexible AI factories can autonomously adjust electricity consumption to stabilize grid demand, using NVIDIA GPUs and infrastructure at a London data center to absorb peak power surges without disrupting critical workloads.
OpenAI shares detailed lessons learned from scaling a single Kubernetes cluster to 7,500 nodes to support large machine learning workloads, covering networking, scheduling, and infrastructure challenges. The post builds on their earlier experience scaling to 2,500 nodes and aims to help the broader Kubernetes community.