Scaling Kubernetes to 2,500 nodes

OpenAI Blog News

Summary

OpenAI shares infrastructure lessons from scaling Kubernetes to 2,500 nodes, detailing optimizations for container image pulls including kubelet configuration changes, Docker overlay2 migration, and preloading strategies to resolve Pending pod issues.

No content available
Original Article
View Cached Full Text

Cached at: 04/20/26, 02:45 PM

# Scaling Kubernetes to 2,500 nodes Source: [https://openai.com/index/scaling-kubernetes-to-2500-nodes/](https://openai.com/index/scaling-kubernetes-to-2500-nodes/) Our[Dota⁠\(opens in a new window\)](https://blog.openai.com/more-on-dota-2/)project started out on Kubernetes, and as it scaled, we noticed that fresh Kubernetes nodes often have pods sitting in[Pending⁠\(opens in a new window\)](https://kubernetes.io/docs/tasks/debug-application-cluster/debug-pod-replication-controller/#my-pod-stays-pending)for a long time\. The game image is around 17GB, and would often take 30 minutes to pull on a fresh cluster node, so we understood why the Dota container would be Pending for a while — but this was true for other containers as well\. Digging in, we found that[kubelet⁠\(opens in a new window\)](https://kubernetes.io/docs/admin/kubelet)has a`\-\-serialize\-image\-pulls`flag which defaults to`true`, meaning the Dota image pull blocked all other images\. Changing to`false`required switching Docker to overlay2 rather than AUFS\. To further speed up pulls, we also moved the Docker root to the instance\-attached SSD, like we did for the etcd machines\. Even after optimizing the pull speed, we saw pods failing to start with a cryptic error message:`rpc error: code = 2 desc = net/http: request canceled`\. The kubelet and Docker logs also contained messages indicating that the image pull had been canceled, due to a lack of progress\. We tracked the root to large images taking too long to pull/extract, or times when we had a long backlog of images to pull\. To address this, we set kubelet’s`\-\-image\-pull\-progress\-deadline`flag to 30 minutes, and set the Docker daemon’s`max\-concurrent\-downloads`option to 10\. \(The second option didn’t speed up extraction of large images, but allowed the queue of images to pull in parallel\.\) Our last Docker pull issue was due to the Google Container Registry\. By default, kubelet pulls a special image from`gcr\.io`\(controlled by the`\-\-pod\-infra\-container\-image`flag\) which is used when starting any new container\. If that pull fails for any reason, like exceeding your[quota⁠\(opens in a new window\)](https://github.com/kubernetes/kubernetes/issues/50568), that node won’t be able to launch any containers\. Because our nodes go through a NAT to reach`gcr\.io`rather than having their own public IP, it’s quite likely that we’ll hit this per\-IP quota limit\. To fix this, we simply preloaded that Docker image in the machine image for our Kubernetes workers by using`docker image save \-o /opt/preloaded\_docker\_images\.tar`and`docker image load \-i /opt/preloaded\_docker\_images\.tar`\. To improve performance, we do the same for a whitelist of common OpenAI\-internal images like the Dota image\.

Similar Articles

Scaling Kubernetes to 7,500 nodes

OpenAI Blog

OpenAI shares detailed lessons learned from scaling a single Kubernetes cluster to 7,500 nodes to support large machine learning workloads, covering networking, scheduling, and infrastructure challenges. The post builds on their earlier experience scaling to 2,500 nodes and aims to help the broader Kubernetes community.

Infrastructure for deep learning

OpenAI Blog

OpenAI shares their deep learning infrastructure approach and open-sources kubernetes-ec2-autoscaler, a batch-optimized scaling manager for Kubernetes, emphasizing how infrastructure quality multiplies research progress.

Advancing Open Source AI, NVIDIA Donates Dynamic Resource Allocation Driver for GPUs to Kubernetes Community

NVIDIA Blog

NVIDIA is donating its Dynamic Resource Allocation (DRA) Driver for GPUs to the Cloud Native Computing Foundation (CNCF) and Kubernetes community, moving it from vendor-governed to community-owned. The donation aims to simplify GPU resource management in Kubernetes for AI workloads and includes GPU support for Kata Containers through collaboration with CNCF's Confidential Containers community.