Scaling Kubernetes to 2,500 nodes

OpenAI Blog 01/18/18, 08:00 AM News

kubernetes scaling infrastructure docker container-orchestration openai

Summary

OpenAI shares infrastructure lessons from scaling Kubernetes to 2,500 nodes, detailing optimizations for container image pulls including kubelet configuration changes, Docker overlay2 migration, and preloading strategies to resolve Pending pod issues.

No content available

Original Article

View Cached Full Text

Cached at: 04/20/26, 02:45 PM

# Scaling Kubernetes to 2,500 nodes Source: [https://openai.com/index/scaling-kubernetes-to-2500-nodes/](https://openai.com/index/scaling-kubernetes-to-2500-nodes/) Our[Dota⁠\(opens in a new window\)](https://blog.openai.com/more-on-dota-2/)project started out on Kubernetes, and as it scaled, we noticed that fresh Kubernetes nodes often have pods sitting in[Pending⁠\(opens in a new window\)](https://kubernetes.io/docs/tasks/debug-application-cluster/debug-pod-replication-controller/#my-pod-stays-pending)for a long time\. The game image is around 17GB, and would often take 30 minutes to pull on a fresh cluster node, so we understood why the Dota container would be Pending for a while — but this was true for other containers as well\. Digging in, we found that[kubelet⁠\(opens in a new window\)](https://kubernetes.io/docs/admin/kubelet)has a`\-\-serialize\-image\-pulls`flag which defaults to`true`, meaning the Dota image pull blocked all other images\. Changing to`false`required switching Docker to overlay2 rather than AUFS\. To further speed up pulls, we also moved the Docker root to the instance\-attached SSD, like we did for the etcd machines\. Even after optimizing the pull speed, we saw pods failing to start with a cryptic error message:`rpc error: code = 2 desc = net/http: request canceled`\. The kubelet and Docker logs also contained messages indicating that the image pull had been canceled, due to a lack of progress\. We tracked the root to large images taking too long to pull/extract, or times when we had a long backlog of images to pull\. To address this, we set kubelet’s`\-\-image\-pull\-progress\-deadline`flag to 30 minutes, and set the Docker daemon’s`max\-concurrent\-downloads`option to 10\. \(The second option didn’t speed up extraction of large images, but allowed the queue of images to pull in parallel\.\) Our last Docker pull issue was due to the Google Container Registry\. By default, kubelet pulls a special image from`gcr\.io`\(controlled by the`\-\-pod\-infra\-container\-image`flag\) which is used when starting any new container\. If that pull fails for any reason, like exceeding your[quota⁠\(opens in a new window\)](https://github.com/kubernetes/kubernetes/issues/50568), that node won’t be able to launch any containers\. Because our nodes go through a NAT to reach`gcr\.io`rather than having their own public IP, it’s quite likely that we’ll hit this per\-IP quota limit\. To fix this, we simply preloaded that Docker image in the machine image for our Kubernetes workers by using`docker image save \-o /opt/preloaded\_docker\_images\.tar`and`docker image load \-i /opt/preloaded\_docker\_images\.tar`\. To improve performance, we do the same for a whitelist of common OpenAI\-internal images like the Dota image\.

Scaling Kubernetes to 2,500 nodes

Similar Articles

Scaling Kubernetes to 7,500 nodes

Infrastructure for deep learning

Advancing Open Source AI, NVIDIA Donates Dynamic Resource Allocation Driver for GPUs to Kubernetes Community

Building the compute infrastructure for the Intelligence Age

how do you scale infrastructure for ai agents on a budget?

Submit Feedback

Similar Articles

Scaling Kubernetes to 7,500 nodes

Infrastructure for deep learning

Advancing Open Source AI, NVIDIA Donates Dynamic Resource Allocation Driver for GPUs to Kubernetes Community

Building the compute infrastructure for the Intelligence Age

how do you scale infrastructure for ai agents on a budget?