Tag
IREN acquires Mirantis for $625 million to integrate its cloud-native Kubernetes and AI infrastructure software into IREN's data centers, aiming to offer a full AI cloud platform.
Ex-Google engineers published a map of Google's internal tools and their open-source equivalents, providing a cheat code for building scalable infrastructure.
Kubernetes v1.36 “Haru” ships 70 enhancements—18 stable, 25 beta, 25 alpha—plus deprecations and removals.
Developer explores how to abstract GPU workloads so they can run across multiple GPU providers without provider-specific configuration, leaning toward separating workload definition from infrastructure binding.
ByteDance has open-sourced Gödel, a high-performance Kubernetes scheduler, contributing it to the open-source community.
NVIDIA is donating its Dynamic Resource Allocation (DRA) Driver for GPUs to the Cloud Native Computing Foundation (CNCF) and Kubernetes community, moving it from vendor-governed to community-owned. The donation aims to simplify GPU resource management in Kubernetes for AI workloads and includes GPU support for Kata Containers through collaboration with CNCF's Confidential Containers community.
OpenAI shares detailed lessons learned from scaling a single Kubernetes cluster to 7,500 nodes to support large machine learning workloads, covering networking, scheduling, and infrastructure challenges. The post builds on their earlier experience scaling to 2,500 nodes and aims to help the broader Kubernetes community.
OpenAI shares infrastructure lessons from scaling Kubernetes to 2,500 nodes, detailing optimizations for container image pulls including kubelet configuration changes, Docker overlay2 migration, and preloading strategies to resolve Pending pod issues.
OpenAI shares their deep learning infrastructure approach and open-sources kubernetes-ec2-autoscaler, a batch-optimized scaling manager for Kubernetes, emphasizing how infrastructure quality multiplies research progress.