Tag
Modular published a blog post explaining why traditional HTTP routing doesn't work for LLM inference workloads. The article describes how their distributed inference framework handles stateful, heterogeneous GPU pods with KV caches, specialized prefill/decode backends, and conversation-level routing that traditional stateless routing algorithms cannot address.
Modly is an open-source desktop app that generates fully textured 3D meshes from images, running 100% locally on your GPU with pluggable AI model extensions.
Meta's In-Kernel Broadcast Optimization (IKBO) eliminates redundant user-embedding broadcast in RecSys inference via kernel-model-system co-design, delivering up to 2/3 latency reduction and ~4x speedup on H100 GPUs, and serving as the backbone for the Meta Adaptive Ranking Model.
AMD is set to release new slottable PCIe-based Instinct GPUs aimed at the enterprise AI market, offering a potential new hardware option for local LLM deployment.
AMD introduces the Instinct MI350P accelerator featuring CDNA 4 architecture in a PCIe form factor, though pricing and availability details are not yet announced.
Modal engineers profiled SGLang's scheduler on multimodal VLM workloads and found that replacing expensive GPU memory bookkeeping with a simple Python dict cache improved throughput by 16% and reduced latency by over 13%, with the fix merged into SGLang v0.5.10.
Anyscale releases Agent Skills to help coding agents correctly deploy Ray workloads with proper GPU memory handling and up-to-date APIs.
Sam Altman shares a manga created with ChatGPT Images 2.0 depicting the GPU hunt, hinting at an upcoming image-generation upgrade.
vLLM launched a redesigned recipes site that turns any HuggingFace model URL into a ready-to-run inference recipe for specific hardware and tasks.
A quick breakdown of ballpark numbers for a 100k H100 GPU datacenter, covering GPU costs (~$3B), full datacenter build (~$5B), power consumption (~0.2GW), and annual energy costs (~$50M).
A tweet highlights how coding agents can clarify complex ideas, using GPU vs NPU memory competition on devices as an example demonstrated through code.
An opinion piece arguing that current GPU hardware is fundamentally insufficient for achieving AGI and that computational architecture would need to be completely redesigned.
A researcher shares their home compute setup for MLX and AI research, featuring M3 Ultra with 512GB, RTX PRO 6000 with 96GB, and M3 Max with 96GB for model porting and stress testing.
A technical blog post by a Saints Row: The Third Remastered developer explaining modern rendering culling techniques including distance culling, backface culling, and frustum culling, with practical insights for game developers working on real-time graphics optimization.
vLLM v0.19.1 release - a fast and easy-to-use open-source library for LLM inference and serving with state-of-the-art throughput, supporting 200+ model architectures and diverse hardware including NVIDIA/AMD GPUs and CPUs.
NVIDIA is donating its Dynamic Resource Allocation (DRA) Driver for GPUs to the Cloud Native Computing Foundation (CNCF) and Kubernetes community, moving it from vendor-governed to community-owned. The donation aims to simplify GPU resource management in Kubernetes for AI workloads and includes GPU support for Kata Containers through collaboration with CNCF's Confidential Containers community.
AMD and OpenAI announce a strategic partnership to deploy 6 gigawatts of AMD Instinct GPUs, with initial 1 gigawatt deployment starting in H2 2026. AMD will issue OpenAI warrants for up to 160 million shares, with vesting tied to deployment milestones and financial targets.
OpenAI announces Stargate Norway, its first European AI data center initiative in Narvik, planned to deliver 100,000 NVIDIA GPUs by end of 2026 with 230MW capacity powered entirely by renewable hydropower. The facility is a joint venture between Nscale and Aker, reflecting OpenAI's broader expansion of AI infrastructure partnerships across Europe and globally.