Tag
This article explores the feasibility of using an external NVIDIA RTX 5090 GPU with an Apple Silicon Mac via Thunderbolt for CUDA inference and gaming, covering methods like tinygrad eGPU drivers and PCI passthrough to a Linux VM.
Modal Labs has released an open-source, interlinked GPU glossary that consolidates fragmented NVIDIA documentation, CUDA details, and compiler flags into a single navigable resource for engineers optimizing LLM training and inference.
NVIDIA has open-sourced cuda-oxide, an experimental rustc backend that allows developers to write CUDA kernels directly in pure Rust without DSLs, FFI, or source-to-source translation.
cuda-oxide is an experimental Rust-to-CUDA compiler that allows developers to write safe, idiomatic Rust GPU kernels that compile directly to PTX.
cuda-oxide is an experimental Rust-to-CUDA compiler backend released by NVIDIA, enabling pure Rust GPU kernel development without foreign language bindings.
This article introduces the Cornell Virtual Workshop's free online tutorial on basic CUDA programming using C, covering prerequisites and additional resources.
Discussion of the shift in GPU kernel engineering from C++ CuTe/CUTLASS to NVIDIA's Python-based CuTeDSL, questioning whether new engineers should learn legacy C++ templates or prioritize the emerging stack for LLM inference work.
NVIDIA GTC 2026 keynote highlights the 20th anniversary of CUDA, introduces DLSS 5 with AI-powered neural rendering, and surveys NVIDIA's accelerated computing platforms across automotive, healthcare, robotics, and other sectors. CEO Jensen Huang projects $1 trillion in computing revenue from 2025-2027 driven by massive AI demand.
OpenAI releases Triton 1.0, an open-source Python-like GPU programming language that enables researchers without CUDA experience to write highly efficient GPU kernels, achieving performance on par with expert-written CUDA code in as few as 25 lines.
DeepSeek releases DeepGEMM, a high-performance CUDA kernel library for LLM computation primitives including FP8/FP4/BF16 GEMMs, fused MoE with overlapped communication, and MQA scoring, compiled at runtime via JIT with no installation-time CUDA compilation required. The library achieves up to 1550 TFLOPS on H800 and matches or exceeds expert-tuned libraries across various matrix shapes.