@npashi: Finally able to talk about what I've been heads-down on for 6 months at @nvidia We just open-sourced cuda-oxide — an ex…
Summary
NVIDIA has open-sourced cuda-oxide, an experimental rustc backend that allows developers to write CUDA kernels directly in pure Rust without DSLs, FFI, or source-to-source translation.
View Cached Full Text
Cached at: 05/08/26, 01:32 PM
Finally able to talk about what I’ve been heads-down on for 6 months at @nvidia 🦀⚡
We just open-sourced cuda-oxide — an experimental rustc backend that lets you write CUDA kernels in pure Rust.
No DSLs. No FFI. No source-to-source step. Single source.
Short🧵👇 https://t.co/YRERctlysd
Similar Articles
cuda-oxide: cuda-oxide is an experimental Rust-to-CUDA compiler
cuda-oxide is an experimental Rust-to-CUDA compiler backend released by NVIDIA, enabling pure Rust GPU kernel development without foreign language bindings.
The cuda-oxide Book
cuda-oxide is an experimental Rust-to-CUDA compiler that allows developers to write safe, idiomatic Rust GPU kernels that compile directly to PTX.
How (and why) we rewrote our production C++ frontend infrastructure in Rust
NearlyFreeSpeech.NET rewrote their production C++ frontend infrastructure (nfsncore) in Rust, a critical system that handles routing, caching, and access control for all incoming requests. The migration was motivated by Rust's safety guarantees, performance, ecosystem strength, and the aging C++ codebase's limitations.
Introducing Triton: Open-source GPU programming for neural networks
OpenAI releases Triton 1.0, an open-source Python-like GPU programming language that enables researchers without CUDA experience to write highly efficient GPU kernels, achieving performance on par with expert-written CUDA code in as few as 25 lines.
@QingQ77: Pure Rust LLM inference engine with custom CUDA kernels for each hardware × model × quantization combination, achieving higher inference speed than vLLM and TensorRT-LLM. https://github.com/Avarok-Cybersecurity/a…
Atlas is a pure Rust LLM inference engine that delivers faster inference than vLLM and TensorRT-LLM by customizing CUDA kernels for each hardware × model × quantization combination.