CUDA-oxide: Nvidia's official Rust to CUDA compiler

Hacker News Top Tools

Summary

CUDA-oxide is an experimental Rust-to-CUDA compiler developed by NVIDIA that enables writing safe GPU kernels in idiomatic Rust, compiling directly to PTX without requiring domain-specific languages or foreign bindings.

No content available
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 05/11/26, 06:56 PM

# The cuda-oxide Book — cuda-oxide Source: [https://nvlabs.github.io/cuda-oxide/index.html](https://nvlabs.github.io/cuda-oxide/index.html) [![cuda-oxide logo](https://nvlabs.github.io/cuda-oxide/_images/logo.png)](https://nvlabs.github.io/cuda-oxide/_images/logo.png)**cuda\-oxide**is an experimental Rust\-to\-CUDA compiler that lets you write \(SIMT\) GPU kernels in safe\(ish\), idiomatic Rust\. It compiles standard Rust code directly to PTX — no DSLs, no foreign language bindings, just Rust\. Note This book assumes familiarity with the Rust programming language, including ownership, traits, and generics\. Later chapters on async GPU programming also assume working knowledge of`async`/`\.await`and runtimes like tokio\. For a refresher, see[The Rust Programming Language](https://doc.rust-lang.org/book/),[Rust by Example](https://doc.rust-lang.org/rust-by-example/), or the[Async Book](https://rust-lang.github.io/async-book/)\. --- ## Project Status[\#](https://nvlabs.github.io/cuda-oxide/index.html#project-status) The v0\.1\.0 release is an early\-stage alpha:**expect bugs, incomplete features, and API breakage**as we work to improve it\. We hope you’ll try it and help shape its direction by sharing feedback on your experience\. --- ## 🚀 Quick start[\#](https://nvlabs.github.io/cuda-oxide/index.html#quick-start) ``` use cuda_device::{cuda_module, kernel, thread, DisjointSlice}; use cuda_core::{CudaContext, DeviceBuffer, LaunchConfig}; #[cuda_module] mod kernels { use super::*; #[kernel] fn vecadd(a: &[f32], b: &[f32], mut c: DisjointSlice<f32>) { let idx = thread::index_1d(); let i = idx.get(); if let Some(c_elem) = c.get_mut(idx) { *c_elem = a[i] + b[i]; } } } fn main() { let ctx = CudaContext::new(0).unwrap(); let stream = ctx.default_stream(); let module = kernels::load(&ctx).unwrap(); let a = DeviceBuffer::from_host(&stream, &[1.0f32; 1024]).unwrap(); let b = DeviceBuffer::from_host(&stream, &[2.0f32; 1024]).unwrap(); let mut c = DeviceBuffer::<f32>::zeroed(&stream, 1024).unwrap(); module .vecadd(&stream, LaunchConfig::for_num_elems(1024), &a, &b, &mut c) .unwrap(); let result = c.to_host_vec(&stream).unwrap(); assert_eq!(result[0], 3.0); } ``` Build and run with`cargo oxide run vecadd`upon installing the[prerequisites](https://nvlabs.github.io/cuda-oxide/getting-started/installation.html)\. Note `\#\[cuda\_module\]`embeds the generated device artifact into the host binary and generates a typed`kernels::load`function plus one launch method per kernel\. The lower\-level`load\_kernel\_module`and`cuda\_launch\!`APIs remain available when you need to load a specific sidecar artifact or build custom launch code\. --- ## Why cuda\-oxide?[\#](https://nvlabs.github.io/cuda-oxide/index.html#why-cuda-oxide) 🦀 Rust on the GPU Write GPU kernels with Rust’s type system and ownership model\. Safety is a first\-class goal, but GPUs have subtleties — read about[the safety model](https://nvlabs.github.io/cuda-oxide/gpu-safety/the-safety-model.html)\. 💎 A SIMT Compiler Not a DSL\. A custom rustc codegen backend that compiles pure Rust to PTX\. ⚡ Async Execution Compose GPU work as lazy`DeviceOperation`graphs\. Schedule across stream pools\. Await results with`\.await`\.

Similar Articles

The cuda-oxide Book

Lobsters Hottest

cuda-oxide is an experimental Rust-to-CUDA compiler that allows developers to write safe, idiomatic Rust GPU kernels that compile directly to PTX.

A hackable compiler to generate efficient fused GPU kernels for AI models [P]

Reddit r/MachineLearning

The author presents a custom, hackable ML compiler written in Python that lowers LLMs to optimized CUDA kernels through a multi-stage IR pipeline, achieving performance competitive with or superior to PyTorch on specific operations. The article details the compiler's optimization passes, lowering rules, and CLI usage for generating efficient fused GPU kernels.