CUDA-oxide：NVIDIA 官方 Rust 转 CUDA 编译器

Hacker News Top 2026/05/11 15:55 工具

摘要

CUDA-oxide 是由 NVIDIA 开发的实验性 Rust 转 CUDA 编译器，支持使用地道的 Rust 编写安全的 GPU 核函数，可直接编译为 PTX，无需借助领域特定语言或外部绑定。

暂无内容

查看缓存全文

缓存时间: 2026/05/11 18:56

《cuda-oxide 手册》—— cuda-oxide 来源：https://nvlabs.github.io/cuda-oxide/index.html cuda-oxide logo (https://nvlabs.github.io/cuda-oxide/_images/logo.png)**cuda\-oxide** 是一个实验性的 Rust-to-CUDA 编译器，让您可以用安全（ish）、地道的 Rust 编写 (SIMT) GPU 内核。它直接将标准 Rust 代码编译为 PTX——无需 DSL，无需外部语言绑定，只有 Rust。注意本书假定读者熟悉 Rust 编程语言，包括所有权、特征（trait）和泛型。后面关于异步 GPU 编程的章节也假定读者具备 `async`/`.await` 以及类似 tokio 的运行时的工作知识。如需复习，请参阅《The Rust Programming Language》(https://doc.rust-lang.org/book/)、Rust by Example (https://doc.rust-lang.org/rust-by-example/) 或 Async Book (https://rust-lang.github.io/async-book/)。 --- ## 项目状态\# (https://nvlabs.github.io/cuda-oxide/index.html#project-status) v0.1.0 版本是早期 alpha 阶段：**预计会有缺陷、功能不完善和 API 破坏性变更**，我们正在努力改进。希望您能试用它，并通过分享使用体验反馈来帮助塑造其发展方向。 --- ## 🚀 快速入门\# (https://nvlabs.github.io/cuda-oxide/index.html#quick-start) ``` use cuda_device::{cuda_module, kernel, thread, DisjointSlice}; use cuda_core::{CudaContext, DeviceBuffer, LaunchConfig}; #[cuda_module] mod kernels { use super::*; #[kernel] fn vecadd(a: &[f32], b: &[f32], mut c: DisjointSlice) { let idx = thread::index_1d(); let i = idx.get(); if let Some(c_elem) = c.get_mut(idx) { *c_elem = a[i] + b[i]; } } } fn main() { let ctx = CudaContext::new(0).unwrap(); let stream = ctx.default_stream(); let module = kernels::load(&ctx).unwrap(); let a = DeviceBuffer::from_host(&stream, &[1.0f32; 1024]).unwrap(); let b = DeviceBuffer::from_host(&stream, &[2.0f32; 1024]).unwrap(); let mut c = DeviceBuffer::<f32>::zeroed(&stream, 1024).unwrap(); module .vecadd(&stream, LaunchConfig::for_num_elems(1024), &a, &b, &mut c) .unwrap(); let result = c.to_host_vec(&stream).unwrap(); assert_eq!(result[0], 3.0); } ``` 在安装完前提条件 (https://nvlabs.github.io/cuda-oxide/getting-started/installation.html) 后，使用 `cargo oxide run vecadd` 构建并运行。注意 `#[cuda_module]` 将生成的设备工件（artifact）嵌入宿主二进制文件中，并生成一个类型化的 `kernels::load` 函数以及每个内核一个启动方法。当你需要加载特定的附属工件或构建自定义启动代码时，较低级的 `load_kernel_module` 和 `cuda_launch!` API 仍然可用。 --- ## 为何选择 cuda-oxide?\# (https://nvlabs.github.io/cuda-oxide/index.html#why-cuda-oxide) 🦀 GPU 上的 Rust 利用 Rust 的类型系统和所有权模型编写 GPU 内核。安全是首要目标，但 GPU 有其微妙之处——请阅读安全模型 (https://nvlabs.github.io/cuda-oxide/gpu-safety/the-safety-model.html)。 💎 SIMT 编译器不是 DSL。一个自定义的 rustc 代码生成后端，可将纯 Rust 编译为 PTX。 ⚡ 异步执行将 GPU 工作组合为惰性 `DeviceOperation` 图。跨流池调度。使用 `.await` 等待结果。

相似文章

cuda-oxide: 一款实验性的 Rust-to-CUDA 编译器

Lobsters Hottest

cuda-oxide 是 NVIDIA 发布的一款实验性 Rust-to-CUDA 编译器后端，支持纯 Rust GPU 内核开发，无需外部语言绑定。

cuda-oxide 手册

Lobsters Hottest

cuda-oxide 是一个实验性的 Rust 到 CUDA 编译器，允许开发者编写安全、符合 Rust 惯用法的 GPU 内核，并直接编译为 PTX。

@npashi: 终于可以谈谈过去6个月我在@nvidia一直埋头做的事了。我们刚刚开源了cuda-oxide——一个实验性…

X AI KOLs Timeline

NVIDIA 已开源 cuda-oxide，这是一个实验性的 rustc 后端，允许开发者直接用纯 Rust 编写 CUDA 内核，无需 DSL、FFI 或源码到源码的转换。

@QingQ77: 用纯 Rust 实现 LLM 推理引擎，针对每种硬件×模型×量化组合定制 CUDA 内核，跑出比 vLLM 和 TensorRT-LLM 更高的推理速度。 https://github.com/Avarok-Cybersecurity/a…

X AI KOLs Timeline

Atlas 是一个纯 Rust 实现的 LLM 推理引擎，通过为每种硬件×模型×量化组合定制 CUDA 内核，实现了比 vLLM 和 TensorRT-LLM 更快的推理速度。

一个可定制的编译器，用于为AI模型生成高效的融合GPU内核 [P]

Reddit r/MachineLearning

作者介绍了一款用 Python 编写、高度可定制且易于修改的 ML 编译器。该编译器通过多级 IR 流水线将 LLMs 转换为优化的 CUDA 内核，在特定操作上实现了与 PyTorch 相当甚至更优的性能。文章详细阐述了该编译器的优化过程、降级规则以及用于生成高效融合 GPU 内核的 CLI 用法。