Working on a cgo-free CUDA binding in Go for ML stuff Week 3 - open source [P]

Reddit r/MachineLearning 05/24/26, 12:41 PM Tools

go cuda ml gpu open-source binding no-cgo

Summary

Developer shares progress on gocudrv, a cgo-free CUDA binding for Go, explaining thread affinity challenges and showing example code.

At our work we use CUDA in Rust since the company switched to it recently. Rust has pretty good Driver API bindings but it made me wonder why the hell we cant have something decent in Go without cgo. I mostly build ML tools in the last month and Go is my main language for pretty much everything. Problem is most Go CUDA projects still need cgo and the full toolkit at build time. That breaks cross compilation and makes Docker images huge which sucks when working on machine learning projects. So last month I started messing around with a proof of concept that loads [libcuda.so](http://libcuda.so) at runtime using purego. No cgo at all. Biggest pain was thread affinity. CUDA keeps context per thread so goroutines switching around kept breaking things. I built a simple executor that locks an OS thread with runtime.LockOSThread and funnels all calls through a channel. Heres roughly what using it looks like right now: func run() error { cuda.Init() dev, _ := cuda.GetDevice(0) ctx, _ := dev.Primary() defer ctx.Close() a, _ := cuda.Alloc[float32](ctx, 1024) b, _ := cuda.Alloc[float32](ctx, 1024) c, _ := cuda.Alloc[float32](ctx, 1024) stream, _ := ctx.NewStream() start, _ := ctx.NewEvent() stop, _ := ctx.NewEvent() start.Record(stream) fn.LaunchOn(bg, stream, cfg, cuda.Arg(a), cuda.Arg(b), cuda.Arg(c), cuda.ArgValue(int32(1024)), ) stop.Record(stream) stop.Synchronize() duration, _ := start.Elapsed(stop) fmt.Printf("GPU time: %v\n", duration) return nil } On my 4070 Ti a 10M vector add showed CPU timer at like 160us but actual GPU event timing was 434us. That difference surprised me. The project is still super early and moves slow cuz i only code on weekends and im a total noob with CUDA. Slowly adding Graphs and multi gpu support. THIS IS SO early , so treat it more like a learning cuda repo, but im having fun learning cuda. Thought some of you might find it interesting too. repo is [github.com/eitamring/gocudrv](http://github.com/eitamring/gocudrv) if you wanna take a look. Would be cool if anyone with 5xxx series cards wants to try it wink wink

Original Article

Working on a cgo-free CUDA binding in Go for ML stuff Week 3 - open source [P]

Similar Articles

@npashi: Finally able to talk about what I've been heads-down on for 6 months at @nvidia We just open-sourced cuda-oxide — an ex…

Writing a bindless GPU abstraction layer

Go-Flavored Concurrency in C

Notes from Optimizing CPU-Bound Go Hot Paths

@reprompting: https://x.com/reprompting/status/2074133435401064486

Submit Feedback

Similar Articles

@npashi: Finally able to talk about what I've been heads-down on for 6 months at @nvidia We just open-sourced cuda-oxide — an ex…

Writing a bindless GPU abstraction layer

Notes from Optimizing CPU-Bound Go Hot Paths

@reprompting: https://x.com/reprompting/status/2074133435401064486