Working on a cgo-free CUDA binding in Go for ML stuff Week 3 - open source [P]

Reddit r/MachineLearning Tools

Summary

Developer shares progress on gocudrv, a cgo-free CUDA binding for Go, explaining thread affinity challenges and showing example code.

At our work we use CUDA in Rust since the company switched to it recently. Rust has pretty good Driver API bindings but it made me wonder why the hell we cant have something decent in Go without cgo. I mostly build ML tools in the last month and Go is my main language for pretty much everything. Problem is most Go CUDA projects still need cgo and the full toolkit at build time. That breaks cross compilation and makes Docker images huge which sucks when working on machine learning projects. So last month I started messing around with a proof of concept that loads [libcuda.so](http://libcuda.so) at runtime using purego. No cgo at all. Biggest pain was thread affinity. CUDA keeps context per thread so goroutines switching around kept breaking things. I built a simple executor that locks an OS thread with runtime.LockOSThread and funnels all calls through a channel. Heres roughly what using it looks like right now: func run() error { cuda.Init() dev, _ := cuda.GetDevice(0) ctx, _ := dev.Primary() defer ctx.Close() a, _ := cuda.Alloc[float32](ctx, 1024) b, _ := cuda.Alloc[float32](ctx, 1024) c, _ := cuda.Alloc[float32](ctx, 1024) stream, _ := ctx.NewStream() start, _ := ctx.NewEvent() stop, _ := ctx.NewEvent() start.Record(stream) fn.LaunchOn(bg, stream, cfg, cuda.Arg(a), cuda.Arg(b), cuda.Arg(c), cuda.ArgValue(int32(1024)), ) stop.Record(stream) stop.Synchronize() duration, _ := start.Elapsed(stop) fmt.Printf("GPU time: %v\n", duration) return nil } On my 4070 Ti a 10M vector add showed CPU timer at like 160us but actual GPU event timing was 434us. That difference surprised me. The project is still super early and moves slow cuz i only code on weekends and im a total noob with CUDA. Slowly adding Graphs and multi gpu support. THIS IS SO early , so treat it more like a learning cuda repo, but im having fun learning cuda. Thought some of you might find it interesting too. repo is [github.com/eitamring/gocudrv](http://github.com/eitamring/gocudrv) if you wanna take a look. Would be cool if anyone with 5xxx series cards wants to try it wink wink
Original Article

Similar Articles

Notes from Optimizing CPU-Bound Go Hot Paths

Hacker News Top

The article discusses performance optimization techniques for CPU-bound Go code, highlighting the limitations of generics and interface abstractions due to lack of inlining, and advocates for code duplication in hot paths. It provides examples from a Brotli port and deep-dive benchmarking.

The cuda-oxide Book

Lobsters Hottest

cuda-oxide is an experimental Rust-to-CUDA compiler that allows developers to write safe, idiomatic Rust GPU kernels that compile directly to PTX.