Kuma: compiling PyTorch models into self-contained WebGPU executables [P]

Reddit r/MachineLearning Tools

Summary

Kuma is a compiler/runtime that compiles exported PyTorch models into self-contained WebGPU executables, enabling direct browser inference without Python or server dependencies.

I've been experimenting with a compiler/runtime project that I'm not entirely sure is a good idea, so I'd love some feedback from people who've worked on deployment systems. The idea is to compile an exported PyTorch model into a self-contained package that contains: graph binary weights backend kernels (currently WGSL) runtime metadata A lightweight runtime loads that package and executes it directly in the browser with WebGPU. No Python, no server inference, and no dependency on a heavyweight runtime. Right now the attached demos are just neural video representations because they were easy to test, but the motivation is actually operator networks and scientific ML, where I like the idea of distributing a single portable artifact. The repo is here: https://github.com/Slater-Victoroff/Kuma I'm mostly looking for architectural feedback. Some questions I'm wrestling with: Is embedding backend kernels in the artifact a terrible idea? Is this solving a real deployment problem or just reinventing ONNX Runtime? Are there existing systems I should study that take a similar approach? If you were designing a deployment format today, what would you change? I'd especially appreciate thoughts from people who've worked on ONNX, IREE, TVM, ExecuTorch, MLIR, or similar compiler/runtime projects.
Original Article

Similar Articles

A hackable compiler to generate efficient fused GPU kernels for AI models [P]

Reddit r/MachineLearning

The author presents a custom, hackable ML compiler written in Python that lowers LLMs to optimized CUDA kernels through a multi-stage IR pipeline, achieving performance competitive with or superior to PyTorch on specific operations. The article details the compiler's optimization passes, lowering rules, and CLI usage for generating efficient fused GPU kernels.

@hank_aibtc: Amazing! Running Gemma 4 in the browser, on par with ChatGPT?! Completely zero server, zero data upload, offline, pure WebGPU local inference! Xenova has open-sourced all 27 custom WebGPU kernels written by Fable 5: - Gemma 4 E2B (2.3B parameters...)

X AI KOLs Timeline

The article introduces Xenova's open-sourcing of 27 custom WebGPU kernels, enabling Gemma 4 to run fully offline and locally in the browser at 255 tok/s, and discusses advantages like privacy and offline use. It also mentions FLUX.2's 3D generation capability.