Kuma: compiling PyTorch models into self-contained WebGPU executables [P]

Reddit r/MachineLearning 06/25/26, 08:17 PM Tools

pytorch webgpu compiler runtime deployment machine-learning browser

Summary

Kuma is a compiler/runtime that compiles exported PyTorch models into self-contained WebGPU executables, enabling direct browser inference without Python or server dependencies.

I've been experimenting with a compiler/runtime project that I'm not entirely sure is a good idea, so I'd love some feedback from people who've worked on deployment systems. The idea is to compile an exported PyTorch model into a self-contained package that contains: graph binary weights backend kernels (currently WGSL) runtime metadata A lightweight runtime loads that package and executes it directly in the browser with WebGPU. No Python, no server inference, and no dependency on a heavyweight runtime. Right now the attached demos are just neural video representations because they were easy to test, but the motivation is actually operator networks and scientific ML, where I like the idea of distributing a single portable artifact. The repo is here: https://github.com/Slater-Victoroff/Kuma I'm mostly looking for architectural feedback. Some questions I'm wrestling with: Is embedding backend kernels in the artifact a terrible idea? Is this solving a real deployment problem or just reinventing ONNX Runtime? Are there existing systems I should study that take a similar approach? If you were designing a deployment format today, what would you change? I'd especially appreciate thoughts from people who've worked on ONNX, IREE, TVM, ExecuTorch, MLIR, or similar compiler/runtime projects.

Original Article

Similar Articles

A hackable compiler to generate efficient fused GPU kernels for AI models [P]

Reddit r/MachineLearning

The author presents a custom, hackable ML compiler written in Python that lowers LLMs to optimized CUDA kernels through a multi-stage IR pipeline, achieving performance competitive with or superior to PyTorch on specific operations. The article details the compiler's optimization passes, lowering rules, and CLI usage for generating efficient fused GPU kernels.

@PyTorch: One runtime, multiple GPU architectures, and zero vendor-specific model code. In this blog post, the TokenSpeed team @l…

X AI KOLs Following

TokenSpeed-Kernel is a portable, high-performance kernel system for LLM inference that enables zero vendor-specific model code and supports multiple GPU architectures, achieving up to 3.6x higher throughput on AMD MI355X.

LFM2.5 230M running in-browser at 1,400 tok/s using custom WebGPU kernels

Reddit r/LocalLLaMA

LFM2.5 230M model achieves 1,400 tokens per second in-browser using custom WebGPU kernels, demonstrating efficient local inference.

I built a compiler that rewrites Python into a model-facing representation

Reddit r/LocalLLaMA

Vulpine is a compiler that transforms human-readable Python code into a compressed macro representation optimized for LLMs, reducing token count by 13.8% on average while enabling exact structural reconstruction.

@hank_aibtc: Amazing! Running Gemma 4 in the browser, on par with ChatGPT?! Completely zero server, zero data upload, offline, pure WebGPU local inference! Xenova has open-sourced all 27 custom WebGPU kernels written by Fable 5: - Gemma 4 E2B (2.3B parameters...)

X AI KOLs Timeline

The article introduces Xenova's open-sourcing of 27 custom WebGPU kernels, enabling Gemma 4 to run fully offline and locally in the browser at 255 tok/s, and discusses advantages like privacy and offline use. It also mentions FLUX.2's 3D generation capability.