@googlegemma: “Agentic kernel optimization is the future of on-device inference” @xenovacom used Fable 5 to write kernels that pushed…
Summary
Xenova used Fable 5 to write optimized kernels achieving 255 tokens per second for Gemma 4 on WebGPU with M4, demonstrating agentic kernel optimization for on-device inference.
View Cached Full Text
Cached at: 07/02/26, 02:24 PM
“Agentic kernel optimization is the future of on-device inference”
@xenovacom used Fable 5 to write kernels that pushed Gemma 4 to a massive 255 tok/s on WebGPU with M4. He shared the demo, so you can try in your browser!! https://t.co/xPuh5OLGEt
Similar Articles
Gemma 4 E2B running in-browser at 255 tok/s using WebGPU kernels written by Fable 5
Gemma 4 is demonstrated running in-browser via WebGPU at 255 tokens per second, using kernels generated by Fable 5, showcasing efficient on-device inference.
@googlegemma: Gemma 4 up to 3x faster, directly in your phone! Check out the difference Speculative Decoding makes! Multi-Token Predi…
Google's Gemma 4 achieves up to 3x faster inference speeds through speculative decoding and multi-token prediction, enabling efficient on-device deployment.
@hank_aibtc: Amazing! Running Gemma 4 in the browser, on par with ChatGPT?! Completely zero server, zero data upload, offline, pure WebGPU local inference! Xenova has open-sourced all 27 custom WebGPU kernels written by Fable 5: - Gemma 4 E2B (2.3B parameters...)
The article introduces Xenova's open-sourcing of 27 custom WebGPU kernels, enabling Gemma 4 to run fully offline and locally in the browser at 255 tok/s, and discusses advantages like privacy and offline use. It also mentions FLUX.2's 3D generation capability.
@googlegemma: Gemma 4 E2B goes super fast on Intel AI PCs thanks to LiteRT NPU support on OpenVINO! 1.3x faster prefill performance o…
Gemma 4 E2B achieves 1.3x faster prefill and 2.8x better performance-per-watt on Intel AI PCs using OpenVINO with LiteRT NPU support, enabling efficient background LLM tasks.
@analogalok: gemma-4-12B-agentic-fable5-composer2.5 V2 is out. the agentic upgrade to the model trained on Fable 5's reasoning. Runn…
A new fine-tuned version of Gemma 4 12B, trained on Fable 5's reasoning, achieves a significant jump in agentic coding benchmarks (from 15% to 55%) and can run locally on an 8GB VRAM GPU using a custom fork of llama.cpp.