@googlegemma: “Agentic kernel optimization is the future of on-device inference” @xenovacom used Fable 5 to write kernels that pushed…

X AI KOLs Timeline 07/01/26, 08:26 PM Tools

on-device-inference kernel-optimization webgpu gemma-4 llm-inference agentic-kernel

Summary

Xenova used Fable 5 to write optimized kernels achieving 255 tokens per second for Gemma 4 on WebGPU with M4, demonstrating agentic kernel optimization for on-device inference.

“Agentic kernel optimization is the future of on-device inference” @xenovacom used Fable 5 to write kernels that pushed Gemma 4 to a massive 255 tok/s on WebGPU with M4. He shared the demo, so you can try in your browser!! https://t.co/xPuh5OLGEt

Original Article

View Cached Full Text

Cached at: 07/02/26, 02:24 PM

“Agentic kernel optimization is the future of on-device inference”

@xenovacom used Fable 5 to write kernels that pushed Gemma 4 to a massive 255 tok/s on WebGPU with M4. He shared the demo, so you can try in your browser!! https://t.co/xPuh5OLGEt

Similar Articles

Gemma 4 E2B running in-browser at 255 tok/s using WebGPU kernels written by Fable 5

Reddit r/LocalLLaMA

Gemma 4 is demonstrated running in-browser via WebGPU at 255 tokens per second, using kernels generated by Fable 5, showcasing efficient on-device inference.

@googlegemma: Gemma 4 up to 3x faster, directly in your phone! Check out the difference Speculative Decoding makes! Multi-Token Predi…

X AI KOLs Timeline

Google's Gemma 4 achieves up to 3x faster inference speeds through speculative decoding and multi-token prediction, enabling efficient on-device deployment.

@hank_aibtc: Amazing! Running Gemma 4 in the browser, on par with ChatGPT?! Completely zero server, zero data upload, offline, pure WebGPU local inference! Xenova has open-sourced all 27 custom WebGPU kernels written by Fable 5: - Gemma 4 E2B (2.3B parameters...)

X AI KOLs Timeline

The article introduces Xenova's open-sourcing of 27 custom WebGPU kernels, enabling Gemma 4 to run fully offline and locally in the browser at 255 tok/s, and discusses advantages like privacy and offline use. It also mentions FLUX.2's 3D generation capability.

@googlegemma: Gemma 4 E2B goes super fast on Intel AI PCs thanks to LiteRT NPU support on OpenVINO! 1.3x faster prefill performance o…

X AI KOLs Timeline

Gemma 4 E2B achieves 1.3x faster prefill and 2.8x better performance-per-watt on Intel AI PCs using OpenVINO with LiteRT NPU support, enabling efficient background LLM tasks.

@analogalok: gemma-4-12B-agentic-fable5-composer2.5 V2 is out. the agentic upgrade to the model trained on Fable 5's reasoning. Runn…

X AI KOLs Timeline

A new fine-tuned version of Gemma 4 12B, trained on Fable 5's reasoning, achieves a significant jump in agentic coding benchmarks (from 15% to 55%) and can run locally on an 8GB VRAM GPU using a custom fork of llama.cpp.

Similar Articles

Gemma 4 E2B running in-browser at 255 tok/s using WebGPU kernels written by Fable 5

@googlegemma: Gemma 4 up to 3x faster, directly in your phone! Check out the difference Speculative Decoding makes! Multi-Token Predi…

@hank_aibtc: Amazing! Running Gemma 4 in the browser, on par with ChatGPT?! Completely zero server, zero data upload, offline, pure WebGPU local inference! Xenova has open-sourced all 27 custom WebGPU kernels written by Fable 5: - Gemma 4 E2B (2.3B parameters...)

@googlegemma: Gemma 4 E2B goes super fast on Intel AI PCs thanks to LiteRT NPU support on OpenVINO! 1.3x faster prefill performance o…

@analogalok: gemma-4-12B-agentic-fable5-composer2.5 V2 is out. the agentic upgrade to the model trained on Fable 5's reasoning. Runn…

Submit Feedback