webgpu

#webgpu

@googlegemma: “Agentic kernel optimization is the future of on-device inference” @xenovacom used Fable 5 to write kernels that pushed…

X AI KOLs Timeline ↗ · 2d ago Cached

Xenova used Fable 5 to write optimized kernels achieving 255 tokens per second for Gemma 4 on WebGPU with M4, demonstrating agentic kernel optimization for on-device inference.

0 favorites 0 likes

#webgpu

@tom_doerr: Modular GraphRAG implementation in Rust with WebGPU acceleration support. https://github.com/automataIA/graphrag-rs…

X AI KOLs Timeline ↗ · 6d ago Cached

A modular, high-performance Rust implementation of GraphRAG (Graph-based Retrieval Augmented Generation) with support for WebGPU acceleration and three deployment architectures: server-only, WASM-only (client-side), and hybrid.

0 favorites 0 likes

#webgpu

@victormustar: 300tok/s on mobile is insane... open source must win

X AI KOLs Following ↗ · 2026-06-26 Cached

Open source AI inference reaches 300 tok/s on mobile, with a WebGPU framework pushing Liquid AI's LFM2.5 230M to 1,400 tok/s in browser.

0 favorites 0 likes

#webgpu

Kuma: compiling PyTorch models into self-contained WebGPU executables [P]

Reddit r/MachineLearning ↗ · 2026-06-25

Kuma is a compiler/runtime that compiles exported PyTorch models into self-contained WebGPU executables, enabling direct browser inference without Python or server dependencies.

0 favorites 0 likes

#webgpu

LFM2.5 230M running in-browser at 1,400 tok/s using custom WebGPU kernels

Reddit r/LocalLLaMA ↗ · 2026-06-25

LFM2.5 230M model achieves 1,400 tokens per second in-browser using custom WebGPU kernels, demonstrating efficient local inference.

0 favorites 0 likes

#webgpu

@yoheinakajima: who wants to help eyal poke holes in this approach to run LLM inference... in browser?

X AI KOLs Following ↗ · 2026-06-25 Cached

Eyal Toledano built an LLM inference engine using pure WebGPU/WGSL, running on-device in browser and Node without API keys, and is seeking peer review.

0 favorites 0 likes

#webgpu

SDXL running locally in the browser on WebGPU, open-source

Reddit r/LocalLLaMA ↗ · 2026-06-24

Stable Diffusion XL (SDXL) can now run locally in the browser using WebGPU, enabling high-quality AI image generation directly on-device with open-source code.

0 favorites 0 likes

#webgpu

Experimenting with the proposed Cross-Origin Storage API in Transformers.js

Hugging Face Blog ↗ · 2026-06-23 Cached

This guest post explores the proposed Cross-Origin Storage API to improve caching of AI model resources in Transformers.js, enabling efficient reuse across origins while maintaining privacy and integrity for in-browser inference.

0 favorites 0 likes

#webgpu

Porting the Moebius 0.2B image inpainting model to run in the browser with Claude Code

Simon Willison's Blog ↗ · 2026-06-22 Cached

Simon Willison ported the Moebius 0.2B image inpainting model to run in the browser using WebGPU and ONNX Runtime, assisted by Claude Code. The resulting demo allows users to upload images and remove objects via inpainting.

0 favorites 0 likes

#webgpu

@rauchg: The team cooked on http://vercel.com/home performance. 'Everything the light touches' was optimized Simba. Painting, la…

X AI KOLs Following ↗ · 2026-06-21 Cached

The team at Vercel has significantly optimized the performance of their homepage, using techniques like WebGPU shaders and scrutinizing every frame, and they plan to share the lessons learned.

0 favorites 0 likes

#webgpu

@hank_aibtc: Amazing! Running Gemma 4 in the browser, on par with ChatGPT?! Completely zero server, zero data upload, offline, pure WebGPU local inference! Xenova has open-sourced all 27 custom WebGPU kernels written by Fable 5: - Gemma 4 E2B (2.3B parameters...)

X AI KOLs Timeline ↗ · 2026-06-18 Cached

The article introduces Xenova's open-sourcing of 27 custom WebGPU kernels, enabling Gemma 4 to run fully offline and locally in the browser at 255 tok/s, and discusses advantages like privacy and offline use. It also mentions FLUX.2's 3D generation capability.

0 favorites 0 likes

#webgpu

Gemma 4 E2B running in-browser at 255 tok/s using WebGPU kernels written by Fable 5

Reddit r/LocalLLaMA ↗ · 2026-06-17

Gemma 4 is demonstrated running in-browser via WebGPU at 255 tokens per second, using kernels generated by Fable 5, showcasing efficient on-device inference.

0 favorites 0 likes

#webgpu

[browser-use-wasm] I made a browser-use agent that runs in WASM at zero cost

Reddit r/LocalLLaMA ↗ · 2026-06-12

A developer built a self-contained browser-use agent that runs entirely in WASM/WebGPU at zero server cost, enabling full webpage control via natural language prompts.

0 favorites 0 likes

#webgpu

@PierceZhang34: Train a Small Model in 10 Seconds! First Look at the LLM Training Tool: http://llm.istanbul Recently discovered a super fun open-source style tool website — http://llm.istanbul, which claims to be a WebGPU LLM Workbench, meaning it fully...

X AI KOLs Timeline ↗ · 2026-06-12 Cached

Introduces llm.istanbul, a WebGPU LLM workbench that lets you train small models, train tokenizers, and generate text entirely in the browser, no server required, fully local.

0 favorites 0 likes

#webgpu

ggml-webgpu: Improve prefill speeds for k-quants + refactor matmul for Q4/Q5/Q8 and k-quants by yomaytk · Pull Request #24225 · ggml-org/llama.cpp

Reddit r/LocalLLaMA ↗ · 2026-06-09 Cached

Improves prefill speeds for k-quants and refactors matrix multiplication for Q4/Q5/Q8 and k-quants in llama.cpp's WebGPU backend.

0 favorites 0 likes

#webgpu

@hank_aibtc: WTF? Image generation has completely changed! PrismML just released Bonsai Image 4B — a 1-bit binary and ternary quantized diffusion model! - Model is only ~3GB (1-bit version even compressed to 0.93GB), while the same-parameter FLUX.2 Klein 4B requires...

X AI KOLs Timeline ↗ · 2026-05-29 Cached

PrismML has released Bonsai Image 4B, a 1-bit binary and ternary quantized diffusion model, with a size of only 3GB (1-bit version 0.93GB), achieving over 8x compression compared to the same-parameter FLUX.2 Klein 4B at 16GB, and fully supports local browser execution.

0 favorites 0 likes

#webgpu

@nicodotdev: Everything you always wanted to know about Transformers.js, in one video. I made a deep dive into how AI models run fro…

X AI KOLs Following ↗ · 2026-05-27 Cached

A deep dive video explaining how AI models run from JavaScript using Transformers.js, covering tensors, ONNX, quantization, WebGPU/WASM, and more.

0 favorites 0 likes

#webgpu

@hank_aibtc: Holy crap, OpenAI finally did something good!!! The first open-source model of 2026 is here – Privacy Filter, released under Apache 2.0! A 1.5B parameter PII detection tool, specifically for redacting names, addresses, phone numbers, emails, IDs, etc. from text…

X AI KOLs Timeline ↗ · 2026-05-27 Cached

OpenAI open-sourced the Privacy Filter model (1.5B parameters, Apache 2.0), which runs locally in the browser via WebGPU to detect personal information without uploading data, protecting privacy.

0 favorites 0 likes

#webgpu

PrismML just released Binary and Ternary Bonsai Image 4B: 1-bit/ternary text-to-image diffusion transformers that can even run 100% locally in your browser on WebGPU.

Reddit r/LocalLLaMA ↗ · 2026-05-26

PrismML released Bonsai Image 4B models in binary and ternary quantized versions, enabling text-to-image generation to run locally in a browser via WebGPU with only 3GB size, under Apache-2.0 license.

0 favorites 0 likes

#webgpu

@ggerganov: Highlighting the new WebGPU backend in llama.cpp/ggml The work to bring full-fledged WebGPU support in llama.cpp starte…

X AI KOLs Following ↗ · 2026-05-22 Cached

The article highlights the new WebGPU backend in llama.cpp/ggml, enabling GPU-accelerated local AI model inference in browsers, developed by Reese Levine and team at USCS over the past year and a half.

0 favorites 0 likes

webgpu

Submit Feedback