Tag
Frontier models have become so proficient at generating web apps that the author now benchmarks them on building a WebGPU water renderer from scratch, comparing Opus-4.8 and Fable-5 orchestrator with GPT-5.5 implementer.
Xenova used Fable 5 to write optimized kernels achieving 255 tokens per second for Gemma 4 on WebGPU with M4, demonstrating agentic kernel optimization for on-device inference.
A modular, high-performance Rust implementation of GraphRAG (Graph-based Retrieval Augmented Generation) with support for WebGPU acceleration and three deployment architectures: server-only, WASM-only (client-side), and hybrid.
Open source AI inference reaches 300 tok/s on mobile, with a WebGPU framework pushing Liquid AI's LFM2.5 230M to 1,400 tok/s in browser.
Kuma is a compiler/runtime that compiles exported PyTorch models into self-contained WebGPU executables, enabling direct browser inference without Python or server dependencies.
LFM2.5 230M model achieves 1,400 tokens per second in-browser using custom WebGPU kernels, demonstrating efficient local inference.
Eyal Toledano built an LLM inference engine using pure WebGPU/WGSL, running on-device in browser and Node without API keys, and is seeking peer review.
Stable Diffusion XL (SDXL) can now run locally in the browser using WebGPU, enabling high-quality AI image generation directly on-device with open-source code.
This guest post explores the proposed Cross-Origin Storage API to improve caching of AI model resources in Transformers.js, enabling efficient reuse across origins while maintaining privacy and integrity for in-browser inference.
Simon Willison ported the Moebius 0.2B image inpainting model to run in the browser using WebGPU and ONNX Runtime, assisted by Claude Code. The resulting demo allows users to upload images and remove objects via inpainting.
The team at Vercel has significantly optimized the performance of their homepage, using techniques like WebGPU shaders and scrutinizing every frame, and they plan to share the lessons learned.
The article introduces Xenova's open-sourcing of 27 custom WebGPU kernels, enabling Gemma 4 to run fully offline and locally in the browser at 255 tok/s, and discusses advantages like privacy and offline use. It also mentions FLUX.2's 3D generation capability.
Gemma 4 is demonstrated running in-browser via WebGPU at 255 tokens per second, using kernels generated by Fable 5, showcasing efficient on-device inference.
A developer built a self-contained browser-use agent that runs entirely in WASM/WebGPU at zero server cost, enabling full webpage control via natural language prompts.
Introduces llm.istanbul, a WebGPU LLM workbench that lets you train small models, train tokenizers, and generate text entirely in the browser, no server required, fully local.
Improves prefill speeds for k-quants and refactors matrix multiplication for Q4/Q5/Q8 and k-quants in llama.cpp's WebGPU backend.
PrismML has released Bonsai Image 4B, a 1-bit binary and ternary quantized diffusion model, with a size of only 3GB (1-bit version 0.93GB), achieving over 8x compression compared to the same-parameter FLUX.2 Klein 4B at 16GB, and fully supports local browser execution.
A deep dive video explaining how AI models run from JavaScript using Transformers.js, covering tensors, ONNX, quantization, WebGPU/WASM, and more.
OpenAI open-sourced the Privacy Filter model (1.5B parameters, Apache 2.0), which runs locally in the browser via WebGPU to detect personal information without uploading data, protecting privacy.
PrismML released Bonsai Image 4B models in binary and ternary quantized versions, enabling text-to-image generation to run locally in a browser via WebGPU with only 3GB size, under Apache-2.0 license.