Tag
A benchmark shows that using vLLM with DFlash speculative decoding boosts Gemma 4 26B inference to ~578 tokens per second on a single RTX 5090, achieving a 2.56x speedup over baseline.
Google released a roundup of major AI updates from April 2026, including the Gemma 4 model, Gemini Enterprise Agent Platform, and eighth-generation TPUs announced at Cloud Next '26.
Z-lab released DFlash, a speculative decoding drafter model for Gemma-4-31B-it that uses lightweight block diffusion to draft multiple tokens in parallel, achieving up to 5.8x speedup over autoregressive baseline.
Google DeepMind released Gemma 4 MTP drafters for the Gemma 4 family, enabling significant decoding speedups via speculative decoding while maintaining exact generation quality for low-latency applications.
Google DeepMind releases Gemma 4, a family of open-weights multimodal models featuring Multi-Token Prediction (MTP) for up to 2x decoding speedups, supporting text, image, video, and audio with enhanced reasoning and coding capabilities.
Google DeepMind’s Model Garden now hosts 200+ leading models, including the newly released Gemini 3.1 Pro, Gemini 3.1 Flash Image, Lyria 3, and the open Gemma 4.
NVIDIA and Hugging Face publish a hands-on demo showing Gemma 4 running as a vision-language-action model entirely on the Jetson Orin Nano Super, using local STT/TTS and webcam input.
Personal benchmark shows Qwen3.5-27B Dense and Gemma4-31B Dense fix 100 % of 37 test failures, outperforming Gemma4-26B MoE even at 8-bit quantization, while using fewer tokens and less wall-clock time.
Qwen 3.6 35B achieves near-perfect 283/285 line recall on a 108 k-token JS file, outperforming Gemma 4 27B (6/16 passes) and fixing long-context weaknesses of earlier Qwen versions.
A user documents how closed models (GPT-4o→5.3, Gemini) degraded and censored Chinese novel translations, while local Gemma 4 31B now outperforms them with natural, uncensored output.
Developer Ivan Fioravanti demonstrates running Andrej Karpathy's autoresearch project locally with a 6-bit quantized Gemma-4-26B model on Apple Silicon, suggesting successful training of Gemma 4 E2B IT variant.
A user reports that the 3.6 GB Gemma 4 e4b model extracted from Google AI Edge Gallery on Android outperforms larger 3.7 GB Unsloth versions and community ports, raising questions about hidden optimizations.
Gemma 4’s vision performance is bottlenecked by low default token budgets; raising --image-max-tokens to 2240 in llama.cpp unlocks state-of-the-art OCR and detail recognition at the cost of ~14 GB extra VRAM.
A user reports that Google's Gemma-4-E2B local/offline model has overly aggressive safety filters that refuse to provide basic survival information like first aid, water purification, and emergency maintenance help, making it unsuitable for emergency preparedness scenarios where internet access is unavailable.
An ML team documents practical challenges encountered while fine-tuning and deploying Gemma-4, including incompatibilities with PEFT, SFTTrainer, DeepSpeed ZeRO-3, and lack of runtime LoRA serving support, along with workarounds for each issue.
A practical guide for audio transcription on macOS using Gemma 4 E2B model with MLX and mlx-vlm, including a uv run recipe and demonstration of the workflow.
NVIDIA and Google collaborate to optimize Gemma 4 models for local deployment across RTX GPUs, DGX Spark, and Jetson devices, enabling efficient on-device agentic AI with support for reasoning, coding, multimodal capabilities, and 35+ languages.
Google DeepMind releases Gemma 4, a frontier multimodal model family available on Hugging Face with Apache 2 licensing, optimized for on-device deployment and supported by various inference libraries.
Google DeepMind releases Gemma 4, a family of open-weight multimodal models ranging from 2.3B to 31B parameters with support for text, image, video, and audio inputs. The models feature 256K context windows, MoE and dense architectures, enhanced reasoning capabilities, and are optimized for deployment across devices from mobile to servers.