@mtschannen: For the past years my research focus was on unifying models and training paradigms across modalities. Today I'm excited…
Summary
Google DeepMind researcher announces the release of Gemma 4 12B, a dense encoder-free model that processes text, image, and audio inputs, continuing work on unifying models across modalities.
View Cached Full Text
Cached at: 06/03/26, 09:55 PM
For the past years my research focus was on unifying models and training paradigms across modalities. Today I’m excited that we’re releasing our latest model aligned with this theme:
Gemma 4 12B, a dense encoder-free model which processes raw text, image, and audio inputs!
1/
Despite being encoder-free, Gemma 4 12B nicely sits on the Pareto frontier of the Gemma 4 family. And in contrast to many other open-weight encoder-free models which focus on vision-language tasks, Gemma 4 12B also shows strong performance on text-focused and agentic tasks.
2/
Even more importantly, Gemma 4 12B nicely fits on GPU laptops with 16GB VRAM, so it’s ideal to build local multimodal applications.
Find out more:
- The Keyword: https://blog.google/innovation-and-ai/technology/developers-tools/introducing-gemma-4-12B/…
- Dev Blog: https://developers.googleblog.com/gemma-4-12b-the-developer-guide/…
- Visual Guide: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-gemma-4-12b…
3/3
Similar Articles
@googleaidevs: We’re launching Gemma 4 12B: Our unified, encoder-free model that brings powerful multimodal intelligence straight to y…
Google launches Gemma 4 12B, an encoder-free multimodal model with native audio support, optimized for local execution on laptops under Apache 2.0.
google/gemma-4-31B-it-assistant
Google DeepMind releases Gemma 4, a family of open-weights multimodal models featuring Multi-Token Prediction (MTP) for up to 2x decoding speedups, supporting text, image, video, and audio with enhanced reasoning and coding capabilities.
google/gemma-4-26B-A4B-it
Google DeepMind releases Gemma 4, a family of open-weight multimodal models ranging from 2.3B to 31B parameters with support for text, image, video, and audio inputs. The models feature 256K context windows, MoE and dense architectures, enhanced reasoning capabilities, and are optimized for deployment across devices from mobile to servers.
Welcome Gemma 4: Frontier multimodal intelligence on device
Google DeepMind releases Gemma 4, a frontier multimodal model family available on Hugging Face with Apache 2 licensing, optimized for on-device deployment and supported by various inference libraries.
Gemma 4: Byte for byte, the most capable open models
Google DeepMind introduces Gemma 4, its most capable family of open models to date, designed for advanced reasoning and agentic workflows with high intelligence-per-parameter efficiency across multiple sizes.