@mtschannen: For the past years my research focus was on unifying models and training paradigms across modalities. Today I'm excited…

X AI KOLs Timeline Models

Summary

Google DeepMind researcher announces the release of Gemma 4 12B, a dense encoder-free model that processes text, image, and audio inputs, continuing work on unifying models across modalities.

For the past years my research focus was on unifying models and training paradigms across modalities. Today I'm excited that we're releasing our latest model aligned with this theme: Gemma 4 12B, a dense encoder-free model which processes raw text, image, and audio inputs! 1/ https://t.co/4J2JKCtzU5
Original Article
View Cached Full Text

Cached at: 06/03/26, 09:55 PM

For the past years my research focus was on unifying models and training paradigms across modalities. Today I’m excited that we’re releasing our latest model aligned with this theme:

Gemma 4 12B, a dense encoder-free model which processes raw text, image, and audio inputs!

1/

Despite being encoder-free, Gemma 4 12B nicely sits on the Pareto frontier of the Gemma 4 family. And in contrast to many other open-weight encoder-free models which focus on vision-language tasks, Gemma 4 12B also shows strong performance on text-focused and agentic tasks.

2/

Even more importantly, Gemma 4 12B nicely fits on GPU laptops with 16GB VRAM, so it’s ideal to build local multimodal applications.

Find out more:

  • The Keyword: https://blog.google/innovation-and-ai/technology/developers-tools/introducing-gemma-4-12B/…
  • Dev Blog: https://developers.googleblog.com/gemma-4-12b-the-developer-guide/…
  • Visual Guide: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-gemma-4-12b…

3/3

Similar Articles

google/gemma-4-31B-it-assistant

Hugging Face Models Trending

Google DeepMind releases Gemma 4, a family of open-weights multimodal models featuring Multi-Token Prediction (MTP) for up to 2x decoding speedups, supporting text, image, video, and audio with enhanced reasoning and coding capabilities.

google/gemma-4-26B-A4B-it

Hugging Face Models Trending

Google DeepMind releases Gemma 4, a family of open-weight multimodal models ranging from 2.3B to 31B parameters with support for text, image, video, and audio inputs. The models feature 256K context windows, MoE and dense architectures, enhanced reasoning capabilities, and are optimized for deployment across devices from mobile to servers.

Gemma 4: Byte for byte, the most capable open models

Google DeepMind Blog

Google DeepMind introduces Gemma 4, its most capable family of open models to date, designed for advanced reasoning and agentic workflows with high intelligence-per-parameter efficiency across multiple sizes.