@osanseviero: Super excited to introduce Gemma 4 12B! - Multimodal: audio, image, video, and text input - Novel architecture: we remo…
Summary
Introducing Gemma 4 12B, a multimodal model supporting audio, image, video, and text input with a novel unified architecture and a new MacOS desktop app powered by LiteRT.
View Cached Full Text
Cached at: 06/03/26, 05:52 PM
Super excited to introduce Gemma 4 12B! 💎
- Multimodal: audio, image, video, and text input
- Novel architecture: we removed the multimodal encoders for a unified, streamlined arch
- New MacOS desktop app powered by LiteRT
- MTP support
Excited to see what you build with it! https://t.co/De5id2XQfz
Similar Articles
@_philschmid: We just launched a Gemma 4 12B! Our first mid-sized model with native audio inputs. Gemma 4 12 B is a unified, encoder-…
We just launched Gemma 4 12B, a mid-sized multimodal model with native audio inputs, requiring only 16GB memory and released under Apache 2.0.
@googleaidevs: We’re launching Gemma 4 12B: Our unified, encoder-free model that brings powerful multimodal intelligence straight to y…
Google launches Gemma 4 12B, an encoder-free multimodal model with native audio support, optimized for local execution on laptops under Apache 2.0.
Google’s Gemma 4 12B just dropped - here’s how to run it locally on your Mac
Google released Gemma 4 12B, an Apache 2.0 open-source multimodal model supporting text, vision, and audio with a 256K context window. The article provides a guide for running it locally on Macs using Ollama, LM Studio, or llama.cpp.
@RedHat_AI: Gemma 4 12B dropped today. Apache 2.0, multimodal: text, image, audio, and video. 256K context, built-in thinking, nati…
Gemma 4 12B has been released under Apache 2.0, supporting multimodal inputs (text, image, audio, video), 256K context, built-in thinking, and native tool calling, running on Red Hat OpenShift AI.
@mtschannen: For the past years my research focus was on unifying models and training paradigms across modalities. Today I'm excited…
Google DeepMind researcher announces the release of Gemma 4 12B, a dense encoder-free model that processes text, image, and audio inputs, continuing work on unifying models across modalities.