T5Gemma: A new collection of encoder-decoder Gemma models

Google DeepMind Blog Models

Summary

Google introduces T5Gemma, a new collection of encoder-decoder models adapted from the Gemma 2 decoder-only architecture, offering improved quality-efficiency trade-offs for tasks like summarization and translation.

Introducing T5Gemma, a new collection of encoder-decoder LLMs.
Original Article
View Cached Full Text

Cached at: 05/08/26, 09:47 AM

# T5Gemma: A new collection of encoder-decoder Gemma models Source: [https://developers.googleblog.com/en/t5gemma/](https://developers.googleblog.com/en/t5gemma/) In the rapidly evolving landscape of large language models \(LLMs\), the spotlight has largely focused on the decoder\-only architecture\. While these models have shown impressive capabilities across a wide range of generation tasks, the classic encoder\-decoder architecture, such as T5 \(The Text\-to\-Text Transfer Transformer\), remains a popular choice for many real\-world applications\. Encoder\-decoder models often excel at summarization, translation, QA, and more due to their high inference efficiency, design flexibility, and richer encoder representation for understanding input\. Nevertheless, the powerful encoder\-decoder architecture has received little relative attention\. Today, we revisit this architecture and introduce[T5Gemma](https://arxiv.org/abs/2504.06225), a new collection of encoder\-decoder LLMs developed by converting pretrained decoder\-only models into the encoder\-decoder architecture through a technique called adaptation\. T5Gemma is based on the Gemma 2 framework, including adapted Gemma 2 2B and 9B models as well as a set of newly trained T5\-sized models \(Small, Base, Large and XL\)\. We are excited to release pretrained and instruction\-tuned T5Gemma models to the community to unlock new opportunities for research and development\. ## From decoder\-only to encoder\-decoder In T5Gemma, we ask the following question:*can we build top\-tier encoder\-decoder models based on pretrained decoder\-only models?*We answer this question by exploring a technique called*model adaptation*\. The core idea is to initialize the parameters of an encoder\-decoder model using the weights of an already pretrained decoder\-only model, and then further adapt them via UL2 or PrefixLM\-based pre\-training\. ![decoder-only model](https://storage.googleapis.com/gweb-developer-goog-blog-assets/images/Chart-1.original.png) An overview of our approach, showing how we initialize a new encoder\-decoder model using the parameters from a pretrained, decoder\-only model\. This adaptation method is highly flexible, allowing for creative combinations of model sizes\. For instance, we can pair a large encoder with a small decoder \(e\.g\., a 9B encoder with a 2B decoder\) to create an "unbalanced" model\. This allows us to fine\-tune the quality\-efficiency trade\-off for specific tasks, such as summarization, where a deep understanding of the input is more critical than the complexity of the generated output\. ## Towards better quality\-efficiency trade\-off *How does T5Gemma perform?* In our experiments, T5Gemma models achieve comparable or better performance than their decoder\-only Gemma counterparts, nearly dominating the quality\-inference efficiency pareto frontier across several benchmarks, such as SuperGLUE which measures the quality of the learned representation\. ![Encoder-decoder models benchmarks](https://storage.googleapis.com/gweb-developer-goog-blog-assets/images/Encoder-decoder_models_benchmarks.original.png) Encoder\-decoder models consistently offer better performance for a given level of inference compute, leading the quality\-efficiency frontier across a range of benchmarks\. This performance advantage isn't just theoretical; it translates to real\-world quality and speed too\. When measuring the actual latency for GSM8K \(math reasoning\), T5Gemma provided a clear win\. For example, T5Gemma 9B\-9B achieves higher accuracy than Gemma 2 9B but with a similar latency\. Even more impressively, T5Gemma 9B\-2B delivers a significant accuracy boost over the 2B\-2B model, yet its latency is nearly identical to the much smaller Gemma 2 2B model\. Ultimately, these experiments showcase that encoder\-decoder adaptation offers a flexible, powerful way to balance across quality and inference speed\. ## Unlocking foundational and fine\-tuned capabilities *Could encoder\-decoder LLMs have similar capabilities to decoder\-only models?* Yes, T5Gemma shows promising capabilities both before and after instruction tuning\. After pre\-training, T5Gemma achieves impressive gains on complex tasks that require reasoning\. For instance, T5Gemma 9B\-9B scores over 9 points higher on GSM8K \(math reasoning\) and 4 points higher on DROP \(reading comprehension\) than the original Gemma 2 9B model\. This pattern demonstrates that the encoder\-decoder architecture, when initialized via adaptation, has the potential to create a more capable, performant foundational model\. ![Detailed results for pretrained models](https://storage.googleapis.com/gweb-developer-goog-blog-assets/images/pretrained-model-results.original.png) Detailed results for pretrained models, illustrating how adapted models have significant gains on several reasoning\-intensive benchmarks compared to decoder\-only Gemma 2\. These foundational improvements from pre\-training set the stage for even more dramatic gains after instruction tuning\. For example, comparing Gemma 2 IT to T5Gemma IT, the performance gap widens significantly across the board\. T5Gemma 2B\-2B IT sees its MMLU score jump by nearly 12 points over the Gemma 2 2B, and its GSM8K score increases from 58\.0% to 70\.7%\. The adapted architecture not only potentially provides a better starting point but also responds more effectively to instruction\-tuning, ultimately leading to a substantially more capable and helpful final model\. ![Results for fine-tuned + RLHFed models](https://storage.googleapis.com/gweb-developer-goog-blog-assets/images/results-for-fine-tuned-RLHFed-models.original.png) Detailed results for fine\-tuned \+ RLHFed models, illustrating the capabilities of post\-training to significantly amplify the performance advantages of the encoder\-decoder architecture\. ## Explore our models: Releasing T5Gemma checkpoints We’re very excited to present this new method of building powerful, general purpose encoder\-decoder models by adapting from pretrained decoder\-only LLMs like Gemma 2\. To help accelerate further research and allow the community to build on this work, we are excited to release a suite of our T5Gemma checkpoints\. The release includes: - **Multiple Sizes:**Checkpoints for T5\-sized models \(Small, Base, Large, and XL\), the Gemma 2\-based models \(2B and 9B\), as well as an additional model in between T5 Large and T5 XL\. - **Multiple Variants**: Pretrained and instruction\-tuned models\. - **Flexible Configurations:**A powerful and efficient unbalanced 9B\-2B checkpoint to explore the trade\-offs between encoder and decoder size\. - **Different Training Objectives:**Models trained with either PrefixLM or UL2 objectives to provide either state\-of\-the\-art generative performance or representation quality\. We hope these checkpoints will provide a valuable resource for investigating model architecture, efficiency, and performance\. ## Getting started with T5Gemma We can't wait to see what you build with T5Gemma\. Please see the following links for more information: - Learn about the research behind this project by reading[the paper](https://arxiv.org/abs/2504.06225)\. - Download the models: Find the model weights on[Hugging Face](https://huggingface.co/collections/google/t5gemma-686ba262fe290b881d21ec86)and[Kaggle](https://www.kaggle.com/models/google/t5gemma)\. - Explore the models capabilities or fine\-tune them for your own use cases with the[Colab notebook](https://github.com/google-gemini/gemma-cookbook/blob/main/Research/%5BT5Gemma%5DExample.ipynb)\. - Run inference with the models on[Vertex AI](https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/t5gemma)\.

Similar Articles

Introducing Gemma 3

Google DeepMind Blog

Google introduces Gemma 3, a collection of lightweight open models (1B, 4B, 12B, 27B) designed to run on single GPUs or TPUs, featuring support for 140+ languages, 128k context window, and multimodal capabilities. The models outperform larger competitors like Llama 3 and DeepSeek-V3 while maintaining efficiency for on-device deployment.

google/gemma-4-31B-it-assistant

Hugging Face Models Trending

Google DeepMind releases Gemma 4, a family of open-weights multimodal models featuring Multi-Token Prediction (MTP) for up to 2x decoding speedups, supporting text, image, video, and audio with enhanced reasoning and coding capabilities.

Google Gemma 4 12B

Product Hunt

Google's Gemma 4 12B model enables local multimodal AI using an encoder-free architecture.

google/gemma-4-26B-A4B-it-assistant

Hugging Face Models Trending

Google DeepMind released Gemma 4 MTP drafters for the Gemma 4 family, enabling significant decoding speedups via speculative decoding while maintaining exact generation quality for low-latency applications.