google/gemma-4-26B-A4B-it-assistant

Hugging Face Models Trending 2026/04/23 20:16 模型

摘要

Google DeepMind 发布了 Gemma 4 MTP 草稿模型（drafter），适用于 Gemma 4 系列模型，通过推测解码（speculative decoding）实现显著的解码加速，同时保持完全一致的生成质量，适用于低延迟应用场景。

任务：any-to-any 标签：transformers, safetensors, gemma4_assistant, text-generation, any-to-any, license:apache-2.0, endpoints_compatible, region:us

查看原文

查看缓存全文

缓存时间: 2026/05/08 09:02

google/gemma-4-26B-A4B-it-assistant · Hugging Face

来源：https://huggingface.co/google/gemma-4-26B-A4B-it-assistant

Hugging Face (https://huggingface.co/collections/google/gemma-4) | GitHub (https://github.com/google-gemma) | MTP 文档 (https://ai.google.dev/gemma/docs/mtp/mp)

许可证：Apache 2.0 (https://ai.google.dev/gemma/docs/gemma_4_license) | 作者：Google DeepMind (https://deepmind.google/models/gemma/)

本模型卡适用于 Gemma 4 模型的多 token 预测（MTP）草稿模型。MTP 通过为基座模型扩展一个更小、更快的草稿模型来实现。在推测解码（Speculative Decoding）流程中使用时，草稿模型提前预测多个 token，然后由目标模型并行验证。这带来了显著的解码加速（最高达 2 倍），同时保证与标准生成完全相同的质量，使这些检查点非常适合低延迟和端侧应用场景。

Gemma 是由 Google DeepMind 构建的开放模型家族。Gemma 4 模型支持多模态，可处理文本和图像输入（小模型还支持音频），并生成文本输出。本次发布包括预训练和指令微调两种变体的开放权重模型。Gemma 4 的上下文窗口高达 256K token，支持超过 140 种语言。该系列同时包含 Dense 和混合专家（MoE）架构，适用于文本生成、编程和推理等任务。模型提供四种不同尺寸：E2B、E4B、26B A4B 和 31B。多样化的尺寸使其可部署于从高端手机到笔记本电脑和服务器的各种环境，让最先进的 AI 技术更加普惠。

Gemma 4 引入了关键的能力和架构进步：

推理——系列中的所有模型都具备强大的推理能力，支持可配置的思考模式。
扩展多模态——处理文本、图像（支持可变宽高比和分辨率，所有模型）、视频和音频（E2B 和 E4B 模型原生支持）。
多样且高效的架构——提供不同尺寸的 Dense 和混合专家（MoE）变体，支持可扩展部署。
端侧优化——较小的模型专为笔记本电脑和移动设备上的高效本地执行而设计。
更大的上下文窗口——小模型具备 128K 上下文窗口，中模型支持 256K。
增强的编程与智能体能力——在编程基准测试中取得显著提升，并原生支持函数调用，驱动高性能自主智能体。
原生系统提示支持——Gemma 4 引入对 system 角色的原生支持，实现更结构化、可控的对话。

https://huggingface.co/google/gemma-4-26B-A4B-it-assistant#models-overview模型概览

Gemma 4 模型旨在每种尺寸下都提供前沿性能，目标部署场景涵盖移动和边缘设备（E2B、E4B）到消费级 GPU 和工作站（26B A4B、31B）。它们适用于推理、智能体工作流、编程和多模态理解。模型采用混合注意力机制，将局部滑动窗口注意力与全局注意力交错使用，确保最后一层始终为全局注意力。这种混合设计既具备轻量级模型的处理速度和低内存占用，又不牺牲复杂长上下文任务所需的深度感知能力。为优化长上下文的内存使用，全局层采用统一的 Key 和 Value，并应用比例 RoPE（p-RoPE）。

https://huggingface.co/google/gemma-4-26B-A4B-it-assistant#dense-modelsDense 模型

属性	E2B	E4B	31B Dense
总参数量	2.3B 有效（含嵌入层 5.1B）	4.5B 有效（含嵌入层 8B）	30.7B
层数	35	42	60
滑动窗口	512 token	512 token	1024 token
上下文长度	128K token	128K token	256K token
词表大小	262K	262K	262K
支持模态	文本、图像、音频	文本、图像、音频	文本、图像
视觉编码器参数量	~150M	~150M	~550M
音频编码器参数量	~300M	~300M	无音频

E2B 和 E4B 中的 “E” 代表“有效“参数量。较小的模型采用逐层嵌入（PLE）以在端侧部署中最大化参数效率。PLE 不是增加更多层或参数，而是为每个解码层的每个 token 提供独立的小型嵌入。这些嵌入表虽然较大，但仅用于快速查找，因此有效参数量远小于总量。

https://huggingface.co/google/gemma-4-26B-A4B-it-assistant#mixture-of-experts-moe-model混合专家（MoE）模型

属性	26B A4B MoE
总参数量	25.2B
激活参数量	3.8B
层数	30
滑动窗口	1024 token
上下文长度	256K token
词表大小	262K
专家数量	8 激活 / 128 总计，1 共享
支持模态	文本、图像
视觉编码器参数量	~550M

26B A4B 中的 “A” 代表“激活参数量“，与模型包含的总参数量相对。推理时仅激活 4B 参数子集，使得 MoE 模型的运行速度远快于其 26B 总量所暗示的速度。这使其成为快速推理的绝佳选择，与 Dense 31B 模型相比，它的运行速度几乎与 4B 参数模型相当。

https://huggingface.co/google/gemma-4-26B-A4B-it-assistant#benchmark-results基准测试结果

这些模型在大量不同的数据集和指标上进行了评估，以覆盖文本生成的各个方面。表中标记的评估结果针对指令微调模型。

	Gemma 4 31B	Gemma 4 26B A4B	Gemma 4 E4B	Gemma 4 E2B	Gemma 3 27B（无思考）
MMLU Pro	85.2%	82.6%	69.4%	60.0%	67.6%
AIME 2026 无工具	89.2%	88.3%	42.5%	37.5%	20.8%
LiveCodeBench v6	80.0%	77.1%	52.0%	44.0%	29.1%
Codeforces ELO	2150	1718	940	633	110
GPQA Diamond	84.3%	82.3%	58.6%	43.4%	42.4%
Tau2（3 项平均）	76.9%	68.2%	42.2%	24.5%	16.2%
HLE 无工具	19.5%	8.7%	-	-	-
HLE 带搜索	26.5%	17.2%	-	-	-
BigBench Extra Hard	74.4%	64.8%	33.1%	21.9%	19.3%
MMMLU	88.4%	86.3%	76.6%	67.4%	70.7%
视觉
MMMU Pro	76.9%	73.8%	52.6%	44.2%	49.7%
OmniDocBench 1.5（平均编辑距离，越低越好）	0.131	0.149	0.181	0.290	0.365
MATH-Vision	85.6%	82.4%	59.5%	52.4%	46.0%
MedXPertQA MM	61.3%	58.1%	28.7%	23.5%	-
音频
CoVoST	-	-	35.54	33.47	-
FLEURS（越低越好）	-	-	0.08	0.09	-
长上下文
MRCR v2 8 针 128k（平均）	66.4%	44.1%	25.4%	19.1%	13.5%

https://huggingface.co/google/gemma-4-26B-A4B-it-assistant#core-capabilities核心能力

Gemma 4 模型可处理文本、视觉和音频领域的广泛任务。关键能力包括：

思考——内置推理模式，让模型在回答前逐步思考。
长上下文——E2B/E4B 支持 128K token，26B A4B/31B 支持 256K token 的上下文窗口。
图像理解——目标检测、文档/PDF 解析、屏幕和 UI 理解、图表理解、OCR（包括多语言）、手写识别和指向。图像可处理可变宽高比和分辨率。
视频理解——通过处理帧序列来分析视频。
交错多模态输入——在单个提示中自由混合任意顺序的文本和图像。
函数调用——原生支持结构化工具使用，实现智能体工作流。
编程——代码生成、补全和修正。
多语言——开箱即支持 35 种以上语言，预训练覆盖 140 种以上语言。
音频（仅 E2B 和 E4B）——自动语音识别（ASR）和多语言语音到翻译文本。

https://huggingface.co/google/gemma-4-26B-A4B-it-assistant#getting-started快速开始

您可以使用最新版本的 Transformers 库运行所有 Gemma 4 模型。首先，在环境中安装必要的依赖：

pip install -U transformers torch accelerate

安装完成后，可以使用以下代码加载目标模型和辅助模型：

from transformers import AutoProcessor, AutoModelForCausalLM

TARGET_MODEL_ID = "google/gemma-4-26B-A4B-it"
ASSISTANT_MODEL_ID = "google/gemma-4-26B-A4B-it-assistant"

# 目标模型
processor = AutoProcessor.from_pretrained(TARGET_MODEL_ID)
target_model = AutoModelForCausalLM.from_pretrained(
    TARGET_MODEL_ID,
    dtype="auto",
    device_map="auto",
)

# 辅助模型（草稿模型）
assistant_model = AutoModelForCausalLM.from_pretrained(
    ASSISTANT_MODEL_ID,
    dtype="auto",
    device_map="auto",
)

模型加载后，即可开始生成输出：

# 提示
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Write a short joke about saving RAM."},
]

# 处理输入
text = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
inputs = processor(text=text, return_tensors="pt").to(target_model.device)
input_len = inputs["input_ids"].shape[-1]

# 生成输出
outputs = target_model.generate(
    **inputs,
    assistant_model=assistant_model,
    max_new_tokens=256,
)
response = processor.decode(outputs[0][input_len:], skip_special_tokens=False)

# 解析输出
processor.parse_response(response)

要启用推理，设置 enable_thinking=True，parse_response 函数将负责解析思考输出。

以下还提供处理音频（仅 E2B 和 E4B）、图像和视频以及文本的代码片段：

处理音频的代码

使用 AutoModelForMultimodalLM 而非 AutoModelForCausalLM 来处理音频。使用前请确保安装以下包：

pip install -U transformers torch torchvision librosa accelerate

安装完成后，使用以下代码加载目标模型和辅助模型：

import torch
from transformers import AutoProcessor, AutoModelForCausalLM, AutoModelForMultimodalLM

TARGET_MODEL_ID = "google/gemma-4-26B-A4B-it"
ASSISTANT_MODEL_ID = "google/gemma-4-26B-A4B-it-assistant"

# 目标模型
processor = AutoProcessor.from_pretrained(TARGET_MODEL_ID)
target_model = AutoModelForMultimodalLM.from_pretrained(
    TARGET_MODEL_ID,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# 辅助模型（草稿模型）
assistant_model = AutoModelForCausalLM.from_pretrained(
    ASSISTANT_MODEL_ID,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

模型加载后，可以直接在提示中引用音频 URL 来生成输出：

# 提示 - 在文本前添加音频
messages = [
    {
        "role": "user",
        "content": [
            {"type": "audio", "audio": "https://github.com/google-gemma/cookbook/raw/refs/heads/main/apps/sample-data/journal1.wav"},
            {"type": "text", "text": "Transcribe the following speech segment in its original language. Follow these specific instructions for formatting the answer:\n* Only output the transcription, with no newlines.\n* When transcribing numbers, write the digits, i.e. write 1.7 and not one point seven, and write 3 instead of three."},
        ]
    }
]

# 处理输入
text = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
inputs = processor(text=text, return_tensors="pt").to(target_model.device)
input_len = inputs["input_ids"].shape[-1]

# 生成输出
outputs = target_model.generate(
    **inputs,
    assistant_model=assistant_model,
    max_new_tokens=256,
)
response = processor.decode(outputs[0][input_len:], skip_special_tokens=False)

# 解析输出
processor.parse_response(response)

处理图像的代码

使用 AutoModelForMultimodalLM 而非 AutoModelForCausalLM 来处理图像。使用前请确保安装以下包：

pip install -U transformers torch torchvision accelerate

安装完成后，使用以下代码加载目标模型和辅助模型：

import torch
from transformers import AutoProcessor, AutoModelForCausalLM, AutoModelForMultimodalLM

TARGET_MODEL_ID = "google/gemma-4-26B-A4B-it"
ASSISTANT_MODEL_ID = "google/gemma-4-26B-A4B-it-assistant"

# 目标模型
processor = AutoProcessor.from_pretrained(TARGET_MODEL_ID)
target_model = AutoModelForMultimodalLM.from_pretrained(
    TARGET_MODEL_ID,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# 辅助模型（草稿模型）
assistant_model = AutoModelForCausalLM.from_pretrained(
    ASSISTANT_MODEL_ID,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

模型加载后，可以直接在提示中引用图像 URL 来生成输出：

# 提示 - 在文本前添加图像
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://raw.githubusercontent.com/google-gemma/cookbook/refs/heads/main/apps/sample-data/GoldenGate.png"},
            {"type": "text", "text": "What is shown in this image?"}
        ]
    }
]

# 处理输入
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
    add_generation_prompt=True,
).to(target_model.device)
input_len = inputs["input_ids"].shape[-1]

# 生成输出
outputs = target_model.generate(**inputs, max_new_tokens=512)
response = processor.decode(outputs[0][input_len:], skip_special_tokens=False)

# 解析输出
processor.parse_response(response)

处理视频的代码

使用 AutoModelForMultimodalLM 而非 AutoModelForCausalLM 来处理视频。使用前请确保安装以下包：

pip install -U transformers torch torchvision librosa accelerate

安装完成后，使用以下代码加载目标模型和辅助模型：

import torch
from transformers import AutoProcessor, AutoModelForCausalLM, AutoModelForMultimodalLM

TARGET_MODEL_ID = "google/gemma-4-26B-A4B-it"
ASSISTANT_MODEL_ID = "google/gemma-4-26B-A4B-it-assistant"

# 目标模型
processor = AutoProcessor.from_pretrained(TARGET_MODEL_ID)
target_model = AutoModelForMultimodalLM.from_pretrained(
    TARGET_MODEL_ID,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# 辅助模型（草稿模型）
assistant_model = AutoModelForCausalLM.from_pretrained(
    ASSISTANT_MODEL_ID,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

模型加载后，可以直接在提示中引用视频 URL 来生成输出：

# 提示 - 在文本前添加视频
messages = [
    {
        'role': 'user',
        'content': [
            {"type": "video", "video": "https://github.com/bebechien/gemma/raw/refs/heads/main/videos/ForBiggerBlazes.mp4"},
            {'type': 'text', 'text': 'Describe this video.'}
        ]
    }
]

# 处理输入
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
    add_generation_prompt=True,
).to(target_model.device)
input_len = inputs["input_ids"].shape[-1]

# 生成输出
outputs = target_model.generate(**inputs, max_new_tokens=512)
response = processor.decode(outputs[0][input_len:], skip_special_tokens=False)

# 解析输出
processor.parse_response(response)

https://huggingface.co/google/gemma-4-26B-A4B-it-assistant#best-practices最佳实践

为获得最佳性能，请使用以下配置和最佳实践：

https://huggingface.co/google/gemma-4-26B-A4B-it-assistant#1-sampling-parameters1. 采样参数

所有用例使用以下标准化采样配置：

temperature=1.0
top_p=0.95
top_k=64

https://huggingface.co/google/gemma-4-26B-A4B-it-assistant#2-thinking-mode-configuration2. 思考模式配置

与 Gemma 3 相比，模型使用标准的 system、assistant 和 user 角色。为正确管理思考过程，使用以下控制 token：

**Trigger

google/gemma-4-26B-A4B-it-assistant

google/gemma-4-26B-A4B-it-assistant · Hugging Face

https://huggingface.co/google/gemma-4-26B-A4B-it-assistant#models-overview模型概览

https://huggingface.co/google/gemma-4-26B-A4B-it-assistant#dense-modelsDense 模型

https://huggingface.co/google/gemma-4-26B-A4B-it-assistant#mixture-of-experts-moe-model混合专家（MoE）模型

https://huggingface.co/google/gemma-4-26B-A4B-it-assistant#benchmark-results基准测试结果

https://huggingface.co/google/gemma-4-26B-A4B-it-assistant#core-capabilities核心能力

https://huggingface.co/google/gemma-4-26B-A4B-it-assistant#getting-started快速开始

处理音频的代码

处理图像的代码

处理视频的代码

https://huggingface.co/google/gemma-4-26B-A4B-it-assistant#best-practices最佳实践

https://huggingface.co/google/gemma-4-26B-A4B-it-assistant#1-sampling-parameters1. 采样参数

https://huggingface.co/google/gemma-4-26B-A4B-it-assistant#2-thinking-mode-configuration2. 思考模式配置

相似文章

google/gemma-4-31B-it-assistant

google/gemma-4-E4B-it-assistant

google/gemma-4-26B-A4B-it

Gemma 4：同等参数规模下能力最强的开源模型

@googlegemma: Gemma 4 在手机上速度提升 3 倍！看看推测解码带来的不同！Multi-Token Predi…

提交意见反馈