unsloth/diffusiongemma-26B-A4B-it-GGUF

Hugging Face Models Trending Models

Summary

Unsloth releases GGUF quantizations of Google DeepMind's DiffusionGemma (26B-A4B), a new block-diffusion architecture for faster text generation, ready for llama.cpp.

Task: image-text-to-text Tags: gguf, gemma4, unsloth, gemma, google, diffusion_gemma, image-text-to-text, base_model:google/diffusiongemma-26B-A4B-it, base_model:quantized:google/diffusiongemma-26B-A4B-it, license:apache-2.0, endpoints_compatible, region:us, conversational
Original Article
View Cached Full Text

Cached at: 06/11/26, 01:33 PM

unsloth/diffusiongemma-26B-A4B-it-GGUF · Hugging Face

Source: https://huggingface.co/unsloth/diffusiongemma-26B-A4B-it-GGUF SeeUnsloth Dynamic 2.0 GGUFsfor our quantization benchmarks.


Hugging Face|GitHub|Launch Blog|Documentation License:Apache 2.0|Authors:Google DeepMind

https://huggingface.co/unsloth/diffusiongemma-26B-A4B-it-GGUF#run-with-llamacpp-ggufRun with llama.cpp (GGUF)

These GGUFs run with the DiffusionGemma build ofllama\.cpp(the DiffusionGemma PRggml-org/llama.cpp#24423). DiffusionGemma is a block-diffusion architecture, so it needs that branch plus the dedicatedllama\-diffusion\-clirunner - the standardllama\-cli/llama\-servercannot generate from it yet.

diffusiongem-ezgif.com-resize

https://huggingface.co/unsloth/diffusiongemma-26B-A4B-it-GGUF#1-build-the-branch1. Build the branch

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

# check out the DiffusionGemma PR (#24423)
gh pr checkout 24423
# or, without the gh CLI:
# git fetch origin pull/24423/head:diffusiongemma && git checkout diffusiongemma

# build with CUDA (drop -DGGML_CUDA=ON for a CPU-only build)
cmake -B build -DGGML_CUDA=ON
cmake --build build -j --config Release --target llama-diffusion-cli

https://huggingface.co/unsloth/diffusiongemma-26B-A4B-it-GGUF#2-download-a-gguf2. Download a GGUF

pip install -U "huggingface_hub[cli]"
hf download unsloth/diffusiongemma-26B-A4B-it-GGUF \
    --local-dir unsloth/diffusiongemma-26B-A4B-it-GGUF \
    --include "*Q8_0*" # Use "*Q4_K_M*" for a smaller 16 GB download

QuantSizeNotesBF1647 GBfull precision referenceQ8_025 GBnear-lossless, recommendedQ6_K21 GBQ5_K_M18 GBQ4_K_M16 GBsmallest, fits a single 24 GB GPU

https://huggingface.co/unsloth/diffusiongemma-26B-A4B-it-GGUF#3-run-it-chat3. Run it (chat)

\-n/\-\-n\-predictis the only length knob you need: it auto-derives the diffusion block count and grows the batch/context to fit. On a single GPU the prompt prefix KV cache turns on automatically.

./build/bin/llama-diffusion-cli \
  -m unsloth/diffusiongemma-26B-A4B-it-GGUF/diffusiongemma-26B-A4B-it-Q8_0.gguf \
  -ngl 99 -cnv -n 2048

https://huggingface.co/unsloth/diffusiongemma-26B-A4B-it-GGUF#4-watch-the-canvas-denoise-live—diffusion-visual4. Watch the canvas denoise live (\-\-diffusion\-visual)

Add\-\-diffusion\-visualto see each 256-token canvas being denoised in place (flicker-free, keeps your scrollback):

./build/bin/llama-diffusion-cli \
  -m unsloth/diffusiongemma-26B-A4B-it-GGUF/diffusiongemma-26B-A4B-it-Q8_0.gguf \
  -ngl 99 -cnv -n 2048 --diffusion-visual

https://huggingface.co/unsloth/diffusiongemma-26B-A4B-it-GGUF#flagsFlags

  • \-n, \-\-n\-predict N- target tokens; derives\-\-diffusion\-blocksand grows\-ub/\-b/\-c.
  • \-ngl 99- offload all layers to the GPU (\-ngl 0for CPU-only).
  • \-cnv- multi-turn conversation mode.
  • \-\-diffusion\-visual- live canvas denoising view.
  • The Entropy-Bound sampler is on by default (\-\-diffusion\-eb auto). Tune it with\-\-diffusion\-eb\-max\-steps(default 48),\-\-diffusion\-eb\-t\-max/\-\-diffusion\-eb\-t\-min(0.8 -> 0.4),\-\-diffusion\-eb\-entropy\-bound(0.1), and\-\-diffusion\-eb\-confidence(0.005).
  • \-\-diffusion\-kv\-cache \{auto,on,off\}- prompt prefix KV cache (auto = on for single GPU).

DiffusionGemma is a generative model built by Google DeepMind. Based on the 26B A4B Mixture-of-Experts (MoE) Gemma 4 architecture, DiffusionGemma generates tokens using discrete diffusion. This open-weights model is multimodal, handling text, image, and video inputs to generate text output.

Built on a MoE foundation, DiffusionGemma is designed to improve generation speed (tokens per second) while remaining deployable across various hardware environments. DiffusionGemma builds upon the architectural and capability advancements of Gemma 4, introducing several core features:

  • Discrete Text Diffusion– Shifts from token-by-token autoregression to block-autoregressive multi-canvas sampling. It generates text by iteratively denoising blocks of tokens (a ‘canvas’) in parallel, significantly increasing decoding speed.
  • Multimodal Input Processing– Processes interleaved text, image (with variable aspect ratio and resolution support), and video inputs to generate text outputs.
  • Encoder-Decoder Architecture– Utilizes an autoregressive encoder to process and cache the prompt context, paired with a decoder that applies bidirectional attention over the generation canvas.
  • Mixture-of-Experts (MoE) Efficiency– Leverages a sparse MoE design (8 active experts out of 128 total) to provide strong reasoning capabilities while maintaining a low memory footprint suitable for local execution.
  • Thinking Mode (Reasoning)– Designed as a highly capable reasoner, with configurable thinking modes.
  • **Optimized for Small Batch Size Inference –**Specifically engineered for low-latency, high-speed generation on a single capable accelerator.
  • Native System Prompt Support– As with Gemma 4, it supports updating thesystemrole, enabling more structured and controllable conversations.

https://huggingface.co/unsloth/diffusiongemma-26B-A4B-it-GGUF#model-overviewModel Overview

DiffusionGemma is engineered to reduce the sequential bottlenecks of standard causal language models. It employs an encoder-decoder architecture specifically optimized for inference speed.

The encoder operates in a prefill capacity, processing the initial prompt and generating the KV cache. The decoder then utilizes bidirectional attention to process an input block (a ‘canvas’) of tokens, accessing the cached context via cross-attention.

During inference, DiffusionGemma leverages multi-canvas sampling. Rather than generating one token at a time, the model iteratively denoises a full block of tokens using a diffusion sampler. Once a canvas is fully denoised, it is processed by the encoder and appended to the KV cache, after which the model generates the next canvas. This block-autoregressive approach facilitates text generation at higher speeds.

https://huggingface.co/unsloth/diffusiongemma-26B-A4B-it-GGUF#diffusiongemma-1DiffusionGemma

Total Parameters25.2BActive Parameters3.8BLayers30Sliding Window1024 tokensContext LengthUp to 256K tokensCanvas Length256Vocabulary Size262KExpert Count8 active / 128 total and 1 sharedSupported ModalitiesText, ImageVision Encoder Parameters~550M

https://huggingface.co/unsloth/diffusiongemma-26B-A4B-it-GGUF#benchmark-resultsBenchmark Results

These models were evaluated against a large collection of different datasets and metrics to cover different aspects of text generation. Evaluation results marked in the table are for instruction-tuned models, with the recommended Entropy Bound (EB) sampler (see Best Practices below).

BenchmarkDiffusionGemma 26B A4BGemma 4 26B A4BMMLU Pro77.6%82.6%AIME 2026 no tools69.1%88.3%LiveCodeBench v669.1%77.1%Codeforces ELO14291718GPQA Diamond73.2%82.3%Tau2 (average over 3)56.2%68.2%HLE no tools11.0%8.7%HLE with search11.9%17.2%BigBench Extra Hard47.6%64.8%MMMLU81.5%86.3%VisionMMMU Pro54.3%73.8%OmniDocBench 1.5 (average edit distance, lower is better)0.3190.149MATH-Vision70.5%82.4%MedXPertQA MM49.0%58.1%Long ContextMRCR v2 8 needle 128k (average)32.0%44.1%

https://huggingface.co/unsloth/diffusiongemma-26B-A4B-it-GGUF#core-capabilitiesCore Capabilities

DiffusionGemma handles a broad range of tasks across text and vision. Key capabilities include:

  • High-Speed Generationparallel denoising of 256 tokens via diffusion sampling achieves low latency by generating 15-20 tokens per forward pass, unlocking per user generation speeds exceeding 1100 tokens per second in low batch size settings (H100, FP8).
  • Adaptive Inference Time ComputationSimpler prompts and structured tasks like code require fewer denoising steps, enabling dynamic tokens-per-second speeds based on task complexity.
  • Thinking– Built-in reasoning mode that lets the model think step-by-step before answering.
  • Long Context– Context windows of up to 256K tokens.
  • Image Understanding– Object detection, Document/PDF parsing, screen and UI understanding, chart comprehension, OCR (including multilingual), handwriting recognition, and pointing. Images can be processed at variable aspect ratios and resolutions.
  • Video Understanding– Analyzes and describes video content by processing sequences of frames.
  • Interleaved Multimodal Input– Mix images, video, and text within a single prompt for context-heavy reasoning.
  • Function Calling– Native support for structured tool use, enabling agentic workflows.
  • Coding & Reasoning– Capable of code generation, completion, and step-by-step logical reasoning.
  • Multilingual– Out-of-the-box support for 35+ languages, pre-trained on 140+ languages.

https://huggingface.co/unsloth/diffusiongemma-26B-A4B-it-GGUF#getting-startedGetting Started

You can use all Gemma 4 models with the latest version of Transformers. To get started, install the necessary dependencies in your environment:

pip install \-U transformers torch accelerate

Once you have everything installed, you can proceed to load the model with the code below:

from transformers import DiffusionGemmaForBlockDiffusion, AutoProcessor

MODEL_ID = "google/diffusiongemma-26B-A4B-it"

# Load model
processor = AutoProcessor.from_pretrained(MODEL_ID)
model = DiffusionGemmaForBlockDiffusion.from_pretrained(
    MODEL_ID,
    dtype="auto",
    device_map="auto",
)

Once the model is loaded, you can start generating output:

# Prompt
message = [
    {"role": "user", "content": "Why is the sky blue?"}
]

# Process input
input_ids = processor.apply_chat_template(
    message,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt"
).to(model.device)
output = model.generate(**input_ids, max_new_tokens=512)

# Parse output
text = processor.decode(output[0], skip_special_tokens=False)

https://huggingface.co/unsloth/diffusiongemma-26B-A4B-it-GGUF#best-practicesBest Practices

For the best performance, use these configurations and best practices:

https://huggingface.co/unsloth/diffusiongemma-26B-A4B-it-GGUF#1-diffusion-sampling-settings1. Diffusion Sampling Settings

Use the following standardized sampling configuration across all use cases:

  • Method: Diffusion sampling with Entropy-Bounded Denoising and Adaptive Stopping.
  • Sampling Configuration:- Maximum number of Denoising Steps = 48 - Temperature schedule (for logit shaping): Linear decay from 0.8 → 0.4 - Token Selection: At each step, the sampler selects the lowest-entropy tokens such that their mutual information bound stays below entropy bound = 0.1 - Token Renoising: The sampler fully renoises the non-selected tokens
  • Adaptive Stopping: Sampling terminates early if and only if both of the following conditions are met simultaneously:- Confident predictions: The average model entropy over the canvas is below the entropy threshold = 0.005 - Stable predictions: The highest-probability token predictions remain identical across two consecutive denoising steps

https://huggingface.co/unsloth/diffusiongemma-26B-A4B-it-GGUF#2-thinking-mode-configuration2. Thinking Mode Configuration

Similar to Gemma 4 models, we use standard system, assistant, and user roles. To properly manage the thinking process, use the following control tokens:

  • **Trigger Thinking:**Thinking is enabled by including the<\|think\|\>token at the start of the system prompt. To disable thinking, remove the token (note that an empty thinking channel might still be emitted).
  • Standard Generation:When thinking is enabled, the model will output its internal reasoning followed by the final answer using this structure: <\|channel\>thought\\n[Internal reasoning]<channel\|\>.
  • Disabled Thinking Behavior:If thinking is disabled, the model will still generate the tags but with an empty thought block: <\|channel\>thought\\n<channel\|\>[Final answer].

Note that many libraries like transformers handle the complexities of the chat template for you.

https://huggingface.co/unsloth/diffusiongemma-26B-A4B-it-GGUF#3-multi-turn-conversations3. Multi-Turn Conversations

  • No Thinking Content in History: In multi-turn conversations, the historical model output should only include the final response. Thoughts from previous model turns mustnot be addedbefore the next user turn begins.

https://huggingface.co/unsloth/diffusiongemma-26B-A4B-it-GGUF#4-modality-order4. Modality order

  • For optimal performance with multimodal inputs, place image contentbeforethe text in your prompt.

https://huggingface.co/unsloth/diffusiongemma-26B-A4B-it-GGUF#5-variable-image-resolution5. Variable Image Resolution

Aside from variable aspect ratios, DiffusionGemma supports variable image resolution through a configurable visual token budget, which controls how many tokens are used to represent an image. A higher token budget preserves more visual detail at the cost of additional compute, while a lower budget enables faster inference for tasks that don’t require fine-grained understanding.

  • The supported token budgets are:70,140,280,560, and1120.- Uselower budgetsfor classification, captioning, or video understanding, where faster inference and processing many frames outweigh fine-grained detail. - Usehigher budgetsfor tasks like OCR, document parsing, or reading small text.

https://huggingface.co/unsloth/diffusiongemma-26B-A4B-it-GGUF#6-video-length6. Video Length

All models support image inputs and can process videos as frames. Video supports a maximum of 60 seconds assuming the images are processed at one frame per second.

https://huggingface.co/unsloth/diffusiongemma-26B-A4B-it-GGUF#model-dataModel Data

https://huggingface.co/unsloth/diffusiongemma-26B-A4B-it-GGUF#data-used-for-model-training-and-how-the-data-was-processedData used for model training and how the data was processed.

https://huggingface.co/unsloth/diffusiongemma-26B-A4B-it-GGUF#training-datasetTraining Dataset

Our pre-training dataset is a large-scale, diverse collection of data encompassing a wide range of domains and modalities, which includes web documents, code, images, audio, with a cutoff date of January 2025. Here are the key components:

  • Web Documents: A diverse collection of web text ensures the model is exposed to a broad range of linguistic styles, topics, and vocabulary. The training dataset includes content in over 140 languages.
  • Code: Exposing the model to code helps it to learn the syntax and patterns of programming languages, which improves its ability to generate code and understand code-related questions.
  • Mathematics: Training on mathematical text helps the model learn logical reasoning, symbolic representation, and address mathematical queries.
  • Images: A wide range of images enables the model to perform image analysis and visual data extraction tasks.

The combination of these diverse data sources is crucial for training a powerful multimodal model that can handle a wide variety of different tasks and data formats.

https://huggingface.co/unsloth/diffusiongemma-26B-A4B-it-GGUF#data-preprocessingData Preprocessing

Here are the key data cleaning and filtering methods applied to the training data:

  • CSAM Filtering: Rigorous CSAM (Child Sexual Abuse Material) filtering was applied at multiple stages in the data preparation process to ensure the exclusion of harmful and illegal content.
  • Sensitive Data Filtering: As part of making Gemma pre-trained models safe and reliable, automated techniques were used to filter out certain personal information and other sensitive data from training sets.
  • Additional methods: Filtering based on content quality and safety in line withour policies.

https://huggingface.co/unsloth/diffusiongemma-26B-A4B-it-GGUF#ethics-and-safetyEthics and Safety

https://huggingface.co/unsloth/diffusiongemma-26B-A4B-it-GGUF#as-open-models-become-central-to-enterprise-infrastructure-provenance-and-security-are-paramount-developed-by-google-deepmind-diffusiongemma-undergoes-the-same-rigorous-safety-evaluations-as-our-proprietary-gemini-modelsAs open models become central to enterprise infrastructure, provenance and security are paramount. Developed by Google DeepMind, DiffusionGemma undergoes the same rigorous safety evaluations as our proprietary Gemini models.

https://huggingface.co/unsloth/diffusiongemma-26B-A4B-it-GGUF#evaluation-approachEvaluation Approach

DiffusionGemma was developed in partnership with internal safety and responsible AI teams. A range of automated as well as human evaluations were conducted to help improve model safety. These evaluations align withGoogle’s AI principles, as well as safety policies, which aim to prevent our generative AI models from generating harmful content, including:

  • Content related to child sexual abuse material and exploitation
  • Dangerous content (e.g., promoting suicide, or instructing in activities that could cause real-world harm)
  • Sexually explicit content
  • Hate speech (e.g., dehumanizing members of protected groups)
  • Harassment (e.g., encouraging violence against people)

https://huggingface.co/unsloth/diffusiongemma-26B-A4B-it-GGUF#evaluation-resultsEvaluation Results

For all areas of safety testing, we saw major improvements in all categories of content safety relative to previous generations of Gemma models. Overall, DiffusionGemma, like Gemma 4 models, significantly outperforms Gemma 3 and 3n models in improving safety, while keeping unjustified refusals low. All testing was intentionally conducted without safety filters to evaluate the model’s raw capabilities and baseline behaviors. For both text-to-text and image-to-text, and across all model sizes, the model produced minimal policy violations, and showed significant improvements over previous Gemma models.

https://huggingface.co/unsloth/diffusiongemma-26B-A4B-it-GGUF#usage-and-limitationsUsage and Limitations

These models have certain limitations that users should be aware of.

https://huggingface.co/unsloth/diffusiongemma-26B-A4B-it-GGUF#intended-usageIntended Usage

Multimodal models (capable of processing vision, language, and/or audio) have a wide range of applications across various industries and domains. The following list of potential uses is not comprehensive. The purpose of this list is to provide contextual information about the possible use-cases that the model creators considered as part of model training and development.

  • Content Creation and Communication- Text Generation: Generates creative text formats such as poems, scripts, code, marketing copy, and email drafts. - Chatbots and Conversational AI: Powers conversational interfaces for customer service, virtual assistants, or interactive applications. - Text Summarization: Generates concise summaries of a text corpus, research papers, or reports. - Image Data Extraction: Extracts, interprets and summarizes visual data for text communications.
  • Research and Education- Natural Language Processing (NLP) and VLM Research: Serves as a foundation for researchers to experiment with VLM and NLP techniques, develop algorithms, and contribute to the advancement of the field. - Language Learning Tools: Supports interactive language learning experiences, aiding in grammar correction or providing writing practice. - Knowledge Exploration: Assists researchers in exploring large bodies of text by generating summaries or answering questions about specific topics.

https://huggingface.co/unsloth/diffusiongemma-26B-A4B-it-GGUF#limitationsLimitations

  • Training Data- The quality and diversity of the training data significantly influence the model’s capabilities. Biases or gaps in the training data can lead to limitations in the model’s responses. - The scope of the training dataset determines the subject areas the model can handle effectively.
  • Context and Task Complexity- The model performs well on tasks that can be framed with clear prompts and instructions. Open-ended or highly complex tasks might be challenging. - The model’s performance can be influenced by the amount of context provided (longer context generally leads to better outputs, up to a certain point).
  • Language Ambiguity and Nuance- Natural language is inherently complex. The model might struggle to grasp subtle nuances, sarcasm, or figurative language.
  • Factual Accuracy- The model generates responses based on information it learned from their training datasets, but they are not knowledge bases. It may generate incorrect or outdated factual statements.
  • Common Sense- The model relies on statistical patterns in language. It might lack the ability to apply common sense reasoning in certain situations.

https://huggingface.co/unsloth/diffusiongemma-26B-A4B-it-GGUF#ethical-considerations-and-risksEthical Considerations and Risks

In creating an open, vision-language model, we have carefully considered the following:

  • Bias and Fairness- VLMs trained on large-scale, real-world text and image data can reflect socio-cultural biases embedded in the training material. DiffusionGemma underwent careful scrutiny, input data pre-processing, and post-training evaluations as reported in this card to help mitigate the risk of these biases.
  • Misinformation and Misuse- VLMs can be misused to generate text that is false, misleading, or harmful. - Guidelines are provided for responsible use with the model, see theResponsible Generative AI Toolkit.
  • Transparency and Accountability- This model card summarizes details on the model’s architecture, capabilities, limitations, and evaluation processes. - A responsibly developed open model offers the opportunity to share innovation by making VLM technology accessible to developers and researchers across the AI ecosystem.

Risks identified and mitigations:

  • Generation of harmful content: Mechanisms and guidelines for content safety are essential. Developers are encouraged to exercise caution and implement appropriate content safety safeguards based on their specific product policies and application use cases.
  • Misuse for malicious purposes: Technical limitations and developer and end-user education can help mitigate against malicious applications of VLMs. Educational resources and reporting mechanisms for users to flag misuse are provided.
  • Privacy violations: Models were trained on data filtered for removal of certain personal information and other sensitive data. Developers are encouraged to adhere to privacy regulations with privacy-preserving techniques.
  • Perpetuation of biases: It’s encouraged to perform continuous monitoring (using evaluation metrics, human review) and the exploration of de-biasing techniques during model training, fine-tuning, and other use cases.

https://huggingface.co/unsloth/diffusiongemma-26B-A4B-it-GGUF#benefitsBenefits

At the time of release, this is a low-latency, high-performance open vision-language model that provides a compelling option for developers and those interested in researching diffusion language models. The model is designed from the ground up for responsible AI development compared to similarly sized models.

Similar Articles

unsloth/gemma-4-26B-A4B-it-GGUF

Hugging Face Models Trending

Unsloth releases GGUF-quantized versions of Google DeepMind's Gemma 4 26B A4B instruction-tuned model, enabling efficient local inference with support for tool-calling and fine-tuning via Unsloth Studio. Gemma 4 is a multimodal MoE model with a 256K context window, supporting text, image, video, and audio inputs.

unsloth/gemma-4-12B-it-qat-GGUF

Hugging Face Models Trending

Unsloth releases GGUF quantized versions of Google DeepMind's Gemma 4 models, optimized with Quantization-Aware Training (QAT) to reduce memory requirements while preserving quality, supporting multiple formats and sizes for diverse deployment.

google/diffusiongemma-26B-A4B-it

Hugging Face Models Trending

Google DeepMind releases DiffusionGemma, a 26B-parameter Mixture-of-Experts model that uses discrete diffusion for faster text generation, supporting multimodal inputs and a 256K token context.

DiffusionGemma

Simon Willison's Blog

Google released DiffusionGemma, an open-weight text generation model (26B parameters, 4B active) under Apache 2 license, demonstrating high inference speeds via NVIDIA's NIM cloud API.

DiffusionGemma: The Developer Guide- Google Developers Blog

Reddit r/LocalLLaMA

DiffusionGemma is a new experimental model from Google DeepMind that uses parallel generation on a 256-token canvas, achieving up to 4x faster token generation on GPUs. This developer guide explains its architecture, bidirectional context, and includes a fine-tuning recipe for solving Sudoku.