Tag
Reports running DiffusionGemma 26B on four AMD 7900 XTX GPUs using vllm, achieving 100 tps generation with overall 45-60 t/s, sharing performance metrics and setup commands.
This post presents benchmark results and tuning parameters for running DiffusionGemma 26B A4B GGUF models on an RTX 5090 GPU, showing up to 44% speedup via optimized temperature settings and quantization choices.
NVIDIA released a 26B MoE multimodal model called DiffusionGemma on Hugging Face, using NVFP4 quantization and achieving over 1,100 tokens per second on Hopper hardware.
NVIDIA optimizes Google DeepMind's DiffusionGemma, an open model that generates text in parallel 256-token blocks, achieving up to 4x faster performance on local RTX GPUs, DGX Spark, and DGX Station systems.
DiffusionGemma is a new experimental model from Google DeepMind that uses parallel generation on a 256-token canvas, achieving up to 4x faster token generation on GPUs. This developer guide explains its architecture, bidirectional context, and includes a fine-tuning recipe for solving Sudoku.
Unsloth releases GGUF quantizations of Google DeepMind's DiffusionGemma (26B-A4B), a new block-diffusion architecture for faster text generation, ready for llama.cpp.