DifussionGemma 4 on 4x7900xtx

Reddit r/LocalLLaMA 06/11/26, 03:18 PM News

diffusiongemma vllm amd 7900xtx inference performance gpu

Summary

Reports running DiffusionGemma 26B on four AMD 7900 XTX GPUs using vllm, achieving 100 tps generation with overall 45-60 t/s, sharing performance metrics and setup commands.

Just got 100 tps on generation, but in total time it around 45-60 t/s in case of prompt processing waiting. Available memory show: GPU KV cache size: 152,671 tokens Maximum concurrency for 131,072 tokens per request: 1.16x amd-smi monitor for this gpu: GPU XCP POWER GPU_T MEM_T GFX_CLK GFX% MEM% ENC% DEC% VRAM_USAGE 3 0 183 W 82 °C 84 °C 3036 MHz 100 % 5 % N/A 0 % 23.6/ 24.0 GB 5 0 161 W 81 °C 88 °C 3101 MHz 100 % 0 % N/A 0 % 23.7/ 24.0 GB 7 0 165 W 78 °C 86 °C 3095 MHz 100 % 1 % N/A 0 % 23.7/ 24.0 GB 8 0 154 W 80 °C 88 °C 3090 MHz 100 % 0 % N/A 0 % 23.6/ 24.0 GB # DiffusionGemma 26B on vllm dgemma branch (4x 7900 XTX) set -uo pipefail docker run --name "$1" \ --rm --tty --ipc=host --shm-size=32g \ --device /dev/kfd:/dev/kfd \ --device /dev/dri/renderD131:/dev/dri/renderD131 \ --device /dev/dri/renderD133:/dev/dri/renderD133 \ --device /dev/dri/renderD136:/dev/dri/renderD136 \ --device /dev/dri/renderD135:/dev/dri/renderD135 \ --device /dev/mem:/dev/mem \ --security-opt seccomp=unconfined \ --group-add video \ -e HIP_VISIBLE_DEVICES=0,1,2,3 \ -e ROCR_VISIBLE_DEVICES=0,1,2,3 \ -v /mnt/tb_disk/llm:/app/models:ro \ -v /mnt/tb_disk/llm/torch_compile_cache:/root/.cache/vllm/torch_compile_cache \ -v /opt/services/llama-swap/moe_configs/E=128,N=176,device_name=AMD_Radeon_RX7900XTX.json:/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=128,N=176,device_name=AMD_Radeon_RX7900XTX.json:ro \ -e TRUST_REMOTE_CODE=1 \ -e OMP_NUM_THREADS=8 \ -e PYTORCH_TUNABLEOP_ENABLED=1 \ -e GPU_MAX_HW_QUEUES=1 \ -e VLLM_ROCM_USE_AITER=0 \ -e VLLM_ROCM_USE_AITER_MOE=0 \ -e VLLM_USE_V2_MODEL_RUNNER=1 \ -e PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:256 \ -p "$2":8000 \ --entrypoint vllm \ vllm-dgemma:nocompile \ serve \ /app/models/models/vllm/diffusiongemma-26B-A4B-it \ --served-model-name "$1" --host 0.0.0.0 --port 8000 --trust-remote-code \ --gpu-memory-utilization 0.65 --tensor-parallel-size 4 \ --tool-call-parser gemma4 --enable-auto-tool-choice \ --reasoning-parser gemma4 \ --attention-backend TRITON_ATTN \ --max-num-seqs 2 --max-model-len 131072 \ --generation-config vllm \ --hf-overrides '{"diffusion_sampler": "entropy_bound", "diffusion_entropy_bound": 0.1}' So it's work, but to launch it we spend 2-3M of deepseek-v4-pro tokens to prepare docker image.

Original Article

DifussionGemma 4 on 4x7900xtx

Similar Articles

@mervenoyann: DiffusionGemma is out it's compute-bound so 4x faster compared to other Gemma-4 models (1k tok/s on H100) also great on…

DiffusionGemma 26B A4B results on my 5090

Gemma 4 26B Hits 600 Tok/s on One RTX 5090

DiffusionGemma under real workloads feels very different from benchmark demos

Gemma 4 QAT benchmark results (AMD 7900 XTX): faster, less VRAM, no quality loss

Submit Feedback

Similar Articles

@mervenoyann: DiffusionGemma is out it's compute-bound so 4x faster compared to other Gemma-4 models (1k tok/s on H100) also great on…

DiffusionGemma 26B A4B results on my 5090

Gemma 4 26B Hits 600 Tok/s on One RTX 5090

DiffusionGemma under real workloads feels very different from benchmark demos

Gemma 4 QAT benchmark results (AMD 7900 XTX): faster, less VRAM, no quality loss