Turboquant+MTP for ROCm(Llama CPP)

Reddit r/LocalLLaMA Tools

Summary

A developer gets TurboQuant TBQ4 KV cache and Multi-Token Prediction working on AMD ROCm for RDNA3 GPUs in llama.cpp, enabling 64k context on 24 GB VRAM with competitive token rates.

TL;DR: I got TBQ4 KV cache + MTP working on AMD ROCm for RX 7900 XTX / RDNA3 / gfx1100 in llama.cpp. Main win: 64k context fits on 24 GB VRAM and remains usable. Branch: tbq4-rdna3-experiment (https://github.com/DrBearJew/llama.cpp/tree/tbq4-rdna3-experiment) I dug into TurboQuant / TBQ4 + MTP on AMD because the existing AMD paths were incomplete or broken for my setup. This branch uses the ROCm VEC Flash Attention path with inline TBQ4 dequant. Test setup: \- RX 7900 XTX, 24 GB \- RDNA3 / gfx1100 \- ROCm / HIP \- Qwen3.6-27B Q4\_K\_M MTP GGUF \- tbq4\_0 KV cache \- MTP with --spec-draft-n-max 3 Current numbers: \- tbq4\_0, 64k ctx: 38–54 tok/s, \~20 GB VRAM \- Prefill: 537.7 tok/s at 16k; 360.8 tok/s in the 64k test \- q8\_0 baseline: \~49.8 tok/s at 16k, \~31 tok/s at 32k, \~22–23 GB VRAM Caveats: \- RX 7900 XTX is RDNA3 / gfx1100, not RDNA3.5. \- RDNA3.5 / RDNA4 are enabled but untested. \- RotorQuant / PlanarQuant / IsoQuant are present but not validated. \- These are reported points from separate runs, not a clean scaling curve. Happy for New Testers. Useful bug reports > hype.
Original Article

Similar Articles

MTP+GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 - llama.cpp

Reddit r/LocalLLaMA

A user benchmarks token generation speed on llama.cpp with the GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 flag, comparing performance with and without MTP (Multi-Token Prediction). Results show a significant speedup from 49 tok/s to 64 tok/s when MTP is enabled on an RTX5090 with a Qwen3.6-27B model.

Running Qwen3.6 35b a3b on 8gb vram and 32gb ram ~190k context

Reddit r/LocalLLaMA

The author shares a high-performance local inference configuration for running Qwen3.6 35B A3B on limited hardware (8GB VRAM, 32GB RAM) using a modified llama.cpp with TurboQuant support, achieving ~37-51 tok/sec with ~190k context.