Turboquant+MTP for ROCm(Llama CPP)

Reddit r/LocalLLaMA Tools

Summary

A developer gets TurboQuant TBQ4 KV cache and Multi-Token Prediction working on AMD ROCm for RDNA3 GPUs in llama.cpp, enabling 64k context on 24 GB VRAM with competitive token rates.

TL;DR: I got TBQ4 KV cache + MTP working on AMD ROCm for RX 7900 XTX / RDNA3 / gfx1100 in llama.cpp. Main win: 64k context fits on 24 GB VRAM and remains usable. Branch: tbq4-rdna3-experiment (https://github.com/DrBearJew/llama.cpp/tree/tbq4-rdna3-experiment) I dug into TurboQuant / TBQ4 + MTP on AMD because the existing AMD paths were incomplete or broken for my setup. This branch uses the ROCm VEC Flash Attention path with inline TBQ4 dequant. Test setup: \- RX 7900 XTX, 24 GB \- RDNA3 / gfx1100 \- ROCm / HIP \- Qwen3.6-27B Q4\_K\_M MTP GGUF \- tbq4\_0 KV cache \- MTP with --spec-draft-n-max 3 Current numbers: \- tbq4\_0, 64k ctx: 38–54 tok/s, \~20 GB VRAM \- Prefill: 537.7 tok/s at 16k; 360.8 tok/s in the 64k test \- q8\_0 baseline: \~49.8 tok/s at 16k, \~31 tok/s at 32k, \~22–23 GB VRAM Caveats: \- RX 7900 XTX is RDNA3 / gfx1100, not RDNA3.5. \- RDNA3.5 / RDNA4 are enabled but untested. \- RotorQuant / PlanarQuant / IsoQuant are present but not validated. \- These are reported points from separate runs, not a clean scaling curve. Happy for New Testers. Useful bug reports > hype.
Original Article

Similar Articles

@no_stp_on_snek: got it here if ya want to try it out:

X AI KOLs Following

A fork of llama.cpp integrating TurboQuant+ for advanced KV-cache and weight quantization, with cross-backend kernel support (Apple Silicon, NVIDIA CUDA, AMD ROCm, Vulkan) and used in production by LocalAI, Chronara, and AtomicChat.