Turboquant+MTP for ROCm(Llama CPP)

Reddit r/LocalLLaMA 05/14/26, 08:24 AM Tools

turboquant tbq4 mtp amd rocm llama-cpp rdna3

Summary

A developer gets TurboQuant TBQ4 KV cache and Multi-Token Prediction working on AMD ROCm for RDNA3 GPUs in llama.cpp, enabling 64k context on 24 GB VRAM with competitive token rates.

TL;DR: I got TBQ4 KV cache + MTP working on AMD ROCm for RX 7900 XTX / RDNA3 / gfx1100 in llama.cpp. Main win: 64k context fits on 24 GB VRAM and remains usable. Branch: tbq4-rdna3-experiment (https://github.com/DrBearJew/llama.cpp/tree/tbq4-rdna3-experiment) I dug into TurboQuant / TBQ4 + MTP on AMD because the existing AMD paths were incomplete or broken for my setup. This branch uses the ROCm VEC Flash Attention path with inline TBQ4 dequant. Test setup: \- RX 7900 XTX, 24 GB \- RDNA3 / gfx1100 \- ROCm / HIP \- Qwen3.6-27B Q4\_K\_M MTP GGUF \- tbq4\_0 KV cache \- MTP with --spec-draft-n-max 3 Current numbers: \- tbq4\_0, 64k ctx: 38–54 tok/s, \~20 GB VRAM \- Prefill: 537.7 tok/s at 16k; 360.8 tok/s in the 64k test \- q8\_0 baseline: \~49.8 tok/s at 16k, \~31 tok/s at 32k, \~22–23 GB VRAM Caveats: \- RX 7900 XTX is RDNA3 / gfx1100, not RDNA3.5. \- RDNA3.5 / RDNA4 are enabled but untested. \- RotorQuant / PlanarQuant / IsoQuant are present but not validated. \- These are reported points from separate runs, not a clean scaling curve. Happy for New Testers. Useful bug reports > hype.

Original Article

Similar Articles

Got MTP + TurboQuant running — Qwen3.6-27B -- 80+ t/s at 262K context on a single RTX 4090

Reddit r/LocalLLaMA

Developer achieved 80+ t/s inference on Qwen3.6-27B with 262K context on a single RTX 4090 by combining MTP (Multi-Token Prediction) with TurboQuant's lossless KV cache compression, sharing their implementation fork and technical details.

80 tok/sec and 128K context on 12GB VRAM with Qwen3.6 35B A3B and llama.cpp MTP

Reddit r/LocalLLaMA

A user shares a configuration for achieving over 80 tokens per second with Qwen3.6 35B A3B on a 12GB VRAM GPU using llama.cpp and Multi-Token Prediction (MTP). The post includes benchmark results and specific command-line parameters to optimize performance.

Multi-Token Prediction (MTP) for Qwen on LLaMA.cpp + TurboQuant

Reddit r/LocalLLaMA

Implemented Multi-Token Prediction for Qwen on LLaMA.cpp with TurboQuant, achieving a 40% performance boost and 90% acceptance rate, running locally on a MacBook Pro M5 Max.

MTP+GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 - llama.cpp

Reddit r/LocalLLaMA

A user benchmarks token generation speed on llama.cpp with the GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 flag, comparing performance with and without MTP (Multi-Token Prediction). Results show a significant speedup from 49 tok/s to 64 tok/s when MTP is enabled on an RTX5090 with a Qwen3.6-27B model.

Running Qwen3.6 35b a3b on 8gb vram and 32gb ram ~190k context

Reddit r/LocalLLaMA

The author shares a high-performance local inference configuration for running Qwen3.6 35B A3B on limited hardware (8GB VRAM, 32GB RAM) using a modified llama.cpp with TurboQuant support, achieving ~37-51 tok/sec with ~190k context.

Similar Articles

Got MTP + TurboQuant running — Qwen3.6-27B -- 80+ t/s at 262K context on a single RTX 4090

80 tok/sec and 128K context on 12GB VRAM with Qwen3.6 35B A3B and llama.cpp MTP

Multi-Token Prediction (MTP) for Qwen on LLaMA.cpp + TurboQuant

MTP+GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 - llama.cpp

Running Qwen3.6 35b a3b on 8gb vram and 32gb ram ~190k context

Submit Feedback