Turboquant+MTP for ROCm(Llama CPP)
Summary
A developer gets TurboQuant TBQ4 KV cache and Multi-Token Prediction working on AMD ROCm for RDNA3 GPUs in llama.cpp, enabling 64k context on 24 GB VRAM with competitive token rates.
Similar Articles
Got MTP + TurboQuant running — Qwen3.6-27B -- 80+ t/s at 262K context on a single RTX 4090
Developer achieved 80+ t/s inference on Qwen3.6-27B with 262K context on a single RTX 4090 by combining MTP (Multi-Token Prediction) with TurboQuant's lossless KV cache compression, sharing their implementation fork and technical details.
2× Radeon R9700 — Qwen 3.6 27B Q8 MTP on llama.cpp
Technical report on running Qwen 3.6 27B Q8 model on a dual AMD Radeon R9700 setup using llama.cpp with ROCm, including performance benchmarks and configuration details.
80 tok/sec and 128K context on 12GB VRAM with Qwen3.6 35B A3B and llama.cpp MTP
A user shares a configuration for achieving over 80 tokens per second with Qwen3.6 35B A3B on a 12GB VRAM GPU using llama.cpp and Multi-Token Prediction (MTP). The post includes benchmark results and specific command-line parameters to optimize performance.
Multi-Token Prediction (MTP) for Qwen on LLaMA.cpp + TurboQuant
Implemented Multi-Token Prediction for Qwen on LLaMA.cpp with TurboQuant, achieving a 40% performance boost and 90% acceptance rate, running locally on a MacBook Pro M5 Max.
@no_stp_on_snek: got it here if ya want to try it out:
A fork of llama.cpp integrating TurboQuant+ for advanced KV-cache and weight quantization, with cross-backend kernel support (Apple Silicon, NVIDIA CUDA, AMD ROCm, Vulkan) and used in production by LocalAI, Chronara, and AtomicChat.