Tag
A developer gets TurboQuant TBQ4 KV cache and Multi-Token Prediction working on AMD ROCm for RDNA3 GPUs in llama.cpp, enabling 64k context on 24 GB VRAM with competitive token rates.
UnslothAI founder Daniel Han released the experimental MTP GGUF version of Qwen3.6, achieving 140 tokens/s for the 27B model and 220 tokens/s for the 35B-A3B version on consumer GPUs — a 1.4x speedup with zero accuracy loss.
A systematic analysis of Qwen 3.6 27B benchmarks reveals that speculative inference (MTP) significantly accelerates coding tasks but slows down creative writing, with task type being the dominant factor over quantization or temperature settings.
A user shares a configuration for achieving over 80 tokens per second with Qwen3.6 35B A3B on a 12GB VRAM GPU using llama.cpp and Multi-Token Prediction (MTP). The post includes benchmark results and specific command-line parameters to optimize performance.
A user benchmarked MTP (Multi-Token Prediction) on Gemma 4 with mlx-vlm on M4 Max Studio, finding it excellent for code generation (1.53x faster, 66% acceptance) but detrimental for JSON output (50% slower, only 8% acceptance) and neutral for long-form prose, suggesting MTP benefits vanish when acceptance drops below 50%.
Developer achieved 80+ t/s inference on Qwen3.6-27B with 262K context on a single RTX 4090 by combining MTP (Multi-Token Prediction) with TurboQuant's lossless KV cache compression, sharing their implementation fork and technical details.
llamacpp will soon support Multi-Token Prediction (MTP), enhancing inference efficiency.
The author provides extracted GGUF files containing only MTP tensors for Qwen3.6 models, allowing users to graft tensors with a significantly reduced download size compared to full model files.