MTP+GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 - llama.cpp

Reddit r/LocalLLaMA 05/12/26, 01:10 PM Tools

llama-cpp multi-token-prediction cuda unified-memory benchmarking performance

Summary

A user benchmarks token generation speed on llama.cpp with the GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 flag, comparing performance with and without MTP (Multi-Token Prediction). Results show a significant speedup from 49 tok/s to 64 tok/s when MTP is enabled on an RTX5090 with a Qwen3.6-27B model.

I was wondering what will be the difference in results with flag: **GGML\_CUDA\_ENABLE\_UNIFIED\_MEMORY=1** vs **MTP+GGML\_CUDA\_ENABLE\_UNIFIED\_MEMORY=1** Results are quite interesting **49tok/sec without MTP** vs **64 tok/sec with MTP.** **PC: RTX5090+128GB DDR5 5600 CL36+Ryzen 9 9950X3D** **Model: Qwen3.6-27B-Q8\_0.gguf (Unsloth with MTP)** Command: `CUDA_VISIBLE_DEVICES=0 GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 /home/marcin/llama-server \` `-m /home/marcin/Pobrane/Qwen3.6-27B-Q8_0.gguf \` `--threads 16 \` `-c 262144 -fa on -np 1 \` `--spec-type mtp --spec-draft-n-max 3 \` `--webui-mcp-proxy \` `--chat-template-kwargs '{"preserve_thinking": true}' \` `--host` [`0.0.0.0`](http://0.0.0.0) `\` `--port 8090 \` `--jinja`

Original Article

Similar Articles

80 tok/sec and 128K context on 12GB VRAM with Qwen3.6 35B A3B and llama.cpp MTP

Reddit r/LocalLLaMA

A user shares a configuration for achieving over 80 tokens per second with Qwen3.6 35B A3B on a 12GB VRAM GPU using llama.cpp and Multi-Token Prediction (MTP). The post includes benchmark results and specific command-line parameters to optimize performance.

Multi-Token Prediction (MTP) for LLaMA.cpp - Gemma 4 speedup by 40%

Reddit r/LocalLLaMA

A new implementation of Multi-Token Prediction (MTP) in llama.cpp achieves a 40% speedup for Gemma 4 models, tested on a MacBook Pro M5Max. The post provides links to quantized GGUF models and the patched source code.

More Qwen3.6-27B MTP success but on dual Mi50s

Reddit r/LocalLLaMA

The article benchmarks the Qwen3.6-27B model using Multi-Token Prediction (MTP) and tensor parallelism on dual Mi50 GPUs, demonstrating significant speedups via llama.cpp.

MTP is all about acceptance rate

Reddit r/LocalLLaMA

A user benchmarked MTP (Multi-Token Prediction) on Gemma 4 with mlx-vlm on M4 Max Studio, finding it excellent for code generation (1.53x faster, 66% acceptance) but detrimental for JSON output (50% slower, only 8% acceptance) and neutral for long-form prose, suggesting MTP benefits vanish when acceptance drops below 50%.

Got MTP + TurboQuant running — Qwen3.6-27B -- 80+ t/s at 262K context on a single RTX 4090

Reddit r/LocalLLaMA

Developer achieved 80+ t/s inference on Qwen3.6-27B with 262K context on a single RTX 4090 by combining MTP (Multi-Token Prediction) with TurboQuant's lossless KV cache compression, sharing their implementation fork and technical details.

Similar Articles

80 tok/sec and 128K context on 12GB VRAM with Qwen3.6 35B A3B and llama.cpp MTP

Multi-Token Prediction (MTP) for LLaMA.cpp - Gemma 4 speedup by 40%

More Qwen3.6-27B MTP success but on dual Mi50s

MTP is all about acceptance rate

Got MTP + TurboQuant running — Qwen3.6-27B -- 80+ t/s at 262K context on a single RTX 4090

Submit Feedback