llama.cpp docker images to run MTP models

Reddit r/LocalLLaMA 05/13/26, 02:20 PM Tools

docker llms llama-cpp multi-token-prediction inference quantization open-source

Summary

Provides Docker images for running MTP models with llama.cpp, including quantization comparisons and usage instructions.

This is follow up from previous post: https://www.reddit.com/r/LocalLLaMA/comments/1t5ageq/ There have been many improvements to the MTP pull request and the llama.cpp main branch, such as image support and various bug fixes. I recently made a new build for my local machine, but keeping guides up to date is an issue, so I built Docker images to make running them easier. If you are already using llama.cpp Docker images, it would be straightforward to switch over until official builds support MTP. Here, pick your flavour: ``` havenoammo/llama:cuda13-server havenoammo/llama:cuda12-server havenoammo/llama:vulkan-server havenoammo/llama:intel-server havenoammo/llama:rocm-server ``` I have not been able to test all of them, as I only run cuda13 for now. Feel free to give it a test and see if it works for your hardware. Also, Unsloth released MTP models for Qwen 3.6, which makes my previous grafted models obsolete. You can find them here if you missed them: * https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF * https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF I believe they quantize some of the MTP layers. I kept mine at Q8 quantization for improved prediction. It is possible that higher quantization for MTP layers makes them more precise, giving you more speed at the cost of more VRAM usage. I will keep my versions for now until I finish doing some benchmarks and I am sure they are fully obsolete. *Quick edit:* They do quantize MTP layers at lower quantization levels. Here is a comparison: | Tensor | havenoammo (UD XL + Q8_0 MTP) | Unsloth (UD XL) | |---|---|---| | `blk.64.attn_k.weight` | **Q8_0** | Q3_K | | `blk.64.attn_k_norm.weight` | F32 | F32 | | `blk.64.attn_norm.weight` | F32 | F32 | | `blk.64.attn_output.weight` | **Q8_0** | Q4_K | | `blk.64.attn_q.weight` | **Q8_0** | Q3_K | | `blk.64.attn_q_norm.weight` | F32 | F32 | | `blk.64.attn_v.weight` | **Q8_0** | Q5_K | | `blk.64.ffn_down.weight` | **Q8_0** | Q4_K | | `blk.64.ffn_gate.weight` | **Q8_0** | Q3_K | | `blk.64.ffn_up.weight` | **Q8_0** | Q3_K | | `blk.64.nextn.eh_proj.weight` | Q8_0 | Q8_0 | | `blk.64.nextn.enorm.weight` | F32 | F32 | | `blk.64.nextn.hnorm.weight` | F32 | F32 | | `blk.64.nextn.shared_head_norm.weight` | F32 | F32 | | `blk.64.post_attention_norm.weight` | F32 | F32 | | MTP layers size | 430.41 MB | 222.33 MB | Will do some benchmarks to see if quantization causes any precision/speed loss for multi-token prediction. Until then if you have VRAM, feel free to test out my releases. * https://huggingface.co/havenoammo/Qwen3.6-35B-A3B-MTP-GGUF * https://huggingface.co/havenoammo/Qwen3.6-27B-MTP-UD-GGUF Finally, here is how I use it: ``` docker run --gpus all --rm \ -p 8080:8080 \ -v ./models:/models \ havenoammo/llama:cuda13-server \ -m /models/Qwen3.6-27B-MTP-UD-Q8_K_XL.gguf \ --port 8080 \ --host 0.0.0.0 \ -n -1 \ --parallel 1 \ --ctx-size 262144 \ --fit-target 844 \ --mmap \ -ngl -1 \ --flash-attn on \ --metrics \ --temp 1.0 \ --min-p 0.0 \ --top-p 0.95 \ --top-k 20 \ --jinja \ --chat-template-kwargs '{"preserve_thinking":true}' \ --ubatch-size 512 \ --batch-size 2048 \ --cache-type-k q8_0 \ --cache-type-v q8_0 \ --spec-type mtp \ --spec-draft-n-max 3 ``` Adjust as you see fit. What matters most for MTP is `--spec-type mtp` and `--spec-draft-n-max 3`.

Original Article

llama.cpp docker images to run MTP models

Similar Articles

llama.cpp is the linux of llm

Multi-Token Prediction (MTP) for LLaMA.cpp - Gemma 4 speedup by 40%

Llama.cpp's auto fit works much better than I expected

@ggerganov: llama-server -hf ggml-org/Qwen3.6-27B-GGUF --spec-default

Turboquant+MTP for ROCm(Llama CPP)

Submit Feedback

Similar Articles

Multi-Token Prediction (MTP) for LLaMA.cpp - Gemma 4 speedup by 40%

Llama.cpp's auto fit works much better than I expected

@ggerganov: llama-server -hf ggml-org/Qwen3.6-27B-GGUF --spec-default
Georgi Gerganov shared a one-liner to launch the quantized 27B Qwen3.6 model with llama-server using default speculative-decoding settings.

Turboquant+MTP for ROCm(Llama CPP)