llama.cpp docker images to run MTP models

Reddit r/LocalLLaMA Tools

Summary

Provides Docker images for running MTP models with llama.cpp, including quantization comparisons and usage instructions.

This is follow up from previous post: https://www.reddit.com/r/LocalLLaMA/comments/1t5ageq/ There have been many improvements to the MTP pull request and the llama.cpp main branch, such as image support and various bug fixes. I recently made a new build for my local machine, but keeping guides up to date is an issue, so I built Docker images to make running them easier. If you are already using llama.cpp Docker images, it would be straightforward to switch over until official builds support MTP. Here, pick your flavour: ``` havenoammo/llama:cuda13-server havenoammo/llama:cuda12-server havenoammo/llama:vulkan-server havenoammo/llama:intel-server havenoammo/llama:rocm-server ``` I have not been able to test all of them, as I only run cuda13 for now. Feel free to give it a test and see if it works for your hardware. Also, Unsloth released MTP models for Qwen 3.6, which makes my previous grafted models obsolete. You can find them here if you missed them: * https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF * https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF I believe they quantize some of the MTP layers. I kept mine at Q8 quantization for improved prediction. It is possible that higher quantization for MTP layers makes them more precise, giving you more speed at the cost of more VRAM usage. I will keep my versions for now until I finish doing some benchmarks and I am sure they are fully obsolete. *Quick edit:* They do quantize MTP layers at lower quantization levels. Here is a comparison: | Tensor | havenoammo (UD XL + Q8_0 MTP) | Unsloth (UD XL) | |---|---|---| | `blk.64.attn_k.weight` | **Q8_0** | Q3_K | | `blk.64.attn_k_norm.weight` | F32 | F32 | | `blk.64.attn_norm.weight` | F32 | F32 | | `blk.64.attn_output.weight` | **Q8_0** | Q4_K | | `blk.64.attn_q.weight` | **Q8_0** | Q3_K | | `blk.64.attn_q_norm.weight` | F32 | F32 | | `blk.64.attn_v.weight` | **Q8_0** | Q5_K | | `blk.64.ffn_down.weight` | **Q8_0** | Q4_K | | `blk.64.ffn_gate.weight` | **Q8_0** | Q3_K | | `blk.64.ffn_up.weight` | **Q8_0** | Q3_K | | `blk.64.nextn.eh_proj.weight` | Q8_0 | Q8_0 | | `blk.64.nextn.enorm.weight` | F32 | F32 | | `blk.64.nextn.hnorm.weight` | F32 | F32 | | `blk.64.nextn.shared_head_norm.weight` | F32 | F32 | | `blk.64.post_attention_norm.weight` | F32 | F32 | | MTP layers size | 430.41 MB | 222.33 MB | Will do some benchmarks to see if quantization causes any precision/speed loss for multi-token prediction. Until then if you have VRAM, feel free to test out my releases. * https://huggingface.co/havenoammo/Qwen3.6-35B-A3B-MTP-GGUF * https://huggingface.co/havenoammo/Qwen3.6-27B-MTP-UD-GGUF Finally, here is how I use it: ``` docker run --gpus all --rm \ -p 8080:8080 \ -v ./models:/models \ havenoammo/llama:cuda13-server \ -m /models/Qwen3.6-27B-MTP-UD-Q8_K_XL.gguf \ --port 8080 \ --host 0.0.0.0 \ -n -1 \ --parallel 1 \ --ctx-size 262144 \ --fit-target 844 \ --mmap \ -ngl -1 \ --flash-attn on \ --metrics \ --temp 1.0 \ --min-p 0.0 \ --top-p 0.95 \ --top-k 20 \ --jinja \ --chat-template-kwargs '{"preserve_thinking":true}' \ --ubatch-size 512 \ --batch-size 2048 \ --cache-type-k q8_0 \ --cache-type-v q8_0 \ --spec-type mtp \ --spec-draft-n-max 3 ``` Adjust as you see fit. What matters most for MTP is `--spec-type mtp` and `--spec-draft-n-max 3`.
Original Article

Similar Articles

llama.cpp is the linux of llm

Reddit r/LocalLLaMA

The article draws a parallel between llama.cpp and Linux, positioning the open-source library as foundational infrastructure for running large language models.

Turboquant+MTP for ROCm(Llama CPP)

Reddit r/LocalLLaMA

A developer gets TurboQuant TBQ4 KV cache and Multi-Token Prediction working on AMD ROCm for RDNA3 GPUs in llama.cpp, enabling 64k context on 24 GB VRAM with competitive token rates.