Mudler released APEX-MTP GGUF quantizations of the Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled model, bundling the multi-token prediction head for self-speculative decoding with llama.cpp.
Description of the module: I host **30+ free APEX MoE quantizations** as independent research. My only local hardware is an **NVIDIA DGX Spark** (122 GB unified memory) — enough for \~30-50B-class MoEs, but **bigger ones (200B+) require rented compute** on H100/H200/Blackwell, typically $20-100 per quant. If APEX quants are useful to you, your support directly funds those bigger runs. [](https://huggingface.co/mudler/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-APEX-MTP-GGUF#qwen36-35b-a3b-claude-47-opus-reasoning-distilled--apex-mtp-gguf)Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled — APEX-MTP GGUF **APEX (Adaptive Precision for EXpert Models)** quantizations of [lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled](https://huggingface.co/lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled), with the **MTP (multi-token prediction) head bundled** for in-the-box self-speculative decoding. [](https://huggingface.co/mudler/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-APEX-MTP-GGUF#whats-different-from-the-plain-apex-repo)What's different from the plain APEX repo? These GGUFs bundle the model's **MTP (multi-token prediction) head** alongside the trunk in a single file, courtesy of [llama.cpp PR #22673](https://github.com/ggml-org/llama.cpp/pull/22673). With a recent llama.cpp (>= commit 255582687) you can enable self-speculative decoding using just this one file — no separate draft model needed: llama-server -m Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-APEX-MTP-I-Balanced.gguf --draft-mtp The non-MTP version is still available at [mudler/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-APEX-GGUF](https://huggingface.co/mudler/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-APEX-GGUF) — slightly smaller, but no self-spec. # [](https://huggingface.co/mudler/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-APEX-MTP-GGUF#file-sizes)File sizes Each quant is \~2.5% larger than its non-MTP counterpart (one extra transformer-block worth of weights, no embedding duplication since MTP shares the trunk's embed\_tokens). # [](https://huggingface.co/mudler/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-APEX-MTP-GGUF#mtp-draft-head-precision)MTP draft head precision The bundled MTP head (`blk.40.*` including the `nextn.*` projection + norms) is quantized to **Q8\_0** (near-lossless) on **every tier except I-Nano**. I-Nano keeps the trunk-tier precision on the MTP block (Q3\_K routed experts, Q4\_K attention) but pins `blk.40.nextn.eh_proj` to Q4\_K — see the [explainer below](https://huggingface.co/mudler/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-APEX-MTP-GGUF#why-the-mtp-head-doesnt-use-imatrix). This keeps draft accuracy high (important for spec-decode acceptance rate) at a modest \~1 GB cost per file vs. trunk-tier precision. # [](https://huggingface.co/mudler/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-APEX-MTP-GGUF#why-the-mtp-head-doesnt-use-imatrix)Why the MTP head doesn't use imatrix `llama-imatrix` runs normal forward passes that only activate the trunk (`blk.0..blk.39`). The MTP head only fires during `--draft-mtp` spec decoding, so its tensors get no imatrix activation data. We work around this by quantizing the MTP head with static K-quant / Q8\_0 which doesn't require imatrix. (A patch to `llama-imatrix` that records MTP activations during collection is in progress at [mudler/llama.cpp#mtp-imatrix](https://github.com/mudler/llama.cpp/tree/mtp-imatrix) — once upstream this will let us push the drafter to lower bit-widths cleanly.) # [](https://huggingface.co/mudler/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-APEX-MTP-GGUF#what-is-apex)What is APEX? APEX is a MoE-aware mixed-precision quantization strategy. Per-tensor-role gradient: routed experts compress hardest, shared experts kept high (always active), attention/Mamba uniform; 5+5 symmetric edge gradient across the 40 trunk layers + MTP layer 40 at edge precision. I-variants use diverse imatrix calibration (chat, code, reasoning, tool-calling, agentic traces, Wikipedia). [](https://huggingface.co/mudler/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-APEX-MTP-GGUF#architecture)**Architecture** * **Base**: Qwen 3.6 35B-A3B family (Qwen3\_5MoeForCausalLM) * **Layers**: 40 trunk + 1 MTP (bundled) * **Experts**: 256 routed + 1 shared (8 active per token) * **Hidden size**: 2048 * **Calibration**: v1.3 diverse dataset # [](https://huggingface.co/mudler/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-APEX-MTP-GGUF#credits)
This Hugging Face repository provides GGUF files for Qwen3.6-27B with Multi-Token Prediction (MTP) layers grafted onto Unsloth UD XL quantizations. It includes instructions for building llama.cpp with MTP support to enable speculative decoding.
A 35B-parameter Qwen3.6 model fine-tuned with Claude-Opus-style chain-of-thought distillation data and released in GGUF quantized formats for efficient local inference.
A GGUF quantized version of the Qwopus3.6-27B-Coder-MTP model is released on Hugging Face, optimized for local inference and compatible with Transformers, vLLM, SGLang, and Unsloth Studio.
This article announces the release of the Qwen3.6-35B-A3B model weights on Hugging Face, optimized by Unsloth with Multi-Token Prediction (MTP) for faster generation via llama.cpp. It highlights improvements in agentic coding capabilities, tool calling, and reasoning context preservation.
A fine-tuned uncensored version of the Qwen model (Qwen3.6-35B-A3B) with MTP support and APEX quantization, tested stable at 200k context and recommended for use in LM Studio.