bytkim/Qwen3.6-27B-MTP-pi-tune-GGUF
Summary
bytkim releases a 4-bit QLoRA SFT Multi-Token Prediction fine-tune of Qwen3.6-27B, packaged as GGUF for local agentic coding. The no-thinking tune is designed for low-latency direct output in agent loops.
View Cached Full Text
Cached at: 06/20/26, 02:20 PM
bytkim/Qwen3.6-27B-MTP-pi-tune-GGUF · Hugging Face
Source: https://huggingface.co/bytkim/Qwen3.6-27B-MTP-pi-tune-GGUF No-Thinking · PI Tune · MTP · GGUF
🪐 Qwen3.6-27B-MTP-pi-tune
A 4-bit QLoRA SFT Multi-Token Prediction tune of Qwen3.6-27B fine-tuned forno-thinking agentic codingthrough a PI-style harness. Packaged as llama.cpp-compatible GGUF for local agent loops.
🧠 27B Dense Foundation🚫 No-Thinking Tuned⚡ MTP Speculative Decoding🛠️ Coding · DevOps · Agents📦 llama.cpp GGUF🖼️ Multi-Modal Compatible🪟 256k Native · 1M Max Context
For the strongest Pi-style coding-agent behavior, use the reasoning-trained release:
bytkim/Qwen3\.6\-27B\-MTP\-pi\-reasoning\-GGUF. See thetechnical writeupfor the broader evaluation context. This no-thinking tune remains useful when you specifically want the lower-latency direct / instruct path.
⚡
MTP Decoding
Multi-Token Prediction drafts future tokens and accepts them when the main decode path agrees — cutting wall time on long reasoning, code generation, and tool-call setup.
🧩
No-Thinking by Design
Trained on Qwen3.6’s non-thinking inference path. The model responds directly with tool calls, edits, and structured output — no<thinking\>preamble eating wall time before the harness can act.
🧪
Local-First Throughput
Designed for llama.cpp-class runtimes on a single workstation, with MTP draft acceptance measured on real agent workloads.
🚀
Practical Speed
Built for workflows where waiting minutes per turn breaks the loop — the tune favors decisive output over scratch-pad expansion.
Context Window
128ktested
256knatively supported by the base · extensible up to1Mtokens via RoPE scaling.
MTP Draft Acceptance
~78%
Drafted future tokens accepted by the main decode path on agent workloads.3 speculative steps · 4 draft tokens.
https://huggingface.co/bytkim/Qwen3.6-27B-MTP-pi-tune-GGUF#%F0%9F%92%A1-1-model-overview💡 1. Model Overview
AttributeDetailsBase modelQwen/Qwen3\.6\-27BRelease formatGGUFRuntime targetllama.cpp-compatible local inferenceTuning focusHarness fluency, coding-agent tasks, terminal workflows, tool use, repository workFine-tuning style4-bit QLoRA SFT on private passed agent trajectoriesTechnical writeupQwen3.6 27B reasoning writeupReasoning data policyInternal reasoning traces were not exported into the SFT rowsRecommended quantQ4\_K\_Mas the default starting point
Qwen3.6-27B is a causal language model with a vision encoder.Image and video understanding are supportedby pairing the language-model GGUF with the compatible Qwen3.6
mmproj\-F16\.ggufsidecar (see §4Multi-modal inference). TheMTP draft heads are kept atQ8\_0precision inside every quantvia\-\-tensor\-type nextn=q8\_0at quantize time — speculative decoding works at any quant level, not justQ8\_0/bf16.
https://huggingface.co/bytkim/Qwen3.6-27B-MTP-pi-tune-GGUF#%F0%9F%A7%A9-2-why-this-tune🧩 2. Why This Tune
Qwen3.6 supports both athinkinginference mode (which emits a<thinking\>\.\.\.</thinking\>block before responding) and anon-thinkingmode (where the model answers and acts directly). This release is fine-tuned specifically for theno-thinking path— the mode where agent loops actually live. In a PI-style harness running tool-call loops, every thinking token is wall time the harness can’t dispatch the next action against, so the tune is shaped to make the no-thinking path quality where it counts: tool calls, repository edits, terminal commands, verifier feedback, and structured output.
This carries forward Qwen3.6’s existing agentic-coding posture — frontend workflows, repository-level reasoning, and tool calling — but pulls the quality into the inference mode that local agent runtimes can budget for.
- Terminal and shell task execution.
- Repository inspection, patching, and test iteration.
- Tool-call-shaped interactions and structured outputs.
- DevOps runbooks, environment setup, and debugging loops.
- Coding tasks where command use, file edits, and verifier feedback matter.
https://huggingface.co/bytkim/Qwen3.6-27B-MTP-pi-tune-GGUF#%F0%9F%93%A6-3-quantizations📦 3. Quantizations
Recommended starting point:Q4\_K\_M.
QuantFile sizeVRAM (approx)Suggested useQ2\_K~11 GB~13 GBSmallest memory footprint; quality tradeoffs are expected.Q3\_K\_S~12 GB~15 GBLow-memory 3-bit option.Q3\_K\_M~14 GB~16 GBBalanced 3-bit option.Q3\_K\_L~15 GB~17 GBHigher-quality 3-bit option.Q4\_K\_S~16 GB~18 GBSmaller 4-bit option.Q4\_K\_M~17 GB~19 GB**Default recommendation for most local use.**Comfortable on a 24 GB GPU.Q5\_K\_S~19 GB~21 GBHigher-quality 5-bit option.Q5\_K\_M~20 GB~22 GBStrong quality/memory tradeoff; near the upper edge of a 24 GB GPU.Q6\_K~22 GB~25 GBHigh-quality local inference if you have the memory.Q8\_0~29 GB~32 GBHighest-precision quantized option.bf16~55 GB~58 GBBF16 GGUF reference, if present.
VRAM figures are rough estimates for GPU-offload inference (\-ngl 99 \-fa) at a moderate context (~32k) with quantized KV cache; they scale up with longer contexts.
Every quant in this release ships with the MTP
nextnprediction heads stored atQ8\_0precision, regardless of the overall quant target. That means speculative decoding works at any quant level — pick the smallest one that fits your VRAM and you still get the MTP throughput profile described in §6.
Some files may still be uploading. Check the Files tab for the exact artifacts currently available.
https://huggingface.co/bytkim/Qwen3.6-27B-MTP-pi-tune-GGUF#%F0%9F%9A%80-4-quickstart🚀 4. Quickstart
Run with llama.cpp (standard launch — works on any build):
# 128k context shown; base model natively supports 256k and is extensible to ~1M via RoPE scaling.
# Sampling values match Qwen3.6's recommended non-thinking-mode defaults — this is the inference
# path the tune was trained for, so prefer these.
llama-server -hf bytkim/Qwen3.6-27B-MTP-pi-tune-GGUF:Q4_K_M \
--jinja -ngl 99 -fa -c 131072 \
--temp 0.7 --top-p 0.8 --top-k 20 --min-p 0 \
--presence-penalty 1.5
Run withupstream llama.cpp + MTP speculative decoding(ggml\-org/llama\.cpp, MTP support merged inPR #22673):
# The nextn prediction heads in this release activate via upstream's draft-mtp speculator.
# -np must be 1 with MTP (parallel slots are not yet supported alongside MTP).
llama-server -hf bytkim/Qwen3.6-27B-MTP-pi-tune-GGUF:Q4_K_M \
--spec-type draft-mtp \
--spec-draft-n-max 3 \
-np 1 \
--jinja -ngl 99 -fa -c 131072 \
--cache-type-k q8_0 --cache-type-v q8_0 \
--temp 0.7 --top-p 0.8 --top-k 20 --min-p 0 \
--presence-penalty 1.5
Run with Ollama:
ollama run hf.co/bytkim/Qwen3.6-27B-MTP-pi-tune-GGUF:Q4_K_M
Download a single GGUF file:
hf download bytkim/Qwen3.6-27B-MTP-pi-tune-GGUF \
Qwen3.6-27B-MTP-pi-tune-Q4_K_M.gguf \
--local-dir .
Download the whole repo:
hf download bytkim/Qwen3.6-27B-MTP-pi-tune-GGUF --local-dir .
https://huggingface.co/bytkim/Qwen3.6-27B-MTP-pi-tune-GGUF#multi-modal-inference-image–videoMulti-modal inference (image + video)
This release is compatible with the Qwen3.6mmproj\-F16\.ggufsidecar for vision-language inference.A single mmproj file pairs with every quantin this release; the projector is architecturally tied to the base model’s vision tower, not to the LM quant level, so download it once and reuse it.
The compatible mmproj can be downloaded fromunsloth/Qwen3\.6\-27B\-MTP\-GGUF. The fine-tune in this release islanguage-only— the vision encoder weights have not been touched. Image / video understanding is therefore inherited unchanged from the upstream Qwen3.6-27B base model; this release does not claim to improve it, only to preserve it.
# Pull the LM weights from this repo
hf download bytkim/Qwen3.6-27B-MTP-pi-tune-GGUF \
Qwen3.6-27B-MTP-pi-tune-Q4_K_M.gguf \
--local-dir .
# Pull the compatible mmproj sidecar
hf download unsloth/Qwen3.6-27B-MTP-GGUF \
mmproj-F16.gguf \
--local-dir .
# Launch llama-server with vision attached
llama-server -m ./Qwen3.6-27B-MTP-pi-tune-Q4_K_M.gguf \
--mmproj ./mmproj-F16.gguf \
--jinja -ngl 99 -fa -c 131072 \
--temp 0.7 --top-p 0.8 --top-k 20 --min-p 0 \
--presence-penalty 1.5
For a quick text-and-image session without spinning up a server:
llama-mtmd-cli -m ./Qwen3.6-27B-MTP-pi-tune-Q4_K_M.gguf \
--mmproj ./mmproj-F16.gguf
https://huggingface.co/bytkim/Qwen3.6-27B-MTP-pi-tune-GGUF#use-as-an-openai-compatible-apiUse as an OpenAI-compatible API
llama\-serverexposes an OpenAI-compatible/v1/chat/completionsendpoint, so any client written against the OpenAI SDK can point at it directly — no client changes required:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8080/v1",
api_key="not-needed",
)
resp = client.chat.completions.create(
model="Qwen3.6-27B-MTP-pi-tune",
messages=[
{"role": "system", "content": "You are a precise coding agent."},
{"role": "user", "content": "Write a Python function that merges overlapping intervals."},
],
)
print(resp.choices[0].message.content)
The same endpoint acceptstools=\[\.\.\.\]for function calling and supports streaming viastream=True.
https://huggingface.co/bytkim/Qwen3.6-27B-MTP-pi-tune-GGUF#%F0%9F%A7%AC-5-training–data-notes🧬 5. Training & Data Notes
Tuned onreal agent traces, not synthetic generations or distilled chat — every training row is the trail of an assistant that actually executed the task end-to-end through a PI-style harness, exported as Qwen-compatible ChatML rows with tool schemas and runtime prompts preserved.
High-level task coverage spanned:
- Terminal and shell-environment agent tasks.
- Tool / function-calling interactions.
- Multi-language code editing and repair tasks.
- Repository issue resolution and test-driven patching.
- Coding and API integration tasks.
- Shell, package, migration, ops, and verifier-driven tasks.
🧭
Training philosophy
Trained with4-bit QLoRA SFT. Training rows are exported with internal reasoning tracesremovedand assistant turns formatted in Qwen3.6’snon-thinking style. The quality this release ships is the no-thinking path’s quality — that’s where the tune was trained, and that’s where it should be run.
Specific dataset names and training-row counts are intentionally omitted from this initial card.
https://huggingface.co/bytkim/Qwen3.6-27B-MTP-pi-tune-GGUF#%E2%9A%A1-6-mtp-throughput⚡ 6. MTP Throughput
MTP stands forMulti-Token Prediction. The model drafts likely future tokens and the runtime accepts them when they agree with the main decode path. On local agent work this matters because long reasoning, code generation, tool-call setup, and shell-oriented turns otherwise spend most of their wall time waiting on generation.
The numbers below describe thecurrent local profileof the release. They are representative figures from internal runs against the PI harness — full benchmark publication is forthcoming and will replace these with task-success-rate tables in §7.
https://huggingface.co/bytkim/Qwen3.6-27B-MTP-pi-tune-GGUF#raw-decode-profileRaw decode profile
Prompt / Prefill
~615tok/s
Reading and processing prompt / context tokens.
Decode / Generation
~40tok/s
Raw generated-token speed reported by llama.cpp.
End-to-End Request
~71tok/s
Full llama request throughput across prompt and decode.
MTP Draft Acceptance
~78%
Share of drafted future tokens accepted by the main decode path.
https://huggingface.co/bytkim/Qwen3.6-27B-MTP-pi-tune-GGUF#agentic-throughputAgentic throughput
The agentic number is different from raw tokens/sec. It measures real task throughput across agent runs — including model generation, tool calls, shell commands, package installs, file I/O, and verifier-facing work.
Effective Output
~33tok/s
Output tokens divided by full agent execution time.
Effective Total
~1.6ktok/s
Input + output tokens counted by the harness, end-to-end.
Effective output throughput is computed as:
sum(output tokens) / sum(agent execution duration)
That makes it a more realistic agent-workflow number than plain decode speed — it includes time spent operating through the harness, not just time spent generating text.
https://huggingface.co/bytkim/Qwen3.6-27B-MTP-pi-tune-GGUF#%F0%9F%93%8A-7-coding-eval-benchmarks📊 7. Coding Eval Benchmarks
🛰️
Releasing soon
Task-success benchmarks are tracked separately from throughput and will be published in a follow-up update to this card. Throughput answershow fast the local MTP stack moves; coding evals answerhow often the agent actually solves the task— the two should not be inferred from one another.
The follow-up release will cover task-success rates across the high-level areas listed in §5: terminal/shell agent tasks, tool & function calling, multi-language code editing, repository issue resolution, and coding/API integration tasks.
Throughput figures above are from the local MTP-enabled run. Task success rates should be reported only from completed eval runs, not inferred from speed.
https://huggingface.co/bytkim/Qwen3.6-27B-MTP-pi-tune-GGUF#%F0%9F%8E%AF-8-recommended-use-cases🎯 8. Recommended Use Cases
- Local coding-agent experiments.
- Tool-heavy chat and function-calling experiments.
- DevOps troubleshooting and runbook drafting.
- Repository navigation, patch planning, and test iteration.
- Long-context engineering workflows where local inference is preferred.
https://huggingface.co/bytkim/Qwen3.6-27B-MTP-pi-tune-GGUF#%E2%9A%A0%EF%B8%8F-9-limitations⚠️ 9. Limitations
- This is a community release for research, evaluation, and workflow exploration.
- Low-bit quantizations may reduce instruction following and tool-call reliability.
- Coding-eval success rates are not finalized in this initial card.
- This card does not claim safety alignment beyond the behavior inherited from the base model and the fine-tuning data.
https://huggingface.co/bytkim/Qwen3.6-27B-MTP-pi-tune-GGUF#%F0%9F%93%9C-10-license📜 10. License
Released under theApache 2.0license, inherited from the upstream Qwen3.6-27B base model. You are free to use, modify, and redistribute the model and its derivatives subject to the terms of that license.
https://huggingface.co/bytkim/Qwen3.6-27B-MTP-pi-tune-GGUF#%F0%9F%99%8F-11-acknowledgements🙏 11. Acknowledgements
Thanks to theQwen teamfor the Qwen3.6 base model and its MTP design, to theggml-org / llama.cppmaintainers for native multi-token-prediction support in upstream, and to the broader open-source quantization tooling community whose work makes local-first inference of frontier models possible.
Built for local agent loops
Speed where it matters · context where it counts · MTP at every quant.
128k
tested · 256k native · 1M max
~78%
MTP draft acceptance
~1.6ktok/s
end-to-end agent throughput
Similar Articles
unsloth/Qwen3.6-27B-MTP-GGUF
Unsloth has released GGUF weights for the Qwen3.6-27B model, featuring Multi-Token Prediction (MTP) for faster generation and enhanced agentic coding capabilities.
havenoammo/Qwen3.6-27B-MTP-UD-GGUF
This Hugging Face repository provides GGUF files for Qwen3.6-27B with Multi-Token Prediction (MTP) layers grafted onto Unsloth UD XL quantizations. It includes instructions for building llama.cpp with MTP support to enable speculative decoding.
unsloth/Qwen3.6-35B-A3B-MTP-GGUF
This article announces the release of the Qwen3.6-35B-A3B model weights on Hugging Face, optimized by Unsloth with Multi-Token Prediction (MTP) for faster generation via llama.cpp. It highlights improvements in agentic coding capabilities, tool calling, and reasoning context preservation.
Qwen3.6-27B-GGUF is here!
Community GGUF release of Qwen’s 27B hybrid-architecture model with 262k context, multimodal inputs, tool calling and "Thinking Preservation" for agentic coding.
MTP on Unsloth
Unsloth releases GGUF-quantized versions of Qwen3.6 models with Multi Token Prediction (MTP) support.