Jackrong/Qwopus3.6-27B-v2-MTP-GGUF

Hugging Face Models Trending Models

Summary

Jackrong/Qwopus3.6-27B-v2-MTP-GGUF is a quantized GGUF version of a 27B parameter language model, hosted on Hugging Face with instructions for use with various libraries and tools.

Task: text-generation Tags: transformers, gguf, text-generation-inference, unsloth, qwen3_6, reasoning, chain-of-thought, mtp, multi-token-prediction, speculative-decoding, lora, sft, agent, coder, devops, math, science, image, text-generation, en, zh, ko, ru, ja, es, dataset:Jackrong/Claude-opus-4.7-TraceInversion-5000x, dataset:Jackrong/Claude-opus-4.6-TraceInversion-9000x, license:apache-2.0, endpoints_compatible, region:us, conversational
Original Article
View Cached Full Text

Cached at: 05/24/26, 01:52 PM

Jackrong/Qwopus3.6-27B-v2-MTP-GGUF · Hugging Face

Source: https://huggingface.co/Jackrong/Qwopus3.6-27B-v2-MTP-GGUF

Instructions to use Jackrong/Qwopus3.6-27B-v2-MTP-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

  • Libraries
  • TransformersHow to use Jackrong/Qwopus3.6-27B-v2-MTP-GGUF with Transformers: # Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Jackrong/Qwopus3.6-27B-v2-MTP-GGUF") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages) # Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Jackrong/Qwopus3.6-27B-v2-MTP-GGUF", dtype="auto")
  • llama-cpp-pythonHow to use Jackrong/Qwopus3.6-27B-v2-MTP-GGUF with llama-cpp-python: # !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="Jackrong/Qwopus3.6-27B-v2-MTP-GGUF", filename="Qwopus3.6-27B-v2-MTP-BF16.gguf", ) llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] )
  • Notebooks
  • Google Colab
  • Kaggle
  • Local Appshttps://huggingface.co/settings/local-apps#local-apps
  • llama.cppHow to use Jackrong/Qwopus3.6-27B-v2-MTP-GGUF with llama.cpp: ##### Install from brew brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf Jackrong/Qwopus3.6-27B-v2-MTP-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf Jackrong/Qwopus3.6-27B-v2-MTP-GGUF:Q4_K_M ##### Install from WinGet (Windows) winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf Jackrong/Qwopus3.6-27B-v2-MTP-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf Jackrong/Qwopus3.6-27B-v2-MTP-GGUF:Q4_K_M ##### Use pre-built binary # Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf Jackrong/Qwopus3.6-27B-v2-MTP-GGUF:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf Jackrong/Qwopus3.6-27B-v2-MTP-GGUF:Q4_K_M ##### Build from source code git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf Jackrong/Qwopus3.6-27B-v2-MTP-GGUF:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf Jackrong/Qwopus3.6-27B-v2-MTP-GGUF:Q4_K_M ##### Use Docker docker model run hf.co/Jackrong/Qwopus3.6-27B-v2-MTP-GGUF:Q4_K_M
  • LM Studio
  • Jan
  • vLLMHow to use Jackrong/Qwopus3.6-27B-v2-MTP-GGUF with vLLM: ##### Install from pip and serve model # Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Jackrong/Qwopus3.6-27B-v2-MTP-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Jackrong/Qwopus3.6-27B-v2-MTP-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' ##### Use Docker docker model run hf.co/Jackrong/Qwopus3.6-27B-v2-MTP-GGUF:Q4_K_M
  • SGLangHow to use Jackrong/Qwopus3.6-27B-v2-MTP-GGUF with SGLang: ##### Install from pip and serve model # Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Jackrong/Qwopus3.6-27B-v2-MTP-GGUF" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Jackrong/Qwopus3.6-27B-v2-MTP-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' ##### Use Docker images docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Jackrong/Qwopus3.6-27B-v2-MTP-GGUF" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Jackrong/Qwopus3.6-27B-v2-MTP-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'
  • OllamaHow to use Jackrong/Qwopus3.6-27B-v2-MTP-GGUF with Ollama: ollama run hf.co/Jackrong/Qwopus3.6-27B-v2-MTP-GGUF:Q4_K_M
  • Unsloth StudionewHow to use Jackrong/Qwopus3.6-27B-v2-MTP-GGUF with Unsloth Studio: ##### Install Unsloth Studio (macOS, Linux, WSL) curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Jackrong/Qwopus3.6-27B-v2-MTP-GGUF to start chatting ##### Install Unsloth Studio (Windows) irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Jackrong/Qwopus3.6-27B-v2-MTP-GGUF to start chatting ##### Using HuggingFace Spaces for Unsloth # No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for Jackrong/Qwopus3.6-27B-v2-MTP-GGUF to start chatting
  • PinewHow to use Jackrong/Qwopus3.6-27B-v2-MTP-GGUF with Pi: ##### Start the llama.cpp server # Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf Jackrong/Qwopus3.6-27B-v2-MTP-GGUF:Q4_K_M ##### Configure the model in Pi # Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "Jackrong/Qwopus3.6-27B-v2-MTP-GGUF:Q4_K_M" } ] } } } ##### Run Pi # Start Pi in your project directory: pi
  • Hermes AgentnewHow to use Jackrong/Qwopus3.6-27B-v2-MTP-GGUF with Hermes Agent: ##### Start the llama.cpp server # Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf Jackrong/Qwopus3.6-27B-v2-MTP-GGUF:Q4_K_M ##### Configure Hermes # Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default Jackrong/Qwopus3.6-27B-v2-MTP-GGUF:Q4_K_M ##### Run Hermes hermes
  • Docker Model RunnerHow to use Jackrong/Qwopus3.6-27B-v2-MTP-GGUF with Docker Model Runner: docker model run hf.co/Jackrong/Qwopus3.6-27B-v2-MTP-GGUF:Q4_K_M
  • LemonadeHow to use Jackrong/Qwopus3.6-27B-v2-MTP-GGUF with Lemonade: ##### Pull the model # Download Lemonade from https://lemonade-server.ai/ lemonade pull Jackrong/Qwopus3.6-27B-v2-MTP-GGUF:Q4_K_M ##### Run and chat with the model lemonade run user.Qwopus3.6-27B-v2-MTP-GGUF-Q4_K_M ##### List all available models lemonade list

🪐 Qwopus3.6-27B-v2-MTP

MTP Release

Multi-Token Prediction reasoning model fine-tuned from Qwen3.6-27B

🧬 Trace Inversion & Negentropy🧠 27B Parameters⚡ Speculative Decoding🛠️ Coding / DevOps / Math

💡What is Qwopus3.6-27B-v2-MTP?

🪐Qwopus3.6-27B-v2-MTPis a speed-oriented reasoning release built on top ofQwen3.6-27B. It keeps the Qwopus line’s focus on reconstructed reasoning traces, coding discipline, DevOps procedures, and mathematical derivations, while addingMulti-Token Predictionfor faster generation. The goal is simple: preserve the depth and structure of a 27B reasoning model while making real interactive use noticeably faster.

⚡ MTP DecodingAuxiliary future-token prediction improves throughput on long reasoning, code, math, and strict-format prompts.

🧩 Structured ReasoningInherits the Qwopus training recipe built around reconstructed step-by-step reasoning trajectories.

🧪 GB10 TestedValidated on a 30-question local benchmark across Logic, Coding, DevOps, Math, and Edge tasks.

🚀 Practical SpeedDesigned for workflows where strong answers matter, but waiting several extra minutes per task does not.

https://huggingface.co/Jackrong/Qwopus3.6-27B-v2-MTP-GGUF#%F0%9F%92%A1-1-base-model-training-library–cooperation💡 1. Base Model, Training Library & Cooperation

🧠1.1 Base Model Specifications (Qwen3.6-27B)

Qwen3.6-27Bprovides the dense 27B foundation for this release. Qwopus3.6-27B-v2-MTP focuses on preserving the base model’s broad reasoning capability while tuning the output style toward stepwise analysis, tool-aware execution, and practical engineering answers.

AttributeSpecifications & Details🧠 ArchitectureDense Transformer / 27 Billion Parameters🎯 Focus DomainsAgentic Coding, DevOps, structured logic, mathematics, and strict-format output⚡ MTP ObjectiveImprove generation throughput through multi-token speculative prediction while retaining final-answer quality.

🧪1.2 Hardware Cooperation & Joint Collaboration

This project is built in close collaboration with hardware engineerKyle Hessling, whose infrastructure and training support helped make stable 27B-scale experimentation possible.

👉You can follow him for hardware and model training updates on X / Twitter:@KyleHessling1

🦥1.3 Fine-tuning Framework (Unsloth)

The model training workflow is accelerated and memory-optimized withUnsloth. Special thanks to the Unsloth team for making efficient large-model fine-tuning more accessible.

⚙️1.4 Custom MTP Heads Processing & Automation Tooling

This release features a custom splitting and merging methodology designed specifically for Qwen series Multi-Token Prediction (MTP) heads. The automation skill and complete processing pipeline scripts are open-sourced inqwen-mtp-gguf.

🌟If you find this toolkit helpful, please support the project by leaving a star on GitHub!

Community Release Notice: Qwopus3.6-27B-v2-MTP is an experimental community release intended for research, evaluation, and workflow exploration.


https://huggingface.co/Jackrong/Qwopus3.6-27B-v2-MTP-GGUF#%F0%9F%9A%80-2-mtp-benchmark-qwen36-27b-vs-qwopus36-27b-v2-mtp🚀 2. MTP Benchmark: Qwen3.6-27B vs Qwopus3.6-27B-v2-MTP

Performance Snapshot

Across a 30-question benchmark coveringLogic, Coding, DevOps, Math, and Edge-format tasks, Qwopus3.6-27B-v2-MTP delivers a clear speed advantage over Qwen3.6-27B while producing a more compact overall answer stream. The benchmark is not just a raw throughput test: it includes long coding prompts, operational runbooks, math derivations, and strict constrained-output cases.

Overall Throughput

10.46 T/s

1.66x vs Qwen3.6-27B

Latency Saved

2.34 h

56.5% total time reduction

Token Efficiency

-27.7%

fewer completion tokens overall

Coverage

30 / 30

all benchmark prompts completed

  • Speed: Qwopus3.6-27B-v2-MTP reaches10.46 overall tokens/sec, compared with6.29 tokens/secfor Qwen3.6-27B.
  • Latency: total evaluation time drops from14,901.69sto6,487.81s, saving8,413.88sacross the full run.
  • Output shape: MTP produces67,862 completion tokensversus93,802from Qwen3.6-27B, giving a more compact overall response profile.

Benchmark source:/workspace/renji\-training/Jackrong/qwopus3\.6\-27B\-v2\-MTP/benchmark\_27b\_pair\_report\.mdon the GB10 server. Local workspace date: 2026-05-22.


https://huggingface.co/Jackrong/Qwopus3.6-27B-v2-MTP-GGUF#%E2%9A%99%EF%B8%8F-3-test-environment–configuration⚙️ 3. Test Environment & Configuration

  • Compute platform: GB10 dedicated server platform.
  • Evaluation format: same local GGUF server stack for both models.
  • llama-server total context:49152.
  • Temperature / Top-p:1\.0 / 0\.95.
  • Max generated tokens: no explicit cap; generation is bounded by the request budget.
  • Request format:/v1/chat/completionswith user content as text payload.

Benchmark Summary: Qwen3.6-27B vs Qwopus3.6-27B-v2-MTPModelCompletedAvg SpeedOverall T/sCompletion TokensTotal TimeQwen3.6-27B306.326.2993,80214,901.69sQwopus3.6-27B-v2-MTP3010.6610.4667,8626,487.81sDomain-Level PerformanceDomainQuestionsQwen3.6-27B T/sMTP T/sLatency GainQwen3.6-27B TimeMTP TimeToken DeltaLogic56.3310.772.31x38.5 min16.7 min-26.3%Coding76.2610.272.25x1.52 h40.6 min-27.3%DevOps66.2910.392.31x47.4 min20.5 min-28.5%Math86.2911.002.35x1.01 h25.8 min-25.6%Edge46.488.282.27x10.3 min4.5 min-43.6%

https://huggingface.co/Jackrong/Qwopus3.6-27B-v2-MTP-GGUF#%F0%9F%93%8A-4-full-30-question-comparison📊 4. Full 30-Question Comparison

The table below keeps the benchmark concrete: every row compares the base Qwen3.6-27B run against the Qwopus MTP run on the same prompt. The strongest improvements appear in strict output, probability, DevOps configuration, and medium-length coding tasks, while a few prompts intentionally produce more detailed MTP answers.

30-Question Detailed ComparisonQDomainTaskQwen T/sQwen TimeQwen TokensMTP T/sMTP TimeMTP TokensResult PatternQ1LogicWrong-label coin boxes6.369.4 min3,56911.402.3 min1,5304.16x faster; much more conciseQ2LogicEngineer deployment ordering6.396.1 min2,34910.983.1 min2,0341.98x faster; more conciseQ3LogicSelf-referential truth card6.377.8 min2,99010.834.5 min2,9421.72x faster; similar lengthQ4LogicThree switches and bulbs6.323.6 min1,34210.441.6 min9992.21x faster; more conciseQ5LogicHH vs TH stopping probability6.3011.6 min4,36710.625.2 min3,2662.25x faster; more conciseQ6CodingStreaming top-k frequency6.2813.8 min5,2109.9513.3 min7,9171.04x faster; more expansiveQ7CodingThread-safe TTL cache6.2818.6 min7,00910.645.3 min3,3673.52x faster; much more conciseQ8CodingInterval merge implementation6.2511.2 min4,20310.833.3 min2,1573.36x faster; much more conciseQ9CodingStreaming CSV to JSONL6.2616.5 min6,20010.625.9 min3,7412.81x faster; more conciseQ10CodingC++17 LRU cache6.2713.1 min4,92010.156.0 min3,6442.18x faster; more conciseQ11CodingHighest-paid employee SQL6.296.1 min2,28310.372.4 min1,4752.54x faster; more conciseQ12CodingAtomic Bash backup6.2812.1 min4,54510.334.4 min2,6952.76x faster; much more conciseQ13DevOpsNginx reverse proxy6.2910.4 min3,92410.882.8 min1,8213.70x faster; much more conciseQ14DevOpsLinux service OOM diagnosis6.299.9 min3,7279.964.9 min2,8882.04x faster; more conciseQ15DevOpssystemd worker unit6.298.0 min3,02310.393.3 min2,0372.43x faster; more conciseQ16DevOpsKubernetes rollback runbook6.326.3 min2,38710.362.9 min1,8202.14x faster; more conciseQ17DevOpsDocker CMD vs ENTRYPOINT6.335.4 min2,02810.782.9 min1,8921.82x faster; more conciseQ18DevOpsPrometheus pull monitoring6.327.4 min2,81810.673.7 min2,3422.02x faster; more conciseQ19MathDerivative and critical point6.328.7 min3,27412.063.7 min2,6312.37x faster; more conciseQ20MathLinear system solve6.3210.7 min4,06511.914.2 min2,9762.57x faster; more conciseQ21MathDifferent-color probability6.283.9 min1,47210.1849.6 s4904.74x faster; much more conciseQ22Math2x2 eigen decomposition6.3112.3 min4,66211.284.5 min3,0582.72x faster; more conciseQ23MathInduction proof6.325.8 min2,21111.531.7 min1,1933.34x faster; much more conciseQ24MathBayes disease test6.345.0 min1,87811.383.2 min2,1561.56x faster; more expansiveQ25MathIntegration by parts6.295.5 min2,06411.803.5 min2,4931.55x faster; more expansiveQ26MathCentral Limit Theorem6.248.8 min3,2898.264.1 min2,0462.12x faster; more conciseQ27EdgeStrict JSON output6.323.6 min1,35010.4323.1 s2259.28x faster; much more conciseQ28EdgeExact token pattern6.3752.4 s32812.1529.9 s3451.75x faster; similar lengthQ29EdgeForbidden-word explanation6.715.1 min2,0407.623.5 min1,5731.47x faster; more conciseQ30EdgeIgnore noisy input6.3544.5 s27510.9411.4 s1093.89x faster; much more concise


https://huggingface.co/Jackrong/Qwopus3.6-27B-v2-MTP-GGUF#%F0%9F%A7%AD-5-domain-reading🧭 5. Domain Reading

Logic

Logic prompts show a strong latency reduction, especially on the box-label puzzle and the HH-vs-TH stopping problem. The MTP model tends to reach the same kind of structured decision path with fewer generated tokens, making it useful when reasoning traces need to stay readable and quick.

Coding

Coding is one of the most practical wins. Thread-safe caching, interval merging, CSV streaming, C++ LRU, SQL, and Bash backup tasks all become substantially faster. Q6 is intentionally more expansive, but the broader coding group remains much faster overall.

DevOps

DevOps prompts benefit from concise operational structure. Nginx, OOM diagnosis, systemd, Kubernetes rollback, Docker command semantics, and Prometheus monitoring all show faster completion while preserving stepwise command-oriented guidance.

Math & Edge Tasks

Math has the highest MTP throughput among the five domains. Edge tasks show the sharpest wall-clock wins, especially strict JSON and noisy-input filtering, where the model can quickly settle into the required output pattern.


https://huggingface.co/Jackrong/Qwopus3.6-27B-v2-MTP-GGUF#%F0%9F%8E%AF-6-recommended-use-cases🎯 6. Recommended Use Cases

  • Agentic coding and code review assistance.
  • DevOps runbooks, configuration generation, and incident diagnosis.
  • Multi-step math and probability derivations.
  • Structured reasoning with explicit intermediate logic.
  • Fast constrained output generation where latency matters.

Resources, Acknowledgements & Citation

🙏 AcknowledgementsThanks to the Qwen team, Unsloth, open-source contributors, andKyle Hesslingfor close collaboration on hardware and training infrastructure.

📖 Citation

@misc{qwopus36_27b_v2_mtp_2026,
  title        = {Qwopus3.6-27B-v2-MTP},
  author       = {Jack Rong},
  year         = {2026},
  note         = {Qwen3.6-27B based Multi-Token Prediction reasoning model},
  howpublished = {Hugging Face model card}
}

Downloads last month14,966

Datasets used to trainJackrong/Qwopus3.6-27B-v2-MTP-GGUF

#### Jackrong/Claude-opus-4.6-TraceInversion-9000x Viewer• Updated5 days ago • 8.67k • 300 • 12 #### Jackrong/Claude-opus-4.7-TraceInversion-5000x Viewer• Updated5 days ago • 4.76k • 241 • 7

Collection includingJackrong/Qwopus3.6-27B-v2-MTP-GGUF

Similar Articles

Jackrong/Qwopus3.6-27B-v2-GGUF

Hugging Face Models Trending

Qwopus3.6-27B-v2 is a reasoning-enhanced fine-tuned version of Qwen3.6-27B, using Trace Inversion datasets and curriculum learning, released as GGUF for efficient inference.

Jackrong/Qwopus3.6-35B-A3B-v1-GGUF

Hugging Face Models Trending

Jackrong releases Qwopus3.6-35B-A3B-v1, a reasoning-enhanced fine-tune of Alibaba's Qwen3.6 MoE model, optimized for logic and agentic coding with 35B total parameters and 3B active parameters.

Jackrong/Qwopus3.5-9B-Coder-MTP-GGUF

Hugging Face Models Trending

Jackrong releases Qwopus3.5-9B-Coder-MTP-GGUF, a Qwen-based 9B coding model fine-tuned with Multi-Token Prediction (MTP) architecture, achieving 35.8% throughput improvement and 8.3% accuracy gain over the base model, with perfect scores on coding and math benchmarks.

havenoammo/Qwen3.6-27B-MTP-UD-GGUF

Hugging Face Models Trending

This Hugging Face repository provides GGUF files for Qwen3.6-27B with Multi-Token Prediction (MTP) layers grafted onto Unsloth UD XL quantizations. It includes instructions for building llama.cpp with MTP support to enable speculative decoding.

Jackrong/Qwopus-GLM-18B-Merged-GGUF

Hugging Face Models Trending

Jackrong released Qwopus-GLM-18B-Merged-GGUF, a 64-layer frankenmerge combining two Qwen3.5-9B finetunes into an ~18B parameter model, healed with 1000-step LoRA fine-tuning to fix layer boundary issues. The model achieves 90.9% on capability benchmarks while using less than half the VRAM of Qwen 3.6-35B MoE.