Tmax-27b - a Qwen3.6-27b terminal agent for small GPUs trained with DPPO (RL)

Reddit r/LocalLLaMA 06/23/26, 07:05 PM Models

terminal-agent quantization dppo reinforcement-learning small-gpu gguf speculative-decoding

Summary

Ai2 released Tmax-27B, a terminal-agent LLM trained with DPPO (RL) on Qwen3.6-27B, and the author provides importance-matrix-calibrated GGUF quantizations that achieve competitive performance on agentic benchmarks even at very low bit-widths, with a grafted MTP draft head for speculative decoding.

Hey everyone, wanted to share some work on making the new Tmax-27B terminal agent actually runnable on consumer hardware. What is Tmax-27B? Ai2 just released Tmax, a family of terminal-agent LLMs trained with DPPO (RL) on top of Qwen3.6. The 27B model hits ~43% on Terminal Bench 2.0 and ~69% on TB Lite. These are agentic benchmarks where the model navigates a shell, edits files, runs tests, and completes real dev tasks in a container. The problem: 27B at FP16 is ~54 GB. Not fitting on your RTX 5070. What we did: A bunch of importance-matrix-calibrated GGUF quants from ~2-5 bits-per-weight, each with a grafted MTP draft head at Q8_0 for built-in speculative decoding. Pick the tier that fits your VRAM: Q2_K (plain) IQ2_XS IQ2_M Q2_K_S IQ3_M IQ4_XS Q5_K_M File Q2_K IQ2_XS IQ2_M Q2_K_S IQ3_M IQ4_XS Technique plain hybrid imatrix hybrid imatrix hybrid imatrix hybrid imatrix hybrid imatrix Size (GiB) 9.98 8.47 9.32 9.54 11.72 14.05 BPW 3.186 2.704 2.976 3.048 3.742 4.486 PPL (general) 7.6005 20.3585 21.0408 16.7292 20.4368 13.1867 KLD med (general) 0.1727 0.1262 0.0783 0.0826 0.0278 0.0059 top_p (general) 73.03% 73.89% 77.77% 77.96% 83.56% 91.45% Lower KLD / higher top_p = closer to FP16. Q2_K is a plain (non-imatrix) anchor; everything else uses the hybrid importance matrix. Why calibration matters for agents. Agentic tasks are brutal on quantization. The model has to produce valid tool-call XML, reason over multi-step contexts, and not degrade on long trajectories where token-level errors compound. Raw 2-bit quantization shreds this. An importance matrix tells the quantizer where precision matters most, per channel, based on real activation energy from agentic coding sessions. Critical layers keep more bits; everything else gets squeezed. Additionally, we increase our calibration context from 512 tokens to 4K while also minimizing the influence of the system prompt which can sometimes take the entire calibration budget without leaving room for any tool calls. The agentic results. Every quant was run as a coding agent (mini-swe-agent) over the same 10 held-out SWE-rebench instances, one clean Docker container each. pass_rate = fraction whose patch makes the gold FAIL_TO_PASS tests pass; patch_rate = fraction that produced a non-empty diff: Quant pass_rate patch_rate resolved mean tokens mean steps tool-err Q2_K 50% 100% 5/10 621,931 38.7 11% IQ2_XS 70% 100% 7/10 784,972 49.8 9% IQ2_M 60% 100% 6/10 596,658 40.9 10% Q2_K_S 70% 100% 7/10 529,560 37.1 12% IQ3_M 70% 100% 7/10 770,113 47.5 10% IQ4_XS 70% 100% 7/10 791,474 48.3 9% IQ2_XS at 8.5 GiB / 2.7 BPW hits 70% pass rate. Same as IQ4_XS at 14 GiB. The plain Q2_K (no imatrix) is the only one that drops to 50%. Calibration is the difference between "falls apart mid-task" and "actually resolves bugs." Every quant produced a non-empty diff on all 10 instances (100% patch_rate). They all attempt the work. The question is whether the patches actually fix the tests, and that's where calibrated vs. plain diverges hard. Tool error rates stay in the 9-12% range across the board. The imatrix quants keep tool-call generation stable even at 2-bit, which is where uncalibrated quants typically choke. Grafted MTP head. Tmax-27B dropped Qwen3.6's native Multi-Token-Prediction draft head. Since Tmax is architecturally identical to Qwopus3.6-Coder (same Qwen3.6-27B base), we grafted Qwopus's trained nextn head back on at Q8_0. Built-in speculative decoding with ~95% draft acceptance at --spec-draft-n-max 1. Pure speed, not quality, but a free 1.5-2x decode speedup on memory-bound GPUs. How to try it: ollama run hf.co/pearsonkyle/tmax-27b-imatrix-MTP-GGUF:IQ2_M # also: :IQ2_XS :Q2_K_S :Q2_K :IQ3_M :IQ4_XS :Q5_K_M Or with llama.cpp + MTP speculative decoding: ./llama-server --model tmax-27b-IQ4_XS.gguf \ --ctx-size 16384 --n-gpu-layers 999 \ --spec-type draft-mtp --spec-draft-n-max 1 \ --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0 📎 Repo: pearsonkyle/tmax-27b-imatrix-MTP-GGUF 📎 Base model: allenai/tmax-27b 📎 Paper: Tmax: A simple recipe for terminal agents Happy to answer questions on the calibration methodology, the MTP graft, or the agentic eval setup. Let me know if folks would like to see results for the 9B model family too.

Original Article

Tmax-27b - a Qwen3.6-27b terminal agent for small GPUs trained with DPPO (RL)

Similar Articles

UPDATE: Qwen-27B-IQ4_KS and Qwen-27B-IQ_KS_KT for ik_llama.cpp, especially for NVIDIA with 16GB VRAM

@iotcoi: Qwen3.6-27B-FP8 + Dflash + DDTree, 256k context, 10 agents ~200 tokens/sec max decode 136t/s average on a single tiny G…

@Ex0byt: And... Ladies and Gentlemen: Qwen3.6-27B-PRISM-PRO-DQ - enjoy!

@SpaceTimeViking: Qwen3.6 27B getting some love on the new AEON ULTIMATE VLLM image @NVIDIAAI DGX SPARK OPTIMIZED! https://github.com/AEO…

qwen 3.6 27B AR-> Diffusion - local training on 5090

Submit Feedback

Similar Articles

UPDATE: Qwen-27B-IQ4_KS and Qwen-27B-IQ_KS_KT for ik_llama.cpp, especially for NVIDIA with 16GB VRAM
New GGUF quantizations of Qwen3.6-27B optimized for 16GB VRAM NVIDIA GPUs, including an experimental Trellis variant, with perplexity benchmarks.

@iotcoi: Qwen3.6-27B-FP8 + Dflash + DDTree, 256k context, 10 agents ~200 tokens/sec max decode 136t/s average on a single tiny G…

@Ex0byt: And... Ladies and Gentlemen: Qwen3.6-27B-PRISM-PRO-DQ - enjoy!

@SpaceTimeViking: Qwen3.6 27B getting some love on the new AEON ULTIMATE VLLM image @NVIDIAAI DGX SPARK OPTIMIZED! https://github.com/AEO…

qwen 3.6 27B AR-> Diffusion - local training on 5090