Ai2 released Tmax-27B, a terminal-agent LLM trained with DPPO (RL) on Qwen3.6-27B, and the author provides importance-matrix-calibrated GGUF quantizations that achieve competitive performance on agentic benchmarks even at very low bit-widths, with a grafted MTP draft head for speculative decoding.
Hey everyone, wanted to share some work on making the new Tmax-27B terminal agent actually runnable on consumer hardware. What is Tmax-27B? Ai2 just released Tmax, a family of terminal-agent LLMs trained with DPPO (RL) on top of Qwen3.6. The 27B model hits ~43% on Terminal Bench 2.0 and ~69% on TB Lite. These are agentic benchmarks where the model navigates a shell, edits files, runs tests, and completes real dev tasks in a container. The problem: 27B at FP16 is ~54 GB. Not fitting on your RTX 5070. What we did: A bunch of importance-matrix-calibrated GGUF quants from ~2-5 bits-per-weight, each with a grafted MTP draft head at Q8_0 for built-in speculative decoding. Pick the tier that fits your VRAM: Q2_K (plain) IQ2_XS IQ2_M Q2_K_S IQ3_M IQ4_XS Q5_K_M File Q2_K IQ2_XS IQ2_M Q2_K_S IQ3_M IQ4_XS Technique plain hybrid imatrix hybrid imatrix hybrid imatrix hybrid imatrix hybrid imatrix Size (GiB) 9.98 8.47 9.32 9.54 11.72 14.05 BPW 3.186 2.704 2.976 3.048 3.742 4.486 PPL (general) 7.6005 20.3585 21.0408 16.7292 20.4368 13.1867 KLD med (general) 0.1727 0.1262 0.0783 0.0826 0.0278 0.0059 top_p (general) 73.03% 73.89% 77.77% 77.96% 83.56% 91.45% Lower KLD / higher top_p = closer to FP16. Q2_K is a plain (non-imatrix) anchor; everything else uses the hybrid importance matrix. Why calibration matters for agents. Agentic tasks are brutal on quantization. The model has to produce valid tool-call XML, reason over multi-step contexts, and not degrade on long trajectories where token-level errors compound. Raw 2-bit quantization shreds this. An importance matrix tells the quantizer where precision matters most, per channel, based on real activation energy from agentic coding sessions. Critical layers keep more bits; everything else gets squeezed. Additionally, we increase our calibration context from 512 tokens to 4K while also minimizing the influence of the system prompt which can sometimes take the entire calibration budget without leaving room for any tool calls. The agentic results. Every quant was run as a coding agent (mini-swe-agent) over the same 10 held-out SWE-rebench instances, one clean Docker container each. pass_rate = fraction whose patch makes the gold FAIL_TO_PASS tests pass; patch_rate = fraction that produced a non-empty diff: Quant pass_rate patch_rate resolved mean tokens mean steps tool-err Q2_K 50% 100% 5/10 621,931 38.7 11% IQ2_XS 70% 100% 7/10 784,972 49.8 9% IQ2_M 60% 100% 6/10 596,658 40.9 10% Q2_K_S 70% 100% 7/10 529,560 37.1 12% IQ3_M 70% 100% 7/10 770,113 47.5 10% IQ4_XS 70% 100% 7/10 791,474 48.3 9% IQ2_XS at 8.5 GiB / 2.7 BPW hits 70% pass rate. Same as IQ4_XS at 14 GiB. The plain Q2_K (no imatrix) is the only one that drops to 50%. Calibration is the difference between "falls apart mid-task" and "actually resolves bugs." Every quant produced a non-empty diff on all 10 instances (100% patch_rate). They all attempt the work. The question is whether the patches actually fix the tests, and that's where calibrated vs. plain diverges hard. Tool error rates stay in the 9-12% range across the board. The imatrix quants keep tool-call generation stable even at 2-bit, which is where uncalibrated quants typically choke. Grafted MTP head. Tmax-27B dropped Qwen3.6's native Multi-Token-Prediction draft head. Since Tmax is architecturally identical to Qwopus3.6-Coder (same Qwen3.6-27B base), we grafted Qwopus's trained nextn head back on at Q8_0. Built-in speculative decoding with ~95% draft acceptance at --spec-draft-n-max 1. Pure speed, not quality, but a free 1.5-2x decode speedup on memory-bound GPUs. How to try it: ollama run hf.co/pearsonkyle/tmax-27b-imatrix-MTP-GGUF:IQ2_M # also: :IQ2_XS :Q2_K_S :Q2_K :IQ3_M :IQ4_XS :Q5_K_M Or with llama.cpp + MTP speculative decoding: ./llama-server --model tmax-27b-IQ4_XS.gguf \ --ctx-size 16384 --n-gpu-layers 999 \ --spec-type draft-mtp --spec-draft-n-max 1 \ --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0 📎 Repo: pearsonkyle/tmax-27b-imatrix-MTP-GGUF 📎 Base model: allenai/tmax-27b 📎 Paper: Tmax: A simple recipe for terminal agents Happy to answer questions on the calibration methodology, the MTP graft, or the agentic eval setup. Let me know if folks would like to see results for the 9B model family too.
Quantized 27B Qwen3.6 model achieves 200 tok/s peak (136 avg) with 256k context and 10 agents on a single 49W GB10 GPU using Dflash+DDTree optimizations.
Release of Qwen3.6-27B-PRISM-PRO-DQ, a dynamically quantized GGUF version of Qwen3.6-27B with bias/propaganda removal, preserving native MTP draft head and vision tower, enabling lossless speculative decoding for faster inference.
AEON-7 releases a fully uncensored, capability-enhanced abliteration of Qwen3.6-27B, optimized for NVIDIA DGX Spark with NVFP4 quantization and DFlash speculative decoding for improved performance.
The author details attempts to locally train a Qwen 3.6 27B autoregressive-to-diffusion model on an Nvidia 5090 GPU using qlora and modifications from open-dllm and d3LLM, facing VRAM constraints and hardware issues while exploring one-shot diffusion techniques.