Tag
Ai2 released Tmax-27B, a terminal-agent LLM trained with DPPO (RL) on Qwen3.6-27B, and the author provides importance-matrix-calibrated GGUF quantizations that achieve competitive performance on agentic benchmarks even at very low bit-widths, with a grafted MTP draft head for speculative decoding.
New GGUF quantizations of Qwen3.6-27B optimized for 16GB VRAM NVIDIA GPUs, including an experimental Trellis variant, with perplexity benchmarks.
A GGUF conversion of MiniMax M3's EAGLE draft model for llama.cpp is now available, enabling speculative decoding speedups on compatible hardware.
A guide on running Z.ai's open model GLM-5.2 locally using Unsloth Dynamic GGUFs. The model features 744B total parameters (40B active) and a 1M context window, with quantized versions reducing memory to 239GB for 2-bit, enabling local inference on 256GB Macs.
Qwopus 3.6 27B-Coder-Compat is a new GGUF release with compatibility fixes for various harnesses, reducing looping and improving thinking stability. It can generate full HTML games and is suitable for local deployment.
The user released Apostate, an abliterated version of Qwen 3.6 27B that reduces safety alignment refusal rate from 92% to 7.6% with minimal capability loss (KL 0.120).
Antirez reports the first working implementation of GLM 5.2 in DwarfStar, using a 433 GB GGUF file on an M3 Ultra with 512GB RAM, though it needs further refinement.
A user questions why AutoRound, a quantization tool offering superior accuracy retention at low bits and direct GGUF export, is overlooked despite outperforming standard AWQ and RTN, especially on complex models like Qwen3.6 27B.
Empero AI releases Qwythos-9B-Claude-Mythos-5-1M-GGUF, a 9B parameter reasoning model fine-tuned on 500M+ tokens of Claude Mythos/Fable traces with chain-of-thought, achieving significant gains over Qwen3.5-9B and supporting 1M-token context via YaRN rope-scaling. The GGUF quantizations enable local inference on llama.cpp and compatible runtimes.
A fine-tuned version of Gemma-4-12B, optimized for local coding and agentic tasks, achieving ~3.5x improvement over the base model on the tau2-bench telecom benchmark.
This article introduces calibrated 2-bit GGUF quantizations of the Qwopus3.6-27B-Coder model for agentic coding tasks, demonstrating that the IQ2_M quant (9.74 GiB) achieves a 63% pass rate on the SWE-rebench benchmark, comparable to a Q5_K_M quant at half the size.
Mia-AiLab released Gemmable 4 12B, a fine-tuned version of Google's Gemma 4 12B model using Fable-5 style reasoning and assistant traces, available in GGUF and MLX formats for local inference.
UnslothAI announces GLM-5.2, Z.ai's strongest open model with 744B parameters, now runnable locally via dynamic GGUF quantization reducing size by ~84% to 239GB while retaining ~82% accuracy. It fits on 256GB Macs and supports long-context, reasoning, and agentic tasks.
GLM 5.2 GGUF quantized model is released, with 8-bit version half the size of the full model; smaller versions are coming soon.
unsloth has uploaded a GGUF version of GLM-5.2 to Hugging Face, providing ready-to-use model files for various inference engines like llama.cpp, vLLM, and SGLang.
GestaltLabs releases Ornstein-3.5-9B-V1.5 GGUF quantizations, a reasoning-focused fine-tune of Qwen 3.5 9B with an MTP head and vision projector for multimodal use.
Alibaba's Tongyi Lab highlights a community model, Qwen3.6-27B-MTP-pi-reasoning-GGUF, built on Qwen3.6-27B, optimized for automated programming and debugging workflows for local coding agents.
GGUF quantized versions of Cohere's command-a-plus-05-2026 model, optimized for llama.cpp and available in various quantization levels for local inference.
A Hugging Face repository (kaitchup/Qwen3.6-27B-GGUF-MoQ) provides GGUF quantized weights for the Qwen3.6-27B MoQ model, enabling local inference with tools like llama.cpp and Ollama.
The writer shares their experience with Nex-N2 Pro, originally mistaken as Rio-3.5, and finds it performs exceptionally well on coding benchmarks without hallucination, rivaling GPT-5.x on their Mac setup.