Tag
Qwythos 9B is a new open-source, uncensored reasoning model based on Qwen3.5, offering GGUF quantizations, 1 million token context, vision, and function calling, with significant performance improvements over the base model.
A quantized GGUF version of the abliterated GLM-5.2 model is released on Hugging Face, enabling local inference with various tools like Transformers, llama.cpp, and vLLM.
A new uncensored GGUF quantized version of the Qwythos-9B-Claude-Mythos-5-1M model, created using abliteration, is released on Hugging Face.
Unsloth released a GGUF quantization of Qwen-AgentWorld-35B-A3B, a native language world model that simulates agentic environments across seven domains (MCP, Search, Terminal, SWE, Android, Web, OS) using long chain-of-thought reasoning and trained via CPT, SFT, and RL.
Antirez announces high probability of merging a branch implementing GLM 5.2 in DwarfStar, which could become the best model for 512GB Mac Studio and potentially run on distributed 128GB MacBooks with 2-bit quantization.
NVIDIA released GLM-5.2-NVFP4, a quantized version of ZAI's GLM-5.2 MoE language model optimized for inference on NVIDIA Blackwell GPUs using Model Optimizer.
unsloth has uploaded a GGUF version of GLM-5.2 to Hugging Face, providing ready-to-use model files for various inference engines like llama.cpp, vLLM, and SGLang.
GestaltLabs releases Ornstein-3.5-9B-V1.5 GGUF quantizations, a reasoning-focused fine-tune of Qwen 3.5 9B with an MTP head and vision projector for multimodal use.
A Hugging Face repository (kaitchup/Qwen3.6-27B-GGUF-MoQ) provides GGUF quantized weights for the Qwen3.6-27B MoQ model, enabling local inference with tools like llama.cpp and Ollama.
A GGUF quantized version of the Qwopus3.6-27B-Coder-MTP model is released on Hugging Face, optimized for local inference and compatible with Transformers, vLLM, SGLang, and Unsloth Studio.
Holo3.1 is an updated computer-use model family that improves robustness across web, desktop, and mobile environments, introduces quantized checkpoints for local execution, and adds native support for function-calling protocols.
NVIDIA releases Qwen3.6-35B-A3B-NVFP4, a quantized version of Alibaba's mixture-of-experts multimodal language model, optimized for deployment on NVIDIA GPUs using Model Optimizer.
DealignAI releases CRACK-abliterated and MXFP4/MXFP8 quantized versions of Qwen3.6-27B and 35B models, preserving MTP for faster speculative decoding on Apple Silicon.
Qwen 3.6 27B runs fast on 16 GB VRAM thanks to 'Pure Quant' technology, achieving 40 tokens/s with MTP and supporting 64k contexts, enabling local AI on consumer GPUs like RTX 4060 Ti.
TurboQuant is a GGUF quantized version of the Qwopus3.6-27B-v2 model, confirmed with GPQA test results and shared on Hugging Face, with credits to Jackrong and KyleHessling.
Release of Qwen3.6-27B-PRISM-PRO-DQ, a dynamically quantized GGUF version of Qwen3.6-27B with bias/propaganda removal, preserving native MTP draft head and vision tower, enabling lossless speculative decoding for faster inference.
CohereLabs releases Command A+, an open-source 25B active parameter model optimized for agentic, multilingual, and reasoning tasks, with vision support and Apache 2.0 license.
DavidAU releases a custom 40B parameter model based on Qwen 3.6, expanded and fine-tuned with Claude 4.6 Opus distill and Deckard datasets, featuring optimized GGUF quantizations for improved precision and uncensored capabilities.
A new 18B merged quantized model, Qwopus-GLM-18B-GGUF, outperforms 35B MoE models while using half the VRAM and running on consumer GPUs.
Google’s Gemma 4 E2B/E4B quantized variants now run fully offline on iPhone via apps like Locally AI, leveraging the Apple Neural Engine for on-device inference.