Tag
A guide on running Z.ai's open model GLM-5.2 locally using Unsloth Dynamic GGUFs. The model features 744B total parameters (40B active) and a 1M context window, with quantized versions reducing memory to 239GB for 2-bit, enabling local inference on 256GB Macs.
A developer fine-tunes a small Qwen 3 0.6B model using the Unsloth framework to categorize household questions, achieving good results with only 850 training examples.
A tweet promoting the Qwen 3.6 27b model and recommending UnslothAI for running it on any GPU.
Unsloth 成功将 GLM-5.2 模型以 2-bit 量化压缩至 238GB,可在 256GB Mac 上本地运行,保留约 82% 的准确率。
UnslothAI announces GLM-5.2, Z.ai's strongest open model with 744B parameters, now runnable locally via dynamic GGUF quantization reducing size by ~84% to 239GB while retaining ~82% accuracy. It fits on 256GB Macs and supports long-context, reasoning, and agentic tasks.
GLM 5.2 GGUF quantized model is released, with 8-bit version half the size of the full model; smaller versions are coming soon.
Unsloth quantizations for the GLM 5.2 model are being released.
Daniel Han built Unsloth, a tool that rewrites GPU kernels to make fine-tuning 2-3 times faster on a single GPU, enabling many open-source users to train models without a cluster.
Unsloth releases GGUF quantizations of Kimi K2.7 Code, a 1 trillion parameter MoE coding model built on Kimi K2.6 with improved token efficiency and agentic coding capabilities.
Unsloth is uploading a GGUF quantized version of the MiniMax M3 model to Hugging Face.
Unsloth releases a GGUF quantized version of the MiniMax-M3 multimodal model, enabling image-text-to-text tasks with support for Transformers, llama.cpp, vLLM, and other inference engines.
Unsloth AI announces that Gemma 4 runs 2x faster with MTP GGUFs, making it feasible for local coding agents on hardware like a MacBook Pro M1 Max at 72 tokens/s.
Unsloth releases a 2-bit quantized Gemma 4 12B model, only 4.66GB, runnable locally, with capabilities like autonomous online search and deep analysis similar to McKinsey consulting.
The user is working on implementing reasoning training with verifiers using Unsloth and TRL, reporting progress on locally generating GRPO-like rollouts with a small SLM and a tiny RM, and promises a video soon.
Unsloth releases GGUF quantizations of Google DeepMind's DiffusionGemma (26B-A4B), a new block-diffusion architecture for faster text generation, ready for llama.cpp.
Unsloth released Gemma 4 QAT MTP assistant models as GGUF files on Hugging Face, available in q8_0 and larger quantization formats.
A detailed benchmark comparing ByteShape and Unsloth quantizations of Qwen3.6-35B-A3B on tool calling performance, KV cache quantization effects, and long context degradation using llama.cpp and tool-eval-bench.
A discussion on whether it is sensible to use alternative quantization methods on quantization-aware trained (QAT) models like Gemma-4, questioning if unsloth's benchmarks showing closer performance to QAT fine-tunes are beneficial or counterproductive.
Unsloth has released Multi-Token Prediction (MTP) GGUF weights for Gemma 4 models (31B, 26B-A4B, 12B) in Q8, F16, and BF16 precisions, available on Hugging Face.
Unsloth releases GGUF quantized versions of Google DeepMind's Gemma 4 models, optimized with Quantization-Aware Training (QAT) to reduce memory requirements while preserving quality, supporting multiple formats and sizes for diverse deployment.