Tag
Unsloth, a popular LLM fine-tuning library, announces upcoming support for Apple Silicon devices, expanding its optimization capabilities beyond NVIDIA GPUs.
A step-by-step guide on using MCP servers with local LLMs like Qwen3.6 and Gemma 4 via Unsloth and llama.cpp, enabling private automated workflows with tools, files, and APIs.
Announces an upcoming video on training tiny models for preference tuning, covering reward models, RLHF, DPO, ORPO with Unsloth and TRL.
UnslothAI announces that its 4-bit Qwen3.6 MTP GGUF model can search over 70 websites from a single prompt, running locally on 20GB RAM via Unsloth Studio. The update adds automatic MTP and speculative decoding support.
Unsloth Qwen3.6 27B Q6_K achieves over 100 tokens per second with MTP on RTX 5090, up from 45-50 t/s without MTP.
A user shares their preference for Unsloth quantized models due to fast releases and low perplexity, compares them with Apex MoE quants, and asks the community for their favorite quant publisher.
The article highlights the ability to run Qwen3-35B-A3B locally on a laptop for free using llama.cpp and Unsloth 4-bit quantization.
The tweet announces that users can now fine-tune Google's Gemma 4 model for free in the browser using the Unsloth Colab notebook, significantly lowering the barrier to entry for model customization.
Unsloth has released an optimized GGUF version of the Qwen3.6-27B MTP model, achieving significantly faster inference speeds (up to 114 tok/s on an RTX 5090) compared to previous quantizations.
A user benchmarks MTP, TriAttention, and TurboQuant optimizations on Qwen 3.6 35B using Unsloth on consumer hardware, finding TurboQuant to be the most effective.
The article provides a detailed tutorial on setting up a local coding agent using Qwen3.6-27B via Unsloth Studio and the Pi coding harness. It highlights the benefits of using GGUF quantized models for efficient inference on consumer hardware like Apple Silicon Macs.
Unsloth releases GGUF-quantized versions of Qwen3.6 models with Multi Token Prediction (MTP) support.
This article announces the release of the Qwen3.6-35B-A3B model weights on Hugging Face, optimized by Unsloth with Multi-Token Prediction (MTP) for faster generation via llama.cpp. It highlights improvements in agentic coding capabilities, tool calling, and reasoning context preservation.
Unsloth has released GGUF weights for the Qwen3.6-27B model, featuring Multi-Token Prediction (MTP) for faster generation and enhanced agentic coding capabilities.
NVIDIA and Unsloth have published a technical guide detailing three low-level optimizations that can accelerate LLM fine-tuning by up to 25%, including packed-sequence caching, double-buffered checkpointing, and optimized MoE routing. The guide provides deep systems-level explanations and benchmarks aimed at ML engineers and developers.
This Hugging Face repository provides GGUF files for Qwen3.6-27B with Multi-Token Prediction (MTP) layers grafted onto Unsloth UD XL quantizations. It includes instructions for building llama.cpp with MTP support to enable speculative decoding.
This entry describes Qwen3.5-9B-DeepSeek-V4-Flash, a distilled AI model that transfers reasoning capabilities from DeepSeek-V4 into a smaller 9B parameter space for efficient inference.
User demonstrates Qwen 3.6 27B/35B running locally with llama-server cuts Claude Code API costs from $142 to <$4 for 8-hour vibe-coding session, achieving 30-day payback on $4500 dual-RTX 3090 rig.
Unsloth releases a GGUF quantized version of the Qwen3.6-27B model, featuring improved agentic coding capabilities, tool calling, and support for Unsloth Studio.
Unsloth has released a GGUF-quantized version of the Kimi K2.6 model, enabling efficient local inference.