Tag
A diffusion model that can transform any image into an interactive, playable hallucination, running locally on user hardware.
A guide on running Z.ai's open model GLM-5.2 locally using Unsloth Dynamic GGUFs. The model features 744B total parameters (40B active) and a 1M context window, with quantized versions reducing memory to 239GB for 2-bit, enabling local inference on 256GB Macs.
A comprehensive guide to optimizing local LLM inference on consumer hardware, covering tools like llama.cpp, vLLM, and LM Studio, with practical advice on memory hierarchy, layer placement, and common failure modes.
DwarfStar is a self-contained native inference engine optimized for DeepSeek V4 Flash and PRO models, supporting Metal, CUDA, and ROCm backends, with a focus on high-end personal machines and Mac Studios.
Antirez reports the first working implementation of GLM 5.2 in DwarfStar, using a 433 GB GGUF file on an M3 Ultra with 512GB RAM, though it needs further refinement.
GLM 5.2 offers improved token efficiency, allowing users to achieve 98% of max-level intelligence using less than half the tokens. The model's 'high' effort level provides a practical alternative for day-to-day use compared to the resource-intensive 'max' level.
GLM-5.2 is now supported for local execution via llama.cpp and Unsloth Studio.
Unsloth 成功将 GLM-5.2 模型以 2-bit 量化压缩至 238GB,可在 256GB Mac 上本地运行,保留约 82% 的准确率。
A user runs GLM-5.2 locally on CPU only, demonstrating how to run a large model on a modest setup.
GLM 5.2 is an open-weights LLM that is sufficiently capable to allow businesses to manage their IT needs locally on affordable hardware, potentially transforming small/medium enterprise data management.
Mia-AiLab released Gemmable 4 12B, a fine-tuned version of Google's Gemma 4 12B model using Fable-5 style reasoning and assistant traces, available in GGUF and MLX formats for local inference.
UnslothAI announces GLM-5.2, Z.ai's strongest open model with 744B parameters, now runnable locally via dynamic GGUF quantization reducing size by ~84% to 239GB while retaining ~82% accuracy. It fits on 256GB Macs and supports long-context, reasoning, and agentic tasks.
The article introduces Xenova's open-sourcing of 27 custom WebGPU kernels, enabling Gemma 4 to run fully offline and locally in the browser at 255 tok/s, and discusses advantages like privacy and offline use. It also mentions FLUX.2's 3D generation capability.
A pruned and quantized version of MiniMax-M3 (MiniMax-M3-Medium-JANG_2L) optimized to run on 128GB Macs using vMLX, featuring 32% expert pruning and JANG_2L mixed-precision quantization to fit within ~105 GB.
parakeet.cpp enables running NVIDIA Parakeet ASR behind the OpenAI API locally with prebuilt Docker images, supporting CPU and CUDA (including arm64) for real-time transcription with word timestamps.
Mia-AiLab releases Qwable-3.6-27b-MTP, a full fine-tuned checkpoint of Qwen3.6-27B using a cleaned Fable 5 reasoning and instruction dataset, focused on code, structured reasoning, and local inference with MTP layers.
WeiboAI released VibeThinker-3B, a small 3B reasoning model tested locally on coding tasks, achieving 3/3 on algorithm problems.
A Hugging Face repository (kaitchup/Qwen3.6-27B-GGUF-MoQ) provides GGUF quantized weights for the Qwen3.6-27B MoQ model, enabling local inference with tools like llama.cpp and Ollama.
Tweet reminding developers they can run coding agents locally using llama.cpp and OpenCode for fast, reliable, and private inference, demonstrating with UnslothAI's North-Mini-Code-1.0-GGUF model.
Gemma 4 now runs 2x faster with MTP GGUF format and can run locally on just 6GB RAM. The linked article explains how GGUF works, including quantization and memory mapping.