Tag
A guide on optimizing VRAM usage on an AMD 7900XTX to run a 27B Qwen model with Q6K quantization and 131k context by compiling llama.cpp with OpenBLAS and CUDA_FA_ALL_QUANTS, and using kvcache quantization at q5_0/q4_0.
Unsloth enables free fine-tuning of a 31B parameter multimodal model on Kaggle using 4-bit quantization, requiring only 22-24GB VRAM for local runs.
A user reports running GLM 5.2 locally on a Mac Studio with 2-bit quantization, claiming it outperforms Opus 4.8 and enables free, private superintelligence for coding and agent tasks.
NVFP4 KV cache quantization on sm120 significantly improves memory efficiency for large language models, enabling 32GB VRAM systems to achieve ~60 tok/sec inference at 196k context size with Qwen3.6-27B.
UnslothAI announces GLM-5.2, Z.ai's strongest open model with 744B parameters, now runnable locally via dynamic GGUF quantization reducing size by ~84% to 239GB while retaining ~82% accuracy. It fits on 256GB Macs and supports long-context, reasoning, and agentic tasks.
This paper proposes a unified framework for customizing and deploying LLM-based multi-agent systems in enterprise settings, combining model customization through continual pretraining, fine-tuning, and preference optimization with inference optimization using speculative decoding and FP8 quantization. It achieves 4.48x throughput speedup while maintaining performance on enterprise workloads.
Proposes a structural pruning framework for MoE models that maximizes channel-score coverage via attribution-based approximation, achieving 50% or 25% pruning with 4-bit quantization and reducing memory footprint by 5.27x on Qwen3-30B-A3B.
A pruned and quantized version of MiniMax-M3 (MiniMax-M3-Medium-JANG_2L) optimized to run on 128GB Macs using vMLX, featuring 32% expert pruning and JANG_2L mixed-precision quantization to fit within ~105 GB.
This paper identifies a fundamental limitation (shrinkage bias) in non-uniform FP4 quantization formats for LLM pretraining and proposes UFP4, a uniform 4-bit training recipe that outperforms existing E2M1-based methods.
GLM 5.2 GGUF quantized model is released, with 8-bit version half the size of the full model; smaller versions are coming soon.
Unsloth quantizations for the GLM 5.2 model are being released.
A user reports achieving over 90 tokens per second inference speed with Qwen 3.6-35b-a3b MoE model on an RTX 3090 using llama.cpp, with prefill speeds exceeding 1000 t/s, indicating practical local deployment of large language models on consumer hardware.
GLM-5.2, a 753B parameter open-source model with MIT license, offers frontier-level coding capabilities and massive context window. Its distillation potential promises significant improvements for local AI setups.
A user asks about the feasibility of running GLM-5.2 at 4-bit quantization on four Ascend GX10s or DGX Sparks, wondering about speed and memory for 100k context.
This paper presents a quantized, integer-only transformer implementation for jet tagging on AMD Versal AI Engines, including a reusable open-source framework that maps transformer layers to AIE tiles for low-latency trigger systems at CERN LHC.
This paper introduces MODE, a modality-decomposed expert-level mixed-precision quantization framework for MoE multimodal LLMs that addresses biases in expert importance estimation by decomposing selection frequency by modality and filtering redundant vision tokens, achieving minimal performance loss under aggressive quantization.
A discussion on the cheapest local hardware setups for running GLM 5.x and similarly sized models at 4-bit quantization, including CPU-only and multi-GPU options, with a user sharing their experience running Minimax 2.7 and Qwen 3.6 on a 5900X + 128GB DDR4 + 7900XT setup.
User runs Gemma 4 31B dense model on 8GB VRAM gaming laptop at ~3 tokens/sec using llama.cpp with MTP speculative decoding, demonstrating feasibility of running a 31B dense model on consumer hardware and proposing agentic workflows where a fast MoE model routes to this slower dense model for hard tasks.
GLM 5.2, an open-weight AI model comparable to top closed models, has been released and is now running on MLX on two Mac Studios (M3 Ultra).
GGUF quantized versions of Cohere's command-a-plus-05-2026 model, optimized for llama.cpp and available in various quantization levels for local inference.