Tag
Ai2 released Tmax-27B, a terminal-agent LLM trained with DPPO (RL) on Qwen3.6-27B, and the author provides importance-matrix-calibrated GGUF quantizations that achieve competitive performance on agentic benchmarks even at very low bit-widths, with a grafted MTP draft head for speculative decoding.
New GGUF quantizations of Qwen3.6-27B optimized for 16GB VRAM NVIDIA GPUs, including an experimental Trellis variant, with perplexity benchmarks.
An analysis questioning whether OpenRouter's API pricing for open models like GLM-5.2 implies more aggressive quantization than assumed, given the economics of running large models on expensive hardware like 8xH200.
GLM 5.2 delivers major performance gains on Mac Studio with 512GB RAM, achieving prefill speeds above 100 t/s at high context lengths and enabling 4-bit quantization for contexts over 100k tokens, as detailed in a pull request by the oMLX creator.
The author maps the Kullback-Leibler divergence of KV cache quantization for the Qwen3.6-35B-A3B and Gemma4-E2B QAT models.
A GGUF conversion of MiniMax M3's EAGLE draft model for llama.cpp is now available, enabling speculative decoding speedups on compatible hardware.
Baseten announces the world's fastest API for the GLM-5.2 open model, achieving over 280 tokens per second via NVFP4 quantization, disaggregated inference, and other optimizations.
A guide on running Z.ai's open model GLM-5.2 locally using Unsloth Dynamic GGUFs. The model features 744B total parameters (40B active) and a 1M context window, with quantized versions reducing memory to 239GB for 2-bit, enabling local inference on 256GB Macs.
A user proposes a hardware setup using four RTX 5060 Ti GPUs and 512 GB of DDR3 server RAM to run GLM2 at a decent quantization and seeks feedback on the idea's viability.
详细介绍了针对语音克隆模型的W4A4 CUDA内核优化,通过INT4量化和融合LoRA,实现了比FP16快2.6倍的推理速度。
Using TurboQuant, the user achieved 20 tokens per second on a Qwen 3.6 35B MoE model running on a GTX1060 3GB, showcasing impressive performance on outdated hardware.
The Gemma 4 QAT 31B model demonstrates improved behavior with KV cache quantization, suggesting enhanced inference efficiency.
Discusses running a Q6 quantized version of the Gemma 4 31B model on a dual 9060 XT GPU configuration, likely for local inference.
A new page in the LLM Engineer's Almanac provides a block quant visualizer to help engineers understand quantization formats for owning their LLM inference.
A tweet thread introduces a visualizer for micro-scaling/block quant formats like NVFP4 and MXFP4, explaining how these low-precision floats work and their use in LLM inference to reduce memory bandwidth demands.
A user questions why AutoRound, a quantization tool offering superior accuracy retention at low bits and direct GGUF export, is overlooked despite outperforming standard AWQ and RTN, especially on complex models like Qwen3.6 27B.
Luke Alonso uploaded an NVFP4 quantized version of GLM 5.2 (467GB) that can fit on 4x DGX Sparks hardware, costing approximately $20k.
A new tool enables converting and running EXL3 quantized models on Apple Silicon Macs, matching or nearly matching RTX conversion quality, making high-fidelity quants more accessible.
Sharing a machine learning systems notes repo on GitHub, covering distributed computing, parallelization, quantization, and PyTorch internals related to LLM training and inference. Suitable for learners interested in ML systems.
A guide on optimizing VRAM usage on an AMD 7900XTX to run a 27B Qwen model with Q6K quantization and 131k context by compiling llama.cpp with OpenBLAS and CUDA_FA_ALL_QUANTS, and using kvcache quantization at q5_0/q4_0.