quantization

#quantization

I Figured Out What Causes 'Super Weights'

Reddit r/ArtificialInteligence ↗ · 1h ago

Explains that super weights in large language models arise from the SoftMax-Attention interaction creating a 'Nothing Dump' token that serves as a stable reference point; removing these weights cripples performance.

0 favorites 0 likes

#quantization

Tmax-27b - a Qwen3.6-27b terminal agent for small GPUs trained with DPPO (RL)

Reddit r/LocalLLaMA ↗ · 3h ago

Ai2 released Tmax-27B, a terminal-agent LLM trained with DPPO (RL) on Qwen3.6-27B, and the author provides importance-matrix-calibrated GGUF quantizations that achieve competitive performance on agentic benchmarks even at very low bit-widths, with a grafted MTP draft head for speculative decoding.

0 favorites 0 likes

#quantization

UPDATE: Qwen-27B-IQ4_KS and Qwen-27B-IQ_KS_KT for ik_llama.cpp, especially for NVIDIA with 16GB VRAM

Reddit r/LocalLLaMA ↗ · 4h ago

New GGUF quantizations of Qwen3.6-27B optimized for 16GB VRAM NVIDIA GPUs, including an experimental Trellis variant, with perplexity benchmarks.

0 favorites 0 likes

#quantization

Openrouter model prices implying heavier quantization?

Reddit r/LocalLLaMA ↗ · 5h ago

An analysis questioning whether OpenRouter's API pricing for open models like GLM-5.2 implies more aggressive quantization than assumed, given the economics of running large models on expensive hardware like 8xH200.

0 favorites 0 likes

#quantization

GLM 5.2 on Mac Studio Speedup PR

Reddit r/LocalLLaMA ↗ · 5h ago

GLM 5.2 delivers major performance gains on Mac Studio with 512GB RAM, achieving prefill speeds above 100 t/s at high context lengths and enabling 4-bit quantization for contexts over 100k tokens, as detailed in a pull request by the oMLX creator.

0 favorites 0 likes

#quantization

I mapped the KLD of KV cache quantization for Qwen3.6-35B-A3B and Gemma4-E2B QAT

Reddit r/LocalLLaMA ↗ · 7h ago

The author maps the Kullback-Leibler divergence of KV cache quantization for the Qwen3.6-35B-A3B and Gemma4-E2B QAT models.

0 favorites 0 likes

#quantization

MiniMax-M3-EAGLE3-GGUF - Llama.cpp compatible MiniMax M3 EAGLE draft model!

Reddit r/LocalLLaMA ↗ · 18h ago

A GGUF conversion of MiniMax M3's EAGLE draft model for llama.cpp is now available, enabling speculative decoding speedups on compatible hardware.

0 favorites 0 likes

#quantization

@philipkiely: https://x.com/philipkiely/status/2069212319746506968

X AI KOLs Timeline ↗ · 22h ago Cached

Baseten announces the world's fastest API for the GLM-5.2 open model, achieving over 280 tokens per second via NVFP4 quantization, disaggregated inference, and other optimizations.

0 favorites 0 likes

#quantization

Unsloth GLM-5.2 – How to Run Locally

Hacker News Top ↗ · yesterday Cached

A guide on running Z.ai's open model GLM-5.2 locally using Unsloth Dynamic GGUFs. The model features 744B total parameters (40B active) and a 1M context window, with quantized versions reducing memory to 239GB for 2-bit, enabling local inference on 256GB Macs.

0 favorites 0 likes

#quantization

Idea for how to run GLM2 at a decent quant, need critique/feedback

Reddit r/LocalLLaMA ↗ · yesterday

A user proposes a hardware setup using four RTX 5060 Ti GPUs and 512 GB of DDR3 server RAM to run GLM2 at a decent quantization and seeks feedback on the idea's viability.

0 favorites 0 likes

#quantization

@charles_irl: https://x.com/charles_irl/status/2069113412869914944

X AI KOLs Timeline ↗ · yesterday Cached

详细介绍了针对语音克隆模型的W4A4 CUDA内核优化，通过INT4量化和融合LoRA，实现了比FP16快2.6倍的推理速度。

0 favorites 0 likes

#quantization

@BlackRainLabs: Using TurboQuant i was able to push 20 tk/s on qwen 3.6 35b MoE on a GTX1060 3GB. Insane for such a small and old card.…

X AI KOLs Following ↗ · yesterday Cached

Using TurboQuant, the user achieved 20 tokens per second on a Qwen 3.6 35B MoE model running on a GTX1060 3GB, showcasing impressive performance on outdated hardware.

0 favorites 0 likes

#quantization

Gemma 4 QAT 31B responds better to KV cache quantization too

Reddit r/LocalLLaMA ↗ · yesterday

The Gemma 4 QAT 31B model demonstrates improved behavior with KV cache quantization, suggesting enhanced inference efficiency.

0 favorites 0 likes

#quantization

Gemma 4 31B Q6 on Dual 9060 XT

Reddit r/LocalLLaMA ↗ · yesterday

Discusses running a Q6 quantized version of the Gemma 4 31B model on a dual 9060 XT GPU configuration, likely for local inference.

0 favorites 0 likes

#quantization

@charles_irl: This block quant visualizer is another page in our LLM Engineer's Almanac -- a one-stop shop for engineers looking to o…

X AI KOLs Following ↗ · yesterday Cached

A new page in the LLM Engineer's Almanac provides a block quant visualizer to help engineers understand quantization formats for owning their LLM inference.

0 favorites 0 likes

#quantization

@charles_irl: Low-precision floats are weird. I have been building up my intuition by playing with them outside of inference/training…

X AI KOLs Following ↗ · yesterday Cached

A tweet thread introduces a visualizer for micro-scaling/block quant formats like NVFP4 and MXFP4, explaining how these low-precision floats work and their use in LLM inference to reduce memory bandwidth demands.

0 favorites 0 likes

#quantization

Why is AutoRound being slept on so hard?

Reddit r/LocalLLaMA ↗ · 2d ago

A user questions why AutoRound, a quantization tool offering superior accuracy retention at low bits and direct GGUF export, is overlooked despite outperforming standard AWQ and RTN, especially on complex models like Qwen3.6 27B.

0 favorites 0 likes

#quantization

@TheAhmadOsman: Luke Alonso has uploaded an NVFP4 of GLM 5.2 467GB, would fit on 4x DGX Sparks (~$20k)

X AI KOLs Following ↗ · 3d ago Cached

Luke Alonso uploaded an NVFP4 quantized version of GLM 5.2 (467GB) that can fit on 4x DGX Sparks hardware, costing approximately $20k.

0 favorites 0 likes

#quantization

You can now convert EXL3 quants on Apple Silicon Mac

Reddit r/LocalLLaMA ↗ · 3d ago

A new tool enables converting and running EXL3 quantized models on Apple Silicon Macs, matching or nearly matching RTX conversion quality, making high-fidelity quants more accessible.

0 favorites 0 likes

#quantization

@PierceZhang34: A Machine Learning Systems Notes Repo on GitHub — The author has deeply studied machine learning systems over the past few months, mainly focusing on training and inference of large language models. This notes collection covers distributed computing, parallelization, quantization, and PyTorch internals, with most content derived from the author's experiments. 1. Distributed Technologies - covering distributed training…

X AI KOLs Timeline ↗ · 3d ago Cached

Sharing a machine learning systems notes repo on GitHub, covering distributed computing, parallelization, quantization, and PyTorch internals related to LLM training and inference. Suitable for learners interested in ML systems.

0 favorites 0 likes

quantization

Submit Feedback