quantization

Tag

Cards List
#quantization

7900XTX 24GB vram, can finally fit Q6K+MTP with Qwen 3.6 27B at 131k context

Reddit r/LocalLLaMA · 3d ago

A guide on optimizing VRAM usage on an AMD 7900XTX to run a 27B Qwen model with Q6K quantization and 131k context by compiling llama.cpp with OpenBLAS and CUDA_FA_ALL_QUANTS, and using kvcache quantization at q5_0/q4_0.

0 favorites 0 likes
#quantization

@0x0SojalSec: Imagine fine-tuning a 31B parameter multimodal model for free,, on Kaggle. Now you can train this massive 31B dense mul…

X AI KOLs Timeline · 3d ago Cached

Unsloth enables free fine-tuning of a 31B parameter multimodal model on Kaggle using 4-bit quantization, requiring only 22-24GB VRAM for local runs.

0 favorites 0 likes
#quantization

@AlexFinn: I can't believe this is real I have GLM 5.2 running 100% locally on my Mac Studio. 2 bit quant. The results I'm getting…

X AI KOLs Following · 4d ago Cached

A user reports running GLM 5.2 locally on a Mac Studio with 2-bit quantization, claiming it outperforms Opus 4.8 and enables free, private superintelligence for coding and agent tasks.

0 favorites 0 likes
#quantization

NVFP4 kv cache quantization on sm120 will make 32GB VRAM systems very capable

Reddit r/LocalLLaMA · 5d ago

NVFP4 KV cache quantization on sm120 significantly improves memory efficiency for large language models, enabling 32GB VRAM systems to achieve ~60 tok/sec inference at 196k context size with Qwen3.6-27B.

0 favorites 0 likes
#quantization

@UnslothAI: GLM-5.2 can now be run locally! The 2-bit model retains ~82% accuracy after we shrunk it from 1.51TB to 238GB (-84% siz…

X AI KOLs Timeline · 5d ago Cached

UnslothAI announces GLM-5.2, Z.ai's strongest open model with 744B parameters, now runnable locally via dynamic GGUF quantization reducing size by ~84% to 239GB while retaining ~82% accuracy. It fits on 256GB Macs and supports long-context, reasoning, and agentic tasks.

0 favorites 0 likes
#quantization

Towards Scalable Customization and Deployment of Multi-Agent Systems for Enterprise Applications

arXiv cs.CL · 5d ago Cached

This paper proposes a unified framework for customizing and deploying LLM-based multi-agent systems in enterprise settings, combining model customization through continual pretraining, fine-tuning, and preference optimization with inference optimization using speculative decoding and FP8 quantization. It achieves 4.48x throughput speedup while maintaining performance on enterprise workloads.

0 favorites 0 likes
#quantization

Attribution-Guided and Coverage-Maximized Pruning for Structural MoE Compression

arXiv cs.LG · 5d ago Cached

Proposes a structural pruning framework for MoE models that maximizes channel-score coverage via attribution-based approximation, achieving 50% or 25% pruning with 4-bit quantization and reducing memory footprint by 5.27x on Qwen3-30B-A3B.

0 favorites 0 likes
#quantization

@dealignai: MiniMax m3, made for 128gb Mac’s Thank you to @hornsby_andrew for preparing the pruning calibration dataset and doing e…

X AI KOLs Timeline · 5d ago Cached

A pruned and quantized version of MiniMax-M3 (MiniMax-M3-Medium-JANG_2L) optimized to run on 128GB Macs using vMLX, featuring 32% expert pruning and JANG_2L mixed-precision quantization to fit within ~105 GB.

0 favorites 0 likes
#quantization

Rethinking Shrinkage Bias in LLM FP4 Pretraining: Geometric Origin, Systemic Impact, and UFP4 Recipe

Hugging Face Daily Papers · 5d ago Cached

This paper identifies a fundamental limitation (shrinkage bias) in non-uniform FP4 quantization formats for LLM pretraining and proposes UFP4, a uniform 4-bit training recipe that outperforms existing E2M1-based methods.

0 favorites 0 likes
#quantization

@aisearchio: GLM 5.2 GGUF is already here! 8-bit is ~half the size of the full model. Smaller versions coming soon https://huggingfa…

X AI KOLs Timeline · 6d ago Cached

GLM 5.2 GGUF quantized model is released, with 8-bit version half the size of the full model; smaller versions are coming soon.

0 favorites 0 likes
#quantization

@Sentdex: SITUATION DETECTED: Unsloth quants for GLM 5.2 are landing.

X AI KOLs Following · 6d ago Cached

Unsloth quantizations for the GLM 5.2 model are being released.

0 favorites 0 likes
#quantization

@ItsmeAjayKV: Update on 3090: Now with Qwen 3.6-35b-a3b moe (q6_k_xl). Crossed 90 t/s for the very first time, no MTP yet, prefill sp…

X AI KOLs Timeline · 6d ago Cached

A user reports achieving over 90 tokens per second inference speed with Qwen 3.6-35b-a3b MoE model on an RTX 3090 using llama.cpp, with prefill speeds exceeding 1000 t/s, indicating practical local deployment of large language models on consumer hardware.

0 favorites 0 likes
#quantization

GLM-5.2 is a win for local AI

Reddit r/LocalLLaMA · 6d ago

GLM-5.2, a 753B parameter open-source model with MIT license, offers frontier-level coding capabilities and massive context window. Its distillation potential promises significant improvements for local AI setups.

0 favorites 0 likes
#quantization

GLM 5.2 on 4x Sparks reasonable?

Reddit r/LocalLLaMA · 6d ago

A user asks about the feasibility of running GLM-5.2 at 4-bit quantization on four Ascend GX10s or DGX Sparks, wondering about speed and memory for 100k context.

0 favorites 0 likes
#quantization

Reconfigurable Computing Challenge: Transformer for Jet Tagging on Versal AI Engines

arXiv cs.LG · 6d ago Cached

This paper presents a quantized, integer-only transformer implementation for jet tagging on AMD Versal AI Engines, including a reusable open-source framework that maps transformer layers to AIE tiles for low-latency trigger systems at CERN LHC.

0 favorites 0 likes
#quantization

MODE: Modality-Decomposed Expert-Level Mixed-Precision Quantization for MoE Multimodal LLMs

arXiv cs.LG · 6d ago Cached

This paper introduces MODE, a modality-decomposed expert-level mixed-precision quantization framework for MoE multimodal LLMs that addresses biases in expert importance estimation by decomposing selection frequency by modality and filtering redundant vision tokens, achieving minimal performance loss under aggressive quantization.

0 favorites 0 likes
#quantization

Cheapest way to run GLM 5.x locally that's not a unified memory system?

Reddit r/LocalLLaMA · 6d ago

A discussion on the cheapest local hardware setups for running GLM 5.x and similarly sized models at 4-bit quantization, including CPU-only and multi-GPU options, with a user sharing their experience running Minimax 2.7 and Qwen 3.6 on a 5900X + 128GB DDR4 + 7900XT setup.

0 favorites 0 likes
#quantization

@analogalok: my 8 GB VRAM gaming laptop is absolutely going to hate me for this. but I still did it. ran a 31b dense model (Gemma 4 …

X AI KOLs Timeline · 6d ago Cached

User runs Gemma 4 31B dense model on 8GB VRAM gaming laptop at ~3 tokens/sec using llama.cpp with MTP speculative decoding, demonstrating feasibility of running a 31B dense model on consumer hardware and proposing agentic workflows where a fast MoE model routes to this slower dense model for hard tasks.

0 favorites 0 likes
#quantization

@pcuenq: GLM 5.2 has just been released Here it's already running with MLX on two Mac Studios (M3 Ultra). This is comparable to …

X AI KOLs Timeline · 2026-06-16 Cached

GLM 5.2, an open-weight AI model comparable to top closed models, has been released and is now running on MLX on two Mac Studios (M3 Ultra).

0 favorites 0 likes
#quantization

bartowski/command-a-plus-05-2026-GGUF · Hugging Face

Reddit r/LocalLLaMA · 2026-06-16 Cached

GGUF quantized versions of Cohere's command-a-plus-05-2026 model, optimized for llama.cpp and available in various quantization levels for local inference.

0 favorites 0 likes
← Previous
Next →
← Back to home

Submit Feedback