model-compression

#model-compression

clark-labs/clark-air-sana-1.6b-1.58bit · Hugging Face

Reddit r/LocalLLaMA ↗ · 17h ago Cached

Clark Labs released Clark Air Sana 1.6B, a ternary-quantized version of the Sana 1.6B text-to-image transformer that is 8.6× smaller than FP16 while maintaining near-FP16 quality, enabling efficient deployment.

0 favorites 0 likes

#model-compression

We built a calibration-aware Q4_K_M quant of Qwen3.5 0.8B that recovers 96.5% of the BF16 gap vs pure llama.cpp Q4_K_M (SpectralQuant)

Reddit r/LocalLLaMA ↗ · yesterday

A calibration-aware Q4_K_M quantization of Qwen3.5 0.8B using SpectralQuant recovers 96.5% of the BF16 performance gap compared to the standard llama.cpp Q4_K_M quant.

0 favorites 0 likes

#model-compression

What Survives When You Compress a Recursive Reasoner for the Edge?

arXiv cs.LG ↗ · 2d ago Cached

A systematic study of compressing recursive reasoning models for edge hardware finds that aggressive quantization destroys global reasoning while preserving local prediction. The paper introduces per-channel calibrated INT4 to recover reasoning ability and provides deployment recipes fitting 8 MB SoC and 4 MB MCU targets.

0 favorites 0 likes

#model-compression

Cascaded Multi-Granularity Pruning for On-Device LLM Inference in Industrial IoT

arXiv cs.CL ↗ · 2d ago Cached

This paper presents a cascaded multi-granularity pruning framework for deploying LLMs on Industrial IoT edge devices, achieving up to 13.8x compression with minimal accuracy loss on MHA+GELU architectures while exposing a collapse on GQA+SwiGLU designs.

0 favorites 0 likes

#model-compression

CAT-Q: Cost-efficient and Accurate Ternary Quantization for LLMs

arXiv cs.CL ↗ · 2d ago Cached

CAT-Q introduces a post-training ternary quantization method for LLMs that uses learnable modulation and softened ternarization, achieving superior performance over BitNet 1.58-bit while using only 512 calibration samples and scaling to 235B parameters.

0 favorites 0 likes

#model-compression

Why is AutoRound being slept on so hard?

Reddit r/LocalLLaMA ↗ · 2026-06-21

A user questions why AutoRound, a quantization tool offering superior accuracy retention at low bits and direct GGUF export, is overlooked despite outperforming standard AWQ and RTN, especially on complex models like Qwen3.6 27B.

0 favorites 0 likes

#model-compression

Attribution-Guided and Coverage-Maximized Pruning for Structural MoE Compression

arXiv cs.LG ↗ · 2026-06-18 Cached

Proposes a structural pruning framework for MoE models that maximizes channel-score coverage via attribution-based approximation, achieving 50% or 25% pruning with 4-bit quantization and reducing memory footprint by 5.27x on Qwen3-30B-A3B.

0 favorites 0 likes

#model-compression

MODE: Modality-Decomposed Expert-Level Mixed-Precision Quantization for MoE Multimodal LLMs

arXiv cs.LG ↗ · 2026-06-17 Cached

This paper introduces MODE, a modality-decomposed expert-level mixed-precision quantization framework for MoE multimodal LLMs that addresses biases in expert importance estimation by decomposing selection frequency by modality and filtering redundant vision tokens, achieving minimal performance loss under aggressive quantization.

0 favorites 0 likes

#model-compression

Simplex-Constrained Sparse Bagging: Transitioning from Uniform Priors to Sparse Posteriors in Ensemble Learning

arXiv cs.AI ↗ · 2026-06-15 Cached

Introduces Simplex-Constrained Sparse Bagging (SCSB), a post-training framework that optimizes estimator weights over the probability simplex using out-of-bag samples, achieving up to 96% ensemble compression and improved calibration.

0 favorites 0 likes

#model-compression

Small LLMs: Pruning vs. Training from Scratch

arXiv cs.LG ↗ · 2026-06-15 Cached

This paper empirically compares pruning vs. training small language models from scratch, finding that pruning provides a strong advantage under limited token budgets but that the advantage diminishes as training scales, especially with coarse pruning.

0 favorites 0 likes

#model-compression

@Ex0byt: Open frontier intelligence, in your hands - the MiniMax-M3 PRISM Dynamic-Quant recipe is ready! 428B parameters compres…

X AI KOLs Timeline ↗ · 2026-06-14 Cached

The MiniMax-M3 PRISM Dynamic-Quant recipe compresses a 428B parameter model from ~450GB to 119GB using per-tensor sensitivity ranking, with plans to prune further to 60-80GB for local deployment.

0 favorites 0 likes

#model-compression

Reducing the Complexity of Deep Learning Models for EEG Analysis on Wearable Devices

arXiv cs.AI ↗ · 2026-06-12 Cached

This paper investigates reducing the computational complexity of deep neural networks for EEG analysis on wearable devices by applying parameter quantization and electrode reduction techniques, demonstrating significant complexity reduction with minimal accuracy loss for epileptic seizure detection.

0 favorites 0 likes

#model-compression

Squeeze-Release: Iterative Pruning with Exact Structural Minimization

Hugging Face Daily Papers ↗ · 2026-06-12 Cached

This paper introduces Squeeze-Release, an iterative pruning method that achieves exact structural minimization.

0 favorites 0 likes

#model-compression

Should I Commit and Publish the Results? [R]

Reddit r/MachineLearning ↗ · 2026-06-10

A researcher describes building a deep learning model with 270k parameters to predict melting points from topological indices, achieving R² 0.6399, and asks whether to publish the results.

0 favorites 0 likes

#model-compression

Sigma-Branch: Hierarchical Single-Path Network Reconstruction for Dynamic Inference with Reduced Active Parameters

arXiv cs.LG ↗ · 2026-06-10 Cached

Sigma-Branch restructures pretrained dense networks into a hierarchical binary tree with a shared backbone, routers, and specialized leaves, reducing per-inference active parameters by 58–60% while staying within 1.72 pp of baseline accuracy on CIFAR-100, ImageNet-1K, and ModelNet40.

0 favorites 0 likes

#model-compression

SHAPE: Coalition-Aware Expert Pruning for Sparse Mixture-of-Experts LLMs

arXiv cs.LG ↗ · 2026-06-10 Cached

SHAPE proposes a coalition-aware expert pruning framework for sparse MoE LLMs that uses Shapley-style attribution over routing traces to identify essential experts, achieving competitive accuracy under 20-40% pruning and reducing GPU memory footprint.

0 favorites 0 likes

#model-compression

Structured Neuron Pruning in Deep Neural Networks Using Multi-Armed Bandits

arXiv cs.LG ↗ · 2026-06-09 Cached

This paper proposes a novel structured neuron pruning framework for deep neural networks using multi-armed bandit algorithms, demonstrating effectiveness on various tasks.

0 favorites 0 likes

#model-compression

@_philschmid: More Gemma 4! New QAT Gemma 4 checkpoints with similar performance while using ~4x less memory! It comes with a new mob…

X AI KOLs Following ↗ · 2026-06-08 Cached

New QAT Gemma 4 checkpoints offer similar performance with ~4x less memory, enabling a 1GB memory footprint for Gemma 4 E2B via a new mobile quantization format.

0 favorites 0 likes

#model-compression

SigmaScale: LLM Compression with SVD-based Low-Rank Decomposition and Learned Scaling Matrices

arXiv cs.CL ↗ · 2026-06-08 Cached

Introduces SigmaScale, a method that learns auxiliary scaling matrices for SVD-based LLM compression, showing competitive performance on Llama 3.1 8B and Qwen3-8B benchmarks.

0 favorites 0 likes

#model-compression

FAIR-Calib: Frontier-Aware Instability-Reweighted Calibration for Post-Training Quantization of Diffusion Large Language Models

arXiv cs.LG ↗ · 2026-06-08 Cached

This paper proposes FAIR-Calib, a two-stage post-training quantization framework for diffusion large language models that addresses the instability of token commitments during iterative refinement. It achieves state-of-the-art results on LLaDA and Dream models under low-bit quantization.

0 favorites 0 likes

model-compression

Submit Feedback