Tag
Clark Labs released Clark Air Sana 1.6B, a ternary-quantized version of the Sana 1.6B text-to-image transformer that is 8.6× smaller than FP16 while maintaining near-FP16 quality, enabling efficient deployment.
A calibration-aware Q4_K_M quantization of Qwen3.5 0.8B using SpectralQuant recovers 96.5% of the BF16 performance gap compared to the standard llama.cpp Q4_K_M quant.
A systematic study of compressing recursive reasoning models for edge hardware finds that aggressive quantization destroys global reasoning while preserving local prediction. The paper introduces per-channel calibrated INT4 to recover reasoning ability and provides deployment recipes fitting 8 MB SoC and 4 MB MCU targets.
This paper presents a cascaded multi-granularity pruning framework for deploying LLMs on Industrial IoT edge devices, achieving up to 13.8x compression with minimal accuracy loss on MHA+GELU architectures while exposing a collapse on GQA+SwiGLU designs.
CAT-Q introduces a post-training ternary quantization method for LLMs that uses learnable modulation and softened ternarization, achieving superior performance over BitNet 1.58-bit while using only 512 calibration samples and scaling to 235B parameters.
A user questions why AutoRound, a quantization tool offering superior accuracy retention at low bits and direct GGUF export, is overlooked despite outperforming standard AWQ and RTN, especially on complex models like Qwen3.6 27B.
Proposes a structural pruning framework for MoE models that maximizes channel-score coverage via attribution-based approximation, achieving 50% or 25% pruning with 4-bit quantization and reducing memory footprint by 5.27x on Qwen3-30B-A3B.
This paper introduces MODE, a modality-decomposed expert-level mixed-precision quantization framework for MoE multimodal LLMs that addresses biases in expert importance estimation by decomposing selection frequency by modality and filtering redundant vision tokens, achieving minimal performance loss under aggressive quantization.
Introduces Simplex-Constrained Sparse Bagging (SCSB), a post-training framework that optimizes estimator weights over the probability simplex using out-of-bag samples, achieving up to 96% ensemble compression and improved calibration.
This paper empirically compares pruning vs. training small language models from scratch, finding that pruning provides a strong advantage under limited token budgets but that the advantage diminishes as training scales, especially with coarse pruning.
The MiniMax-M3 PRISM Dynamic-Quant recipe compresses a 428B parameter model from ~450GB to 119GB using per-tensor sensitivity ranking, with plans to prune further to 60-80GB for local deployment.
This paper investigates reducing the computational complexity of deep neural networks for EEG analysis on wearable devices by applying parameter quantization and electrode reduction techniques, demonstrating significant complexity reduction with minimal accuracy loss for epileptic seizure detection.
This paper introduces Squeeze-Release, an iterative pruning method that achieves exact structural minimization.
A researcher describes building a deep learning model with 270k parameters to predict melting points from topological indices, achieving R² 0.6399, and asks whether to publish the results.
Sigma-Branch restructures pretrained dense networks into a hierarchical binary tree with a shared backbone, routers, and specialized leaves, reducing per-inference active parameters by 58–60% while staying within 1.72 pp of baseline accuracy on CIFAR-100, ImageNet-1K, and ModelNet40.
SHAPE proposes a coalition-aware expert pruning framework for sparse MoE LLMs that uses Shapley-style attribution over routing traces to identify essential experts, achieving competitive accuracy under 20-40% pruning and reducing GPU memory footprint.
This paper proposes a novel structured neuron pruning framework for deep neural networks using multi-armed bandit algorithms, demonstrating effectiveness on various tasks.
New QAT Gemma 4 checkpoints offer similar performance with ~4x less memory, enabling a 1GB memory footprint for Gemma 4 E2B via a new mobile quantization format.
Introduces SigmaScale, a method that learns auxiliary scaling matrices for SVD-based LLM compression, showing competitive performance on Llama 3.1 8B and Qwen3-8B benchmarks.
This paper proposes FAIR-Calib, a two-stage post-training quantization framework for diffusion large language models that addresses the instability of token commitments during iterative refinement. It achieves state-of-the-art results on LLaDA and Dream models under low-bit quantization.