@no_stp_on_snek: Config-I quant of MiniMax-M3 is up on MLX. 2-bit experts, 4-bit attention, 8-bit boundaries + embeddings, f16 router. ~…
Summary
Announces the release of a Config-I quantization of MiniMax-M3 on MLX, using 2-bit experts and 4-bit attention to reduce the 427B MoE model from 869GB to ~167GB, though the quant is untested and requires a patch for mlx_lm.
View Cached Full Text
Cached at: 06/16/26, 07:39 PM
Config-I quant of MiniMax-M3 is up on MLX. 2-bit experts, 4-bit attention, 8-bit boundaries + embeddings, f16 router. ~167GB.
Have I run it? No. Can I run it? Also no. My machine takes one look at this thing and files for divorce.
Shipped blind. If you’ve got the RAM, load it and tell me if it’s a genius or brain damage.
thetom-ai/MiniMax-M3-ConfigI-MLX · Hugging Face
Source: https://huggingface.co/thetom-ai/MiniMax-M3-ConfigI-MLX
https://huggingface.co/thetom-ai/MiniMax-M3-ConfigI-MLX#minimax-m3-turboquant-config-i-mlxMiniMax-M3, TurboQuant+ Config-I (MLX)
https://huggingface.co/thetom-ai/MiniMax-M3-ConfigI-MLX#%E2%9A%A0%EF%B8%8F-untested-model-use-at-your-own-risk⚠️ UNTESTED MODEL, USE AT YOUR OWN RISK **I did not have enough disk/RAM to host or run this model, so it has NOT been validated.**No perplexity, MMLU, needle-in-a-haystack, or generation testing was performed onthisM3 quant. The size and bits-per-weight figures below are the measured output of the conversion;**everything about output quality is unverified.**It may produce broken or degraded output. The Config-I policy itself is proven on other MoE models (seeMiniMax-M2.7-ConfigI-MLX, 93.5% MMLU), and M3 uses the same policy, but M3 is a different, larger architecture (
minimax\_m3\_vl, ~427B) that has not been independently confirmed to survive 2-bit expert compression.**Validate before relying on it.**If you run it, please report results.
https://huggingface.co/thetom-ai/MiniMax-M3-ConfigI-MLX#%F0%9F%94%A7-patch-required-m3-is-not-in-stock-mlx_lm-yet🔧 PATCH REQUIRED, M3 is not in stock mlx_lm yet MiniMax-M3 (
minimax\_m3\_vl) has no model class in releasedmlx\_lm. Support is in-flight upstream, this quant was made againstml-explore/mlx-lm#1398(see also#1401). Until one of those merges, you need that model class present. Two ways: - Bundled here:minimax\_m3\_vl\.pyships in this repo, drop it into yourmlx\_lm/models/directory. - **From the PR:**check out the PR branch, orpip install "git\+https://github\.com/ml\-explore/mlx\-lm\.git@refs/pull/1398/head". Once #1398/#1401 lands in a release, stockmlx\_lmwill load it and no patch is needed.
Config-I quantization ofMiniMaxAI/MiniMax-M3(~427B total MoE, 60 layers, 128 experts/layer top-4 + 1 shared expert). The MoE/attention weights are Config-I quantized; thevision tower and MiniMax Sparse Attention (MSA) indexer weights are retained at bf16so a future VL/MSA-capable MLX can use them (currentmlx\_lmignores them and runs the model text-only with dense attention). The policy applies aggressive 2-bit compression to expert MLPs (where MoE is most tolerant), protects attention at 4-bit, and shields boundary layers, routing, and embeddings at higher precision. See theConfig-I paperfor the policy derivation.
https://huggingface.co/thetom-ai/MiniMax-M3-ConfigI-MLX#compressionCompression
Sizebf16 source~869 GBMXFP8 source (used for this conversion)~444 GBConfig-I (quantized weights 3.097 bpw) + bf16 vision/MSA****~167 GBReduction vs bf16~81% Includes the bf16 vision tower + MSA indexer (+2.2 GB) retained for forward-compatibility.
Converted from the officialMXFP8 checkpoint(FP8 weights dequantized at load). The sensitive layers (router gates, embeddings, lm_head) are full-precision in the MXFP8 source, so Config-I’s FP8→low-bit step only touches the expert/attention weights it crushes anyway.
https://huggingface.co/thetom-ai/MiniMax-M3-ConfigI-MLX#qualityQuality
**NOT MEASURED.**See the warning at the top. The tables of MMLU / PPL / NIAH / throughput that accompany the validated M2.7 release are deliberately absent here because no such measurements exist for this M3 quant.
https://huggingface.co/thetom-ai/MiniMax-M3-ConfigI-MLX#config-i-policy-minimax-m3-adaptationConfig-I Policy (MiniMax-M3 adaptation)
ComponentBitsLayersRationaleExpert MLP gate/up (w1/w3)2-bitmiddle 56bulk of params, MoE-tolerantExpert MLP down (w2)3-bitmiddle 56write-back sensitivity (Config-I finding)Attention Q/K/V/O4-bitmiddle 56uniform per layerBoundary (all tensors)8-bitfirst 2 + last 2boundary-layer protectionMoE routerf16allrouting precision criticalEmbeddings + lm_head8-bit,protected Uniform MLX quantization produces broken output on MiniMax-class MoE because it compresses attention and routing to the same bits as expert MLPs. Config-I protects the components that control coherence while compressing the ~97% of parameters (expert MLPs) that tolerate it.
https://huggingface.co/thetom-ai/MiniMax-M3-ConfigI-MLX#compatibilityCompatibility
FieldValueFormatMLX safetensors (standard)Avg bits3.097 bpw (quantized weights; vision + MSA-index kept bf16)Runtimemlx\_lm(Python),mlx\-swift\-lm(Swift)Model typeminimax\_m3\_vl(text backbone)PlatformApple Silicon, needs ~200 GB unified memory (M3 Ultra 256 GB / M-series with 192 GB+)Quantized on2026-06-14
Standard MLX per-layer quantization, butM3 support is new and needs the patch above(see “🔧 Patch required”): theminimax\_m3\_vlmodel class isn’t in releasedmlx\_lmyet. Use the bundledminimax\_m3\_vl\.py(drop intomlx\_lm/models/) or the in-flight PR#1398.
https://huggingface.co/thetom-ai/MiniMax-M3-ConfigI-MLX#how-to-runHow to Run
https://huggingface.co/thetom-ai/MiniMax-M3-ConfigI-MLX#python-mlx_lmPython (mlx_lm)
# Needs minimax_m3_vl support, use the bundled minimax_m3_vl.py or PR #1398
# (see "🔧 Patch required" above). Then:
python -m mlx_lm.generate --model thetom-ai/MiniMax-M3-ConfigI-MLX --prompt "Hello"
from mlx_lm import load, generate
model, tokenizer = load("thetom-ai/MiniMax-M3-ConfigI-MLX")
print(generate(model, tokenizer, prompt="Hello", max_tokens=256, temp=1.0, top_p=0.95))
**Note:**MiniMax models are always-reasoning, use
temperature=1\.0; greedy/temp=0 can cause infinite thinking loops.
https://huggingface.co/thetom-ai/MiniMax-M3-ConfigI-MLX#limitations-current-loaderLimitations (current loader)
With today’sminimax\_m3\_vlloader (PR #1398), this runs as atext-only, dense-attentionmodel:
- **No image input.**The vision tower weights ship in the repo but the loader doesn’t wire up VL inference yet; they are dead weight until MLX adds M3-VL support, at which point no re-quantization is needed.
- **Dense attention, not MSA.**MiniMax Sparse Attention is run as full causal attention, numerically exact (equal-or-better quality), but long context is slower / more KV-hungry than native M3. The MSA indexer weights are retained (bf16) for a future MSA-capable loader.
Both are intentional: the weights are kept so the artifact is forward-compatible without re-quantizing from source.
https://huggingface.co/thetom-ai/MiniMax-M3-ConfigI-MLX#what-is-config-iWhat is Config-I?
Config-I is a tensor-role-aware weight compression policy from TurboQuant+. Through systematic A/B isolation it was found that attention tensors, FFN read projections (gate/up), FFN write-back projections (down), and boundary layers have dramatically different compression sensitivity. The key insight:compressionpolicymatters more than compressionmath: which tensors to compress, which to protect, and how aggressively. For MoE models, expert MLPs dominate parameter count but tolerate aggressive compression because only a few of the 128 experts are active per token; Config-I compresses them to 2–3 bit while protecting attention and routing.
This quant was produced from the MXFP8 checkpoint withconvert\_m3\.py. It is shared as-is, untested, for others with the hardware to evaluate it.
Similar Articles
@TeksEdge: With MiniMax M3 open source now out, here is what to expect on quants and sizes, including VRAM needed: MiniMax M3 (428…
MiniMax M3, a 428B MoE model with ~23B active parameters, is now open source. It offers ultra-long context (up to 1M) and efficiency improvements, with various quantized sizes and VRAM requirements for local deployment.
JANGQ-AI/MiniMax-M2.7-JANGTQ_K : mixed-bit quant of MiniMax M2.7 - 74 GB on disk
Release of a mixed-bit quantized version of the MiniMax M2.7 model, optimized to 74 GB for efficient local inference on Apple Silicon devices.
@stevibe: MiniMax M2.7 is 230B params. Can you actually run it at home? I tested Unsloth's UD-IQ3_XXS (80GB) on 4 different rigs:…
A user tested MiniMax M2.7 (230B parameter model) using Unsloth's UD-IQ3_XXS quantization (80GB) across four different hardware configurations including RTX 4090, RTX 5090, RTX PRO 6000, and DGX setups, reporting token generation speeds and time-to-first-token metrics.
@Ex0byt: Open frontier intelligence, in your hands - the MiniMax-M3 PRISM Dynamic-Quant recipe is ready! 428B parameters compres…
The MiniMax-M3 PRISM Dynamic-Quant recipe compresses a 428B parameter model from ~450GB to 119GB using per-tensor sensitivity ranking, with plans to prune further to 60-80GB for local deployment.
@dealignai: MiniMax m3, made for 128gb Mac’s Thank you to @hornsby_andrew for preparing the pruning calibration dataset and doing e…
A pruned and quantized version of MiniMax-M3 (MiniMax-M3-Medium-JANG_2L) optimized to run on 128GB Macs using vMLX, featuring 32% expert pruning and JANG_2L mixed-precision quantization to fit within ~105 GB.