@no_stp_on_snek: Config-I quant of MiniMax-M3 is up on MLX. 2-bit experts, 4-bit attention, 8-bit boundaries + embeddings, f16 router. ~…

X AI KOLs Following Models

Summary

Announces the release of a Config-I quantization of MiniMax-M3 on MLX, using 2-bit experts and 4-bit attention to reduce the 427B MoE model from 869GB to ~167GB, though the quant is untested and requires a patch for mlx_lm.

Config-I quant of MiniMax-M3 is up on MLX. 2-bit experts, 4-bit attention, 8-bit boundaries + embeddings, f16 router. ~167GB. Have I run it? No. Can I run it? Also no. My machine takes one look at this thing and files for divorce. Shipped blind. If you've got the RAM, load it and tell me if it's a genius or brain damage.
Original Article
View Cached Full Text

Cached at: 06/16/26, 07:39 PM

Config-I quant of MiniMax-M3 is up on MLX. 2-bit experts, 4-bit attention, 8-bit boundaries + embeddings, f16 router. ~167GB.

Have I run it? No. Can I run it? Also no. My machine takes one look at this thing and files for divorce.

Shipped blind. If you’ve got the RAM, load it and tell me if it’s a genius or brain damage.


thetom-ai/MiniMax-M3-ConfigI-MLX · Hugging Face

Source: https://huggingface.co/thetom-ai/MiniMax-M3-ConfigI-MLX

https://huggingface.co/thetom-ai/MiniMax-M3-ConfigI-MLX#minimax-m3-turboquant-config-i-mlxMiniMax-M3, TurboQuant+ Config-I (MLX)

https://huggingface.co/thetom-ai/MiniMax-M3-ConfigI-MLX#%E2%9A%A0%EF%B8%8F-untested-model-use-at-your-own-risk⚠️ UNTESTED MODEL, USE AT YOUR OWN RISK **I did not have enough disk/RAM to host or run this model, so it has NOT been validated.**No perplexity, MMLU, needle-in-a-haystack, or generation testing was performed onthisM3 quant. The size and bits-per-weight figures below are the measured output of the conversion;**everything about output quality is unverified.**It may produce broken or degraded output. The Config-I policy itself is proven on other MoE models (seeMiniMax-M2.7-ConfigI-MLX, 93.5% MMLU), and M3 uses the same policy, but M3 is a different, larger architecture (minimax\_m3\_vl, ~427B) that has not been independently confirmed to survive 2-bit expert compression.**Validate before relying on it.**If you run it, please report results.

https://huggingface.co/thetom-ai/MiniMax-M3-ConfigI-MLX#%F0%9F%94%A7-patch-required-m3-is-not-in-stock-mlx_lm-yet🔧 PATCH REQUIRED, M3 is not in stock mlx_lm yet MiniMax-M3 (minimax\_m3\_vl) has no model class in releasedmlx\_lm. Support is in-flight upstream, this quant was made againstml-explore/mlx-lm#1398(see also#1401). Until one of those merges, you need that model class present. Two ways: - Bundled here:minimax\_m3\_vl\.pyships in this repo, drop it into yourmlx\_lm/models/directory. - **From the PR:**check out the PR branch, orpip install "git\+https://github\.com/ml\-explore/mlx\-lm\.git@refs/pull/1398/head". Once #1398/#1401 lands in a release, stockmlx\_lmwill load it and no patch is needed.

Config-I quantization ofMiniMaxAI/MiniMax-M3(~427B total MoE, 60 layers, 128 experts/layer top-4 + 1 shared expert). The MoE/attention weights are Config-I quantized; thevision tower and MiniMax Sparse Attention (MSA) indexer weights are retained at bf16so a future VL/MSA-capable MLX can use them (currentmlx\_lmignores them and runs the model text-only with dense attention). The policy applies aggressive 2-bit compression to expert MLPs (where MoE is most tolerant), protects attention at 4-bit, and shields boundary layers, routing, and embeddings at higher precision. See theConfig-I paperfor the policy derivation.

https://huggingface.co/thetom-ai/MiniMax-M3-ConfigI-MLX#compressionCompression

Sizebf16 source~869 GBMXFP8 source (used for this conversion)~444 GBConfig-I (quantized weights 3.097 bpw) + bf16 vision/MSA****~167 GBReduction vs bf16~81% Includes the bf16 vision tower + MSA indexer (+2.2 GB) retained for forward-compatibility.

Converted from the officialMXFP8 checkpoint(FP8 weights dequantized at load). The sensitive layers (router gates, embeddings, lm_head) are full-precision in the MXFP8 source, so Config-I’s FP8→low-bit step only touches the expert/attention weights it crushes anyway.

https://huggingface.co/thetom-ai/MiniMax-M3-ConfigI-MLX#qualityQuality

**NOT MEASURED.**See the warning at the top. The tables of MMLU / PPL / NIAH / throughput that accompany the validated M2.7 release are deliberately absent here because no such measurements exist for this M3 quant.

https://huggingface.co/thetom-ai/MiniMax-M3-ConfigI-MLX#config-i-policy-minimax-m3-adaptationConfig-I Policy (MiniMax-M3 adaptation)

ComponentBitsLayersRationaleExpert MLP gate/up (w1/w3)2-bitmiddle 56bulk of params, MoE-tolerantExpert MLP down (w2)3-bitmiddle 56write-back sensitivity (Config-I finding)Attention Q/K/V/O4-bitmiddle 56uniform per layerBoundary (all tensors)8-bitfirst 2 + last 2boundary-layer protectionMoE routerf16allrouting precision criticalEmbeddings + lm_head8-bit,protected Uniform MLX quantization produces broken output on MiniMax-class MoE because it compresses attention and routing to the same bits as expert MLPs. Config-I protects the components that control coherence while compressing the ~97% of parameters (expert MLPs) that tolerate it.

https://huggingface.co/thetom-ai/MiniMax-M3-ConfigI-MLX#compatibilityCompatibility

FieldValueFormatMLX safetensors (standard)Avg bits3.097 bpw (quantized weights; vision + MSA-index kept bf16)Runtimemlx\_lm(Python),mlx\-swift\-lm(Swift)Model typeminimax\_m3\_vl(text backbone)PlatformApple Silicon, needs ~200 GB unified memory (M3 Ultra 256 GB / M-series with 192 GB+)Quantized on2026-06-14 Standard MLX per-layer quantization, butM3 support is new and needs the patch above(see “🔧 Patch required”): theminimax\_m3\_vlmodel class isn’t in releasedmlx\_lmyet. Use the bundledminimax\_m3\_vl\.py(drop intomlx\_lm/models/) or the in-flight PR#1398.

https://huggingface.co/thetom-ai/MiniMax-M3-ConfigI-MLX#how-to-runHow to Run

https://huggingface.co/thetom-ai/MiniMax-M3-ConfigI-MLX#python-mlx_lmPython (mlx_lm)

# Needs minimax_m3_vl support, use the bundled minimax_m3_vl.py or PR #1398
# (see "🔧 Patch required" above). Then:
python -m mlx_lm.generate --model thetom-ai/MiniMax-M3-ConfigI-MLX --prompt "Hello"
from mlx_lm import load, generate
model, tokenizer = load("thetom-ai/MiniMax-M3-ConfigI-MLX")
print(generate(model, tokenizer, prompt="Hello", max_tokens=256, temp=1.0, top_p=0.95))

**Note:**MiniMax models are always-reasoning, usetemperature=1\.0; greedy/temp=0 can cause infinite thinking loops.

https://huggingface.co/thetom-ai/MiniMax-M3-ConfigI-MLX#limitations-current-loaderLimitations (current loader)

With today’sminimax\_m3\_vlloader (PR #1398), this runs as atext-only, dense-attentionmodel:

  • **No image input.**The vision tower weights ship in the repo but the loader doesn’t wire up VL inference yet; they are dead weight until MLX adds M3-VL support, at which point no re-quantization is needed.
  • **Dense attention, not MSA.**MiniMax Sparse Attention is run as full causal attention, numerically exact (equal-or-better quality), but long context is slower / more KV-hungry than native M3. The MSA indexer weights are retained (bf16) for a future MSA-capable loader.

Both are intentional: the weights are kept so the artifact is forward-compatible without re-quantizing from source.

https://huggingface.co/thetom-ai/MiniMax-M3-ConfigI-MLX#what-is-config-iWhat is Config-I?

Config-I is a tensor-role-aware weight compression policy from TurboQuant+. Through systematic A/B isolation it was found that attention tensors, FFN read projections (gate/up), FFN write-back projections (down), and boundary layers have dramatically different compression sensitivity. The key insight:compressionpolicymatters more than compressionmath: which tensors to compress, which to protect, and how aggressively. For MoE models, expert MLPs dominate parameter count but tolerate aggressive compression because only a few of the 128 experts are active per token; Config-I compresses them to 2–3 bit while protecting attention and routing.


This quant was produced from the MXFP8 checkpoint withconvert\_m3\.py. It is shared as-is, untested, for others with the hardware to evaluate it.

Similar Articles