We added W8A8 activation quantization to MLX — prefill went from 2.84s to 2.52s on M5 Pro

Reddit r/LocalLLaMA 05/25/26, 08:16 AM Tools

Summary

Mininglamp AI released Cider, a small SDK that adds W8A8 activation quantization to Apple's MLX framework, achieving up to 1.84x speedup on prefill for large language models on M5 Pro via custom Metal kernels. The tool works with any MLX model, with INT8 TensorOps support for M5 and above.

Hey, I work on inference tooling at Mininglamp AI. We needed faster prefill for a 4B VLM running on Apple Silicon. Problem was MLX only does weight-only quant — activations stay FP16 the whole way through. So we wrote Cider, a small SDK that adds W8A8 activation quant on top of MLX. Numbers on M5 Pro (64GB, 307 GB/s), 4516 token context: |Quantization|Prefill|Decode| |:-|:-|:-| |W8A16 (MLX)|2.839s|80.1 tok/s| |W8A8 (Cider)|2.519s|79.5 tok/s| Under the hood it's custom Metal kernels we registered as MLX primitives. At M=4096 the per-channel path runs 1.84x faster than W8A16 on the same shape. Not just for our model btw — works with anything that runs through MLX. One catch: INT8 TensorOps only compile on M5 and above. pip install on M4 still works, just falls back to the regular path. Repo: [https://github.com/Mininglamp-AI/cider](https://github.com/Mininglamp-AI/cider) Edit: adding accuracy numbers since it came up. Wikitext2 PPL on Qwen3-8B: FP16 9.73, W8A16 9.71, W8A8 per-channel 9.76. Llama3-8B: FP16 6.14, W8A16 6.15, W8A8 per-channel 6.27. Per-group gs=64 keeps it tighter if precision matters more than speed for your use case.

Original Article

Similar Articles

@neural_avb: I am working on porting SAM models and harness into Apple silicon. Already seeing 1.25x inference speed increase on mlx…

X AI KOLs Following

Porting SAM 2.1 models to Apple silicon with MLX, achieving 1.25x inference speed increase on the small model, with quantized versions planned.

MLX 16/8/4/2-bit quants of nvidia/llama-embed-nemotron-8b

Reddit r/LocalLLaMA

The user converted Nvidia's Llama-Embed-Nemotron-8B model to MLX format with fp16, 8-bit, 4-bit, and 2-bit quantizations, enabling in-process embedding loading on Apple Silicon via mlx-embeddings.

@nash_su: Mac inference speed doubled. MTPLX is an integrated solution combining MLX and MTP, specifically optimized for model inference on Apple Silicon. By using models with a custom MTP head, it can deliver doubled inference speed. I tested it with Qwen3.6-27…

X AI KOLs Timeline

MTPLX is an integrated solution combining MLX and MTP, specifically optimized for model inference speed on Apple Silicon. Tests show that Qwen3.6-27B achieves double the inference speed of LM Studio, and it also integrates fan management.

Gemma4 26b MoE running in MLX with turboquant (and custom kernel)

Reddit r/LocalLLaMA

A developer successfully ran Gemma4 26b MoE on Apple MacBook Air M5 using MLX with turboquant and a custom kernel, achieving faster prompt processing and generation speeds than llama.cpp with lower memory usage. The implementation includes instructions for local deployment.

@ivanfioravanti: Apple M5 Max + MLX = raw power! Look at this demo I'm playing with "FasterLivePortrait-MLX" I started with MPS but resu…