We added W8A8 activation quantization to MLX — prefill went from 2.84s to 2.52s on M5 Pro
Summary
Mininglamp AI released Cider, a small SDK that adds W8A8 activation quantization to Apple's MLX framework, achieving up to 1.84x speedup on prefill for large language models on M5 Pro via custom Metal kernels. The tool works with any MLX model, with INT8 TensorOps support for M5 and above.
Similar Articles
@neural_avb: I am working on porting SAM models and harness into Apple silicon. Already seeing 1.25x inference speed increase on mlx…
Porting SAM 2.1 models to Apple silicon with MLX, achieving 1.25x inference speed increase on the small model, with quantized versions planned.
MLX 16/8/4/2-bit quants of nvidia/llama-embed-nemotron-8b
The user converted Nvidia's Llama-Embed-Nemotron-8B model to MLX format with fp16, 8-bit, 4-bit, and 2-bit quantizations, enabling in-process embedding loading on Apple Silicon via mlx-embeddings.
@nash_su: Mac inference speed doubled. MTPLX is an integrated solution combining MLX and MTP, specifically optimized for model inference on Apple Silicon. By using models with a custom MTP head, it can deliver doubled inference speed. I tested it with Qwen3.6-27…
MTPLX is an integrated solution combining MLX and MTP, specifically optimized for model inference speed on Apple Silicon. Tests show that Qwen3.6-27B achieves double the inference speed of LM Studio, and it also integrates fan management.
Gemma4 26b MoE running in MLX with turboquant (and custom kernel)
A developer successfully ran Gemma4 26b MoE on Apple MacBook Air M5 using MLX with turboquant and a custom kernel, achieving faster prompt processing and generation speeds than llama.cpp with lower memory usage. The implementation includes instructions for local deployment.
@ivanfioravanti: Apple M5 Max + MLX = raw power! Look at this demo I'm playing with "FasterLivePortrait-MLX" I started with MPS but resu…
The author demonstrates that migrating a LivePortrait implementation from MPS to Apple's MLX framework on an M5 Max chip results in significantly better performance and speed.