Command A+ (218B MoE) running on Apple Silicon — MLX port, PR open

Reddit r/LocalLLaMA 05/23/26, 08:14 PM Tools

mlx cohere apple-silicon moe inference open-source command-a-plus

Summary

A PR for mlx-lm adds support for Cohere's Command A+ (218B MoE) model on Apple Silicon, with architecture details for the implementation.

Cohere dropped Command A+ on the 20th (218B total / 25B active, 128 experts top-8, Apache 2.0). Wrote a cohere2\_moe implementation for mlx-lm to get it running on Apple Silicon. Architecture notes for anyone digging into this model: \- Single shared expert with a larger intermediate (16384 = 4096×4) combined with the routed output via (routed + shared)/2 \- Sigmoid routing (not softmax), normalized top-8 \- Sliding window 3:1 (3 sliding + 1 full), interleaved RoPE on sliding layers only \- Parallel attn+MLP block off the same LayerNorm \- Gotcha that cost me a few iterations: the biases in the W4A4 checkpoint are NVFP4 quantization artifacts — the BF16 model is entirely bias-free. sanitize() handles both formats. I couldn't validate locally (W4A4 needs \~132GB, my M3 Max is 128). [https://github.com/vlbosch](https://github.com/vlbosch) ran it on a bigger box: BF16→Q8 conversion + clean generation, tool calling, multi-turn with KV-cache continuation, 22.9 tok/s gen / 57.6 tok/s prompt, 241GB peak. PR is open on ml-explore/mlx-lm (in review). Happy to take feedback or fixes — and if someone with 192GB+ wants to test the W4A4 path directly, would love the error output. [https://github.com/ml-explore/mlx-lm/pull/1294](https://github.com/ml-explore/mlx-lm/pull/1294) https://preview.redd.it/wvwa6irg6y2h1.png?width=3006&format=png&auto=webp&s=52c0a56ff7bc6ea0dec7fd4e43e79d7525047c1c

Original Article

Command A+ (218B MoE) running on Apple Silicon — MLX port, PR open

Similar Articles

mlx-code — local LLM coding agent for Apple Silicon

I fitted the new δ-mem research for apple silicon using mlx and openclaw integration! My findings

@cohere: Introducing: Cohere Command A+ We’ve created our most powerful LLM yet, optimized it to run on as little hardware as po…

I built mlx-Chronos — a community benchmark leaderboard for local LLM engines on Apple Silicon (oMLX, Rapid-MLX, mlx-lm, Ollama)

@neural_avb: I am working on porting SAM models and harness into Apple silicon. Already seeing 1.25x inference speed increase on mlx…

Submit Feedback

Similar Articles

mlx-code — local LLM coding agent for Apple Silicon

I fitted the new δ-mem research for apple silicon using mlx and openclaw integration! My findings

@cohere: Introducing: Cohere Command A+ We’ve created our most powerful LLM yet, optimized it to run on as little hardware as po…

I built mlx-Chronos — a community benchmark leaderboard for local LLM engines on Apple Silicon (oMLX, Rapid-MLX, mlx-lm, Ollama)

@neural_avb: I am working on porting SAM models and harness into Apple silicon. Already seeing 1.25x inference speed increase on mlx…
Porting SAM 2.1 models to Apple silicon with MLX, achieving 1.25x inference speed increase on the small model, with quantized versions planned.