@AlexJonesax: Two open-source MLX inference servers worth knowing about if you run LLMs on Mac: MTPLX (@youssofal) Uses a model's own…

X AI KOLs Timeline Tools

Summary

This article highlights two open-source MLX inference servers for Mac: MTPLX, which optimizes token speed using speculative decoding without a draft model, and oMLX, which improves workflow efficiency with persistent KV caches for coding agents.

Two open-source MLX inference servers worth knowing about if you run LLMs on Mac: MTPLX (@youssofal) Uses a model's own MTP heads for speculative decoding. No draft model needed. ~63 tok/s on Qwen3.6-27B (M5Max). Mathematically exact sampling too; not just greedy prefix matching. oMLX (@jundot) Tiered KV cache that persists to SSD across restarts. Huge for coding agents where you're sending the same codebase context repeatedly. Also serves LLMs, VLMs, embeddings, rerankers, and audio simultaneously. They're solving different problems; MTPLX maximizes tok/s, oMLX maximizes workflow efficiency. Both have OpenAI + Anthropic-compatible APIs, both work with Claude Code/OpenCode/Cursor out of the box. Running both depending on the task. But, both worth checking out.
Original Article

Similar Articles

jundot/omlx

GitHub Trending (daily)

oMLX is a new open-source tool for optimized LLM inference on Apple Silicon Macs, featuring continuous batching and tiered KV caching managed via a menu bar app.

New MLX LM Server From Apple

Reddit r/LocalLLaMA

Apple's MLX team introduces MLX LM Server, a tool for running AI agent workflows fully locally on Mac, supporting continuous batching, distributed inference, and M5 neural acceleration, with no need for cloud or API keys.