@AlexJonesax: Two open-source MLX inference servers worth knowing about if you run LLMs on Mac: MTPLX (@youssofal) Uses a model's own…

X AI KOLs Timeline 05/10/26, 01:32 PM Tools

open-source mlx macos llm-inference speculative-decoding ai-tools

Summary

This article highlights two open-source MLX inference servers for Mac: MTPLX, which optimizes token speed using speculative decoding without a draft model, and oMLX, which improves workflow efficiency with persistent KV caches for coding agents.

Two open-source MLX inference servers worth knowing about if you run LLMs on Mac: MTPLX (@youssofal) Uses a model's own MTP heads for speculative decoding. No draft model needed. ~63 tok/s on Qwen3.6-27B (M5Max). Mathematically exact sampling too; not just greedy prefix matching. oMLX (@jundot) Tiered KV cache that persists to SSD across restarts. Huge for coding agents where you're sending the same codebase context repeatedly. Also serves LLMs, VLMs, embeddings, rerankers, and audio simultaneously. They're solving different problems; MTPLX maximizes tok/s, oMLX maximizes workflow efficiency. Both have OpenAI + Anthropic-compatible APIs, both work with Claude Code/OpenCode/Cursor out of the box. Running both depending on the task. But, both worth checking out.

Original Article

Similar Articles

jundot/omlx

GitHub Trending (daily)

oMLX is a new open-source tool for optimized LLM inference on Apple Silicon Macs, featuring continuous batching and tiered KV caching managed via a menu bar app.

MTPLX V1: The Swift App For Running & Creating MLX MTP Models (2x TPS Qwen 3.6 27B)

Reddit r/LocalLLaMA

MTPLX V1 is a native Mac app that bundles the MTP speculative decoding engine for MLX models, offering features like model conversion via Forge, built-in chat, benchmarking, and support for smaller models. It achieves over 2x speedup with mathematical exactness.

MLX engine comparison… and oMLX is the top choice.

Reddit r/LocalLLaMA

A blog post comparing MLX inference engines, concluding oMLX as the top choice, with benchmarks on M5 Max 64GB using Qwen3.6-35B-A3B-4bit.

New MLX LM Server From Apple

Reddit r/LocalLLaMA

Apple's MLX team introduces MLX LM Server, a tool for running AI agent workflows fully locally on Mac, supporting continuous batching, distributed inference, and M5 neural acceleration, with no need for cloud or API keys.

@jundotkim: oMLX 0.3.9rc1 released. Highlights: - Low-memory Macs stay stable instead of getting killed by the OS - DFlash bumped t…

X AI KOLs Timeline

oMLX 0.3.9rc1, an LLM inference server optimized for Apple Silicon Macs, adds low-memory stability, chunked prefill, multi-tasking admin chat, and more.

Similar Articles

jundot/omlx

MTPLX V1: The Swift App For Running & Creating MLX MTP Models (2x TPS Qwen 3.6 27B)

MLX engine comparison… and oMLX is the top choice.

New MLX LM Server From Apple

@jundotkim: oMLX 0.3.9rc1 released. Highlights: - Low-memory Macs stay stable instead of getting killed by the OS - DFlash bumped t…

Submit Feedback