@cevenif: For those running local LLMs on Macs, here's a tool worth watching — Rapid-MLX. It delivers 2-4x faster inference on M-series chips than Ollama, thanks to being built directly on Apple's MLX framework for more thorough utilization of the chip architecture. Key highlights: KV cache pruning plus…

X AI KOLs Timeline Tools

Summary

Rapid-MLX is a local LLM inference tool optimized for Apple M-series chips. Built on the MLX framework, it achieves 2 to 4 times faster inference than Ollama, supports multiple models, tool calling, and an OpenAI API-compatible interface.

For those running local LLMs on Macs, here's a tool worth watching — Rapid-MLX. It delivers 2-4x faster inference on M-series chips than Ollama, thanks to being built directly on Apple's MLX framework for more thorough utilization of the chip architecture. Key highlights: - KV cache pruning combined with DeltaNet state snapshots bring first-token latency for multi-turn conversations down to around 0.08 seconds, virtually eliminating any noticeable wait - Tool calling supports 17 parsers, automatically recognizing output formats of models like Qwen, DeepSeek, Gemma, GLM, and can auto-repair issues that arise after quantization - Compatible with the OpenAI API spec, so Cursor, Claude Code, Aider, LangChain can all connect with minimal code changes It also supports reasoning chain separation, cloud routing, multimodal vision/audio, V-cache compression, and more. If you have an M-series Mac and find Ollama not fast enough, Rapid-MLX is worth a try. https://github.com/raullenchai/Rapid-MLX…
Original Article
View Cached Full Text

Cached at: 06/18/26, 02:17 PM

Rapid-MLX

Run AI on your Mac. Faster than anything else.

Run local AI models on your Mac — no cloud, no API costs. Works with Cursor, Claude Code, and any OpenAI-compatible app.

rapidmlx.com · Desktop app · Community benchmarks · Model mirror

pip install → serve Gemma 4 26B → chat + tool calling → works with PydanticAI, LangChain, Aider, and more.

Similar Articles

@nash_su: Mac inference speed doubled. MTPLX is an integrated solution combining MLX and MTP, specifically optimized for model inference on Apple Silicon. By using models with a custom MTP head, it can deliver doubled inference speed. I tested it with Qwen3.6-27…

X AI KOLs Timeline

MTPLX is an integrated solution combining MLX and MTP, specifically optimized for model inference speed on Apple Silicon. Tests show that Qwen3.6-27B achieves double the inference speed of LM Studio, and it also integrates fan management.

@sitinme: There's a pretty interesting open-source project called Cider, specifically designed to accelerate local AI inference on Macs with Apple Silicon chips. Many people buy a Mac mini or MacBook Pro and want to run models locally, but often encounter issues like insufficient speed and high memory usage. Actually...

X AI KOLs Timeline

Cider is an open-source project designed for Apple Silicon Macs, accelerating local AI inference by fully leveraging the computing power of M-series chips. It is compatible with the MLX ecosystem, supports models like Qwen and Llama, and is easy to install.

New MLX LM Server From Apple

Reddit r/LocalLLaMA

Apple's MLX team introduces MLX LM Server, a tool for running AI agent workflows fully locally on Mac, supporting continuous batching, distributed inference, and M5 neural acceleration, with no need for cloud or API keys.

@berryxia: Apple has been betting on on-device models all along! Unified architecture memory is the natural habitat for on-device models! Unified memory means memory is VRAM. We are seeing more and more excellent on-device models emerge. OpenBMB released MiniCPM-V 4.6, a 1.3B multimodal model. After reading it…

X AI KOLs Timeline

OpenBMB released MiniCPM-V 4.6, a 1.3B parameter multimodal model. Using high-resolution visual processing and efficient compression, it achieves fast inference on consumer hardware and mobile phones, outperforming larger models. It is fully open-source and supports multiple inference and quantization frameworks.