@cevenif: For those running local LLMs on Macs, here's a tool worth watching — Rapid-MLX. It delivers 2-4x faster inference on M-series chips than Ollama, thanks to being built directly on Apple's MLX framework for more thorough utilization of the chip architecture. Key highlights: KV cache pruning plus…

X AI KOLs Timeline 06/18/26, 12:35 AM Tools

local-ai apple-silicon mlx inference open-source macos tool-calling

Summary

Rapid-MLX is a local LLM inference tool optimized for Apple M-series chips. Built on the MLX framework, it achieves 2 to 4 times faster inference than Ollama, supports multiple models, tool calling, and an OpenAI API-compatible interface.

For those running local LLMs on Macs, here's a tool worth watching — Rapid-MLX. It delivers 2-4x faster inference on M-series chips than Ollama, thanks to being built directly on Apple's MLX framework for more thorough utilization of the chip architecture. Key highlights: - KV cache pruning combined with DeltaNet state snapshots bring first-token latency for multi-turn conversations down to around 0.08 seconds, virtually eliminating any noticeable wait - Tool calling supports 17 parsers, automatically recognizing output formats of models like Qwen, DeepSeek, Gemma, GLM, and can auto-repair issues that arise after quantization - Compatible with the OpenAI API spec, so Cursor, Claude Code, Aider, LangChain can all connect with minimal code changes It also supports reasoning chain separation, cloud routing, multimodal vision/audio, V-cache compression, and more. If you have an M-series Mac and find Ollama not fast enough, Rapid-MLX is worth a try. https://github.com/raullenchai/Rapid-MLX…

Original Article

View Cached Full Text

Cached at: 06/18/26, 02:17 PM

Rapid-MLX

Run AI on your Mac. Faster than anything else.

Run local AI models on your Mac — no cloud, no API costs. Works with Cursor, Claude Code, and any OpenAI-compatible app.

rapidmlx.com · Desktop app · Community benchmarks · Model mirror

pip install → serve Gemma 4 26B → chat + tool calling → works with PydanticAI, LangChain, Aider, and more.

Similar Articles

@nash_su: Mac inference speed doubled. MTPLX is an integrated solution combining MLX and MTP, specifically optimized for model inference on Apple Silicon. By using models with a custom MTP head, it can deliver doubled inference speed. I tested it with Qwen3.6-27…

X AI KOLs Timeline

MTPLX is an integrated solution combining MLX and MTP, specifically optimized for model inference speed on Apple Silicon. Tests show that Qwen3.6-27B achieves double the inference speed of LM Studio, and it also integrates fan management.

@sitinme: There's a pretty interesting open-source project called Cider, specifically designed to accelerate local AI inference on Macs with Apple Silicon chips. Many people buy a Mac mini or MacBook Pro and want to run models locally, but often encounter issues like insufficient speed and high memory usage. Actually...

X AI KOLs Timeline

Cider is an open-source project designed for Apple Silicon Macs, accelerating local AI inference by fully leveraging the computing power of M-series chips. It is compatible with the MLX ecosystem, supports models like Qwen and Llama, and is easy to install.

SwiftLM: Pure-Swift Apple Silicon LLM inference server—no Python, runs big models on low-RAM Macs

X AI KOLs Timeline

SwiftLM is a Swift-native LLM inference server for Apple Silicon that runs large models without Python, using SSD streaming to load MoE weights and enabling 122B models on 64 GB Macs.

New MLX LM Server From Apple

Reddit r/LocalLLaMA

Apple's MLX team introduces MLX LM Server, a tool for running AI agent workflows fully locally on Mac, supporting continuous batching, distributed inference, and M5 neural acceleration, with no need for cloud or API keys.

@berryxia: Apple has been betting on on-device models all along! Unified architecture memory is the natural habitat for on-device models! Unified memory means memory is VRAM. We are seeing more and more excellent on-device models emerge. OpenBMB released MiniCPM-V 4.6, a 1.3B multimodal model. After reading it…

X AI KOLs Timeline

OpenBMB released MiniCPM-V 4.6, a 1.3B parameter multimodal model. Using high-resolution visual processing and efficient compression, it achieves fast inference on consumer hardware and mobile phones, outperforming larger models. It is fully open-source and supports multiple inference and quantization frameworks.

Similar Articles

@nash_su: Mac inference speed doubled. MTPLX is an integrated solution combining MLX and MTP, specifically optimized for model inference on Apple Silicon. By using models with a custom MTP head, it can deliver doubled inference speed. I tested it with Qwen3.6-27…

SwiftLM: Pure-Swift Apple Silicon LLM inference server—no Python, runs big models on low-RAM Macs

New MLX LM Server From Apple

Submit Feedback