Tag
A guide on building a large language model from scratch using Apple's MLX framework.
A collection of 650+ Apache-2.0 licensed biomedical NER and de-identification models that run on-device via MLX, achieving 30-40x faster inference than PyTorch-CPU on an M3 Max with identical outputs.
Rapid-MLX is a local LLM inference tool optimized for Apple M-series chips. Built on the MLX framework, it achieves 2 to 4 times faster inference than Ollama, supports multiple models, tool calling, and an OpenAI API-compatible interface.
GLM 5.2, an open-weight AI model comparable to top closed models, has been released and is now running on MLX on two Mac Studios (M3 Ultra).
Announces the release of a Config-I quantization of MiniMax-M3 on MLX, using 2-bit experts and 4-bit attention to reduce the 427B MoE model from 869GB to ~167GB, though the quant is untested and requires a patch for mlx_lm.
The react-native-executorch library now integrates Google's Gemma 4 model, enabling fully offline, GPU-accelerated inference in React Native apps using Vulkan on Android and MLX on Apple Silicon.
MLX-LoRA-Studio is a native macOS app for fine-tuning LLMs on Apple Silicon, offering a user-friendly interface and support for various training algorithms including SFT, DPO, and QAT. It is fully open-source and allows local, private fine-tuning without cloud dependency.
oMLX, a MLX server for local AI, now supports the standard Hugging Face cache model directory, simplifying model loading.
A tweet highlights an excellent WWDC video by Angelos Kath on building local agentic AI with MLX, noting rapid progress in open-weight models and hardware capabilities.
MTPLX V1 is a native Mac app that bundles the MTP speculative decoding engine for MLX models, offering features like model conversion via Forge, built-in chat, benchmarking, and support for smaller models. It achieves over 2x speedup with mathematical exactness.
Yagil Bubrovnik presented at WWDC, demoing LM Studio's upcoming clustering feature on stage, crediting the MLX team for their work.
Cohere officially launches North Mini Code, a coding model, with weights available on Hugging Face and deployment support for vLLM and MLX.
Three MLX videos from WWDC demonstrate running AI agents entirely locally on Apple Silicon using the MLX stack, including local inference, tool calling, and distributed inference across Macs, enabling no-cloud, offline AI workflows.
Apple's MLX team introduces MLX LM Server, a tool for running AI agent workflows fully locally on Mac, supporting continuous batching, distributed inference, and M5 neural acceleration, with no need for cloud or API keys.
A special guest from Google will discuss next generation foundation models at the Extreme Alpha RN event, with additional speaker Awni Hannun, co-creator of MLX.
oMLX v0.4.0 ships a native Swift macOS app with redesigned onboarding, settings UI, Hugging Face cache discovery, and improved model management for running local AI on Macs.
A CS student built mlx-Chronos, an open-source CLI tool that standardizes benchmarking of MLX inference engines on Apple Silicon by measuring TTFT, throughput, memory usage, and thermal state, with a community leaderboard for sharing results.
mlx-code is a Python package that provides a local-first LLM coding agent for Apple Silicon, bundling an MLX inference server, multi-protocol API support, git worktree isolation, and composable multi-agent primitives.
pibot is now fully local, using Parakeet for STT, Qwen3-tts for TTS, and Qwen 3.6 as the local multimodal LLM via llama.cpp, with Rust/mlx-c based inference engines, achieving zero Python dependencies.
Mininglamp AI released Cider, a small SDK that adds W8A8 activation quantization to Apple's MLX framework, achieving up to 1.84x speedup on prefill for large language models on M5 Pro via custom Metal kernels. The tool works with any MLX model, with INT8 TensorOps support for M5 and above.