@awnihannun: Three MLX videos dropped at WWDC: Running agents locally by @angeloskath https://youtube.com/watch?v=wykPErJ8M-8… Distr…
Summary
Three MLX videos from WWDC demonstrate running AI agents entirely locally on Apple Silicon using the MLX stack, including local inference, tool calling, and distributed inference across Macs, enabling no-cloud, offline AI workflows.
View Cached Full Text
Cached at: 06/09/26, 08:59 AM
Three MLX videos dropped at WWDC: Running agents locally by @angeloskath https://youtube.com/watch?v=wykPErJ8M-8… Distributed inference and training by Tatiana Likhomanenko https://youtube.com/watch?v=CzgK02zsRg4… MLX Swift by David Koski https://youtube.com/watch?v=KCL8f9ztKFk…
TL;DR: Build and run agent-based AI workflows entirely locally on Mac using MLX — no cloud or API keys required, data stays on your device, fully offline.
How Local Agent AI Works
Over the past year, AI agents have evolved from research prototypes into everyday productivity tools. The traditional chat experience works like this: you give a prompt to a language model and it returns a response; you then manually act on that response. Agent workflows are different: you talk to the agent, the agent talks to the model, decides what to do next, calls tools (run commands, read files, call APIs), observes the results, goes back to the model, and loops until the task is complete.
On Apple Silicon, the entire loop can run entirely locally. Data stays on your machine, AI is always available, and there are no usage costs.
The Four Layers of the Stack
The stack that drives local agent AI has four layers (from bottom to top):
- MLX: An open-source array framework built specifically for Apple Silicon, handling low-level computation, Metal acceleration, and memory management.
- MLX-LM: Provides everything needed to load, run, quantize, and fine-tune large language models. Supports thousands of models from Hugging Face, offers a CLI tool and Python API.
- MLX-LM Server: An HTTP server compatible with OpenAI’s API, exposing local models through a standard API. Supports structured tool calling and reasoning models (step-by-step analysis of complex problems), and is a plug-and-play replacement for any cloud LLM API.
- The Agent Itself: Any framework or tool that follows the OpenAI chat completions protocol — such as Xcode, OpenCode, Pi agent, custom scripts, etc. Because MLX-LM Server provides a standard interface, agent frameworks work out of the box.
It’s worth noting that many popular applications and tools (like Ollama, LM Studio, vLLM) are also built on top of MLX and MLX-LM.
Three Steps to Set Up a Local Agent
- Install MLX-LM:
pip install mlx-lm - Start the server: Run
mlx_lm.server, using a model that supports tool calls (start with a small model for testing). - Point your agent to the local server: Set the base URL to
localhost:8080in your agent framework.
For example, in OpenCode’s configuration: define a local provider, set the URL to localhost, specify the model name, then tell OpenCode to use this local model for all requests.
Three Key Challenges Accelerated by MLX for Agents
Prompt Processing (Neural Accelerator)
In an agent loop, the model must process a lot of new context after each tool output. Sessions often contain hundreds of thousands of tokens. The dedicated Neural Accelerator introduced in the M5 chip makes matrix multiplication 4× faster than the M4. MLX’s specialized matrix multiplication and attention kernels translate directly into faster prompt processing — the agent can read codebases or process tool results almost four times as fast. This all happens automatically, no special parameters or code changes needed.
Concurrency (Continuous Batching)
When multiple sub-agents work in parallel, they send requests simultaneously. MLX-LM Server uses continuous batching: incoming requests are dynamically grouped and processed together on the GPU, and new requests can join an ongoing batch. This way sub-agents don’t wait in a queue, and the entire workflow stays fluid.
Model Size (Distributed Inference)
When a single Mac doesn’t have enough memory (e.g., DeepSeek’s 1.6 trillion parameter model — weights alone exceed 800 GB), MLX allows distributing the model across multiple Macs connected via Thunderbolt or Ethernet. This not only makes running larger models possible, but also parallelizes prompt processing across devices, accelerating the agent loop. Thunderbolt RDMA is supported starting in macOS 26.2, and with four nodes speedups of up to 3× are possible. Use mlx.launch with a hostfile to start a distributed server.
Live Demo: Building an App from Scratch and Fixing Bugs
Demo 1: Building a SwiftUI Drawing App
Starting from a blank Xcode project, the agent is asked to build a drawing app for iPad. The agent looks at the current directory, makes a plan, writes files, then builds the app and fixes errors. The first version was generated in about two minutes, fully functional. Then the agent was asked to add round end caps — it edited the code and recompiled successfully.
Demo 2: Fixing a Bug in Xcode
Connect Xcode to an MLX server already running locally (Settings → Intelligence → Add Local Hosted Provider, port 8080). A bug was intentionally introduced into the project, and the model was asked to fix it. The model identified the bug, checked the surrounding code, wrote a fix, and then the project could be built and run. Throughout the entire process, the code never left the Mac.
Get Started
Install MLX-LM, start the server, and point your favorite agent to it. All demo content is open source and ready to use now.
Source: https://www.youtube.com/watch?v=wykPErJ8M-8
Similar Articles
@awnihannun: The video from @angeloskath on local agentic AI with MLX is excellent. I also hear it's one of the most viewed videos i…
A tweet highlights an excellent WWDC video by Angelos Kath on building local agentic AI with MLX, noting rapid progress in open-weight models and hardware capabilities.
New MLX LM Server From Apple
Apple's MLX team introduces MLX LM Server, a tool for running AI agent workflows fully locally on Mac, supporting continuous batching, distributed inference, and M5 neural acceleration, with no need for cloud or API keys.
I've created the fastest local AI engine for Apple Silicon. Optimised for agentic use.
The author announces the release of 'lightning-mlx', a local AI engine optimized for Apple Silicon that achieves high token speeds for coding agents and tool-calling workflows.
@AlexJonesax: Two open-source MLX inference servers worth knowing about if you run LLMs on Mac: MTPLX (@youssofal) Uses a model's own…
This article highlights two open-source MLX inference servers for Mac: MTPLX, which optimizes token speed using speculative decoding without a draft model, and oMLX, which improves workflow efficiency with persistent KV caches for coding agents.
@julien_c: and is Apple Silicon the King of Local AI?
Discussion on whether Apple Silicon is the best hardware for running local AI models, referencing a linked article or thread.