@awnihannun: Three MLX videos dropped at WWDC: Running agents locally by @angeloskath https://youtube.com/watch?v=wykPErJ8M-8… Distr…

X AI KOLs Following News

Summary

Three MLX videos from WWDC demonstrate running AI agents entirely locally on Apple Silicon using the MLX stack, including local inference, tool calling, and distributed inference across Macs, enabling no-cloud, offline AI workflows.

Three MLX videos dropped at WWDC: Running agents locally by @angeloskath https://youtube.com/watch?v=wykPErJ8M-8… Distributed inference and training by Tatiana Likhomanenko https://youtube.com/watch?v=CzgK02zsRg4… MLX Swift by David Koski https://youtube.com/watch?v=KCL8f9ztKFk…
Original Article
View Cached Full Text

Cached at: 06/09/26, 08:59 AM

Three MLX videos dropped at WWDC: Running agents locally by @angeloskath https://youtube.com/watch?v=wykPErJ8M-8… Distributed inference and training by Tatiana Likhomanenko https://youtube.com/watch?v=CzgK02zsRg4… MLX Swift by David Koski https://youtube.com/watch?v=KCL8f9ztKFk…


TL;DR: Build and run agent-based AI workflows entirely locally on Mac using MLX — no cloud or API keys required, data stays on your device, fully offline.

How Local Agent AI Works

Over the past year, AI agents have evolved from research prototypes into everyday productivity tools. The traditional chat experience works like this: you give a prompt to a language model and it returns a response; you then manually act on that response. Agent workflows are different: you talk to the agent, the agent talks to the model, decides what to do next, calls tools (run commands, read files, call APIs), observes the results, goes back to the model, and loops until the task is complete.

On Apple Silicon, the entire loop can run entirely locally. Data stays on your machine, AI is always available, and there are no usage costs.

The Four Layers of the Stack

The stack that drives local agent AI has four layers (from bottom to top):

  1. MLX: An open-source array framework built specifically for Apple Silicon, handling low-level computation, Metal acceleration, and memory management.
  2. MLX-LM: Provides everything needed to load, run, quantize, and fine-tune large language models. Supports thousands of models from Hugging Face, offers a CLI tool and Python API.
  3. MLX-LM Server: An HTTP server compatible with OpenAI’s API, exposing local models through a standard API. Supports structured tool calling and reasoning models (step-by-step analysis of complex problems), and is a plug-and-play replacement for any cloud LLM API.
  4. The Agent Itself: Any framework or tool that follows the OpenAI chat completions protocol — such as Xcode, OpenCode, Pi agent, custom scripts, etc. Because MLX-LM Server provides a standard interface, agent frameworks work out of the box.

It’s worth noting that many popular applications and tools (like Ollama, LM Studio, vLLM) are also built on top of MLX and MLX-LM.

Three Steps to Set Up a Local Agent

  1. Install MLX-LM: pip install mlx-lm
  2. Start the server: Run mlx_lm.server, using a model that supports tool calls (start with a small model for testing).
  3. Point your agent to the local server: Set the base URL to localhost:8080 in your agent framework.

For example, in OpenCode’s configuration: define a local provider, set the URL to localhost, specify the model name, then tell OpenCode to use this local model for all requests.

Three Key Challenges Accelerated by MLX for Agents

Prompt Processing (Neural Accelerator)

In an agent loop, the model must process a lot of new context after each tool output. Sessions often contain hundreds of thousands of tokens. The dedicated Neural Accelerator introduced in the M5 chip makes matrix multiplication 4× faster than the M4. MLX’s specialized matrix multiplication and attention kernels translate directly into faster prompt processing — the agent can read codebases or process tool results almost four times as fast. This all happens automatically, no special parameters or code changes needed.

Concurrency (Continuous Batching)

When multiple sub-agents work in parallel, they send requests simultaneously. MLX-LM Server uses continuous batching: incoming requests are dynamically grouped and processed together on the GPU, and new requests can join an ongoing batch. This way sub-agents don’t wait in a queue, and the entire workflow stays fluid.

Model Size (Distributed Inference)

When a single Mac doesn’t have enough memory (e.g., DeepSeek’s 1.6 trillion parameter model — weights alone exceed 800 GB), MLX allows distributing the model across multiple Macs connected via Thunderbolt or Ethernet. This not only makes running larger models possible, but also parallelizes prompt processing across devices, accelerating the agent loop. Thunderbolt RDMA is supported starting in macOS 26.2, and with four nodes speedups of up to 3× are possible. Use mlx.launch with a hostfile to start a distributed server.

Live Demo: Building an App from Scratch and Fixing Bugs

Demo 1: Building a SwiftUI Drawing App

Starting from a blank Xcode project, the agent is asked to build a drawing app for iPad. The agent looks at the current directory, makes a plan, writes files, then builds the app and fixes errors. The first version was generated in about two minutes, fully functional. Then the agent was asked to add round end caps — it edited the code and recompiled successfully.

Demo 2: Fixing a Bug in Xcode

Connect Xcode to an MLX server already running locally (Settings → Intelligence → Add Local Hosted Provider, port 8080). A bug was intentionally introduced into the project, and the model was asked to fix it. The model identified the bug, checked the surrounding code, wrote a fix, and then the project could be built and run. Throughout the entire process, the code never left the Mac.

Get Started

Install MLX-LM, start the server, and point your favorite agent to it. All demo content is open source and ready to use now.

Source: https://www.youtube.com/watch?v=wykPErJ8M-8

Similar Articles

New MLX LM Server From Apple

Reddit r/LocalLLaMA

Apple's MLX team introduces MLX LM Server, a tool for running AI agent workflows fully locally on Mac, supporting continuous batching, distributed inference, and M5 neural acceleration, with no need for cloud or API keys.