@PyTorch: ExecuTorch now has an MLX delegate that runs PyTorch models on Apple Silicon GPUs. It supports LLMs, speech-to-text, an…

X AI KOLs Following 05/18/26, 04:00 PM Tools

executorch mlx pytorch apple-silicon gpu-inference quantization open-source

Summary

ExecuTorch now has an MLX delegate that enables GPU-accelerated inference for PyTorch models on Apple Silicon Macs, supporting LLMs, speech-to-text, and MoE models with quantization via TorchAO.

ExecuTorch now has an MLX delegate that runs PyTorch models on Apple Silicon GPUs. It supports LLMs, speech-to-text, and MoE models with quantization via TorchAO. Export with torch.export, run on Metal. Read our latest blog: https://pytorch.org/blog/running-pytorch-models-on-apple-silicon-gpus-with-the-executorch-mlx-delegate/…

Original Article

View Cached Full Text

Cached at: 05/18/26, 04:34 PM

Running PyTorch Models on Apple Silicon GPUs with the ExecuTorch MLX Delegate – PyTorch

Source: https://pytorch.org/blog/running-pytorch-models-on-apple-silicon-gpus-with-the-executorch-mlx-delegate/

Featured projects

TL;DR: Introducing the ExecuTorch MLX Delegate

The new MLX delegate enables optimized, GPU-accelerated inference for PyTorch models on Apple Silicon Macs, using Apple’s MLX framework.
The delegate seamlessly integrates with the PyTorch 2 export stack and supports a wide range of quantization options (BF16, FP16, FP32, 2/4/8-bit affine, NVFP4).
It supports various models, including dense transformers (Llama, Qwen, Gemma), sparse Mixture-of-Experts, and speech-to-text models (Whisper, Voxtral, Parakeet) for both offline and real-time transcription.
Note: The MLX delegate is currently experimental.

Apple Silicon has become a popular platform for running large language models locally. Until now,ExecuTorchusers on macOS were limited to CPU-based backends like XNNPACK or the AOTI Metal backend. Now we’ve released the MLX delegate, which brings fully optimized GPU-accelerated inference to Apple Silicon Macs through Apple’sMLXframework.

In this post we’ll cover what the MLX delegate is, why we built it as an ExecuTorch backend, and what you can run with it today.

**Note:**The MLX delegate is currently experimental and under active development. APIs and supported features may change.

What is the MLX Delegate?

The MLX delegate is a new ExecuTorch backend that compiles and runs PyTorch models on Apple Silicon GPUs. You export your model using the standard ExecuTorch pipeline, and the delegate handles the rest: partitioning the graph, serializing it into an optimized format, and dispatching operations to MLX’s Metal GPU kernels at runtime.

From the user’s perspective, the workflow is the same as any other ExecuTorch backend:

Export your model withtorch\.export
Lower it withto\_edge\_transform\_and\_lowerusing theMLXPartitioner
Run the resulting\.ptefile with the ExecuTorch runtime

The delegate currently supports around 90 ATen ops, covering the full range of operations needed for transformer inference: quantized matmul, multi-head attention, rotary position embeddings, mixture-of-experts routing, recurrent state-space operations, and more.

Why Build This as an ExecuTorch Delegate?

There are already excellent tools for running models on Apple Silicon, including MLX’s ownmlx\-lm. So why build another one? Three reasons:

**Performance.**The MLX delegate achieves 3-6x higher throughput on generative AI workloads compared to existing ExecuTorch delegates on macOS. Moving inference to MLX’s optimized Metal kernels makes a meaningful difference for ExecuTorch applications like chat and real-time transcription.

**PyTorch 2 integration.**The delegate plugs directly into the PyTorch 2 export stack. It usestorch\.exportfor graph capture and TorchAO for quantization, the same tools used by every other ExecuTorch backend. If you can export a model withtorch\.export, you can run it on MLX. When new models or quantization techniques land in PyTorch, they become available to the MLX delegate without additional work.

**Portable applications.**ExecuTorch provides a single runtime API across all backends. An application built against the ExecuTorch C++ or Python runtime can run models exported for MLX, XNNPACK, CoreML, Vulkan, or CUDA without changing application code.

Quantization and Dtype Support

The delegate supports the precision and quantization options you’d expect for on-device inference:

BF16, FP16, and FP32for weights and activations
2, 4, and 8-bit affine quantizationvia TorchAO’squantize\_API. This uses the same quantization scheme as the XNNPACK and Vulkan backends, which means a single quantized model definition can target multiple backends, and opens the door to fat PTE files that run on whichever backend is available at runtime.
NVFP4 quantizationusing NVIDIA’s FP4 data type
Tied quantized embeddingsfor models that share weights between the embedding layer and the language model head

What Models Can I Run?

We’ve validated the delegate across a range of architectures:

Large Language Models

Dense transformerswork out of the box, with support for both full KV caches and sliding window caches:

Llama 3.2 1B
Qwen 3 (0.6B, 1.7B, 4B)
Phi-4 mini (3.8B)
Gemma 3 (1B, 4B) with sliding window attention

Sparse Mixture-of-Expertsmodels are supported through custom gather operations that efficiently route tokens to the correct experts on the GPU:

Qwen 3.5 35B-A3B: 256 experts with top-8 routing, combining GatedDeltaNet linear attention layers with full SDPA attention layers

Speech-to-Text

Offline transcriptionmodels process a complete audio recording and return the transcript:

OpenAI Whisper (tiny through large-v3-turbo)
NVIDIA Parakeet TDT (0.6B) with word-level timestamps
Mistral Voxtral (3B)

Real-time streaming transcriptionprocesses audio in small chunks as it arrives, enabling live use cases:

Mistral Voxtral Realtime (4B) with live microphone input, ring buffer KV caches, and sliding window attention

Broader Coverage

Beyond these flagship models, over 30 additional models have been validated through our backend test suites, covering dense transformers, encoder-decoder architectures, and vision models.

Getting Started

Each supported model has a README with detailed export and inference instructions:

LLMs via HuggingFace: covers Llama, Qwen, and Gemma using optimum-executorch
LLMs via export_llm: covers Phi-4 and Stories 110M using the Hydra-based pipeline
Qwen 3.5 MoE: covers the sparse MoE export with `–backend mlx`
Voxtral Realtime: covers streaming and offline speech-to-text
Parakeet: covers speech recognition with timestamps
Whisper: covers OpenAI’s speech recognition models

For an overview of the delegate architecture, supported operations, and development guide, see theMLX Delegate README.

We’d love to hear what models and use cases matter most to you. If you run into issues or have feature requests, please open an issue on theExecuTorch GitHub repoor join ourDiscord Channel.

@PyTorch: ExecuTorch now has an MLX delegate that runs PyTorch models on Apple Silicon GPUs. It supports LLMs, speech-to-text, an…

Running PyTorch Models on Apple Silicon GPUs with the ExecuTorch MLX Delegate – PyTorch

Featured projects

What is the MLX Delegate?

Why Build This as an ExecuTorch Delegate?

Quantization and Dtype Support

What Models Can I Run?

Large Language Models

Speech-to-Text

Broader Coverage

Similar Articles

@neural_avb: I am working on porting SAM models and harness into Apple silicon. Already seeing 1.25x inference speed increase on mlx…

SwiftLM: Pure-Swift Apple Silicon LLM inference server—no Python, runs big models on low-RAM Macs

I fitted the new δ-mem research for apple silicon using mlx and openclaw integration! My findings

Command A+ (218B MoE) running on Apple Silicon — MLX port, PR open

Submit Feedback

Similar Articles

@berryxia: Great news for Mac users! Apple's on-device model advantage is back! I also saw today that Jina natively supports MLX in its framework! Previously, the release rhythm for open-source embedding models was usually like this: Day 0: Release PyTorch original. Day 7-30: Community converts to GGUF. Day 3…

@neural_avb: I am working on porting SAM models and harness into Apple silicon. Already seeing 1.25x inference speed increase on mlx…
Porting SAM 2.1 models to Apple silicon with MLX, achieving 1.25x inference speed increase on the small model, with quantized versions planned.

SwiftLM: Pure-Swift Apple Silicon LLM inference server—no Python, runs big models on low-RAM Macs

I fitted the new δ-mem research for apple silicon using mlx and openclaw integration! My findings

Command A+ (218B MoE) running on Apple Silicon — MLX port, PR open