ExecuTorch -- A Unified PyTorch Solution to Run AI Models On-Device

arXiv cs.LG Papers

Summary

This article introduces ExecuTorch, a unified PyTorch-native deployment framework designed to run AI models on diverse edge devices without requiring model conversion or reimplementation.

arXiv:2605.08195v1 Announce Type: new Abstract: Local execution of AI on edge devices is important for low latency and offline operation. However, deploying models on diverse hardware remains fragmented, often requiring model conversion or complete reimplementation outside the PyTorch ecosystem where the model was originally authored. We introduce ExecuTorch, a unified PyTorch-native deployment framework for edge AI. ExecuTorch enables seamless deployment of machine learning models across heterogeneous compute environments. It scales from embedded microcontrollers to complex system-on-chips (SoCs) with dedicated accelerators, powering devices ranging from wearables and smartphones to large compute clusters. ExecuTorch preserves PyTorch semantics while allowing customization, support for optimizations like quantization, and pluggable execution "backends". These features together enable fast experimentation, allowing researchers to validate deployment behavior entirely within PyTorch, bridging the gap between research and production.
Original Article
View Cached Full Text

Cached at: 05/12/26, 07:01 AM

# 1 Introduction
Source: [https://arxiv.org/html/2605.08195](https://arxiv.org/html/2605.08195)
marginparsep has been altered\. topmargin has been altered\. marginparwidth has been altered\. marginparpush has been altered\. The page layout violates the ICML style\.Please do not change the page layout, or include packages like geometry, savetrees, or fullpage, which change it for you\. We’re not able to reliably undo arbitrary changes to the style\. Please remove the offending package\(s\), or layout\-changing commands and try again\.

ExecuTorch \- A Unified PyTorch Solution to Run AI Models On\-Device

Anonymous Authors1

###### Abstract

Local execution of AI on edge devices is important for low latency and offline operation\. However, deploying models on diverse hardware remains fragmented, often requiring model conversion or complete reimplementation outside the PyTorch ecosystem where the model was originally authored\. We introduce ExecuTorch, a unified PyTorch\-native deployment framework for edge AI\. ExecuTorch enables seamless deployment of machine learning models across heterogeneous compute environments\. It scales from embedded microcontrollers to complex system\-on\-chips \(SoCs\) with dedicated accelerators, powering devices ranging from wearables and smartphones to large compute clusters\. ExecuTorch preserves PyTorch semantics while allowing customization, support for optimizations like quantization, and pluggable execution “backends”\. These features together enable fast experimentation, allowing researchers to validate deployment behavior entirely within PyTorch, bridging the gap between research and production\.

††footnotetext:1Anonymous Institution, Anonymous City, Anonymous Region, Anonymous Country\. Correspondence to: Anonymous Author <anon\.email@domain\.com\>\.
Preliminary work\. Under review by the Machine Learning and Systems \(MLSys\) Conference\. Do not distribute\.Local execution of AI on edge devices is critical for countless important applications that demand low latency or offline operation, from live translation to autonomous vehicles and patient monitoringWang and Jia \([2025](https://arxiv.org/html/2605.08195#bib.bib43)\); Wanget al\.\([2025](https://arxiv.org/html/2605.08195#bib.bib44)\); Nget al\.\([2025](https://arxiv.org/html/2605.08195#bib.bib45)\); Kuoet al\.\([2020](https://arxiv.org/html/2605.08195#bib.bib46)\); Xuet al\.\([2021](https://arxiv.org/html/2605.08195#bib.bib47)\); Sperling and Ernst \([2024](https://arxiv.org/html/2605.08195#bib.bib48)\); Nigadeet al\.\([2024](https://arxiv.org/html/2605.08195#bib.bib49)\); Kanget al\.\([2024](https://arxiv.org/html/2605.08195#bib.bib50)\); Ponset al\.\([2023](https://arxiv.org/html/2605.08195#bib.bib51)\)\. Continued advances in model architecturesLinet al\.\([2024](https://arxiv.org/html/2605.08195#bib.bib25)\); Vasuet al\.\([2023](https://arxiv.org/html/2605.08195#bib.bib53);[2024](https://arxiv.org/html/2605.08195#bib.bib54)\)and specialized accelerators such as NPUsAhsanet al\.\([2025](https://arxiv.org/html/2605.08195#bib.bib55)\)have made on\-device inference increasingly practical\. However, moving from research to production remains fragmentedWang and Jia \([2025](https://arxiv.org/html/2605.08195#bib.bib43)\); Wanget al\.\([2025](https://arxiv.org/html/2605.08195#bib.bib44)\): although PyTorch powers over 70%Foundation \([2024](https://arxiv.org/html/2605.08195#bib.bib71)\)of AI research, ML developers must either leave the PyTorch environment for platform\-specific tools or accept performance and portability trade\-offs\.

Deploying AI models on edge devices has spawned numerous solutions, yet existing frameworks suffer from key limitations:

- •Model conversion between disconnected authoring environments with different semantics \(e\.g\., ONNXMicrosoft \([2018](https://arxiv.org/html/2605.08195#bib.bib58)\), TensorFlow LiteDavidet al\.\([2021](https://arxiv.org/html/2605.08195#bib.bib56)\)\)
- •Forced reimplementation in framework\-specific formats \(e\.g\., llama\.cppGerganov \([2023](https://arxiv.org/html/2605.08195#bib.bib61)\)\)
- •Tight coupling to specific hardware vendors \(e\.g\., Qualcomm SNPEQualcomm Technologies, Inc\. \([2016](https://arxiv.org/html/2605.08195#bib.bib63)\), Apple CoreMLApple Inc\. \([2017](https://arxiv.org/html/2605.08195#bib.bib57)\)\)
- •Prohibitive runtime overhead \(PyTorch Mobile\)

These factors create friction in the experimentation\-deployment loop\. A unified workflow is needed that preserves PyTorch semantics from training to production across all devices without sacrificing performance for portability\.

![Refer to caption](https://arxiv.org/html/2605.08195v1/imgs/intro-bold.png)Figure 1:Users can bring PyTorch models into ExecuTorch for compilation and optimization \(both backend\-agnostic and backend\-specific\) to generate a PTE file that runs on platforms from 0\.01 to 800 watts\.ExecuTorch addresses this challenge by providing a PyTorch\-native development environment as shown in Figure[1](https://arxiv.org/html/2605.08195#S1.F1)\. It leverages PyTorch’s core technology underlyingtorch\.compileandtorch\.exportto construct a portable representation of the original PyTorch model, which can be deployed as a binary across diverse hardware and exploit device\-specific kernel optimizations\. To do so, ExecuTorch implements infrastructure forbackend delegation, allowing different parts of the model to run on the most suitable hardware for a given device\. ExecuTorch performs ahead\-of\-time \(AOT\) graph\-level compilation, which greatly reduces interpreter overhead and runtime dependencies compared to previous approaches such as TorchScriptPyTorch Team \([2019](https://arxiv.org/html/2605.08195#bib.bib65)\)\. As a result, researchers can validate quantization, profile performance, and debug models within PyTorch before deployment\.

One advantage of ExecuTorch is experimentation velocity: researchers can validate runtime behavior entirely within PyTorch\. Other frameworks’ pipelines require model conversion and separate validation, often leading to numerical mismatches and resource\-intensive debug cycles\. ExecuTorch eliminates this gap by usingtorch\.exportto generate execution graphs that can run directly in PyTorch eager mode while capturing a faithful representation that can be executed on\-device\. For example, a quantized LLM \(Large Language Model\) can be tested and debugged in PyTorch, then deployed to a phone’s NPU, with confidence that its behavior will match almost exactly\.

This parity is achieved through several technologies working in tandem\. First,torch\.exportconverts models into hardware\-agnostic AOT graphs built from a small set of<300<300Core ATen primitives, reducing the burden for edge environments\. These graphs remove Python dependencies but retain debug symbols, and—unlike ONNX—remain executable in PyTorch for pre\-deployment validation\. Second, ExecuTorch supports selective backend delegation, allowing parts of a model to run on specialized accelerators like Qualcomm Hexagon or Apple Neural Engine, with CPU fallback as needed\. Both AOT and just\-in\-time compilation modes are supported, and hardware vendors can integrate via a clean API without modifying the core runtime\. Third, quantization\-aware export makes post\-training quantization and quantization\-aware training first\-class steps\. Backends declare their capabilities, and ExecuTorch applies quantization accordingly, ensuring that quantization validated in PyTorch matches on\-device execution\.

ExecuTorch also addresses memory constraints of on\-device LLM deployment through techniques such as KV\-cache quantization, sliding\-window attention, and 4\-bit group\-wise weight quantization, which reduces model size by 50%Meta AI \([2024](https://arxiv.org/html/2605.08195#bib.bib66)\)\. These techniques operate on the model graph produced bytorch\.export, which allows for evaluation of accuracy and memory trade\-offs prior to deployment on a mobile device\.

We provide a comprehensive evaluation of ExecuTorch’s latency and throughput across large language models \(LLMs\) and traditional computer vision models on mobile phones, spanning CPUs, GPUs, and NPUs, and compare it against widely used alternatives such as ONNX Runtime, llama\.cpp, and LiteRT\. Across devices and workloads, we find that ExecuTorch delivers competitive or state\-of\-the\-art performance across heterogeneous backends: it is consistently strong on CPU \(via XNNPACK\), achieves high token\-generation throughput on mobile GPUs \(via Vulkan\), and unlocks substantial prefill speedups on NPUs \(via QNN\) when models are well\-delegated, while matching native performance on iOS through CoreML when full\-graph delegation is available\.

Beyond performance, ExecuTorch provides production\-critical customization at two levels: runtime extensions allow developers to implement custom kernels for specialized operations, selectively build operators to reduce binary size, and create custom data loaders for embedded systems; AOT extensions support memory planning and target\-specific compiler passes, enabling further optimization for hardware constraints\.

This paper makes three principal contributions\. First, we present ExecuTorch, the first PyTorch framework to achieve experimentation parity through locally executable export graphs, enabling unified deployment from microcontrollers to smartphones without model conversion or reimplementation\. Second, we introduce experimentation parity as a design principle: throughtorch\.exportand capability\-driven backend integration, researchers validate deployment behavior—quantization, hardware delegation, performance—within PyTorch before committing to production\. Third, we demonstrate production viability at scale: ExecuTorch powers billions of daily inferences across Meta’s family of appsMeta \([2025a](https://arxiv.org/html/2605.08195#bib.bib73)\)and Reality Labs \(e\.g\., Ray\-Ban smart glasses\)Meta \([2025b](https://arxiv.org/html/2605.08195#bib.bib74)\)\. It supports execution of vision, audio, and large language models on 12 hardware backends, enabling efficient inference on mobile, embedded, and desktop devices\. ExecuTorch demonstrates that researchers need not choose between PyTorch’s development velocity and edge deployment requirements—a unified workflow can deliver both\.

## 2Related Work

Edge AI deployment frameworks make different trade\-offs between development velocity, performance, and portability\. We evaluate existing approaches through the lens ofexperimentation parity, i\.e\., the ability to validate deployment behavior within the model development and training environment\.

Conversion\-based frameworkssuch as ONNX RuntimeMicrosoft \([2018](https://arxiv.org/html/2605.08195#bib.bib58)\)and TensorFlow LiteDavidet al\.\([2021](https://arxiv.org/html/2605.08195#bib.bib56)\)decouple training and deployment through intermediate representations, but the conversion step introduces semantic gaps that surface only after deployment\. For example, QAT in PyTorch may not translate faithfully to ONNX’s quantization semantics\. Early frameworks such as CaffeJiaet al\.\([2014](https://arxiv.org/html/2605.08195#bib.bib29)\)enabled C\+\+\-based on\-device deployment but required complete model authoring within their ecosystem\.

Compiler\-based approachessuch as TVMChenet al\.\([2018](https://arxiv.org/html/2605.08195#bib.bib59)\)and MNNJianget al\.\([2020](https://arxiv.org/html/2605.08195#bib.bib60)\)generate optimized kernels through domain\-specific compilation but require learning separate toolchains, tuning procedures, and debugging workflows distinct from PyTorch\.

Vendor\-specific runtimeslike Apple’s CoreMLApple Inc\. \([2017](https://arxiv.org/html/2605.08195#bib.bib57)\), Qualcomm’s SNPEQualcomm Technologies, Inc\. \([2016](https://arxiv.org/html/2605.08195#bib.bib63)\), and Apple’s MLXHannunet al\.\([2023](https://arxiv.org/html/2605.08195#bib.bib20)\)deliver excellent platform\-specific performance but fragment the deployment landscape, requiring parallel implementations for multi\-platform support\.

PyTorch MobilePyTorch Team \([2019](https://arxiv.org/html/2605.08195#bib.bib65)\)and TorchScriptPyTorch Foundation \([2021](https://arxiv.org/html/2605.08195#bib.bib16)\)attempted PyTorch\-native deployment but were limited by high memory footprint and narrow hardware integration\.

Model\-specific runtimessuch as llama\.cppGerganov \([2023](https://arxiv.org/html/2605.08195#bib.bib61)\)and vLLMKwonet al\.\([2023](https://arxiv.org/html/2605.08195#bib.bib70)\)achieve strong performance through architecture\-specific optimization but require complete reimplementation outside the training framework, breaking iteration velocity\. vLLM also requires a Python runtime, which is not feasible in embedded systems\.

## 3Architecture

ExecuTorch’s export APIs and lean runtime allow for seamless deployment of PyTorch models on any target device, as shown in Figure[1](https://arxiv.org/html/2605.08195#S1.F1)\. The key design goals are to offer:

- •Unified and Portable Runtime – A lightweight runtime with minimal dependencies and execution overhead, ensuring maximum portability and consistent behavior across deployment environments\.
- •Composable and Extensible Architecture – Modular interfaces for backends, graph transform passes, and quantizers allow hardware vendors and developers to plug in custom components without modifying the core runtime\.
- •Efficient Model Execution – Leverage device capabilities and access to accelerators \(CPU, GPU, DSP/NPU\), as well as architectural optimizations such as quantization and memory planning, to minimize memory footprint and latency\.

The two main components of ExecuTorch are the AOT export stack and the Runtime stack \(Figure[2](https://arxiv.org/html/2605.08195#S3.F2)\)\.

On the AOT side, ExecuTorch integrates tightly with PyTorch\. It usestorch\.exportto capture computation graphs fromtorch\.nn\.Moduleand thetorch\.fxgraph pass infrastructure to implement graph\-level optimizations such as operator fusion\. Graph transforms such as quantization, subgraph delegation, and memory planning are performed ahead of time, allowing the runtime to stay lean and focus on executing the pre\-optimized model graph\.

On the runtime side, ExecuTorch provides a compact and customizable execution environment\. Users can link target\-specific kernels and backend libraries to tailor deployments for specific hardware\. The core runtime library is small and efficient to ensure compatibility with resource\-constrained platforms\.

![Refer to caption](https://arxiv.org/html/2605.08195v1/imgs/executorch-stack-no-code-bold.png)Figure 2:High\-level architecture of ExecuTorch, showing two stages: model preparation and model execution\. The preparation flow exports a PyTorch model usingtorch\.export, converts it to the ExecuTorch edge dialect, optionally applies backend delegation and graph optimizations, and eventually serializes the result into the PTE format for deployment\.
## 4Model Preparation

At a high level, the model preparation workflow follows the steps illustrated by the diagram in Figure[2](https://arxiv.org/html/2605.08195#S3.F2)\.

### 4\.1torch\.exportand IRs

torch\.exportis an execution graph capture mechanism provided by PyTorch\. It uses tracing technologies introduced in PyTorch 2\.0Anselet al\.\([2024](https://arxiv.org/html/2605.08195#bib.bib9)\)to convert a model defined in PyTorch Python code into a static graph data structure, which we call Export IR\.

Export IRPyTorch Developers \([2022b](https://arxiv.org/html/2605.08195#bib.bib12)\)is a Torch FX graphReedet al\.\([2021](https://arxiv.org/html/2605.08195#bib.bib10)\)with strong guarantees: \(1\) Shape soundness: shapes in the captured graph satisfy the shape rules defined by each operator’s semantics; \(2\) Graph normalization: the graph contains no Python semantics, and nodes are restricted to a defined operator set; \(3\) Tensor metadata availability: shape metadata is available on inputs, intermediate values, and outputs; \(4\) Program metadata availability: provenance information records the original program’storch\.nn\.Modulehierarchy and Python call stack\.

Export IR supports progressive lowering into dialects\. Subsequent dialects may enforce custom graph properties and constraints\. Complex operators may be decomposed to enforce a limited operator set; operators may be converted to functional forms to eliminate mutations and aliasing\.

Edge Dialect\- ExecuTorch defines a more restrictive Edge Dialect on top of Export IR with three additional properties: \(1\) fully functional graphs with no mutations or aliasing; \(2\) restriction to<300<300“Core ATen” operatorsPyTorch Developers \([2022a](https://arxiv.org/html/2605.08195#bib.bib11)\), minimizing the implementation burden for custom kernel libraries and delegates; and \(3\) explicit dtype and memory format specialization, including adim orderconcept that describes the memory layout of tensors\.

Edge Dialect is the IR provided to custom graph passes and delegate lowering logic\. In the final lowering stages, the constraints preventing mutation and aliasing will be relaxed in very specific scenarios to allow for optimizations such as KV\-cache writeback that require in\-place updates\. Only non\-computational state updates \(i\.e\., direct data copies without modification\) are allowed\. Delegates are also allowed to diverge from the constraints of Edge Dialect during lowering to optimize performance, but they must ensure that graph transformations produce computations equivalent to the original Edge Dialect graph\.

### 4\.2Memory Planning

Memory planning is performed before serialization as the final preprocessing step\. ExecuTorch analyzes the size and lifespan of each tensor to allocate space within fixed\-size memory arenas, with mutable state tensors given infinite lifespan to prevent overwriting\. The default greedy best\-fit algorithm reuses the smallest non\-overlapping buffer when available, but otherwise allocates linearly to minimize fragmentation\. Custom memory planning algorithms are also supported\.

### 4\.3Quantization

Quantization is an essential technique for deploying models on device\. It significantly reduces model size and inference latency at some cost to accuracy\. ExecuTorch builds on TorchAOtorchao \([2024](https://arxiv.org/html/2605.08195#bib.bib72)\)to support a variety of backend\-specific and generalized quantization algorithms, such as SpinQuantLiuet al\.\([2025b](https://arxiv.org/html/2605.08195#bib.bib24)\)\. Two quantization workflows are available: PyTorch 2 Export for static quantization, and Eager mode for dynamic or weight\-only quantization\.

PyTorch 2 Export QuantizationJerry Zhang \([2025](https://arxiv.org/html/2605.08195#bib.bib13)\); Andrew Or \([2025](https://arxiv.org/html/2605.08195#bib.bib14)\)transforms a model in Export IR for both post\-training quantization \(PTQ\) and quantization\-aware training \(QAT\)\. The graph is first captured withtorch\.export, and each backend uses its ownQuantizerclass with annotation APIs to specify quantization intent for operators and patterns, as shown in Figure[3](https://arxiv.org/html/2605.08195#S4.F3)\.

Eager Mode Quantizationoperates directly onnn\.Moduleinstances by converting the weight tensor of target submodules \(e\.g\., linear or embedding\) into a quantized tensor subclassPyTorch \([2025](https://arxiv.org/html/2605.08195#bib.bib15)\)or replacing the submodule with a quantized variant\. Each type of quantization \(dtype, packing format\) has its own Tensor subclass\.

![Refer to caption](https://arxiv.org/html/2605.08195v1/imgs/pt2e-bold.png)Figure 3:The quantizer annotates input/output tensors of an operator \(or pattern\) with quantization info such as dtype, bitwidth, range, and observer\.
### 4\.4Backend Delegate Interface

ExecuTorch’s backend delegate abstraction enables executing model subgraphs on specialized processors \(Figure[4](https://arxiv.org/html/2605.08195#S4.F4)\)\. Each delegate provides \(1\) an AOT compiler that lowers compatible Edge Dialect subgraphs into a backend\-specific representation \(a “delegate blob”\), and \(2\) a runtime library that can deserialize and execute the delegate blob on the target processor\. The delegate specifies which operators it can accelerate; the partitioner identifies matching subgraphs composed of supported operators and routes them to the delegate compiler\. This way, backends will not receive subgraphs that they cannot execute\.

![Refer to caption](https://arxiv.org/html/2605.08195v1/imgs/backend_delegate_large.png)Figure 4:An example showing how the backend receives the graph, compiles it, and executes it\.
### 4\.5Model Serialization

ExecuTorch introduces the PyTorch Edge file format, with the\.ptefile extension, designed for minimal runtime overhead and file size \(Figure[5](https://arxiv.org/html/2605.08195#S4.F5)\)\.

Theprogramcomponent contains execution plans for each model method \(e\.g\.,forward,encode\) represented as a list of instructions\.KernelCallinvokes an operator, andDelegateCallinvokes execution of a delegated subgraph\. Arguments are represented as indices into a shared list ofEValues, each of which corresponds to a tensor or scalar\. Linear execution \(plus theJumpinstruction for control flow\) reduces computational overhead compared to executing a graph representation\.

Thesegmentscomponent contains discrete, aligned memory blocks that can be independently loaded and freed\. Segments holding small program data persist for the lifetime of the model\. Segments representing large delegate blobs can be freed after model initialization to reduce peak memory\. Page\-aligned segments also support direct mmap access without additional copying\.

ExecuTorch also defines the PTD format, with the\.ptdfile extension, for storing named tensor and delegate data, enabling weight sharing between PTE files, checkpointing for on\-device training, and independent deployment of program and data\.

![Refer to caption](https://arxiv.org/html/2605.08195v1/imgs/pte-combined-1.png)Figure 5:Overview of the PTE file \(a\) and weight sharing mechanisms; multi\-method \(b\) and program data separation \(c\)\.
### 4\.6Weight Sharing

ExecuTorch supports weight sharing via two mechanisms \(Figure[5](https://arxiv.org/html/2605.08195#S4.F5)b, c\): \(1\)multi\-method PTE files, where different model methods \(e\.g\., LLM prefill and decode\) share data segments within a single file; and \(2\)program\-data separation, where a PTE program references external PTD weight files that can be shared across models \(e\.g\., LoRA adapters sharing foundation weights\)\. Both strategies reduce binary size and enable buffer reuse at runtime\.

### 4\.7On\-Device Fine\-Tuning

ExecuTorch supports on\-device training by lowering both forward and backward execution graphs\. Updated weights are written as new PTD checkpoints\. We validated fine\-tuning a classification model \(lowered to the XNNPACK backend\) using the CIFAR\-10 dataset on an Android device\.

## 5Model Execution

ExecuTorch provides a lightweight and modular runtime with tight memory and compute budgets \(Figure[6](https://arxiv.org/html/2605.08195#S5.F6)\)\. The runtime executes the instruction lists encoded in theprogramcomponent of the PTE file \(Section[4\.5](https://arxiv.org/html/2605.08195#S4.SS5)\), where each instruction maps to a statically registered kernel or delegate call\. At build time, the selective build API allows developers to link only the required kernel and delegate libraries\. Static registration removes dynamic operator resolution overhead and minimizes per\-instruction latency\.

![Refer to caption](https://arxiv.org/html/2605.08195v1/imgs/runtime.png)Figure 6:ExecuTorch Runtime### 5\.1Core Runtime Portability

The core runtime targets C\+\+17 and does not depend on dynamic memory allocation, synchronization primitives, or exceptions in order to maximize portability across hardware platforms \(servers, mobile phones, and bare\-metal embedded systems\) and software environments \(POSIX, Windows, bare\-metal environments\)\. Note that these restrictions do not apply to extensions and backends, which may be intended for use only on specific platforms\.

The core runtime does not create or manage heap\-allocated memory or use C\+\+ STL types which allocate or manage their own memory\. All memory must be user provided via theMemoryManagerabstraction which, in addition to ensuring portability, enables placement of tensor data in specialized memory regions such as SRAM or DRAM\. TheFreeableBufferabstraction enables lifetime management by wrapping a buffer pointer and a user\-defined free function\.

The Platform Abstraction Layer \(PAL\) allows overriding system operations such as logging, querying the time, or panicking the process/system\. TheDataLoaderinterface supports custom PTE loading strategies \(e\.g\., file I/O or mmap\)\.

### 5\.2Runtime API Language Bindings

ExecuTorch ships optional extensions that present a PyTorch\-like façade over the core runtime, including a C\+\+ModuleAPI mirroring eager\-mode usage andTensorPtrfor safe, zero\-copy tensor passing\. Native bindings for iOS \(Objective\-C/Swift\) and Android \(Java/Kotlin\) allow apps to call into ExecuTorch without touching C\+\+ directly\.

### 5\.3Runtime Overhead

To quantify reduction in framework overhead, we compare inference of a minimal model \(mul\+add\) between ExecuTorch and the TorchScript\-based PyTorch Mobile InterpreterPyTorch Foundation \([2021](https://arxiv.org/html/2605.08195#bib.bib16)\)\. FlatBuffer\-based deserialization yields a 5\.3×\\timesspeedup, runtime initialization is 37\.4×\\timesfaster by eliminating dynamic operator resolution, and simplified instruction\-to\-kernel routing reduces per\-operator overhead\. Deterministic memory behavior follows from ahead\-of\-time memory planning\.

Table 1:Runtime overhead comparison between ExecuTorch and the PyTorch Mobile Interpreter on a minimal model\. All values in CPU cycles\.PhaseComponentMIETSpeedupLoadingDeserialization510975\.3×\\timesLoadingInitialization312,6318,35037\.4×\\timesExecutionFramework overhead324,399754,325×\\timesExecutionaten::mul7,97636022\.0×\\timesExecutionaten::add8,49339021\.8×\\timesTotal per inference654,0099,27270\.5×\\times

## 6Kernels

ExecuTorch kernels are stateless functions that implement tensor operations characterized by a fixed name, an operator schema, and clear input and output semantics \(i\.e\., expected data types, aliasing, etc\.\)\. Built\-in kernel libraries \(i\.e\., a collection of kernels that is linked during build and invoked during execution\) implement the Core ATen operator set used in Edge Dialect\. Operator semantics are identical to the corresponding implementations in PyTorch’s ATen library, except that tensor memory must be densely packed\. As in PyTorch, custom operators may be defined using PyTorch’s custom operator API\.

### 6\.1Kernel Libraries

ExecuTorch ships with two CPU kernel libraries\. ThePortable Kernel Libraryis a reference implementation of Core ATen operators with no external dependencies\. It provides functional and correct implementations that are always available\. TheOptimized Kernel Libraryaccelerates selected operators using SIMD intrinsics and optimized math libraries \(e\.g\., SLEEF, OpenBLAS\), trading portability for performance\. Users may map operators to either library\.

### 6\.2Kernel Registration APIs

Selective Build:The full portable library is∼\\sim2\.3 MiB, which is too large for many resource\-constrained applications\. ExecuTorch’s selective build feature allows users to specify a subset of kernels to include when building, which can reduce binary size from MiB to KiB\. To further reduce binary size, dtype selective build preserves only kernel code paths for data types that will actually be exercised during inference, discarding the rest\.

Runtime Registration API:In PyTorch, operator schemas are resolved by parsing a string DSL at static initialization time, incurring significant startup latency\. ExecuTorch avoids this by capturing and storing argument sequence and type information based on the exported graph\. To account for the possibility of PyTorch operator schema/functionality being updated, Edge Dialect operators have strong backward compatibility guarantees\.

## 7Backends

ExecuTorch currently includes a diverse selection of production\-ready backends, which together enable efficient execution across a wide range of processors\. Additionally, several more backends are under active development: MediaTek, OpenVINO, Samsung Exynos, NXP eIQ Neutron, CUDA, and Metal\.

### 7\.1XNNPACK CPU Backend

The XNNPACK backend is a high\-performance CPU backend built on Google’s XNNPACK libraryGoogle \([2019](https://arxiv.org/html/2605.08195#bib.bib36)\)\. Active collaboration between the ExecuTorch and Google teams ensures tight integration and alignment between the two projects\. Complementary libraries such as Arm’s KleidiAIArm Ltd\. \([2024b](https://arxiv.org/html/2605.08195#bib.bib37)\)are also used to extend hardware coverage\. The quantizer exposes support for static and dynamic quantization for int8 inference, as well as per\-channel and group\-wise int4 weights for LLM workloads\.

At runtime, the delegate achieves high performance using an extensive library of SIMD\-optimized, multithreading\-compatible kernels tuned across a wide range of CPU architectures and input shapes\. A weight caching mechanism enables efficient model reloading and LoRA weight sharing\.

### 7\.2Vulkan Backend

The VulkanKhronos Group \([2023](https://arxiv.org/html/2605.08195#bib.bib41)\)backend is designed for inference on mobile GPUs\. Model inference is powered by a growing library of GLSL compute shaders that currently implement 76 ATen operators\. One operator may map to multiple compute shaders, which are selected at runtime based on tensor storage type, memory layout, supported Vulkan extensions, input shapes, and GPU architecture\. Similar to XNNPACK, the backend also includes int8 variants of several operators \(convolution, matmul, linear\) to support int8 inference via static or dynamic quantization\. Integer inference leverages hardware\-accelerated integer dot product instructions when available\. Inference with group\-wise quantized int4 weights is also supported for LLM workloads\.

### 7\.3Arm Ethos\-U NPU Backend

The Arm backend targets Ethos\-U NPUsArm Ltd\. \([2024a](https://arxiv.org/html/2605.08195#bib.bib40)\), including the latest Ethos\-U85, for efficient inference on embedded platforms\. The delegate converts Edge Dialect subgraphs to TOSA \(Tensor Operator Set Architecture\)Consortium \([2023](https://arxiv.org/html/2605.08195#bib.bib39)\)IR, which is then compiled by the Vela compiler \(Regor backend\) into optimized binaries for Ethos\-U NPUs\. The backend’s quantizer features symmetricint8and mixed\-precision quantization to support a wide range of models\.

### 7\.4Qualcomm QNN Backend

The QNN backend, built on Qualcomm AI Engine DirectQualcomm Technologies \([2024](https://arxiv.org/html/2605.08195#bib.bib42)\), targets inference on the Hexagon DSP using the A8W8 and A16W4 quantization formats\. While static shapes offer optimal performance, limited dynamic shape support is also available\. The backend is compatible with many quantization algorithms and features in TorchAO, including SpinQuant, Range\-Setting, and a shared SeqMSE observer\. It supports multi\-method execution, spill\-fill buffers, runtime\-adjustable power modes, profiling, and both offline and online compilation\.

### 7\.5CoreML Backend

The CoreML backendApple Inc\. \([2023](https://arxiv.org/html/2605.08195#bib.bib38)\)targets inference on Apple platforms\. It is able to perform model inference on CPU, GPU, and Neural Engine \(ANE\) processors\. It exposes most capabilities available when using CoreML directly, such as 8\-bit static quantization and weight\-only quantization; selection of compute units and compute precision; support for static, enumerated, and dynamic shapes; and execution of stateful models\.

## 8Devtools

ExecuTorch provides a suite of developer tools for profiling and debugging on\-device deployments\. These tools assist developers in linking the eager\-mode behavior of a model in PyTorch to the on\-device behavior when executing with ExecuTorch\. The workflow centers on two artifacts:ETRecord, produced during export, captures the Edge Dialect graph and contains debug metadata linking runtime events to the Python source code from the original model; andETDump, produced during execution, captures operator latencies, memory lifetimes, delegate and kernel call events, and optionally intermediate tensor values\.

The Inspector API then provides tools to view and analyze the data contained in these artifacts\. Developers can use latency data to identify slow operators and bottlenecks within delegates\. When a model is producing incorrect outputs, intermediate tensor values can be inspected and compared to a reference model to identify the source of the error\. Tensor lifetimes and allocator behavior can also be analyzed to reduce memory footprint\.

## 9Enabling Use Cases

### 9\.1Large Language Models

LLMs can be exported to ExecuTorch using either ExecuTorch’s modular transformer\-decoder definition, which supports popular LLMs like Llama 3\.2 and Qwen3Yanget al\.\([2025](https://arxiv.org/html/2605.08195#bib.bib23)\), or Hugging Face’s Transformers libraryWolfet al\.\([2020](https://arxiv.org/html/2605.08195#bib.bib35)\)using Optimum ExecuTorchPyTorch Developers \([2025](https://arxiv.org/html/2605.08195#bib.bib21)\)\. Over 80% of models on Hugging Face’s text generation leaderboard can be exported through Optimum ExecuTorch\.

To represent a LLM’s prefill and decode steps with a single graph,torch\.exportmarks the sequence length dimension as dynamic\. For backends that require static shapes \(e\.g\., QNN\), two separate models are exported: one for prefill, padded to the maximum context length, and one for single\-token decode\. ExecuTorch provides a C\+\+ tokenizer that can be deployed to native platforms with minimal dependencies\.

![Refer to caption](https://arxiv.org/html/2605.08195v1/imgs/llm_optmize.png)Figure 7:\(a\) Quantized Flash Attention and \(b\) Efficient sliding window attentionKey optimizations accelerate LLM inference on CPU \(Figure[7](https://arxiv.org/html/2605.08195#S9.F7)\):

Flash attention: Avoids the cost of materializing intermediate attention tensors, which is particularly useful for reducing memory footprint and inference latency for long contexts\.

Quantized KV cache and attention: A per\-channel quantized KV cache with quantized attention reduces memory for long contexts \(Figure[7](https://arxiv.org/html/2605.08195#S9.F7)\(a\)\)\.

Efficient sliding window attention: For models with local\-global attentionShao \([2024](https://arxiv.org/html/2605.08195#bib.bib68)\)\(e\.g\., Gemma 3\), cache positions are tracked in a separate array to generate causal masks without shifting the KV cache \(Figure[7](https://arxiv.org/html/2605.08195#S9.F7)\(b\)\), avoiding costly memory movement\.

The QNN and CoreML backends also support speculative decoding as an optimization for token generation throughput\.

### 9\.2Multi\-modality

ExecuTorch supports multi\-modal transformers by splitting models at export time into text embedding, text decoder, and multi\-modal encoder components\. They are then efficiently stitched together in the C\+\+ model runner\. This targets Early Fusion modelsGadzickiet al\.\([2020](https://arxiv.org/html/2605.08195#bib.bib32)\)\(e\.g\., VoxtralLiuet al\.\([2025a](https://arxiv.org/html/2605.08195#bib.bib30)\), Gemma 3 4BTeamet al\.\([2025](https://arxiv.org/html/2605.08195#bib.bib31)\)\); cross\-attention models are supported via an attention interface for external KV cache management\.

### 9\.3MCU Deployment

ExecuTorch’s portable runtime and selective build system enable deployment on microcontroller\-class targets\. We demonstrate this with an MNIST digit classifier on the Raspberry Pi Pico 2 \(Arm Cortex\-M33, 520 KiB SRAM, 4 MiB Flash\), comparing two configurations: FP32 inference using the Portable Kernel Library, and int8 inference using Arm’s CMSIS\-NNArm Limited \([2025](https://arxiv.org/html/2605.08195#bib.bib62)\)library via the Arm Ethos\-U backend\.

Selective buildreduces runtime code size by linking only the kernels required by the model\. For the FP32 Portable configuration, selective build shrinks total code size from 1,322 KiB to 253 KiB \(5\.2×\\times\); for int8 CMSIS\-NN, from 1,248 KiB to 203 KiB \(6\.2×\\times\)\.

Table[2](https://arxiv.org/html/2605.08195#S9.T2)provides a detailed Flash breakdown with selective build enabled\. The ExecuTorch runtime itself occupies only 13–26 KiB; the majority of Flash is consumed by the model artifact, kernel libraries, and system code \(Pico SDK \+ libc\)\.

Table 2:Flash breakdown with selective build \(KiB\)\.Table[3](https://arxiv.org/html/2605.08195#S9.T3)reports measured RAM usage\. Ahead\-of\-time memory planning determines the arena size at export time, eliminating runtime allocation\. Int8 quantization and operator fusion reduce the memory\-planned arena from 101\.2 KiB to 3\.8 KiB, bringing total RAM to approximately 11 KiB— well within the Pico 2’s 520 KiB SRAM budget\.

Table 3:RAM breakdown, measured on device \(KiB\)\.Int8 quantization with CMSIS\-NN delivers a 16\.46×\\timesinference speedup \(3\.5 ms vs 57\.6 ms\) while reducing model size by 3\.6×\\timesand RAM by 10×\\timescompared to the FP32 configuration\. These results demonstrate that ExecuTorch can deploy real models on sub\-dollar MCUs\.

## 10Platform and Hardware Support

ExecuTorch supports diverse platforms via source\-level portability and multiple hardware backends\. The core runtime builds on Windows and Unix using Clang or GCC\. Embedded targets use standard C\+\+ with minimal toolchain assumptions and macro\-based abstractions for compiler extensions\. Platform bindings for Android and iOS offer out\-of\-the\-box usability, while backends may relax portability to leverage hardware\-specific optimizations\.

Table 4:Platform compatibility\.PlatformBackendsLanguageWindowsVulkan, CUDA, XNNPACKC\+\+LinuxIntel OpenVINOMacOSCoreML, MPS, XNNPACKC\+\+, Swift,iOSObjective\-CAndroidVulkan, XNNPACK, Arm VGF,C\+\+, Java,Qualcomm NPU, MediaTek NPU,KotlinSamsung ExynosEmbedded SystemsCortex\-M, Arm Ethos\-U,C\+\+NXP NPU, Cadence DSP

Table[4](https://arxiv.org/html/2605.08195#S10.T4)summarizes platform and backend coverage\. For platforms without an accelerated backend, the portable kernel library serves as a fallback\.

## 11Performance Evaluations

We evaluate ExecuTorch’s performance on a representative set of large language models and image classification models\. We compare against other widely adopted on\-device ML frameworks: llama\.cpp, ONNX Runtime, LiteRT, and CoreML\. The most recent framework versions as of March 31, 2026, are used\. Experiments were conducted on a Samsung Galaxy S25 Ultra \(Snapdragon 8 Elite SoC; 16 GiB RAM, 2× Cortex\-X925 \+ 6× Cortex\-A725 CPUs, Adreno 830 GPU, Hexagon NPU\), a Google Pixel 9 Pro XL \(Tensor G4 SoC; 16 GiB RAM, 1× Cortex\-X4 \+ 3× Cortex\-A720 \+ 4× Cortex\-A520 CPUs, Mali\-G715 GPU, Edge TPU\), and an Apple iPhone 15 Pro\. Missing entries \(“–”\) indicate configurations for which data could not be collected due to issues during model export or inference\.

Table 5:ExecuTorch \(ET\) prefill and decode throughput range in tokens/sec, and model size \(i\.e\., the “Size” column\) in MiB for Qwen3 0\.6B, Llama 3\.2 1B, and Phi4 Mini compared against other frameworks\. The “GS” column indicates the quantization group size\.Qwen3 0\.6BLlama 3\.2 1BPhi4 Mini \(3\.8B\)PrefillDecodePrefillDecodePrefillDecodeHardwareGSFrameworkminmaxminmaxSizeminmaxminmaxSizeminmaxminmaxSizeSamsung Galaxy S25 Ultra \(Snapdragon 8 Elite\)CPU32ET XNNPACK716\.62732\.5972\.3472\.73417524\.59528\.9365\.8867\.10821143\.60159\.7018\.5019\.902428llama\.cpp747\.70750\.4097\.90100\.80442512\.70537\.8065\.6066\.50728151\.20153\.3022\.1022\.902216ONNX387\.68443\.9360\.3265\.09376284\.12328\.7660\.3764\.7376960\.0265\.7217\.3718\.292337LiteRT150\.83153\.3811\.8812\.48326143\.49149\.6331\.3633\.03667\-\-\-\-\-128ET XNNPACK837\.58848\.3974\.9275\.25376649\.75658\.1071\.6971\.97742195\.70202\.6020\.5022\.002201ONNX483\.27554\.7463\.5265\.39323490\.63538\.2074\.6076\.5665992\.08106\.5418\.8120\.021994LiteRT172\.57173\.2212\.5613\.23299185\.35190\.2634\.3034\.56612\-\-\-\-\-GPU32ET Vulkan1206\.421246\.4557\.9858\.23457927\.54930\.9159\.1959\.40920191\.20238\.9216\.0316\.382829llama\.cpp1709\.901718\.0077\.8078\.804421064\.801092\.7036\.7042\.00728339\.10341\.5011\.6013\.102216ONNX\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-LiteRT\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-128ET Vulkan1538\.011556\.2162\.3563\.523371207\.551207\.5566\.2566\.41676343\.57372\.6018\.7519\.672088ONNX\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-LiteRT\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-NPU32ET QNN1462\.861542\.1761\.0262\.386812813\.192976\.7446\.5046\.5714341161\.291229\.2718\.1219\.633584\-QAIRT\-\-\-\-\-2277\.902392\.3452\.0352\.721229\-\-\-\-\-32llama\.cpp343\.10409\.1022\.5023\.40442329\.90374\.4023\.6025\.60728114\.40130\.0012\.1012\.502216LiteRT\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-Google Pixel 9 Pro XL \(Tensor G4\)CPU32ET XNNPACK169\.13299\.5437\.0740\.38417235\.51245\.9231\.2932\.9982139\.3251\.947\.9311\.252428llama\.cpp240\.60241\.1046\.5046\.60442158\.10202\.4029\.6030\.9072857\.2058\.6010\.0010\.302216ONNX144\.42259\.9326\.5235\.23376133\.25156\.7222\.5527\.0776932\.7534\.896\.066\.482337LiteRT96\.0397\.749\.5310\.0832694\.0399\.1919\.9320\.74667\-\-\-\-\-128ET XNNPACK187\.99379\.5141\.9242\.22376270\.61296\.9834\.2435\.0374255\.6563\.2712\.0012\.512201ONNX229\.85285\.7032\.7134\.27323256\.34267\.8637\.9839\.3565938\.2854\.646\.877\.761994LiteRT101\.14109\.6810\.2210\.31299117\.00117\.5121\.4821\.65612\-\-\-\-\-GPU32ET Vulkan313\.10329\.9920\.5920\.81457192\.77197\.3823\.4323\.5492049\.6750\.228\.678\.862829llama\.cpp75\.5076\.4024\.0031\.3044236\.0038\.7015\.7018\.3072811\.5011\.706\.706\.802216ONNX\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-LiteRT\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-128ET Vulkan591\.01601\.8320\.7121\.12337530\.02540\.0823\.7424\.03676119\.64120\.079\.589\.642088ONNX\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-LiteRT\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-

Table 6:ExecuTorch \(ET\) inference time in milliseconds \(avg, p5, p95\) for MV3, ResNet50, ViT, and Swin\-T compared against other frameworks for different backends\. The “dtype” column indicates the inference precision\.Dense LLM Performance\- We benchmarked Qwen3 0\.6B, Llama 3\.2 1B, and Phi4 Mini to showcase inference performance across a range of model sizes\. Quantization configurations were standardized as much as possible across frameworks to ensure comparable model quality\. For GPU and CPU inference on ExecuTorch, ONNX, and LiteRT, model weights are group\-wise quantized to 4\-bit precision and activations are dynamically quantized to 8\-bit precision using runtime\-computed quantization parameters\. Results for 32 and 128 quantization group sizes are shown\. For llama\.cpp, the Q4\_0 quantization format is used; most weights are quantized to 4\-bit with a quantization group size of 32\. However, unlike the other frameworks, the LM\-head weights are quantized to 6 bits, and dynamic quantization of activations may or may not be performed depending on the backend\. For inference on Qualcomm NPU, QNN/Qualcomm AI Runtime \(QAIRT\) requires that models be statically quantized, with 4\-bit weights and 16\-bit activations\. ExecuTorch’s QNN delegate uses group\-wise quantized weights with a group size of 32, while QAIRT uses channel\-wise quantized weights\. For NPU inference via llama\.cpp, Q4\_0 quantization is used\.

All models were benchmarked with 256 prompt tokens and 256 generated tokens; 3 runs were performed for each model, and the minimum/maximum throughput values observed are recorded in Table[5](https://arxiv.org/html/2605.08195#S11.T5)\. Models were configured to use a maximum context length of 2048; for ExecuTorch, the maximum context length is configured at export time, and for production settings higher values can be used if needed\. To mitigate the impact of thermal throttling, the device undergoes a 60\-second cooldown period between runs\. For CPU inference, the number of threads is set to the number of performance cores on the device: 8 for the Samsung Galaxy S25 Ultra and 4 for the Google Pixel 9 Pro XL\. A warmup run is performed before measurement to mitigate “cold\-start” effects\.

The model size reported for each framework is the size of the model artifacts produced by each framework for a given model\. ExecuTorch, LiteRT, and llama\.cpp all produce self\-contained files \(\.pte,\.litertlm, and\.ggufrespectively\) which contain the constant and weight tensor data \(i\.e\., quantized weights, quantization parameters, etc\.\) as well as serialized model representations required for the framework to execute the model\. ONNX uses a\.onnxfile to store the model representation, and a separate\.onnx\.datafile to store constant and weight tensor data; the model artifact size for ONNX is the sum of the sizes of these two files\.

ExecuTorch’s model artifacts tend to be larger compared to other frameworks\. For XNNPACK, this is because the delegate does not yet support tied embeddings \(a technique which shares the weight tensor between the embedding layer and the LM\-head linear layer\), which results in a duplicated embedding table\. For Vulkan, although tied embeddings are supported, the delegate pre\-computes per\-group integer weight sums \(required for quantized accumulation\) during export and stores them in the model artifact\. The overhead of these pre\-computed sums increases with smaller group sizes, which explains the difference in model size between 32 and 128 group sizes\. For both QNN and QAIRT, models use 16\-bit embeddings and 8\-bit LM\-head, which prevents the use of tied embeddings\. QNN also contains higher quantization parameter overhead compared to QAIRT due to the use of group\-wise quantization\.

ExecuTorch’s XNNPACK delegate delivers strong performance compared to ONNX and LiteRT across both devices\. On the Samsung Galaxy S25, llama\.cpp demonstrates higher decode throughput for Qwen3 and Phi4 Mini \(although prefill throughput is comparable\) because its attention implementation is more efficient for single\-token decode than ExecuTorch’s\.

ExecuTorch’s Vulkan delegate is able to execute all models with full graph delegation; no operators fall back to CPU\. It underperforms llama\.cpp in prefill throughput on the Samsung Galaxy S25 Ultra, but tends to deliver better token generation throughput\. On the Pixel 9 Pro XL, the Vulkan delegate greatly outperforms llama\.cpp in prefill throughput, but this is because llama\.cpp currently only contains optimized compute shaders for Adreno GPUs\. Although group size has a large impact on prefill throughput for both devices, for the Pixel 9 Pro XL, prefill throughput drops sharply at group size 32 compared to 128\. This suggests a threshold at which the number of unique quantization parameters fetched during the requantization step of quantized linear layers increases memory traffic enough to cause GPU cache thrashing\. For LiteRT, we observed a segmentation fault when attempting to benchmark exported models with GPU acceleration; for ONNX, we could not find a way to execute LLMs with GPU acceleration\.

For a detailed operator\-level performance comparison of ExecuTorch and llama\.cpp across all 3 models on the Samsung Galaxy S25 Ultra, see Appendix[A](https://arxiv.org/html/2605.08195#A1)\. Note that the analysis only covers CPU and GPU inference\.

ExecuTorch’s QNN delegate demonstrates stronger prefill throughput compared to executing via QAIRT for Llama 3\.2 1B\. QAIRT achieves higher decode throughput, which may be due to its use of per\-channel quantization rather than the per\-group quantization used by ExecuTorch’s QNN delegate\. We could not generate QAIRT binaries for Qwen3 0\.6B and Phi4 Mini due to errors during the model export process\. llama\.cpp’s Hexagon backend \(currently marked experimental\) targets NPU inference using custom DSP kernels, and some ops may be falling back to the CPU, which would explain the much lower throughput compared to QNN\. In contrast, the QNN delegate executes the model with full graph delegation with no operators falling back to CPU\.

Vision Model Performance\- Performance results for MV3, ResNet50, ViT, and Swin\-T are reported in Table[6](https://arxiv.org/html/2605.08195#S11.T6)\. These models are selected as a representative sample of convolution\-based and transformer\-based image processing workloads\. Each model was benchmarked with 10 warmup iterations and 200 inference iterations, and the average, p5, and p95 inference latencies in milliseconds are reported\. For CPU inference, the number of threads is set to the number of performance cores on the device; 8 for the Samsung Galaxy S25 Ultra and 4 for the Google Pixel 9 Pro XL\.

We encountered errors when exporting the Swin\-T model with LiteRT, so no measurements are reported for Swin\-T on LiteRT\. We were also unable to collect GPU inference measurements for many LiteRT models due to a segmentation fault that occurred after loading the GPU acceleration library\. For ONNX, GPU/NPU inference was tested via the QNN execution provider, which is not available for the Pixel 9 Pro XL\. A runtime exception was encountered when executing Swin\-T with the QNN execution provider, and therefore no data was collected for that model on GPU/NPU with ONNX\.

For CPU inference, ExecuTorch’s XNNPACK delegate delivers extremely strong performance relative to LiteRT and ONNX\.

For GPU inference, ExecuTorch’s Vulkan delegate executes MobileNet V3 and ResNet50 \(both quantized and fp16 variants\) with full graph delegation\. For ViT\-B/16, 4 unsupported operator types \(72 instances total\) in the attention masking pipeline—mul\.Scalar,logical\_not,eq\.Scalar, andany\.dim—cause the model to be split into 25 Vulkan partitions\. For Swin\-T, 7 unsupported operator types \(∼\\sim160 instances\)—primarilyslice\_scatter,fmod\.Scalar,index\.Tensorwith 2D sources—produce 12 partitions\. Although CPU fallback operators account for∼\\sim29% of ViT execution latency and∼\\sim22% of Swin\-T execution latency, the overhead introduced by graph breaks \(i\.e\., copying tensors between CPU and GPU\) accounts for only 5%–6% of execution latency\.

Generally, the delegate delivers comparable performance to other frameworks on the Samsung Galaxy S25\. A notable exception is ResNet50, where ONNX achieves much faster inference\. Likewise, LiteRT achieves much faster inference on MobileNet V3 compared to both ExecuTorch and ONNX\. These gaps do not appear to be consistent across models\. Since neither QNN nor LiteRT’s GPU acceleration library is open source, it is difficult to diagnose the source of the performance gap\. For int8 inference on the GPU, support for static int8 quantization in the Vulkan delegate is an ongoing effort, and so far only ResNet50 is supported among the models tested\.

For NPU inference, ExecuTorch’s QNN delegate delivers extremely strong performance relative to LiteRT and ONNX\. We found that for LiteRT and ONNX, the NPU execution provider / accelerator was only claiming a limited number of nodes in the model graph, resulting in a majority of model inference being executed on the CPU\.

On the iPhone 15 Pro, ExecuTorch’s CoreML delegate matches or slightly outperforms native CoreML across all four models, and is the only configuration that produces results for Swin\-T\. These results confirm that ExecuTorch’s delegation overhead is negligible compared to directly running CoreML\.

## 12Limitations and Future Work

Model exportability:ExecuTorch relies ontorch\.exportfor graph capture\. Unfortunately, there are several classes of models that pose challenges for export\.

- •Data\-dependent control flow:models that contain operations where control flow depends on runtime tensor values, e\.g\., models with dynamic padding, data\-dependent slicing, or models that branch on data\-dependent values \(e\.g\., beam search\)\.
- •Dynamic Shapes:models with input\-dependent tensor dimensions such as dynamic LSTMs or Mask R\-CNN architectures produce graphs whose structure changes at runtime\. These often require splitting the model into separately exported subgraphs or rewriting the model to be export\-friendly\.
- •Custom operators:Models that rely on custom C\+\+ or CUDA kernels \(e\.g\., FlashAttention\) require those kernels to be registered with a fake\-tensor implementation that describes output shapes symbolically\. A corresponding kernel must also be registered in the ExecuTorch runtime, and each target delegate must implement support for it, in order for ExecuTorch to handle the custom operator\.

Models containing any of the above elements may require changes such as:

- •Rewriting control flow with higher\-order operators such astorch\.cond\(if/else\),torch\.scanortorch\.while\_loop\(loops\), andtorch\.where\(element\-wise selection\)Wuet al\.\([2025](https://arxiv.org/html/2605.08195#bib.bib75)\)\.
- •Wrapping untraceable ops in a custom op\.
- •Adding assertions \(torch\.\_check\), which serve as compiler hints for constraining dynamic shapes\.

Hardware retargetability:AOT compilation optimizes models for specific hardware, but hardware diversity in the Android ecosystem \(e\.g\., Qualcomm, MediaTek, Samsung NPUs\) may require developers to query device capabilities and select a hardware\-appropriate model file when downloading models from a delivery service, or bundle multiple hardware\-specific model files in one APK \(perhaps with a common shared weight file\)\. This increases engineering complexity compared to runtime\-retargetable solutions \(like ONNX or LiteRT\), which are more flexible but forgo AOT optimizations\.

Desktop/Laptop Support:ExecuTorch accelerates inference on consumer desktops and laptops using backends like XNNPACK, OpenVINO, and QNN\. With rising demand for local inference \(exemplified by llama\.cpp, MLX\), ExecuTorch is now experimenting with CUDA and Metal backend support, leveraging PyTorch’s AOTInductor technology\.

Sparsity:ExecuTorch’s IR can represent sparse weights using dense values and mask tensors, but hardware\-accelerated sparse kernels \(e\.g\., 2:4 structured sparsity\) are not yet widely available on edge targets\. Sparsity support remains a future direction as backend capabilities mature\.

## Acknowledgements

The completion of this project would not have been possible without the support, input, and contributions of numerous individuals and institutions\. We wish to express our sincere gratitude to all those who offered their expertise, resources, and guidance throughout this project\. For a full list, please see Appendix[B](https://arxiv.org/html/2605.08195#A2)\.

## References

- Hardware accelerators for artificial intelligence\.Springer,Cham\.External Links:[Document](https://dx.doi.org/10.1007/978-3-031-71436-8%5F14),ISBN 978\-3\-031\-71436\-8Cited by:[§1](https://arxiv.org/html/2605.08195#S1.p1.1)\.
- Andrew Or \(2025\)PyTorch 2 export quantization\-aware training \(qat\)\.External Links:[Link](https://docs.pytorch.org/ao/stable/tutorials_source/pt2e_quant_qat.html)Cited by:[§4\.3](https://arxiv.org/html/2605.08195#S4.SS3.p2.1)\.
- J\. Ansel, E\. Yang, H\. He, N\. Gimelshein, A\. Jain, M\. Voznesensky, B\. Bao, P\. Bell, D\. Berard, E\. Burovski, G\. Chauhan, A\. Chourdia, W\. Constable, A\. Desmaison, Z\. DeVito, E\. Ellison, W\. Feng, J\. Gong, M\. Gschwind, B\. Hirsh, S\. Huang, K\. Kalambarkar, L\. Kirsch, M\. Lazos, M\. Lezcano, Y\. Liang, J\. Liang, Y\. Lu, C\. K\. Luk, B\. Maher, Y\. Pan, C\. Puhrsch, M\. Reso, M\. Saroufim, M\. Y\. Siraichi, H\. Suk, S\. Zhang, M\. Suo, P\. Tillet, X\. Zhao, E\. Wang, K\. Zhou, R\. Zou, X\. Wang, A\. Mathews, W\. Wen, G\. Chanan, P\. Wu, and S\. Chintala \(2024\)PyTorch 2: faster machine learning through dynamic python bytecode transformation and graph compilation\.InProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2,ASPLOS ’24,New York, NY, USA,pp\. 929–947\.External Links:ISBN 9798400703850,[Link](https://doi.org/10.1145/3620665.3640366),[Document](https://dx.doi.org/10.1145/3620665.3640366)Cited by:[§4\.1](https://arxiv.org/html/2605.08195#S4.SS1.p1.1)\.
- Apple Inc\. \(2017\)Core ML\.Apple Inc\.,Cupertino, CA\.Note:Machine learning framework for iOS, macOS, watchOS, and tvOS\. Introduced at WWDC 2017External Links:[Link](https://developer.apple.com/documentation/coreml)Cited by:[3rd item](https://arxiv.org/html/2605.08195#S1.I1.i3.p1.1),[§2](https://arxiv.org/html/2605.08195#S2.p4.1)\.
- Apple Inc\. \(2023\)Core ml: machine learning framework for apple platforms\.Note:[https://developer\.apple\.com/documentation/coreml](https://developer.apple.com/documentation/coreml)Accessed: 2025\-10\-28Cited by:[§7\.5](https://arxiv.org/html/2605.08195#S7.SS5.p1.1)\.
- Arm Limited \(2025\)CMSIS\-nn: efficient neural network kernels for arm cortex\-m cpus\.GitHub\.External Links:[Link](https://github.com/)Cited by:[§9\.3](https://arxiv.org/html/2605.08195#S9.SS3.p1.1)\.
- Arm Ltd\. \(2024a\)Arm ethos\-u ecosystem: micronpus and software for efficient edge ai\.Note:[https://developer\.arm\.com/Processors/Ethos\-U](https://developer.arm.com/Processors/Ethos-U)Accessed: 2025\-10\-28Cited by:[§7\.3](https://arxiv.org/html/2605.08195#S7.SS3.p1.1)\.
- Arm Ltd\. \(2024b\)KleidiAI: open\-source micro\-kernel library for ai workloads on arm cpus\.Note:[https://gitlab\.arm\.com/kleidi/kleidiai](https://gitlab.arm.com/kleidi/kleidiai)\(mirror:[https://github\.com/ARM\-software/kleidiai](https://github.com/ARM-software/kleidiai)\)Accessed: 2025\-10\-28Cited by:[§7\.1](https://arxiv.org/html/2605.08195#S7.SS1.p1.1)\.
- T\. Chen, T\. Moreau, Z\. Jiang, L\. Zheng, E\. Yan, H\. Shen, M\. Cowan, L\. Wang, Y\. Hu, L\. Ceze, C\. Guestrin, and A\. Krishnamurthy \(2018\)TVM: an automated end\-to\-end optimizing compiler for deep learning\.In13th USENIX Symposium on Operating Systems Design and Implementation \(OSDI 18\),Carlsbad, CA, USA,pp\. 578–594\.External Links:ISBN 978\-1\-939133\-08\-3,[Link](https://www.usenix.org/conference/osdi18/presentation/chen)Cited by:[§2](https://arxiv.org/html/2605.08195#S2.p3.1)\.
- M\. P\. Consortium \(2023\)Tensor operator set architecture \(tosa\) specification v1\.0\.1\.Note:[https://www\.mlplatform\.org/tosa/tosa\_spec\.html](https://www.mlplatform.org/tosa/tosa_spec.html)Accessed: 2025\-10\-28Cited by:[§7\.3](https://arxiv.org/html/2605.08195#S7.SS3.p1.1)\.
- R\. David, J\. Duke, A\. Jain, V\. J\. Reddi, N\. Jeffries, J\. Li, N\. Kreeger, I\. Nappier, M\. Natraj, S\. Regev, R\. Rhodes, T\. Wang, and P\. Warden \(2021\)TensorFlow Lite Micro: Embedded Machine Learning on TinyML Systems\.InProceedings of Machine Learning and Systems,A\. Smola, A\. Dimakis, and I\. Stoica \(Eds\.\),Vol\.3,San Jose, CA, USA,pp\. 800–811\.Note:Now known as LiteRTExternal Links:[Link](https://proceedings.mlsys.org/paper_files/paper/2021/file/6c44dc73014d66ba49b28d483a8f8b0d-Paper.pdf)Cited by:[1st item](https://arxiv.org/html/2605.08195#S1.I1.i1.p1.1),[§2](https://arxiv.org/html/2605.08195#S2.p2.1)\.
- T\. L\. Foundation \(2024\)Annual report 2024: accelerating industry innovation\.The Linux Foundation\.Note:Accessed: 2025\-10\-30External Links:[Link](https://www.linuxfoundation.org/resources/publications/linux-foundation-annual-report-2024)Cited by:[§1](https://arxiv.org/html/2605.08195#S1.p1.1)\.
- K\. Gadzicki, R\. Khamsehashari, and C\. Zetzsche \(2020\)Early vs late fusion in multimodal convolutional neural networks\.In2020 IEEE 23rd International Conference on Information Fusion \(FUSION\),Vol\.,pp\. 1–6\.External Links:[Document](https://dx.doi.org/10.23919/FUSION45008.2020.9190246)Cited by:[§9\.2](https://arxiv.org/html/2605.08195#S9.SS2.p1.1)\.
- G\. Gerganov \(2023\)Llama\.cpp: LLM inference in C/C\+\+\.GitHub\.Note:[https://github\.com/ggerganov/llama\.cpp](https://github.com/ggerganov/llama.cpp)MIT LicenseExternal Links:[Link](https://github.com/ggerganov/llama.cpp)Cited by:[2nd item](https://arxiv.org/html/2605.08195#S1.I1.i2.p1.1),[§2](https://arxiv.org/html/2605.08195#S2.p6.1)\.
- Google \(2019\)XNNPACK: high\-efficiency floating\-point neural network inference operators for mobile and server platforms\.Note:[https://github\.com/google/XNNPACK](https://github.com/google/XNNPACK)Accessed: 2025\-10\-28Cited by:[§7\.1](https://arxiv.org/html/2605.08195#S7.SS1.p1.1)\.
- A\. Hannun, J\. Digani, A\. Katharopoulos, and R\. Collobert \(2023\)MLX: efficient and flexible machine learning on apple silicon\.External Links:[Link](https://github.com/ml-explore)Cited by:[§2](https://arxiv.org/html/2605.08195#S2.p4.1)\.
- Jerry Zhang \(2025\)PyTorch 2 export post training quantization\.External Links:[Link](https://docs.pytorch.org/ao/stable/tutorials_source/pt2e_quant_ptq.html)Cited by:[§4\.3](https://arxiv.org/html/2605.08195#S4.SS3.p2.1)\.
- Y\. Jia, E\. Shelhamer, J\. Donahue, S\. Karayev, J\. Long, R\. Girshick, S\. Guadarrama, and T\. Darrell \(2014\)Caffe: convolutional architecture for fast feature embedding\.External Links:1408\.5093,[Link](https://arxiv.org/abs/1408.5093)Cited by:[§2](https://arxiv.org/html/2605.08195#S2.p2.1)\.
- X\. Jiang, H\. Wang, Y\. Chen, Z\. Wu, L\. Wang, B\. Zou, Y\. Yang, Z\. Cui, Y\. Cai, T\. Yu, C\. Lv, and Z\. Wu \(2020\)MNN: a universal and efficient inference engine\.InProceedings of Machine Learning and Systems,Vol\.2,pp\. 1–13\.Note:Alibaba Inc\. Available at[https://github\.com/alibaba/MNN](https://github.com/alibaba/MNN)External Links:[Link](https://proceedings.mlsys.org/paper_files/paper/2020/hash/bc19061f88f16e9ed4a18f0bbd47048a-Abstract.html)Cited by:[§2](https://arxiv.org/html/2605.08195#S2.p3.1)\.
- W\. Kang, J\. Lee, Y\. Lee, S\. Oh, K\. Lee, and H\. S\. Chwa \(2024\)RT\-Swap: addressing GPU memory bottlenecks for real\-time multi\-DNN inference\.In2024 IEEE 30th Real\-Time and Embedded Technology and Applications Symposium \(RTAS\),Hong Kong, China,pp\. 373–385\.External Links:[Document](https://dx.doi.org/10.1109/RTAS61025.2024.00037)Cited by:[§1](https://arxiv.org/html/2605.08195#S1.p1.1)\.
- Khronos Group \(2023\)Vulkan api specification, version 1\.3\.The Khronos Group Inc\.\.Note:Accessed: 2025\-10\-28External Links:[Link](https://registry.khronos.org/vulkan/specs/1.3/html/vkspec.html)Cited by:[§7\.2](https://arxiv.org/html/2605.08195#S7.SS2.p1.1)\.
- T\. Kuo, J\. Kim, and R\. A\. Gabriel \(2020\)Privacy\-preserving model learning on a blockchain network\-of\-networks\.Vol\.27,Oxford University Press\.External Links:[Document](https://dx.doi.org/10.1093/jamia/ocz214)Cited by:[§1](https://arxiv.org/html/2605.08195#S1.p1.1)\.
- W\. Kwon, Z\. Li, S\. Zhuang, Y\. Sheng, L\. Zheng, C\. H\. Yu, J\. E\. Gonzalez, H\. Zhang, and I\. Stoica \(2023\)Efficient memory management for large language model serving with pagedattention\.External Links:2309\.06180,[Link](https://arxiv.org/abs/2309.06180)Cited by:[§2](https://arxiv.org/html/2605.08195#S2.p6.1)\.
- P\. Langley \(2000\)Crafting papers on machine learning\.InProceedings of the 17th International Conference on Machine Learning \(ICML 2000\),P\. Langley \(Ed\.\),Stanford, CA,pp\. 1207–1216\.Cited by:[Appendix B](https://arxiv.org/html/2605.08195#A2.p15.1)\.
- J\. Lin, J\. Tang, H\. Tang, S\. Yang, W\. Chen, W\. Wang, G\. Xiao, X\. Dang, C\. Gan, and S\. Han \(2024\)AWQ: activation\-aware weight quantization for llm compression and acceleration\.External Links:2306\.00978,[Link](https://arxiv.org/abs/2306.00978)Cited by:[§1](https://arxiv.org/html/2605.08195#S1.p1.1)\.
- A\. H\. Liu, A\. Ehrenberg, A\. Lo, C\. Denoix, C\. Barreau, G\. Lample, J\. Delignon, K\. R\. Chandu, P\. von Platen, P\. R\. Muddireddy, S\. Gandhi, S\. Ghosh, S\. Mishra, T\. Foubert, A\. Rastogi, A\. Yang, A\. Q\. Jiang, A\. Sablayrolles, A\. Héliou, A\. Martin, A\. Agarwal, A\. Roux, A\. Darcet, A\. Mensch, B\. Bout, B\. Rozière, B\. D\. Monicault, C\. Bamford, C\. Wallenwein, C\. Renaudin, C\. Lanfranchi, D\. Dabert, D\. S\. Chaplot, D\. Mizelle, D\. de las Casas, E\. Chane\-Sane, E\. Fugier, E\. B\. Hanna, G\. Berrada, G\. Delerce, G\. Guinet, G\. Novikov, G\. Martin, H\. Jaju, J\. Ludziejewski, J\. Rute, J\. Chabran, J\. Chudnovsky, J\. Studnia, J\. Barmentlo, J\. Amar, J\. S\. Roberts, J\. Denize, K\. Saxena, K\. Yadav, K\. Khandelwal, K\. Jain, L\. R\. Lavaud, L\. Blier, L\. Zhao, L\. Martin, L\. Saulnier, L\. Gao, M\. Pellat, M\. Guillaumin, M\. Felardos, M\. Dinot, M\. Darrin, M\. Augustin, M\. Seznec, N\. Gupta, N\. Raghuraman, O\. Duchenne, P\. Wang, P\. Saffer, P\. Jacob, P\. Wambergue, P\. Kurylowicz, P\. Chagniot, P\. Stock, P\. Agrawal, R\. Delacourt, R\. Sauvestre, R\. Soletskyi, S\. Vaze, S\. Subramanian, S\. Garg, S\. Dalal, S\. Gandhi, S\. Aithal, S\. Antoniak, T\. L\. Scao, T\. Schueller, T\. Lavril, T\. Robert, T\. Wang, T\. Lacroix, T\. Bewley, V\. Nemychnikova, V\. Paltz, V\. Richard, W\. Li, W\. Marshall, X\. Zhang, Y\. Wan, and Y\. Tang \(2025a\)Voxtral\.External Links:2507\.13264,[Link](https://arxiv.org/abs/2507.13264)Cited by:[§9\.2](https://arxiv.org/html/2605.08195#S9.SS2.p1.1)\.
- Z\. Liu, C\. Zhao, I\. Fedorov, B\. Soran, D\. Choudhary, R\. Krishnamoorthi, V\. Chandra, Y\. Tian, and T\. Blankevoort \(2025b\)SpinQuant: llm quantization with learned rotations\.External Links:2405\.16406,[Link](https://arxiv.org/abs/2405.16406)Cited by:[§4\.3](https://arxiv.org/html/2605.08195#S4.SS3.p1.1)\.
- Meta AI \(2024\)Introducing quantized Llama models with increased speed and a reduced memory footprint\.Meta AI\.Note:[https://ai\.meta\.com/blog/meta\-llama\-quantized\-lightweight\-models/](https://ai.meta.com/blog/meta-llama-quantized-lightweight-models/)Accessed: 2025\-10\-29Cited by:[§1](https://arxiv.org/html/2605.08195#S1.p6.1)\.
- Meta \(2025a\)Accelerating On\-Device ML on Meta’s Family of Apps with ExecuTorch\.Note:[https://engineering\.fb\.com/2025/07/28/android/executorch\-on\-device\-ml\-meta\-family\-of\-apps/](https://engineering.fb.com/2025/07/28/android/executorch-on-device-ml-meta-family-of-apps/)Accessed: 2025\-12\-09Cited by:[§1](https://arxiv.org/html/2605.08195#S1.p9.1)\.
- Meta \(2025b\)ExecuTorch Reality Labs On\-Device AI\.Note:[https://ai\.meta\.com/blog/executorch\-reality\-labs\-on\-device\-ai/](https://ai.meta.com/blog/executorch-reality-labs-on-device-ai/)Accessed: 2025\-12\-09Cited by:[§1](https://arxiv.org/html/2605.08195#S1.p9.1)\.
- Microsoft \(2018\)ONNX Runtime: cross\-platform, high performance ml inferencing and training accelerator\.Note:[https://github\.com/microsoft/onnxruntime](https://github.com/microsoft/onnxruntime)Open source inference engineExternal Links:[Link](https://onnxruntime.ai/)Cited by:[1st item](https://arxiv.org/html/2605.08195#S1.I1.i1.p1.1),[§2](https://arxiv.org/html/2605.08195#S2.p2.1)\.
- M\. Y\. Ng, J\. Helzer, M\. A\. Pfeffer, T\. Seto, and T\. Hernandez\-Boussard \(2025\)Development of secure infrastructure for advancing generative artificial intelligence research in healthcare at an academic medical center\.Vol\.32,Oxford University Press\.External Links:[Document](https://dx.doi.org/10.1093/jamia/ocaf005),ISSN 1527\-974XCited by:[§1](https://arxiv.org/html/2605.08195#S1.p1.1)\.
- V\. Nigade, P\. Bauszat, H\. E\. Bal, and L\. Wang \(2024\)Inference serving with end\-to\-end latency SLOs over dynamic edge networks\.Vol\.60,Springer\.External Links:[Document](https://dx.doi.org/10.1007/s11241-024-09418-4)Cited by:[§1](https://arxiv.org/html/2605.08195#S1.p1.1)\.
- M\. Pons, E\. Valenzuela, B\. Rodr
- \(52\)’iguez
- PyTorch Developers \(2022a\)IRs\.External Links:[Link](https://docs.pytorch.org/docs/2.9/torch.compiler_ir.html)Cited by:[§4\.1](https://arxiv.org/html/2605.08195#S4.SS1.p4.1)\.
- PyTorch Developers \(2022b\)Torch\.export\.External Links:[Link](https://docs.pytorch.org/docs/2.9/export.html)Cited by:[§4\.1](https://arxiv.org/html/2605.08195#S4.SS1.p2.1)\.
- PyTorch Developers \(2025\)Optimum ExecuTorch — optimize and deploy hugging face models with executorch\.Note:[https://github\.com/huggingface/optimum\-executorch](https://github.com/huggingface/optimum-executorch)Accessed: 2025\-10\-28Cited by:[§9\.1](https://arxiv.org/html/2605.08195#S9.SS1.p1.1)\.
- PyTorch Foundation \(2021\)PyTorch 1\.9 release, including torch\.linalg and mobile interpreter\.External Links:[Link](https://pytorch.org/blog/pytorch-1-9-released/)Cited by:[§2](https://arxiv.org/html/2605.08195#S2.p5.1),[§5\.3](https://arxiv.org/html/2605.08195#S5.SS3.p1.2)\.
- PyTorch Team \(2019\)PyTorch Mobile: end\-to\-end workflow for mobile deployment\.Meta AI\.Note:Introduced in PyTorch 1\.3\. Now superseded by ExecuTorchExternal Links:[Link](https://pytorch.org/mobile/)Cited by:[§1](https://arxiv.org/html/2605.08195#S1.p3.1),[§2](https://arxiv.org/html/2605.08195#S2.p5.1)\.
- PyTorch \(2025\)Subclassing torch\.tensor\.External Links:[Link](https://docs.pytorch.org/docs/stable/notes/extending.html#subclassing-torch-tensor)Cited by:[§4\.3](https://arxiv.org/html/2605.08195#S4.SS3.p3.1)\.
- Inc\. Qualcomm Technologies \(2024\)Qualcomm ai engine direct sdk\.Note:[https://www\.qualcomm\.com/developer/software/qualcomm\-ai\-engine\-direct\-sdk](https://www.qualcomm.com/developer/software/qualcomm-ai-engine-direct-sdk)Accessed: 2025\-10\-28Cited by:[§7\.4](https://arxiv.org/html/2605.08195#S7.SS4.p1.1)\.
- Qualcomm Technologies, Inc\. \(2016\)Snapdragon Neural Processing Engine SDK\.Qualcomm Technologies, Inc\.\.Note:Now branded as Qualcomm Neural Processing SDK for AIExternal Links:[Link](https://developer.qualcomm.com/software/qualcomm-neural-processing-sdk)Cited by:[3rd item](https://arxiv.org/html/2605.08195#S1.I1.i3.p1.1),[§2](https://arxiv.org/html/2605.08195#S2.p4.1)\.
- J\. K\. Reed, Z\. DeVito, H\. He, A\. Ussery, and J\. Ansel \(2021\)Torch\.fx: practical program capture and transformation for deep learning in python\.Vol\.abs/2112\.08429\.External Links:[Link](https://arxiv.org/abs/2112.08429),2112\.08429Cited by:[§4\.1](https://arxiv.org/html/2605.08195#S4.SS1.p2.1)\.
- Y\. Shao \(2024\)Local\-global attention: an adaptive mechanism for multi\-scale feature integration\.External Links:2411\.09604,[Link](https://arxiv.org/abs/2411.09604)Cited by:[§9\.1](https://arxiv.org/html/2605.08195#S9.SS1.p6.1)\.
- N\. Sperling and R\. Ernst \(2024\)Reducing communication cost and latency in autonomous vehicles with subscriber\-centric selective data distribution\.In2024 IEEE 99th Vehicular Technology Conference \(VTC Spring\),Singapore\.Note:Document ID: 10683426Cited by:[§1](https://arxiv.org/html/2605.08195#S1.p1.1)\.
- G\. Team, A\. Kamath, J\. Ferret, S\. Pathak, N\. Vieillard, R\. Merhej, S\. Perrin, T\. Matejovicova, A\. Ramé, M\. Rivière, L\. Rouillard, T\. Mesnard, G\. Cideron, J\. Grill, S\. Ramos, E\. Yvinec, M\. Casbon, E\. Pot, I\. Penchev, G\. Liu, F\. Visin, K\. Kenealy, L\. Beyer, X\. Zhai, A\. Tsitsulin, R\. Busa\-Fekete, A\. Feng, N\. Sachdeva, B\. Coleman, Y\. Gao, B\. Mustafa, I\. Barr, E\. Parisotto, D\. Tian, M\. Eyal, C\. Cherry, J\. Peter, D\. Sinopalnikov, S\. Bhupatiraju, R\. Agarwal, M\. Kazemi, D\. Malkin, R\. Kumar, D\. Vilar, I\. Brusilovsky, J\. Luo, A\. Steiner, A\. Friesen, A\. Sharma, A\. Sharma, A\. M\. Gilady, A\. Goedeckemeyer, A\. Saade, A\. Feng, A\. Kolesnikov, A\. Bendebury, A\. Abdagic, A\. Vadi, A\. György, A\. S\. Pinto, A\. Das, A\. Bapna, A\. Miech, A\. Yang, A\. Paterson, A\. Shenoy, A\. Chakrabarti, B\. Piot, B\. Wu, B\. Shahriari, B\. Petrini, C\. Chen, C\. L\. Lan, C\. A\. Choquette\-Choo, C\. Carey, C\. Brick, D\. Deutsch, D\. Eisenbud, D\. Cattle, D\. Cheng, D\. Paparas, D\. S\. Sreepathihalli, D\. Reid, D\. Tran, D\. Zelle, E\. Noland, E\. Huizenga, E\. Kharitonov, F\. Liu, G\. Amirkhanyan, G\. Cameron, H\. Hashemi, H\. Klimczak\-Plucińska, H\. Singh, H\. Mehta, H\. T\. Lehri, H\. Hazimeh, I\. Ballantyne, I\. Szpektor, I\. Nardini, J\. Pouget\-Abadie, J\. Chan, J\. Stanton, J\. Wieting, J\. Lai, J\. Orbay, J\. Fernandez, J\. Newlan, J\. Ji, J\. Singh, K\. Black, K\. Yu, K\. Hui, K\. Vodrahalli, K\. Greff, L\. Qiu, M\. Valentine, M\. Coelho, M\. Ritter, M\. Hoffman, M\. Watson, M\. Chaturvedi, M\. Moynihan, M\. Ma, N\. Babar, N\. Noy, N\. Byrd, N\. Roy, N\. Momchev, N\. Chauhan, N\. Sachdeva, O\. Bunyan, P\. Botarda, P\. Caron, P\. K\. Rubenstein, P\. Culliton, P\. Schmid, P\. G\. Sessa, P\. Xu, P\. Stanczyk, P\. Tafti, R\. Shivanna, R\. Wu, R\. Pan, R\. Rokni, R\. Willoughby, R\. Vallu, R\. Mullins, S\. Jerome, S\. Smoot, S\. Girgin, S\. Iqbal, S\. Reddy, S\. Sheth, S\. Põder, S\. Bhatnagar, S\. R\. Panyam, S\. Eiger, S\. Zhang, T\. Liu, T\. Yacovone, T\. Liechty, U\. Kalra, U\. Evci, V\. Misra, V\. Roseberry, V\. Feinberg, V\. Kolesnikov, W\. Han, W\. Kwon, X\. Chen, Y\. Chow, Y\. Zhu, Z\. Wei, Z\. Egyed, V\. Cotruta, M\. Giang, P\. Kirk, A\. Rao, K\. Black, N\. Babar, J\. Lo, E\. Moreira, L\. G\. Martins, O\. Sanseviero, L\. Gonzalez, Z\. Gleicher, T\. Warkentin, V\. Mirrokni, E\. Senter, E\. Collins, J\. Barral, Z\. Ghahramani, R\. Hadsell, Y\. Matias, D\. Sculley, S\. Petrov, N\. Fiedel, N\. Shazeer, O\. Vinyals, J\. Dean, D\. Hassabis, K\. Kavukcuoglu, C\. Farabet, E\. Buchatskaya, J\. Alayrac, R\. Anil, Dmitry, Lepikhin, S\. Borgeaud, O\. Bachem, A\. Joulin, A\. Andreev, C\. Hardin, R\. Dadashi, and L\. Hussenot \(2025\)Gemma 3 technical report\.External Links:2503\.19786,[Link](https://arxiv.org/abs/2503.19786)Cited by:[§9\.2](https://arxiv.org/html/2605.08195#S9.SS2.p1.1)\.
- torchao \(2024\)TorchAO: pytorch\-native training\-to\-serving model optimization\.External Links:[Link](https://github.com/pytorch/ao)Cited by:[§4\.3](https://arxiv.org/html/2605.08195#S4.SS3.p1.1)\.
- P\. K\. A\. Vasu, J\. Gabriel, J\. Zhu, O\. Tuzel, and A\. Ranjan \(2023\)FastViT: a fast hybrid vision transformer using structural reparameterization\.InProceedings of the IEEE/CVF International Conference on Computer Vision \(ICCV\),pp\. 5785–5795\.Cited by:[§1](https://arxiv.org/html/2605.08195#S1.p1.1)\.
- P\. K\. A\. Vasu, H\. Pouransari, F\. Faghri, R\. Vemulapalli, and O\. Tuzel \(2024\)MobileCLIP: fast image\-text models through multi\-modal reinforced training\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\),pp\. 15963–15974\.Cited by:[§1](https://arxiv.org/html/2605.08195#S1.p1.1)\.
- T\. Wang, J\. Guo, B\. Zhang, G\. Yang, and D\. Li \(2025\)Deploying AI on edge: advancement and challenges in edge intelligence\.Vol\.13,MDPI\.External Links:[Document](https://dx.doi.org/10.3390/math13111878),ISSN 2227\-7390Cited by:[§1](https://arxiv.org/html/2605.08195#S1.p1.1)\.
- X\. Wang and W\. Jia \(2025\)Optimizing edge AI: a comprehensive survey on data, model, and system strategies\.External Links:2501\.03265,[Link](https://arxiv.org/abs/2501.03265),[Document](https://dx.doi.org/10.48550/arXiv.2501.03265)Cited by:[§1](https://arxiv.org/html/2605.08195#S1.p1.1)\.
- T\. Wolf, L\. Debut, V\. Sanh, J\. Chaumond, C\. Delangue, A\. Moi, P\. Cistac, T\. Rault, R\. Louf, M\. Funtowicz, J\. Davison, S\. Shleifer, P\. von Platen, C\. Ma, Y\. Jernite, J\. Plu, C\. Xu, T\. L\. Scao, S\. Gugger, M\. Drame, Q\. Lhoest, and A\. M\. Rush \(2020\)Transformers: state\-of\-the\-art natural language processing\.InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations,Online,pp\. 38–45\.External Links:[Link](https://www.aclweb.org/anthology/2020.emnlp-demos.6)Cited by:[§9\.1](https://arxiv.org/html/2605.08195#S9.SS1.p1.1)\.
- Y\. Wu, T\. Ortner, R\. Zou, E\. Z\. Yang, A\. Akhundov, H\. He, and Y\. Cao \(2025\)Control flow operators in pytorch\.InChampioning Open\-source Development in ML Workshop @ ICML25,External Links:[Link](https://openreview.net/forum?id=GMFG27v26J)Cited by:[1st item](https://arxiv.org/html/2605.08195#S12.I2.i1.p1.1)\.
- J\. Xu, B\. S\. Glicksberg, C\. Su, P\. Walker, J\. Bian, and F\. Wang \(2021\)Federated learning for healthcare informatics\.Vol\.5,Springer\.External Links:[Document](https://dx.doi.org/10.1007/s41666-020-00082-4)Cited by:[§1](https://arxiv.org/html/2605.08195#S1.p1.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv, C\. Zheng, D\. Liu, F\. Zhou, F\. Huang, F\. Hu, H\. Ge, H\. Wei, H\. Lin, J\. Tang, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Zhou, J\. Lin, K\. Dang, K\. Bao, K\. Yang, L\. Yu, L\. Deng, M\. Li, M\. Xue, M\. Li, P\. Zhang, P\. Wang, Q\. Zhu, R\. Men, R\. Gao, S\. Liu, S\. Luo, T\. Li, T\. Tang, W\. Yin, X\. Ren, X\. Wang, X\. Zhang, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Wang, Z\. Cui, Z\. Zhang, Z\. Zhou, and Z\. Qiu \(2025\)Qwen3 technical report\.External Links:2505\.09388,[Link](https://arxiv.org/abs/2505.09388)Cited by:[§9\.1](https://arxiv.org/html/2605.08195#S9.SS1.p1.1)\.

## Appendix ALLM Operator\-Level Performance Breakdown

To diagnose the performance gaps between ExecuTorch and llama\.cpp observed in Table[5](https://arxiv.org/html/2605.08195#S11.T5), we profiled Llama 3\.2 1B, Qwen3 0\.6B, and Phi4 Mini on the Samsung Galaxy S25 Ultra\. ExecuTorch models use 8da4w quantization \(int8 activations×\\timesint4 weights, group size 32\); llama\.cpp uses Q4\_0 \(dynamically quantized int8 or fp activations×\\timesint4 weights, group size 32; 6\-bit group\-wise quantized weights for the LM head\)\. All models are profiled with a maximum context length of 2048, 256 prompt tokens, and 256 generated tokens\.

Llama 3\.2 1B uses 16 layers with hidden dimension 2048 and 32 query heads, resulting in a head dimension of 64; Qwen3 0\.6B uses 28 layers, hidden dimension 1024, 16 query heads, and an explicit head dimension of 128; and Phi4 Mini uses 32 layers with hidden dimension 3072 and 24 query heads, resulting in a head dimension of 128\. All models use 8 KV heads\.

Due to profiling overhead and run\-to\-run variation, the profiling data presented in this Appendix may not line up exactly with the measurements reported in Table[5](https://arxiv.org/html/2605.08195#S11.T5)\. However, relative performance between frameworks and backends should be fairly consistent\.

In the tables below, columns associated with “ggml” represent inference with llama\.cpp\.

### A\.1CPU: ExecuTorch XNNPACK vs llama\.cpp CPU

Table 7:CPU decode latency breakdown \(ms/token\) for 256 generated tokens\.Decode\(Table[7](https://arxiv.org/html/2605.08195#A1.T7)\)\. Linear layers are at parity across frameworks \(11\.22 vs 12\.10 ms for Llama\)\. The dominant gap is SDPA; llama\.cpp’s implementation is 1\.1–2\.3×\\timesfaster than XNNPACK’scustom\_sdpafor single\-query decode \(1\.71 vs 1\.95 ms on Llama, 2\.39 vs 5\.37 ms on Qwen, 4\.36 vs 10\.01 ms on Phi4\)\. Note that for ET, many ops such as RMSNorm, Activation, and RoPE are decomposed into core ATen ops, which makes it difficult to categorize/identify operators produced by decompositions\. As a result, RoPE is accounted for in the “Other” category\. The overhead introduced by decomposition also contributes to a slight disadvantage for ExecuTorch\.

Table 8:CPU prefill latency breakdown \(total ms for 256 tokens\)\.Prefill\(Table[8](https://arxiv.org/html/2605.08195#A1.T8)\)\. In contrast to decode, XNNPACK is slightly faster in linear layers \(6–19% faster\), which account for the majority of inference time\. Notably, for prefill ExecuTorch’scustom\_sdpaoperator is 2\.1–2\.6×\\timesfaster than llama\.cpp’s attention implementation\. Operator decompositions and inefficiencies in some operator implementations such as embedding account for worse performance in the “Activation” and “Other” categories\. The net result of these effects is that prefill performance between both frameworks is roughly on par\.

ExecuTorch’scustom\_sdpaoperator uses a tiled implementation to optimize for parallel processing of multiple queries\. However, minimal adjustments are made when handling single\-token decode\. In contrast, llama\.cpp’s attention implementation makes significant adjustments to its execution strategy when the batch dimension is 1\. This may explain why ExecuTorch’s attention implementation is much faster for prefill, but much slower for decode\.

### A\.2GPU: ExecuTorch Vulkan vs llama\.cpp OpenCL

Table 9:GPU decode latency breakdown \(ms/token\) for 256 generated tokens\. “Dynamic Quant” occurs only for Vulkan, and consists of a “choose quantization parameters” shader and “activation quantization” shader dispatched before each linear layer; for llama\.cpp, “Linear” includes the Q6\_K lm\_head kernel\.Decode\(Table[9](https://arxiv.org/html/2605.08195#A1.T9)\): There are two notable factors that explain differences in operator latency distribution between the two frameworks\. First, llama\.cpp uses 6\-bit group\-wise quantized weights for the LM\-head, and the associated compute shader is 3\.8–6\.7×\\timesslower than ExecuTorch Vulkan’s corresponding 4\-bit quantized linear compute shader \(e\.g\., 9\.74 vs 2\.58 ms for Llama, 40\.86 vs 6\.06 ms for Phi4\)\. Note that due to the differences in quantization, the computation being performed by both frameworks is not the same\.

Next, notice that while ExecuTorch Vulkan’s SDPA implementation is faster in Llama 3\.2 1B \(1\.82 vs 3\.22 ms\), it is 1\.9–2\.3×\\timesslower in Qwen3 0\.6B and Phi4 Mini \(6\.05 vs 3\.15 ms, 11\.46 vs 5\.09 ms\)\. In the Vulkan backend, SDPA is implemented by three shaders: one to compute the attention weight \(i\.e\., matrix multiply between query tensor and key cache fused with masking and scaling\), one to apply softmax to the attention weight, and one to compute the final matrix multiplication between the attention weight and the value cache\. The latency of the first two shaders is fairly consistent across models, but the final shader is much slower for Qwen3 and Phi4; furthermore, the latency increases dramatically with context length \(see Table[10](https://arxiv.org/html/2605.08195#A1.T10)\)\. Llama 3\.2 uses a head dimension of 64, while Qwen3 and Phi4 use a head dimension of 128, which doubles the memory footprint of the value cache\. For the value cache tensor, ExecuTorch Vulkan keeps the head dim contiguous and the context dim \(i\.e\., the reduction dim\) as the outermost dimension; this memory layout may be suboptimal and make the implementation more susceptible to the increased memory pressure from the doubled head dimension in Qwen3 and Phi4\. Another significant factor is that llama\.cpp uses fp16 storage for cache tensors \(compared to fp32 for ExecuTorch Vulkan\), which helps alleviate the increased memory pressure from larger cache tensors\.

Table 10:Per\-layer average SDPA latency \(μ\\mus\) at different context lengths for ExecuTorch Vulkan on the Samsung Galaxy S25\.The net effect of the above two factors results in ExecuTorch Vulkan achieving faster decode latency compared to llama\.cpp, except for the Qwen3 0\.6B model where the increased SDPA latency outweighs the differences in linear layers\.

As with ExecuTorch XNNPACK, there are some operators \(e\.g\., RMSNorm, SiLU\) that are currently decomposed by ExecuTorch when lowering to the Vulkan delegate, which results in higher latencies for those categories when compared to llama\.cpp\.

Table 11:GPU prefill latency breakdown \(total ms, 256 tokens\)\. “Dynamic Quant” occurs only for Vulkan, and consists of a choose quantization parameters shader and activation quantization shader dispatched before each linear layer\.Prefill\(Table[11](https://arxiv.org/html/2605.08195#A1.T11)\): llama\.cpp has a clear advantage in prefill driven by lower latency for linear layers and attention, which account for the majority of inference time\. llama\.cpp’s linear layers perform fp16 accumulation with fp16 activations and dequantized 4\-bit weights, while ExecuTorch Vulkan dynamically quantizes fp32 activations to 8\-bit and performs integer accumulation using hardware\-accelerated integer dot product instructions\. Though int8 compute throughput is theoretically double that of fp16 compute throughput, Vulkan’s linear layer shaders have longer latencies compared to llama\.cpp, especially for Phi4 Mini\. This is likely due to the additional overhead of loading and applying per\-group quantization parameters during the requantization step; as seen in Table[5](https://arxiv.org/html/2605.08195#S11.T5), using a group size of 128 compared to 32 results in a∼\\sim25–30% increase in overall prefill throughput for Qwen3 and Llama 3\.2, and a∼\\sim56% increase for Phi4 Mini\. The additional cost of dynamic quantization further compounds the latency difference in linear layers between the two frameworks\. For attention, the same factors that contributed to worse latency in the decode step also apply in prefill\.

As with decode, operator decomposition accounts for more latency in Activation operators\. Additionally, the “Other” category for ExecuTorch Vulkan is dominated by reshape operators, which wrap attention layers\.

## Appendix BContributor Acknowledgements

ExecuTorch is a collaborative effort spanning many teams and organizations\. We gratefully acknowledge the following contributors\.

Apple Kulin Seth\. Yifan Shen\. Gyan Sinha\. Denis Vieriu\.

Arm Tom Allsop\. Zingo Andersen\. Oscar Andersson\. Per Åstrand\. Baris Demirbilek\. Rob Elliott\. George Gekov\. Per Held\. Agrima Khare\. Benjamin Klimczak\. Fredrik Knutsson\. Emma Kujala\. Sebastian Larsson\. Xingguo Li\. Martin Lindström\. Adrian Lundell\. Erik Lundell\. Måns Nilsson\. Michiel Olieslagers\. Ryan O’Shea\. Yufeng Shi\. Saoirse Stewart\. Charlie Stokes\. Robert Taylor\. Carey Williams\. Elena Zhelezina\.

Cadence Andrew Grebenisan\. Chandana Madhira\. The Cadence team\.

Intel Aamir Nazir\. Yamini Nimmagadda\. Surya Siddharth Pemmaraju\.

MediaTek The NeuroPilot team\.

NXP Roman Janik\. Robert Kalmar\. Jiri Ocenasek\. Martin Pavella\. Davis Sawer\. Šimon Strýček\.

Qualcomm Felix Baum\. Hao\-Wei Hsu\. Winston Kuo\. Harsh Shah\. Chun\-I Tsai\. Kiwi Wang\. Sheng Feng Wu\. YuYang Zhuang\.

Samsung Collin Allen\. Hoon Choi\. Alex Dean\. Mostafa El\-Khamy\. SangHyuck Ha\. Fangming He\. Shujie Huang\. Bruce Kim\. Sangsoo Ko\. Pavan Lanka\. Jiseong Oh\. Sicheon Oh\.

Tencent Jie Fu\.

Independent Contributors Zuby Afzal\. Xiang Li\.

Meta In addition to the paper authors, we thank: Eli Amesefe\. Stefano Cadario\. Avik Chaudhuri\. Xingying Cheng\. Matthias Cremon\. Salil Desai\. Alban Desmaison\. Huy Do\. Riley Dulin\. Soumyadeep Ghosh\. Chris Gottbrath\. Min Guo\. Lunwen He\. Nitin Jain\. Svetlana Karslioglu\. Harshit Khaitan\. Ali Khosh\. Ji Li\. Yi Li\. Juniper Pineda\. Varun Puri\. Jathu Satkunarajah\. Nathanael See\. Nikita Shulga\. Jake Stevens\. Michael Suo\. Andrey Talman\. Chris Thompson\. Vivek Trivedi\. Chakri Uddaraju\. Eli Uriegas\. Jesse White\. Yidi Wu\. Yiwen Xie\. Justin Yip\. Shangdi Yu\.

We also thank these individuals for their contributions to ExecuTorch while working at Meta:

Ishan Aryendu\. Michael Gschwind\. Rohan Joshi\. Akshit Khurana\. Juntian Liu\. Olivia Liu\. Dhruv Matani\. Bujji Setty\. Joe Spisak\. Conan Truong\. Shen Chen Xu\.

Similar Articles

@AnimaAnandkumar: TorchLean codebase is now available! TorchLean is a Lean 4 framework for verified neural-network software. It supports …

X AI KOLs Following

TorchLean is a newly released Lean 4 framework that enables formal verification of neural network software, featuring typed tensors, verified autograd, PyTorch interoperability, and GPU execution. The release expands support to modern architectures like diffusion models, GPT-style transformers, and state-space models, bridging practical ML workflows with mathematical proof checking.

Are We Underestimating Small Edge AI Models?[D]

Reddit r/MachineLearning

A developer argues that the edge AI community overlooks small, specialized models that can run locally on devices like smartphones, using a self-built offline Morse code recognition feature as an example. The project uses a sub-5 MB AI model with TensorFlow/Keras and LiteRT, and the entire pipeline from data generation to mobile integration was custom-built.