Systematic Optimization of Real-Time Diffusion Model Inference on Apple M3 Ultra
Summary
This paper presents a systematic optimization study of real-time diffusion model inference on the Apple M3 Ultra, achieving 22.7 FPS at 512x512 resolution using CoreML conversion and a distillation model, revealing that CUDA-optimized techniques do not directly transfer to Apple's unified memory architecture.
View Cached Full Text
Cached at: 05/19/26, 06:40 AM
# Systematic Optimization of Real-Time Diffusion Model Inference on Apple M3 Ultra Source: [https://arxiv.org/html/2605.16259](https://arxiv.org/html/2605.16259) Yoichi Ochiai University of Tsukuba, Faculty of Library, Information and Media Science wizard@slis\.tsukuba\.ac\.jp ###### Abstract While real\-time image generation using diffusion models has advanced rapidly on NVIDIA GPUs, systematic optimization research on non\-CUDA platforms such as Apple Silicon remains extremely limited\. In this study, we conducted comprehensive optimization experiments across 10 phases targeting the Apple M3 Ultra \(60\-core GPU, 512 GB unified memory\) with the goal of achieving real\-time camera img2img transformation\. We explored a wide range of techniques including CoreML conversion, quantization, Token Merging, Neural Engine utilization, compact model exploration, frame interpolation, kNN search\-based synthesis, pix2pix\-turbo, optical flow frame skipping, and knowledge distillation, quantitatively evaluating the effectiveness of each approach\. Ultimately, by combining CoreML conversion of the distillation\-specialized model SDXS\-512 with a 3\-thread camera pipeline, we achieved real\-time camera img2img transformation at22\.7 FPSat 512×\\times512 resolution\. The primary contribution of this work is the systematic demonstration that optimization insights established for CUDA are not necessarily effective on Apple Silicon’s unified memory architecture\. We reveal an optimization landscape fundamentally different from that of NVIDIA GPUs—including the absence of speedup from quantization, the ineffectiveness of parallel inference, and the unsuitability of the Neural Engine for large\-scale models—and provide practical guidelines for diffusion model inference on Apple Silicon\. ## 1Introduction Diffusion models\[[1](https://arxiv.org/html/2605.16259#bib.bib1),[2](https://arxiv.org/html/2605.16259#bib.bib2)\]have become the dominant paradigm for text\-to\-image generation and image transformation\. However, their iterative denoising process is computationally expensive, and real\-time inference remains a challenging problem\. In recent years, acceleration techniques have developed rapidly, including single\-step distillation approaches such as SD\-Turbo\[[3](https://arxiv.org/html/2605.16259#bib.bib3)\]and SDXS\[[4](https://arxiv.org/html/2605.16259#bib.bib4)\], as well as few\-step inference methods such as Latent Consistency Models\[[5](https://arxiv.org/html/2605.16259#bib.bib5)\]\. StreamDiffusion\[[8](https://arxiv.org/html/2605.16259#bib.bib8)\]achieved real\-time inference exceeding 100 FPS on NVIDIA GPUs through pipeline\-level optimization\. However, these acceleration studies almost exclusively assume NVIDIA GPUs and the CUDA ecosystem\. CUDA benefits from decades of accumulated resources including extensive kernel libraries, mature profiling tools, and inference optimization through TensorRT\. In contrast, systematic research on diffusion model optimization for non\-CUDA platforms—such as Apple Silicon, Qualcomm Snapdragon, and Intel Arc GPUs—is virtually nonexistent\. The Apple M3 Ultra is a System\-on\-Chip \(SoC\) featuring up to 192 GPU cores, 192 GB of unified memory, and 800 GB/s memory bandwidth, employing a unique Unified Memory Architecture \(UMA\) in which the CPU, GPU, and Neural Engine \(ANE\) share the same memory space\. While this design eliminates the need for CPU–GPU data transfers, it exhibits fundamentally different memory access patterns and computational characteristics from CUDA, meaning that optimization techniques effective on NVIDIA GPUs cannot necessarily be directly applied\. In this study, we conducted systematic optimization experiments across 10 phases on the M3 Ultra \(60\-core GPU, 512 GB unified memory configuration\) with the goal of achieving real\-time camera img2img transformation, starting from StreamDiffusion\[[8](https://arxiv.org/html/2605.16259#bib.bib8)\]\. The contributions of this work are as follows: - •A comprehensive benchmark covering more than 10 techniques for diffusion model inference on Apple Silicon - •Demonstration that CoreML conversion is the only effective UNet acceleration technique, with analysis of the underlying reasons - •Systematic elucidation of why quantization, Token Merging, parallel inference, and other techniques effective in CUDA environments are ineffective on M3 Ultra - •Demonstration of the fundamental limitations of replacing diffusion models with kNN search leveraging 512 GB memory - •Achievement of 22\.7 FPS real\-time img2img transformation using the SDXS\-512 CoreML pipeline - •Practical guidelines for diffusion model optimization on unified memory architectures ## 2Related Work ### 2\.1Fast Diffusion Models Acceleration of Stable Diffusion\[[2](https://arxiv.org/html/2605.16259#bib.bib2)\]toward real\-time inference has been pursued primarily along two directions: model distillation and inference step reduction\. SD\-Turbo\[[3](https://arxiv.org/html/2605.16259#bib.bib3)\]was the first practical model to achieve single\-step inference through Adversarial Diffusion Distillation \(ADD\), combining adversarial training with score distillation\. SDXS\[[4](https://arxiv.org/html/2605.16259#bib.bib4)\]employs a more advanced distillation approach, achieving high\-quality single\-step image generation with a lightweight architecture that completely removes the UNet mid\-block and reduces the down/up blocks from 4 to 3\. The parameter count is reduced to approximately 38% \(328\.2M\) of the standard SD\-Turbo \(865\.9M\)\. Latent Consistency Models \(LCM\)\[[5](https://arxiv.org/html/2605.16259#bib.bib5)\]enable high\-quality generation in 2–4 steps through consistency distillation\. LCM\-LoRA\[[6](https://arxiv.org/html/2605.16259#bib.bib6)\]makes this distillation applicable as a LoRA adapter, enabling retrofitting to existing models\. Hyper\-SD\[[7](https://arxiv.org/html/2605.16259#bib.bib7)\]combines score distillation with reinforcement learning from human feedback \(RLHF\), achieving high\-quality generation across 1–4 step configurations\. These methods have been primarily evaluated on NVIDIA GPUs, with no reported performance on Apple Silicon\. ### 2\.2Real\-Time Inference Pipelines StreamDiffusion\[[8](https://arxiv.org/html/2605.16259#bib.bib8)\]is a framework that optimizes real\-time diffusion model inference at the pipeline level\. Its core innovation is the Stream Batch mechanism, which packs different denoising steps of consecutive frames into a single batch to maximize GPU computational parallelism\. For example, in the case of 4\-step inference, step 1 of framett, step 2 of framet−1t\-1, step 3 of framet−2t\-2, and step 4 of framet−3t\-3are processed as a single batch\. Additionally, Residual CFG reuses the Classifier\-Free Guidance \(CFG\) result from the previous frame, halving the number of UNet calls per frame\. These optimizations enable real\-time inference exceeding 100 FPS on an NVIDIA RTX 4090\. However, StreamDiffusion’s acceleration depends heavily on NVIDIA\-specific technologies such as CUDA Streams, TensorRT, and xformers, necessitating alternatives for porting to Apple Silicon\. Furthermore, when using single\-step inference \(SD\-Turbo, SDXS, etc\.\), the Stream Batch mechanism is inapplicable \(there is only one step to batch\), making optimization of other pipeline components critical\. ### 2\.3Machine Learning Inference on Apple Silicon Apple has released ml\-stable\-diffusion\[[9](https://arxiv.org/html/2605.16259#bib.bib9)\], a CoreML\-based Stable Diffusion inference pipeline\. This framework enables inference on the Neural Engine through a technique called SPLIT\_EINSUM\_V2, which performs split execution of attention operations\. However, SPLIT\_EINSUM\_V2 was designed for the M1/M2 generation Neural Engine and does not fully exploit the high computational performance of M3\-generation GPUs\. CoreML is Apple’s model inference framework, which converts models from PyTorch or TensorFlow and performs inference using optimized kernels on the Metal GPU backend\. Model conversion via coremltools\[[9](https://arxiv.org/html/2605.16259#bib.bib9)\]automatically applies static optimization of the computation graph, operator fusion, and memory layout optimization\. In contrast, PyTorch’s Metal Performance Shaders \(MPS\) backend is a general\-purpose implementation with limited model\-specific optimization\. The performance gap between these two backends has significant implications for diffusion model inference on Apple Silicon\. ### 2\.4Image Translation Models pix2pix\-turbo\[[10](https://arxiv.org/html/2605.16259#bib.bib10)\]is an end\-to\-end img2img translation model based on SD\-Turbo that introduces skip connections between the VAE encoder and decoder, directly conveying structural information from the input image to the decoder\. These structural skip connections achieve high structure preservation in transformations from edge\-detected images to photorealistic images\. However, the design in which the encoder and decoder share intermediate feature tensors has the side effect of making CoreML conversion of the model as a single subgraph difficult\. ### 2\.5Retrieval\-Based Image Generation Retrieval\-based image generation leveraging large\-scale memory\[[11](https://arxiv.org/html/2605.16259#bib.bib11)\]is an approach that retrieves nearest\-neighbor samples from a pre\-computed image feature database and uses them as conditioning to assist diffusion model generation\. FAISS\[[12](https://arxiv.org/html/2605.16259#bib.bib12)\]is a high\-speed approximate nearest\-neighbor search library that enables sub\-millisecond search even on databases of one billion vectors using IVF\-PQ indexing\. The M3 Ultra with 512 GB of unified memory can hold large\-scale vector databases in memory that would be infeasible on typical GPUs \(24 GB\), suggesting new possibilities for retrieval\-based approaches\. ## 3Experimental Setup The experimental environment used in this study is shown in Table[1](https://arxiv.org/html/2605.16259#S3.T1)\. The Apple M3 Ultra is Apple’s flagship chip released in 2024, comprising two M3 Max dies interconnected via UltraFusion\. The configuration used in our experiments features a 60\-core GPU \(16 cores disabled from the maximum 76\-core configuration\) and 512 GB of unified memory\. The theoretical FP16 compute performance is approximately 22 TFLOPS, which is roughly 1/15th of the NVIDIA RTX 4090’s approximately 330 TFLOPS\. However, the ability of the CPU, GPU, and Neural Engine to share 800 GB/s of memory bandwidth through unified memory is an advantage unavailable in discrete GPU configurations\. Table 1:Experimental environmentFor the software stack, we used PyTorch 2\.6\.0 with the MPS backend as our baseline and performed model conversion using CoreML Tools 9\.0\. Hugging Face diffusers 0\.36\.0 was used for loading and preprocessing various diffusion models, and OpenCV was used for camera input/output\. ## 4Phase 1: Porting to the MPS Backend As the first phase, we ported StreamDiffusion, originally implemented exclusively for CUDA, to the Apple Metal Performance Shaders \(MPS\) backend\. The three primary modifications were: \(1\) replacing timing measurements usingtorch\.cuda\.Eventwithtime\.perf\_counter, \(2\) changing device specifications fromcudatomps, and \(3\) initializing the random number generator on the CPU for reproducibility on MPS before transferring to the GPU\. StreamDiffusion’s Stream Batch mechanism is not applicable in our experiments using single\-step inference \(SD\-Turbo\)\. This is because Stream Batch is designed to parallelize multiple steps, and with single\-step inference there are no steps to batch\. Consequently, the optimization focus in this study is directed toward accelerating the entire single\-frame inference pipeline\. Baseline performance after porting is shown in Table[2](https://arxiv.org/html/2605.16259#S4.T2)\. Inference of SD\-Turbo \(865\.9M parameters\) on MPS required 95\.8 ms/frame \(10\.4 FPS\) at 512×\\times512 resolution\. Even at a reduced resolution of 256×\\times256, it only achieved 79\.5 ms \(12\.6 FPS\), indicating that the speed improvement was limited relative to the reduction in computation, suggesting that simple resolution reduction alone is insufficient\. Table 2:Baseline performance after MPS porting \(SD\-Turbo, 1\-step\) ## 5Phase 2: Comprehensive Evaluation of Acceleration Techniques In this phase, we systematically evaluated multiple optimization techniques whose effectiveness has been reported on NVIDIA GPUs on the Apple M3 Ultra\. To state the conclusion upfront, only CoreML conversion proved effective; all other techniques were either ineffective or counterproductive\. ### 5\.1CoreML Conversion We converted the PyTorch model to CoreML format \(\.mlpackage\) and performed inference on the Metal GPU backend\. Conversion was performed using trace\-based conversion viact\.convertwithcompute\_units=CPU\_AND\_GPUspecified\. During conversion, static analysis of the computation graph for operator fusion, direct mapping to Metal kernels, and memory layout optimization are automatically applied\. As shown in Table[3](https://arxiv.org/html/2605.16259#S5.T3), CoreML conversion reduced UNet inference time from 87\.6 ms to 53\.4 ms, a 39% reduction\. This improvement is attributed to the effect of CoreML’s optimized Metal kernels that account for model structure, compared to the general\-purpose Metal Shader implementation of PyTorch’s MPS backend\. Throughout this study,CoreML conversion was found to be the only effective acceleration technique for UNet inference on Apple Silicon\. Table 3:UNet inference acceleration via CoreML conversion ### 5\.2Quantization We comprehensively evaluated CoreML’s post\-training quantization options \(INT8 linear, 6\-bit palettization, 4\-bit palettization, and 2\-bit palettization\)\. On NVIDIA GPUs, inference acceleration through memory bandwidth reduction using techniques such as TensorRT INT8 quantization and AWQ/GPTQ has been widely reported\. However, as shown in Table[4](https://arxiv.org/html/2605.16259#S5.T4), no change in inference speed was observed at any quantization level on the M3 Ultra\. The fact that even 2\-bit palettization \(theoretically 1/16th memory bandwidth\) produced no acceleration strongly suggests that M3 Ultra GPU inference for UNet is compute\-bound rather than memory\-bandwidth\-bound\. This is likely because the unified memory’s 800 GB/s bandwidth provides sufficient headroom for transferring the weights of an 865\.9M\-parameter model\. Table 4:Effect of CoreML quantization \(UNet, 512×\\times512\) ### 5\.3Token Merging Token Merging \(ToMe\)\[[13](https://arxiv.org/html/2605.16259#bib.bib13)\]is a technique that merges tokens based on similarity between Key tokens in self\-attention, reducing the computational cost of the attention operation\. Its effectiveness has been reported on NVIDIA GPUs when the attention operation is the bottleneck\. However, on MPS, the overhead of token similarity computation and merging operations exceeded the reduction in attention computation, resulting in anapproximately 10% slowdown\. This suggests that the attention implementation on MPS has different bottleneck characteristics compared to xformers and Flash Attention on NVIDIA GPUs\. That is, on MPS, the attention operation itself is not the primary bottleneck, and the additional cost of token manipulation is relatively large\. ### 5\.4CoreML Parallel Inference To maximize utilization of the M3 Ultra’s 60\-core GPU, we attempted parallel inference using multiple CoreML model instances\. On NVIDIA GPUs, parallel execution of multiple kernels via CUDA Streams is possible and widely used for batch inference and pipeline parallelization\. However, with CoreML, no throughput improvement was observed with 1–4 instances of parallel execution\. CoreML serializes inference requests on the Metal Command Queue, and simultaneous execution of multiple models does not translate to GPU\-level parallelism\. This represents a fundamental difference from NVIDIA GPUs, where low\-level parallel control via CUDA Streams is possible, indicating that model\-level parallelization strategies cannot be applied on Apple Silicon\. ### 5\.5Neural Engine To leverage the 32\-core Neural Engine \(ANE\) of the M3 Ultra, we evaluated different CoreML compute unit configurations\. The ANE is a dedicated processor specialized for power\-efficient matrix operations, known to demonstrate superior power efficiency over GPUs for small\-scale model inference\. As shown in Table[5](https://arxiv.org/html/2605.16259#S5.T5), the ANE alone \(CPU\_AND\_NE\) required 329\.4 ms for UNet inference—6\.2×\\timesslower than the GPU alone\. Even with mixed GPU\+ANE execution \(ALL\), the result was 63\.6 ms, 19% slower than GPU alone \(53\.2 ms\)\. A UNet with 865\.9M parameters far exceeds the ANE’s processing capacity, and the overhead of data partitioning, transfer, and recombination likely outweighs any benefit from parallelized computation\. Table 5:UNet inference speed by compute unit ### 5\.6Other Techniques torch\.compilecrashed with runtime errors on the MPS backend and could not be evaluated\. This reflects the fact that PyTorch’s compiler stack is optimized for CUDA/CPU and has insufficient support for the MPS backend\. Attention Slicing caused approximately 40% slowdown, likely because the memory management overhead between slices on MPS outweighs the VRAM savings\. Switching to FP32 precision produced no change in speed, and construction of an asynchronous pipeline yielded no practical improvement\. ### 5\.7Phase 2 Summary Table[6](https://arxiv.org/html/2605.16259#S5.T6)summarizes the effects of all techniques evaluated in Phase 2\. Only CoreML conversion achieved a 64% speedup; all other techniques were either ineffective or counterproductive\. This result systematically demonstrates that many optimization techniques established in CUDA environments are inapplicable on Apple Silicon\. Table 6:Comprehensive evaluation of acceleration techniques ## 6Phase 3: Compact Models and Resolution Optimization Since Phase 2 revealed that CoreML conversion is the only effective UNet acceleration technique, we next explored acceleration through reducing model parameter counts\. We converted and evaluated Small\-SD \(579\.4M\) and Tiny\-SD \(323\.4M\), made available through the Knowledge Distillation Stable Diffusion project, via CoreML conversion\. As shown in Table[7](https://arxiv.org/html/2605.16259#S6.T7), parameter reduction directly translated to inference speed, with Tiny\-SD achieving 1\.7×\\timesthe speed of SD\-Turbo\. At a resolution reduced to 320×\\times320, UNet inference alone reached 16\.4 ms \(61 FPS\), but the quality of fine details in generated images degraded significantly\. For SD\-Turbo\-family models, deviation from the training resolution \(512×\\times512\) directly results in quality degradation, leading us to conclude that resolution reduction is not a viable strategy\. Table 7:Relationship between model size and inference speed \(CoreML, 512×\\times512\)It should be noted that Small\-SD and Tiny\-SD were unable to maintain the same quality as SD\-Turbo through the distillation process, and their subjective image quality in camera img2img transformation was inferior to SD\-Turbo\. While parameter reduction contributes to inference speed, maintaining distillation quality remains a challenge\. This insight leads to the superiority of SDXS\-512 \(designed specifically for distillation with maintained quality\) in the subsequent Phase 6\. ## 7Phase 4: Camera img2img Pipeline Construction ### 7\.1Bottleneck Analysis To achieve real\-time camera img2img transformation, we profiled the entire pipeline\. The pipeline consists of camera capture→\\topreprocessing \(resize, normalization\)→\\toVAE encode→\\tonoise addition→\\toUNet inference→\\toVAE decode→\\topostprocessing \(display\)\. As shown in Table[8](https://arxiv.org/html/2605.16259#S7.T8), the UNet constitutes a clear bottleneck, accounting for 68% of the total inference time\. VAE encoding and decoding used CoreML\-converted TAESD \(Tiny Autoencoder for Stable Diffusion\), achieving fast execution at 6\.5 ms each\. Preprocessing \(camera frame resize and Canny edge detection\) required 7\.9 ms, and postprocessing \(tensor\-to\-BGR conversion and display\) required 2\.4 ms\. Table 8:Camera pipeline time breakdown \(SD\-Turbo CoreML\) ### 7\.2Frame Interpolation and Smoothing To reduce the execution frequency of the UNet, we attempted intermediate frame generation through linear interpolation between UNet frames and the previous frame\. However, the quality difference between UNet\-generated frames and interpolated frames was large, causing severe visual oscillation \(flicker\) that prevented achieving practical quality\. EMA \(Exponential Moving Average\) smoothing reduced oscillation but introduced ghosting artifacts for fast\-moving subjects\. ### 7\.33\-Thread Camera Architecture Ultimately, we adopted an architecture that executes camera acquisition, inference, and display in three independent threads\. The inference thread continuously processes the latest camera frame, while the display thread shows the latest inference result\. Temporal coherence between frames was ensured through a combination of fixed noise seeds, feedback of the previous latent output \(α=0\.3\\alpha=0\.3\), and EMA smoothing\. This configuration achieved flicker\-free real\-time camera img2img transformation at 13\.8 FPS with SD\-Turbo CoreML\. ## 8Phase 5: Exploring Model Quality Improvements By Phase 4, SD\-Turbo CoreML had achieved 13\.8 FPS, but we explored several approaches to further improve generation quality\. Applying Apple’s official SPLIT\_EINSUM\_V2 conversion resulted in a 167% slowdown \(5\.2 FPS\) due to wasted computation caused by the design enforcing batch=2\. Applying the Hyper\-SD 1\.5 LoRA adapter degraded the stability of single\-step inference with no quality improvement observed\. SDXL\-class models \(2\.6B parameters\) required over 200 ms for UNet inference even after CoreML conversion, making real\-time performance infeasible\. These experiments led to the insight that quality improvement requires not adapter additions to existing models or the use of larger models, but rather models designed from the outset to balance inference efficiency with quality\. This perspective guided the selection of SDXS\-512 in Phase 6\. ## 9Phase 6: Acceleration with SDXS\-512 ### 9\.1SDXS\-512 Design SDXS\-512\[[4](https://arxiv.org/html/2605.16259#bib.bib4)\]is a model distilled specifically for single\-step inference, with the following design features: \(1\) complete removal of the UNet mid\-block to reduce computation, \(2\) reduction of down/up blocks from the standard 4 stages to 3, and \(3\) training during distillation to maximize image quality in a single step\. As a result, the parameter count is reduced to 328\.2M \(38% of SD\-Turbo\)\. ### 9\.2Performance Evaluation As shown in Table[9](https://arxiv.org/html/2605.16259#S9.T9), CoreML conversion of SDXS\-512 achieved 24\.4 ms for UNet inference \(2\.2×\\timesfaster than SD\-Turbo\) and a camera FPS of 22\.7 \(1\.6×\\timesthat of SD\-Turbo\)\. Compared with Tiny\-SD \(323\.4M, similar parameter count\) evaluated in Phase 3, SDXS\-512 demonstrated significantly superior image quality at comparable speed\. This is attributable to SDXS\-512’s design, which achieves both architectural reduction and maintained distillation quality\. Table 9:SDXS\-512 vs SD\-Turbo \(CoreML, 512×\\times512, camera\)The 22\.7 FPS in the camera pipeline corresponds to a total processing time of 44\.1 ms per frame\. The breakdown is UNet inference at 24\.4 ms, VAE encode/decode at approximately 5 ms each, and pre/postprocessing at approximately 10 ms\. Thanks to the 3\-thread architecture, which parallelizes inference with camera acquisition and display, the perceived video was even smoother\. ## 10Phase 7: kNN Search\-Based Image Synthesis ### 10\.1Research Hypothesis and Motivation The M3 Ultra’s 512 GB unified memory is more than 20×\\timesthe capacity of typical GPUs \(12–24 GB VRAM\)\. To fully exploit this large memory capacity, we tested the hypothesis of replacing the UNet’s “generation through computation” with “retrieval from memory\.” Specifically, we attempted to pre\-generate and store a large number of \(input, output\) pairs and, for new inputs, search for and interpolate the most similar outputs from the database, thereby completely bypassing UNet inference\. ### 10\.2FAISS Search Speed Evaluation We first evaluated kNN search speed against large\-scale vector databases\. As shown in Table[10](https://arxiv.org/html/2605.16259#S10.T10), for 768\-dimensional CLIP embedding vectors, approximate search using IVF\-PQ indexing achieved extremely fast search times of 0\.50 ms even at 100 million vectors\. Even exact search \(Flat Index\) achieved 49\.6 ms at 1 million vectors, and the advantage of 512 GB memory was clearly demonstrated in terms of search speed compared to the infeasibility of exact search on 10 million vectors with a 24 GB GPU due to memory limitations\. Table 10:FAISS kNN search speed \(768\-dim CLIP embedding\) ### 10\.3Image Synthesis Trials We attempted image synthesis using three approaches\. \(a\)CLIP\-kNN: Search by CLIP embedding of the input image and compute a weighted average of the top\-kkoutput images in latent space\. While fast at 26\.1 ms \(38\.3 FPS\), blurring from weighted averaging and temporal instability \(oscillation\) of search results were problematic\. \(b\)VAE Latent kNN: Direct search and interpolation in VAE latent space\. This was the fastest at 22\.1 ms \(45\.3 FPS\), but search accuracy was poor due to the lack of semantic structure in the latent space, resulting in significantly degraded output quality\. \(c\)Hybrid: Use kNN search results as the initial latent for UNet, refined with one step of UNet inference\. Quality improved, but the addition of CLIP encoding overhead resulted in slowdown to 93\.6 ms \(10\.7 FPS\)\. Table 11:Results of kNN search\-based image synthesis ### 10\.4Fundamental Limitations The failure to achieve practical quality across all approaches stems from the fundamental difference between kNN search and diffusion models\. The UNet in a diffusion model is a nonlinear function approximator with hundreds of millions of parameters that generates continuous and smooth outputs for given inputs\. In contrast, kNN search approximates from a finite set of discrete samples: \(1\) weighted averaging in latent space destroys the structural coherence of individual samples, \(2\) it is fundamentally impossible to pre\-compute coverage of the infinite combinations of camera inputs \(lighting, composition, subject\), and \(3\) the lack of guaranteed temporal consistency in search results causes oscillation, where search results change discontinuously between frames\. ## 11Phase 8: pix2pix\-turbo pix2pix\-turbo\[[10](https://arxiv.org/html/2605.16259#bib.bib10)\]is an SD\-Turbo\-based model specialized for edge\-to\-image tasks, achieving high preservation of input structure through skip connections between the VAE encoder and decoder\. Since the UNet component is equivalent to the standard SD\-Turbo \(865\.9M\), CoreML conversion yielded an inference speed of 52\.9 ms\. However, the skip\-connected VAE, in which the encoder and decoder share intermediate feature tensors, is incompatible with CoreML’s static computation graph conversion\. Consequently, VAE encoding and decoding remained as PyTorch inference on the MPS backend, requiring approximately 160 ms total \(encode∼\\sim80 ms \+ decode∼\\sim80 ms\)\. As a result, the VAE became a bottleneck 3×\\timeslarger than the UNet \(53 ms\), limiting camera FPS to 4\.0 \(Table[12](https://arxiv.org/html/2605.16259#S11.T12)\)\. Table 12:pix2pix\-turbo performance \(512×\\times512\)Replacing the VAE with TAESD is not possible because TAESD lacks the skip connection structure, and we concluded that pix2pix\-turbo is not suitable for real\-time operation on Apple Silicon in its current form\. This result illustrates the important lesson that model architecture design directly affects the feasibility of hardware optimization\. ## 12Phase 9: Optical Flow Frame Skipping Rather than executing the UNet on every frame, we evaluated a technique that runs the UNet only once everyNNframes, with the intermediate\(N−1\)\(N\-1\)frames complemented by warping the previous frame using optical flow\. Theoretically, forN=3N=3, the average of a UNet frame \(51\.7 ms\) and warp frames \(6\.6 ms×\\times2\) yields\(51\.7\+6\.6×2\)/3=21\.6\(51\.7\+6\.6\\times 2\)/3=21\.6ms/frame≈\\approx46 FPS\. The Farneback method was used for optical flow, and to reduce computational cost, the input resolution was downscaled from 512×\\times512 to 256×\\times256 for flow field computation, then upscaled to 512×\\times512 for warping\. This half\-resolution flow computation reduced warping time from 22\.3 ms to 6\.6 ms \(Table[13](https://arxiv.org/html/2605.16259#S12.T13)\)\. Table 13:Optical flow frame skipping performance \(N=3N=3\)However, the measured result was only 17\.4 FPS, merely 38% of the theoretical value \(46 FPS\)\. This discrepancy is attributable to synchronization overhead in the single\-threaded implementation\. Furthermore, since warp frames are mere image deformations rather than AI\-generated outputs, pronounced jelly\-like distortions occurred in regions of large motion, and the subjective image quality was significantly inferior to the SDXS baseline \(22\.7 FPS\)\. As this approach was inferior to the SDXS baseline in both speed and quality, it was not adopted\. ## 13Phase 10: Knowledge Distillation for Direct Transformation We evaluated a technique that uses the entire SDXS pipeline \(VAE encode→\\toUNet→\\toVAE decode\) as a teacher model and distills direct edge\-to\-stylized image transformation into a lightweight feedforward CNN \(FastStyleNet\)\. FastStyleNet employs a U\-Net structure using depthwise separable convolutions, with model size controllable via the number of blocks and base channel count\. As shown in Table[14](https://arxiv.org/html/2605.16259#S13.T14), inference speeds were extremely fast, ranging from 6\.0 ms \(167 FPS\) to 7\.2 ms \(140 FPS\), demonstrating significant advantages in terms of speed\. However, when trained on synthetic edge data \(random combinations of lines, circles, and rectangles\) with L1 loss alone, the output was visually unrecognizable after 10 epochs\. Table 14:Distilled FastStyleNet specifications and resultsConfigurationParametersMPS inferenceFPSFastStyleNet \(32ch\)398K6\.0ms167FastStyleNet \(48ch\)875K6\.1ms164FastStyleNet \(64ch\)1\.5M7\.2ms140Training: L1 loss, 10 epochs, 2000 steps/epochResult: Output visually unrecognizable \(quality×\\times\)The causes of failure were multifaceted\. First, L1 loss encourages convergence toward the pixel\-wise mean of all output images, making it unable to learn the sharp and diverse outputs that diffusion models generate\. Second, synthetic edge data differs substantially in distribution from actual Canny edge output from cameras, causing a distribution shift between training and inference data\. Third, a feedforward network with 875K parameters fundamentally cannot reproduce the capacity of a 328\.2M\-parameter diffusion model’s “iterative image refinement through the denoising process\.” While there is room for improvement through the introduction of GAN loss and perceptual loss, as well as training on large\-scale real data, this approach yielded negative results within the scope of this study\. ## 14Comprehensive Comparison of All Experiments Table[15](https://arxiv.org/html/2605.16259#S14.T15)summarizes the quantitative results across all phases\. After exploration across 10 phases, SDXS\-512 CoreML was confirmed as the optimal solution in terms of the balance between speed and quality\. Table 15:Comprehensive comparison of all approaches \(camera img2img, 512×\\times512\) ## 15Discussion ### 15\.1Re\-examining CUDA\-Centric Optimization Assumptions The most important finding of this study is that optimization “common knowledge” established for NVIDIA GPUs and the CUDA ecosystem largely does not hold on Apple Silicon’s unified memory architecture\. Below, we systematically organize comparisons with CUDA\-based insights\. Ineffectiveness of quantization\.On NVIDIA GPUs, quantization techniques such as TensorRT INT8 and AWQ/GPTQ are widely reported to accelerate inference through reduced memory bandwidth for model weights\[[4](https://arxiv.org/html/2605.16259#bib.bib4)\]\. This presumes that NVIDIA GPU inference is memory\-bandwidth\-bound\. In discrete GPU configurations, model weights must be transferred from HBM \(High Bandwidth Memory\) to the compute units, and this transfer bandwidth becomes the bottleneck\. In contrast, the M3 Ultra’s unified memory architecture provides 800 GB/s of memory bandwidth shared among the CPU, GPU, and ANE\. For the weights of an 865\.9M\-parameter model \(approximately 1\.7 GB in FP16\), 800 GB/s bandwidth provides ample headroom, and memory transfer does not become a bottleneck\. As a result, inference on the M3 Ultra is compute\-bound, and memory bandwidth reduction through quantization has no effect on inference speed\. This compute\-bound versus memory\-bandwidth\-bound distinction is one of the most fundamental architectural differences between Apple Silicon and NVIDIA GPUs\. Impossibility of parallel inference\.On NVIDIA GPUs, low\-level parallel control via CUDA Streams enables parallel execution of multiple kernels and pipeline parallelization as standard optimization techniques\. StreamDiffusion’s\[[8](https://arxiv.org/html/2605.16259#bib.bib8)\]Stream Batch also achieves high efficiency by parallelizing multi\-frame batch processing at the CUDA kernel level\. Apple Silicon’s CoreML framework abstracts inference execution through the Metal Command Queue and does not expose GPU kernel\-level parallel control to users\. As confirmed in this study, simultaneous inference of multiple CoreML models is serialized on Metal GPU resources and does not improve throughput\. This is a consequence of the absence of a low\-level GPU control API comparable to NVIDIA’s CUDA on Apple Silicon\. Rather than a hardware performance gap, it is the difference in software stack design philosophy that governs optimization possibilities\. Immaturity of compiler optimization\.PyTorch’storch\.compilegenerates optimized code for CUDA/CPU through the Inductor backend, but support for the MPS backend is incomplete, resulting in runtime errors\. A comprehensive inference optimization tool equivalent to TensorRT for NVIDIA GPUs does not exist for Apple Silicon\. CoreML conversion partially fulfills this role, but its integration with the PyTorch ecosystem is far inferior to TensorRT\. ### 15\.2The Dual Nature of the Unified Memory Architecture The M3 Ultra’s unified memory architecture presents both clear advantages and limitations for diffusion model inference\. Advantage: Zero\-copy data sharing\.Unified memory eliminates CPU–GPU data transfers \(host\-device transfers in NVIDIA GPU parlance\)\. The zero\-cost transfer of tensors generated by preprocessing \(OpenCV on CPU\) to GPU inference is particularly advantageous in camera pipelines\. Additionally, the 512 GB capacity facilitates simultaneous in\-memory retention of multiple models, eliminating load delays during model switching\. Advantage: Potential of large\-scale memory\.As demonstrated in the Phase 7 kNN experiment, 512 GB of memory enables in\-memory retention of a 100\-million\-vector search database with sub\-0\.5 ms search times\. This scale is physically impossible on a 24 GB GPU, presenting unique possibilities for retrieval\-based methods and large\-batch training\. Limitation: Insufficient compute performance\.The M3 Ultra’s approximately 22 TFLOPS FP16 is roughly 1/15th of the NVIDIA RTX 4090’s approximately 330 TFLOPS, representing a fundamental performance gap for compute\-bound inference\. Combined with the ineffectiveness of quantization \(inference is not bandwidth\-bound\), pure reduction of computation—i\.e\., using smaller models—becomes the only means of speed improvement\. Limitation: Immature software ecosystem\.Compared to CUDA’s decades\-long ecosystem \(cuDNN, TensorRT, xformers, Flash Attention, Triton, etc\.\), the Metal/CoreML ecosystem is substantially inferior in both quality and breadth\. The MPS incompatibility of torch\.compile, the overhead of Token Merging on MPS, and the inefficiency of Attention Slicing are all manifestations of this ecosystem’s immaturity\. ### 15\.3Interdependence of Model Architecture and Hardware Optimization This study provides the important lesson that model architecture design directly governs the feasibility of hardware optimization\. The skip\-connected VAE in pix2pix\-turbo is central to quality in img2img tasks, but the intermediate tensor sharing between encoder and decoder prevents CoreML conversion, causing the slow MPS\-based VAE inference to become the pipeline bottleneck \(53 ms UNet vs\. 160 ms VAE\)\. In contrast, SDXS\-512’s simple feedforward architecture has high affinity with CoreML conversion, enabling all components to execute efficiently on CoreML\. This “architecture–hardware co\-design” perspective is expected to grow in importance for future diffusion model research\. Architectures designed for NVIDIA GPUs do not necessarily perform equivalently on other platforms, necessitating model designs that account for target hardware constraints\. ### 15\.4Fundamental Differences Between kNN Search and Diffusion Models The Phase 7 experiments demonstrated that “replacing computation with search” using large\-scale memory is fundamentally difficult in the context of diffusion models\. kNN search is an operation that selects nearest neighbors from a finite set of discrete data points and is fundamentally different from the continuous nonlinear function approximation realized by a diffusion model’s UNet\. Weighted averaging in latent space destroys the structural coherence of individual samples\. For example, averaging the latent vectors of a cat image and a dog image at 0\.5:0\.5 does not produce a “meaningful” intermediate image between a cat and a dog, but rather an incoherent image with collapsed structure\. Diffusion models circumvent this problem by projecting outputs onto a smooth manifold in latent space through the iterative denoising process\. Furthermore, the combinatorial space of camera inputs \(lighting conditions×\\timessubjects×\\timescomposition×\\timescamera parameters\) is effectively infinite, making it impossible to adequately cover this space through pre\-computation\. The true value of 512 GB memory lies not in replacing inference with search, but rather in complete in\-memory retention of large\-scale models, large batch sizes during training, or hybrid methods that combine retrieval and generation, such as Retrieval\-Augmented Generation\. ### 15\.5Limitations and Prospects of Frame Interpolation Techniques Phase 9’s optical flow frame skipping was theoretically promising but faced three practical problems\. First, sequential execution of flow computation, warping, and UNet inference in a single thread achieved only 38% of the theoretical FPS\. Second, since warp frames are pure image deformations rather than AI\-generated outputs, non\-physical distortions occurred in regions of large motion\. Third, while the half\-resolution flow computation improved speed, it reduced flow field accuracy, increasing warping artifacts\. Future improvements may be possible through CoreML conversion of neural optical flow \(e\.g\., RAFT\) for high\-accuracy flow estimation, pipeline integration with the 3\-thread architecture, and introduction of AI\-aware frame interpolation \(e\.g\., FILM\)\. However, given that SDXS\-512 can perform “true” AI generation per frame at 24\.4 ms, the necessity of frame interpolation itself is called into question\. ### 15\.6Implications for Non\-CUDA Environments The findings of this study offer insights not only for Apple Silicon but for diffusion model inference on non\-CUDA environments in general\. Diffusion model inference on diverse hardware platforms—including Qualcomm Snapdragon’s unified memory, Intel Arc GPU’s oneAPI stack, and RISC\-V\-based accelerators—is expected to grow in importance\. The “failure of CUDA assumptions” demonstrated in this study is likely to occur on these platforms as well\. The effectiveness of quantization depends on memory architecture, the feasibility of parallelization depends on software stack design, and the optimizability of model architectures depends on runtime conversion capabilities\. Understanding the unique characteristics of each platform and selecting appropriate optimization strategies is required as a practical methodology to replace the CUDA\-centric approach\. ## 16Conclusion We conducted systematic optimization experiments across 10 phases for real\-time diffusion model inference on the Apple M3 Ultra, achieving real\-time camera img2img transformation at22\.7 FPSat 512×\\times512 resolution\. This represents a 118% speedup from the MPS baseline, achieved through the combination of SDXS\-512 CoreML conversion and a 3\-thread camera pipeline\. The key findings of this study are summarized as follows: 1. 1\.CoreML conversionis the only effective UNet acceleration technique on Apple Silicon\. 2. 2\.Quantization is ineffective, due to the M3 Ultra being compute\-bound\. The high bandwidth of unified memory means that memory transfer is not a bottleneck\. 3. 3\.Parallel inference is not possible, as CoreML’s Metal GPU resource serialization prevents low\-level parallelization comparable to CUDA Streams\. 4. 4\.kNN search cannot replace diffusion models\. While large\-scale search is feasible with 512 GB of memory, the fundamental difference between discrete search and continuous function approximation creates a quality barrier\. 5. 5\.Co\-design of model architecture and hardware optimizationis critical\. Architectural choices such as skip connections can impede CoreML conversion\. 6. 6\.Distillation\-specialized models\(SDXS\-512\) provide the optimal balance of speed and quality\. This study provides a perspective from a different architecture—Apple Silicon—for diffusion model optimization research that has developed primarily around CUDA\. The optimization landscape on unified memory architectures is qualitatively different from that of discrete GPUs and requires its own research approach\. Future work includes direct Metal Compute Shader programming to bypass CoreML, introduction of 2\-step inference for quality improvement, efficient inference of SDXL\-scale models, and performance evaluation on next\-generation Apple Silicon \(M4 series\)\. ## References - \[1\]Ho, J\., Jain, A\., & Abbeel, P\. \(2020\)\. Denoising diffusion probabilistic models\.Advances in Neural Information Processing Systems \(NeurIPS\), 33, 6840–6851\. - \[2\]Rombach, R\., Blattmann, A\., Lorenz, D\., Esser, P\., & Ommer, B\. \(2022\)\. High\-resolution image synthesis with latent diffusion models\.Proc\. IEEE/CVF Conf\. on Computer Vision and Pattern Recognition \(CVPR\), 10684–10695\. - \[3\]Sauer, A\., Lorenz, D\., Blattmann, A\., & Rombach, R\. \(2023\)\. Adversarial diffusion distillation\.arXiv preprint arXiv:2311\.17042\. - \[4\]Song, Y\., Dhariwal, P\., Chen, M\., & Sutskever, I\. \(2024\)\. SDXS: Real\-time one\-step latent diffusion models with image conditions\.arXiv preprint arXiv:2403\.16627\. - \[5\]Luo, S\., Tan, Y\., Huang, L\., Li, J\., & Zhao, H\. \(2023\)\. Latent consistency models: Synthesizing high\-resolution images with few\-step inference\.arXiv preprint arXiv:2310\.04378\. - \[6\]Luo, S\., Tan, Y\., Patil, S\., Gu, D\., von Platen, P\., Passos, A\., Huang, L\., Li, J\., & Zhao, H\. \(2023\)\. LCM\-LoRA: A universal stable\-diffusion acceleration module\.arXiv preprint arXiv:2311\.05556\. - \[7\]Ren, J\., Xia, Y\., Lu, K\., Deng, J\., & Luo, Z\. \(2024\)\. Hyper\-SD: Trajectory segmented consistency model for efficient image synthesis\.arXiv preprint arXiv:2404\.13686\. - \[8\]Kodaira, A\., Xu, C\., Hazama, T\., Yoshimoto, T\., Ohno, K\., Mitsuhori, S\., Sugano, S\., Cho, H\., Liu, Z\., & Keutzer, K\. \(2023\)\. StreamDiffusion: A pipeline\-level solution for real\-time interactive generation\.arXiv preprint arXiv:2312\.12491\. - \[9\]Apple Inc\. \(2022\)\. ml\-stable\-diffusion: Stable diffusion with Core ML on Apple Silicon\.GitHub repository\.[https://github\.com/apple/ml\-stable\-diffusion](https://github.com/apple/ml-stable-diffusion) - \[10\]Parmar, G\., Park, T\., Narasimhan, S\., & Zhu, J\.\-Y\. \(2024\)\. One\-step image translation with text\-to\-image models\.Proc\. European Conf\. on Computer Vision \(ECCV\)\. - \[11\]Blattmann, A\., Rombach, R\., Oktay, O\., Müller, J\., & Ommer, B\. \(2022\)\. Retrieval\-augmented diffusion models\.Advances in Neural Information Processing Systems \(NeurIPS\), 35\. - \[12\]Johnson, J\., Douze, M\., & Jégou, H\. \(2019\)\. Billion\-scale similarity search with GPUs\.IEEE Trans\. on Big Data, 7\(3\), 535–547\. - \[13\]Bolya, D\., & Hoffman, J\. \(2023\)\. Token merging for fast stable diffusion\.CVPR Workshop on Efficient Deep Learning for Computer Vision\.
Similar Articles
I fitted the new δ-mem research for apple silicon using mlx and openclaw integration! My findings
The author implements the δ-mem research paper on Apple Silicon using MLX and OpenClaw, showing memory and attention improvements in local AI agent tests, though with mixed results compared to CUDA benchmarks.
2x 512gb ram M3 Ultra mac studios
A user shares their $25k hardware setup of two 512GB RAM M3 Ultra Mac Studios for running large language models locally, having tested DeepSeek V3 Q8 and GLM 5.1 Q4 via the exo distributed inference backend, while awaiting Kimi 2.6 MLX optimization.
@ivanfioravanti: Apple M5 Max + MLX = raw power! Look at this demo I'm playing with "FasterLivePortrait-MLX" I started with MPS but resu…
The author demonstrates that migrating a LivePortrait implementation from MPS to Apple's MLX framework on an M5 Max chip results in significantly better performance and speed.
Running local models on an M4 with 24GB memory
A guide on running local AI models like Qwen 3.5-9B on an M4 MacBook with 24GB RAM using tools like LM Studio, Ollama, and pi, including specific configuration tips for optimal performance.
qwen 3.6 27B AR-> Diffusion - local training on 5090
The author details attempts to locally train a Qwen 3.6 27B autoregressive-to-diffusion model on an Nvidia 5090 GPU using qlora and modifications from open-dllm and d3LLM, facing VRAM constraints and hardware issues while exploring one-shot diffusion techniques.