Tag
This paper presents a systematic optimization study of real-time diffusion model inference on the Apple M3 Ultra, achieving 22.7 FPS at 512x512 resolution using CoreML conversion and a distillation model, revealing that CUDA-optimized techniques do not directly transfer to Apple's unified memory architecture.