Tag
Flash-WAM introduces a modality-aware distillation method for world-action models, achieving real-time inference by compressing diffusion to a single step per modality, resulting in 23x speedup.
Kog announces real-time LLM inference achieving 3000+ output tokens per second per request on standard datacenter GPUs, bringing high-speed inference previously limited to custom silicon to production hardware.
This paper presents a systematic optimization study of real-time diffusion model inference on the Apple M3 Ultra, achieving 22.7 FPS at 512x512 resolution using CoreML conversion and a distillation model, revealing that CUDA-optimized techniques do not directly transfer to Apple's unified memory architecture.