How we catch silent NPU fallback on Snapdragon in CI [D]

Reddit r/MachineLearning Tools

Summary

A blog post detailing how to detect silent NPU fallback on Snapdragon in CI, including methods like running on real hardware, gating on coefficient of variation, and parsing ORT profiling JSON to identify fallen-back ops.

Posting because I've now seen this exact bug at multiple teams shipping ML to Snapdragon, and the pattern is worth writing up. ONNX Runtime's QNN execution provider (the one that targets Qualcomm's Hexagon NPU on Snapdragon SoCs) will silently route unsupported ops to the CPU. Your accuracy is fine, your eval latency on the dev board looks fine, but production latency mysteriously triples because the input distribution stresses fallback paths differently — and the runtime never raises anything louder than a startup-log line nobody reads. The default median-of-N latency gate doesn't catch this, because fallback creates a bimodal distribution and the median lands on the fast cluster. Three things end up being necessary: 1. \*\*Run on real hardware\*\* — emulators implement the ISA in software so every op is "supported" (for the wrong reason), and cloud x86 doesn't load the QNN EP at all 2. \*\*Gate on coefficient of variation alongside median\*\* — healthy on-NPU CV is 2–5%, intermittent fallback pushes it >15% 3. \*\*Parse the ORT profiling JSON and assert NPU FLOP percentage\*\* — the routing info is in there but you have to opt into \`profiling\_level=detailed\` and post-process it; the default warning-level log just says "23 nodes assigned to QNN, 7 to CPU" The third one is the diagnostic that actually identifies which op fell back, so you can either swap it for a supported equivalent, pin the QNN SDK, or escalate to firmware. Wrote up the full pattern with the actual Python (CV gating function + ORT profile parser): [https://edgegate.frozo.ai/blog/how-we-catch-silent-npu-fallback-on-snapdragon-in-ci](https://edgegate.frozo.ai/blog/how-we-catch-silent-npu-fallback-on-snapdragon-in-ci) Curious if anyone here has hit similar silent-fallback patterns with TensorRT on Jetson or CoreML on iOS — I'd expect the symptom (bimodal latency, silent provider routing) but haven't gone digging. Same with ExecuTorch.
Original Article

Similar Articles

Reverse Engineering the Qualcomm NPU Compiler

Lobsters Hottest

Reverse engineering the Qualcomm NPU compiler reveals undocumented VTCM memory management, MILP-based placement, automatic precision alteration, and a hidden analytical simulator (Hextimate) for edge deployment optimization.

Efficient On-Device Diffusion LLM Inference with Mobile NPU

arXiv cs.LG

This paper presents llada.cpp, an NPU-aware inference framework for accelerating diffusion large language models (dLLMs) on smartphones. It introduces three techniques—Multi-Block Speculative Decoding, Dual-Path Progressive Revision, and Swap-Optimized Memory Runtime—to align dLLM inference with mobile NPU characteristics, achieving 17-42x latency reduction over CPU baseline.

RAG on Snapdragon X2 Laptop, 200K documents.

Reddit r/LocalLLaMA

VecML demonstrates its AI-PC software running RAG on 200K documents using the new Snapdragon X2 laptop, achieving low-token and low-memory retrieval. The software integrates multiple database functions into one platform, and controlled testing for macOS is now open.