How we catch silent NPU fallback on Snapdragon in CI [D]

Reddit r/MachineLearning 05/15/26, 05:28 AM Tools

npu-fallback snapdragon onnx-runtime qnn edge-ml ci-testing latency-optimization

Summary

A blog post detailing how to detect silent NPU fallback on Snapdragon in CI, including methods like running on real hardware, gating on coefficient of variation, and parsing ORT profiling JSON to identify fallen-back ops.

Posting because I've now seen this exact bug at multiple teams shipping ML to Snapdragon, and the pattern is worth writing up. ONNX Runtime's QNN execution provider (the one that targets Qualcomm's Hexagon NPU on Snapdragon SoCs) will silently route unsupported ops to the CPU. Your accuracy is fine, your eval latency on the dev board looks fine, but production latency mysteriously triples because the input distribution stresses fallback paths differently — and the runtime never raises anything louder than a startup-log line nobody reads. The default median-of-N latency gate doesn't catch this, because fallback creates a bimodal distribution and the median lands on the fast cluster. Three things end up being necessary: 1. \*\*Run on real hardware\*\* — emulators implement the ISA in software so every op is "supported" (for the wrong reason), and cloud x86 doesn't load the QNN EP at all 2. \*\*Gate on coefficient of variation alongside median\*\* — healthy on-NPU CV is 2–5%, intermittent fallback pushes it >15% 3. \*\*Parse the ORT profiling JSON and assert NPU FLOP percentage\*\* — the routing info is in there but you have to opt into \`profiling\_level=detailed\` and post-process it; the default warning-level log just says "23 nodes assigned to QNN, 7 to CPU" The third one is the diagnostic that actually identifies which op fell back, so you can either swap it for a supported equivalent, pin the QNN SDK, or escalate to firmware. Wrote up the full pattern with the actual Python (CV gating function + ORT profile parser): [https://edgegate.frozo.ai/blog/how-we-catch-silent-npu-fallback-on-snapdragon-in-ci](https://edgegate.frozo.ai/blog/how-we-catch-silent-npu-fallback-on-snapdragon-in-ci) Curious if anyone here has hit similar silent-fallback patterns with TensorRT on Jetson or CoreML on iOS — I'd expect the symptom (bimodal latency, silent provider routing) but haven't gone digging. Same with ExecuTorch.

Original Article

How we catch silent NPU fallback on Snapdragon in CI [D]

Similar Articles

Reverse Engineering the Qualcomm NPU Compiler

Energy-Efficient On-Device RAG on a Mobile NPU: System Design and Benchmark on Snapdragon X Elite

Efficient On-Device Diffusion LLM Inference with Mobile NPU

Quant.npu: Enabling Efficient Mobile NPU Inference for on-device LLMs via Fully Static Quantization

RAG on Snapdragon X2 Laptop, 200K documents.

Submit Feedback

Similar Articles

Reverse Engineering the Qualcomm NPU Compiler

Energy-Efficient On-Device RAG on a Mobile NPU: System Design and Benchmark on Snapdragon X Elite

Efficient On-Device Diffusion LLM Inference with Mobile NPU

Quant.npu: Enabling Efficient Mobile NPU Inference for on-device LLMs via Fully Static Quantization

RAG on Snapdragon X2 Laptop, 200K documents.