How we catch silent NPU fallback on Snapdragon in CI [D]
Summary
A blog post detailing how to detect silent NPU fallback on Snapdragon in CI, including methods like running on real hardware, gating on coefficient of variation, and parsing ORT profiling JSON to identify fallen-back ops.
Similar Articles
Reverse Engineering the Qualcomm NPU Compiler
Reverse engineering the Qualcomm NPU compiler reveals undocumented VTCM memory management, MILP-based placement, automatic precision alteration, and a hidden analytical simulator (Hextimate) for edge deployment optimization.
Energy-Efficient On-Device RAG on a Mobile NPU: System Design and Benchmark on Snapdragon X Elite
This paper presents the first end-to-end RAG pipeline running entirely on a mobile NPU (Qualcomm Hexagon on Snapdragon X Elite), achieving up to 18x faster LLM prefilling and 4x lower energy vs. CPU, with no quality regression.
Efficient On-Device Diffusion LLM Inference with Mobile NPU
This paper presents llada.cpp, an NPU-aware inference framework for accelerating diffusion large language models (dLLMs) on smartphones. It introduces three techniques—Multi-Block Speculative Decoding, Dual-Path Progressive Revision, and Swap-Optimized Memory Runtime—to align dLLM inference with mobile NPU characteristics, achieving 17-42x latency reduction over CPU baseline.
Quant.npu: Enabling Efficient Mobile NPU Inference for on-device LLMs via Fully Static Quantization
Quant.npu introduces a fully static quantization framework for mobile NPUs, using learnable parameters and rotation matrices to enable efficient low-bit LLM inference without runtime re-computation, achieving up to 15.1% latency reduction.
RAG on Snapdragon X2 Laptop, 200K documents.
VecML demonstrates its AI-PC software running RAG on 200K documents using the new Snapdragon X2 laptop, achieving low-token and low-memory retrieval. The software integrates multiple database functions into one platform, and controlled testing for macOS is now open.