Efficient On-Device Diffusion LLM Inference with Mobile NPU

arXiv cs.LG 06/15/26, 04:00 AM Papers

Summary

This paper presents llada.cpp, an NPU-aware inference framework for accelerating diffusion large language models (dLLMs) on smartphones. It introduces three techniques—Multi-Block Speculative Decoding, Dual-Path Progressive Revision, and Swap-Optimized Memory Runtime—to align dLLM inference with mobile NPU characteristics, achieving 17-42x latency reduction over CPU baseline.

arXiv:2606.13740v1 Announce Type: new Abstract: Diffusion large language models (dLLMs) accelerate generation by denoising multiple tokens in parallel, making them attractive for latency-sensitive mobile inference. However, repeated denoising introduces substantial computation on smartphones. Mobile neural processing units (NPUs) offer high-throughput dense matrix computation, but efficiently exploiting them remains challenging: token commitment shrinks per-block effective workloads, token revision complicates KV cache reuse, and limited NPU-visible address space incurs costly remapping and data transfer overheads. In this paper, we propose llada.cpp, the first NPU-aware inference framework for accelerating dLLMs on smartphones. llada.cpp aligns block-wise dLLM inference with the execution characteristics of mobile NPUs through three techniques. (1) Multi-Block Speculative Decoding fills the shrinking workload in late-stage current-block decoding with speculative future-block tokens. (2) Dual-Path Progressive Revision keeps committed tokens revisable until stable and refreshes unstable tokens through a CPU-side path without stalling dense NPU execution. (3) Swap-Optimized Memory Runtime compacts NPU-visible address layouts and overlaps data staging with NPU computation to reduce remapping and transfer overheads. We implement llada.cpp as an end-to-end framework and evaluate it across diverse hardware platforms and dLLM workloads. llada.cpp reduces LLaDA-8B generation latency by 17x-42x over the CPU baseline with prefix KV cache reuse, while preserving generation quality.

Original Article

View Cached Full Text

Cached at: 06/15/26, 09:07 AM

# Efficient On-Device Diffusion LLM Inference with Mobile NPU
Source: [https://arxiv.org/html/2606.13740](https://arxiv.org/html/2606.13740)
,Yanfan SunBeihang UniversityChinaandJu RenTsinghua UniversityChina

###### Abstract\.

Diffusion large language models \(dLLMs\) accelerate generation by denoising multiple tokens in parallel, making them attractive for latency\-sensitive mobile inference\. However, repeated denoising introduces substantial computation on smartphones\. Mobile neural processing units \(NPUs\) offer high\-throughput dense matrix computation, but efficiently exploiting them remains challenging: token commitment shrinks per\-block effective workloads, token revision complicates KV cache reuse, and limited NPU\-visible address space incurs costly remapping and data transfer overheads\.

In this paper, we proposellada\.cpp, the first NPU\-aware inference framework for accelerating dLLMs on smartphones\.llada\.cppaligns block\-wise dLLM inference with the execution characteristics of mobile NPUs through three techniques\. \(1\)Multi\-Block Speculative Decodingfills the shrinking workload in late\-stage current\-block decoding with speculative future\-block tokens\. \(2\)Dual\-Path Progressive Revisionkeeps committed tokens revisable until stable and refreshes unstable tokens through a CPU\-side path without stalling dense NPU execution\. \(3\)Swap\-Optimized Memory Runtimecompacts NPU\-visible address layouts and overlaps data staging with NPU computation to reduce remapping and transfer overheads\. We implementllada\.cppas an end\-to\-end framework and evaluate it across diverse hardware platforms and dLLM workloads\.llada\.cppreduces LLaDA\-8B generation latency by 17×\\times\-42×\\timesover the CPU baseline with prefix KV cache reuse, while preserving generation quality\.

## 1\.Introduction

Large language models \(LLMs\) are becoming a core interface for mobile computing, enabling applications such as personalized assistants\(Liuet al\.,[2024](https://arxiv.org/html/2606.13740#bib.bib2); Wanget al\.,[2023](https://arxiv.org/html/2606.13740#bib.bib1)\), multimodal interaction\(Chuet al\.,[2023](https://arxiv.org/html/2606.13740#bib.bib3); Zhanget al\.,[2025](https://arxiv.org/html/2606.13740#bib.bib4)\), and context\-aware automation\(Wenet al\.,[2024](https://arxiv.org/html/2606.13740#bib.bib5); Leeet al\.,[2024](https://arxiv.org/html/2606.13740#bib.bib6)\)\. Despite their success, LLM inference remains dominated by autoregressive decoding\(Vaswaniet al\.,[2017](https://arxiv.org/html/2606.13740#bib.bib7)\), which generates outputs token by token and therefore incurs latency that scales with output length\. On mobile devices, this latency pressure is exacerbated by limited memory bandwidth and strict power budgets, impeding the development of interactive LLM applications\.

Recentdiffusion large language models \(dLLMs\)\(Nieet al\.,[2026](https://arxiv.org/html/2606.13740#bib.bib12); Austinet al\.,[2021](https://arxiv.org/html/2606.13740#bib.bib11); Khannaet al\.,[2025](https://arxiv.org/html/2606.13740#bib.bib13)\)offer a promising alternative\. Instead of generating tokens one by one, dLLMs iteratively denoise a masked sequence and update multiple token positions in parallel\. To make this iterative process practical for long\-sequence serving, existing systems often adoptblock\-wise decoding\(Arriolaet al\.,[2025](https://arxiv.org/html/2606.13740#bib.bib14)\), where the output sequence is partitioned into consecutive blocks \(e\.g\., 32 tokens\)\. Blocks are decoded sequentially from left to right to enable prefix key\-value \(KV\) cache reuse, while tokens within each block are updated in parallel through iterative denoising\. This non\-autoregressive formulation exposes sequence\-level parallelism and shifts the latency bottleneck from output length to the number of denoising steps\.

Although dLLMs reduce token\-level sequential dependencies, they introduce a new performance bottleneck:repeated parallel computation\. A dLLM typically performs multiple denoising iterations, each requiring expensive transformer computation over the full sequence\. Moreover, because dLLM iterations repeatedly revise token states, reusing the KV cache across iterations within a block becomes difficult\. On resource\-constrained mobile devices such as smartphones, this repeated parallel computation can impose substantial computational burden\. As shown in Figure[1](https://arxiv.org/html/2606.13740#S1.F1), such overhead can offset the benefits of dLLM over autoregressive decoding\.

![Refer to caption](https://arxiv.org/html/2606.13740v1/x1.png)Figure 1\.End\-to\-end generation speedup for 128\-token outputs on the OnePlus Ace5 Pro with the SM8750 SoC\.In this paper, we identify a key architectural opportunity: the parallel decoding nature of dLLMs aligns well with the tensor\-computation capability of modern mobileneural processing units \(NPUs\)\(Mahurin,[2023](https://arxiv.org/html/2606.13740#bib.bib8); Ricoet al\.,[2024](https://arxiv.org/html/2606.13740#bib.bib9)\)\. Mobile NPUs are designed to execute dense tensor operations with high throughput and energy efficiency\. For example, Qualcomm reports that the Hexagon NPU in Snapdragon X Elite delivers up to 45 TOPS INT8 performance\(Qualcomm Technologies, Inc\.,[2023](https://arxiv.org/html/2606.13740#bib.bib10)\)\. While autoregressive decoding exposes limited parallelism at each generation step, block\-wise dLLM inference naturally creates large parallel workloads within each block and across denoising iterations\. This observation creates a new opportunity for on\-device acceleration:by rethinking dLLM inference around the execution characteristics of mobile NPUs, we can convert the repeated parallel computation of dLLMs from a liability into an advantage\.

However, efficiently realizing dLLM inference on smartphones requires more than naive NPU offloading\. Although mobile NPUs are well suited for dense tensor operations, they favor static, regular, and shape\-stable execution, whereas block\-wise dLLM inference is inherently dynamic\. This mismatch manifests at three levels: \(1\) At thetoken\-commit level, as denoising leaves fewer masked tokens in each block, the effective workload shrinks, making late\-stage NPU forwards poorly amortized\. \(2\) At thetoken\-revision level, committed tokens may still require revision as more context becomes available, introducing sparse and irregular KV cache updates\. \(3\) At thememory\-access level, the working set of dLLM inference often exceeds the limited NPU\-visible address space, forcing frequent data swaps with system memory and incurring costly mapping and transfer overheads\.

To address these challenges, we proposellada\.cpp, the first NPU\-aware inference framework for accelerating dLLMs on smartphones\. The key idea ofllada\.cppis toalign the parallel decoding nature of dLLMs with the powerful tensor computation capabilities of mobile NPUs, translating improved hardware utilization into lower generation latency\. Specifically,llada\.cppintroduces three key components:

\(1\)Multi\-Block Speculative Decodingimproves NPU utilization during the later stages of denoising\.llada\.cppspeculatively incorporates tokens from future blocks to maintain sufficient parallel workload, while preserving the original commitment order through block\-wise acceptance\.

\(2\)Dual\-Path Progressive Revisionenables efficient token revision without disrupting dense NPU execution\.llada\.cppuses a fine\-grained algorithm to identify tokens that require revision and offloads the corresponding updates and logits computation to specialized CPU\-side kernels\.

\(3\)Swap\-Optimized Memory Runtimetargets the memory\-access bottleneck caused by the limited NPU\-visible address space\.llada\.cppuses graph\-level lifetime and access\-order information to build compact VA layouts, reducing fragmentation and remapping overhead, and pipelines data staging with NPU execution to hide data\-movement latency\.

We implementllada\.cppas an end\-to\-end framework and evaluate it on diverse hardware platforms and dLLM workloads\.llada\.cppreduces LLaDA\-8B end\-to\-end latency by 17×\\times\-42×\\timesover the CPU baseline with prefix KV cache reuse, while preserving generation quality\. Notably,llada\.cppachieves up to 3\.9×\\timesspeedup for LLaDA\-8B over an equal\-size autoregressive model, highlighting the advantage of diffusion parallel generation on mobile devices\.

In summary, this paper makes the following contributions:

- •We identify the mismatch between static mobile NPU execution and dynamic block\-wise dLLM inference, and characterize three key challenges in token commitment, token revision, and memory access\.
- •We proposellada\.cpp, the first NPU\-aware inference framework for accelerating dLLMs on smartphones, combining algorithmic and system\-level optimizations across the NPU, CPU, and system memory\.
- •We evaluatellada\.cppacross diverse hardware platforms and dLLM workloads, showing that it effectively translates dLLM parallelism into lower smartphone latency than autoregressive and CPU\-only baselines\.

## 2\.Background and Motivation

### 2\.1\.Diffusion Large Language Model

Autoregressive large language models \(LLMs\) generate text in a strictly left\-to\-right manner, where each decoding step predicts only the next token conditioned on previously generated tokens\. This paradigm has become the dominant generation approach in current LLMs and naturally supports incremental decoding with KV cache reuse\. However, its sequential dependency exposes limited token\-level parallelism\. As a result, even with token\-level optimizations, generation latency remains fundamentally tied to the output length\.

![Refer to caption](https://arxiv.org/html/2606.13740v1/x2.png)Figure 2\.Comparison of decoding paradigms: \(a\) autoregressive, \(b\) diffusion, and \(c\) block\-wise diffusion LLM decoding\.Diffusion large language models \(dLLMs\) provide an alternative to autoregressive generation\. As shown in Figure[2](https://arxiv.org/html/2606.13740#S2.F2)\(b\), they generate text by iteratively denoising masked token sequences rather than producing tokens one at a time\. At each step, the model predicts unresolved positions in parallel and updates a confidence\-selected subset, gradually refining the sequence into the final output\. This shifts latency from output length to the number of denoising steps, while converting token\-wise decoding into sequence\-wise matrix workloads that better utilize accelerators and reduce generation overhead\. These properties make dLLMs particularly attractive for latency\-sensitive scenarios on mobile devices\.

Block\-Wise Decoding\.A key limitation of vanilla dLLM decoding is the lack of effective KV cache reuse\. Since the model repeatedly denoises a sequence whose token states may change across iterations, cached KV states can quickly become stale, forcing the runtime to recompute attention states over a large portion of the sequence\. Block\-wise decoding addresses this issue and makes dLLM generation more practical for long sequences\. As illustrated in Figure[2](https://arxiv.org/html/2606.13740#S2.F2)\(c\), instead of repeatedly denoising the entire growing sequence, the output is divided into consecutive blocks\. Blocks are decoded from left to right, while tokens within the current block are denoised in parallel over multiple steps\. This structure localizes repeated computation and enables KV cache reuse across block boundaries\. Therefore, block\-wise decoding retains intra\-block diffusion parallelism while imposing a structured left\-to\-right workflow across blocks\.

KV Cache Refresh\.While block\-wise decoding enables KV cache reuse across completed blocks, KV states within the current block remain inherently dynamic\. Unlike autoregressive decoding, where generated tokens become fixed once appended to the prefix, dLLM decoding repeatedly refines masked tokens over multiple denoising steps\. As token identities change, the corresponding KV states computed in earlier steps may become stale\. Table[1](https://arxiv.org/html/2606.13740#S2.T1)shows that reusing such stale KV cache entries can make attention operate on outdated representations, causing the model to condition on inconsistent context and degrading generation quality\. Therefore, determining when and how to refresh KV cache becomes a key design problem for practical dLLM systems\.

Table 1\.Accuracy impact of block\-wise decoding\. The baseline follows vanilla dLLM decoding, which repeatedly denoises the entire growing sequence, while block\-wise decoding reuses the prefix KV cache from completed blocks\.Table 2\.Latency of one LLaDA\-8B denoising step pass with 32\-128 active tokens on the CPU and NPU\.
### 2\.2\.Opportunity: Mobile Neural Processing Unit

Modern mobile devices increasingly integrate neural processing units \(NPUs\) as first\-class compute engines for on\-device deep learning inference\. Unlike CPUs, which provide lightweight and flexible computation for control\-intensive tasks, and GPUs, which are primarily optimized for graphics rendering and often incur high\-cost synchronization, NPUs employ dedicated datapaths to execute dense tensor operations efficiently\(Chenet al\.,[2025a](https://arxiv.org/html/2606.13740#bib.bib15)\)\. As shown in Table[2](https://arxiv.org/html/2606.13740#S2.T2), NPUs exhibit clear performance advantages over CPUs on regular LLaDA denoising workloads such as matrix multiplication\.

Key Insight\.In this work, we observe that block\-wise dLLM decoding naturally exposes a computation unit that aligns well with the execution model of mobile NPUs\. Autoregressive decoding typically advances only one token per step, leading to small matrix shapes and low NPU utilization\. In contrast, a dLLM block contains multiple token positions that are denoised together and can be viewed as a small token batch\. This allows block\-wise denoising to be mapped to dense matrix workloads on the NPU\. Since mobile NPUs are often idle during conventional LLM inference, such mapping enables dLLM generation to exploit near\-free additional compute capacity, improving generation efficiency while largely hiding the extra computation of parallel denoising\.

![Refer to caption](https://arxiv.org/html/2606.13740v1/x3.png)Figure 3\.Single FFN down projection matrix multiplication latency on the SM8750 SoC\. \(a\) Measured latency with 4 to 128 tokens on the CPU, GPU, and NPU\. \(b\) Measured latency for processing the same 32 active tokens as sparse groups of different sizes on the CPU and NPU\.Despite this opportunity, effectively exploiting mobile NPUs requires understanding their computation and memory characteristics\. Through comprehensive evaluations, our analysis identifies the following four key properties:

\(1\) Stage Matrix Performance\.On the computation side, mobile NPUs derive most of their throughput from fixed\-size tiled matrix engines, such as systolic arrays, which operate on hardware\-preferred tile shapes\. When tensor dimensions are smaller than the tile size or not divisible by it, the NPU compiler inserts internal padding to form aligned tiles, and these padded elements still consume matrix\-engine cycles\. Consequently, matrix\-multiplication latency exhibits a stage\-like pattern rather than scaling smoothly with tensor size\. As shown in Figure[3](https://arxiv.org/html/2606.13740#S2.F3)\(a\), latency remains nearly constant within the same tile\-aligned range, but increases abruptly once the tensor shape crosses a tile boundary\.

\(2\) Weak Vector Support\.Beyond tiled matrix computation, mobile NPUs provide much weaker support for irregular or vector\-style operations\. While NPUs excel at matrix multiplication through dedicated matrix engines, their general\-purpose vector units typically offer much lower compute throughput and memory bandwidth\. As shown in Figure[3](https://arxiv.org/html/2606.13740#S2.F3)\(b\), vector\-style and irregular operations achieve much lower efficiency on the NPU than dense tiled matrix kernels\. This motivates a computation split that offloads dense matrix operations to the NPU while leaving irregular control and vector\-heavy operations to the CPU\.

![Refer to caption](https://arxiv.org/html/2606.13740v1/x4.png)Figure 4\.NPU memory characteristics: \(a\) FP16 HMX tile layout and \(b\) CPU\-NPU system with shared memory\.\(3\) Specialized Memory Layout\.On the memory side, mobile NPUs rely on specialized data layouts to sustain high matrix throughput\. Their matrix engines consume activations and weights in hardware\-specific tiled formats\. As shown in Figure[4](https://arxiv.org/html/2606.13740#S2.F4), the FP16 path uses 32×\\times32 matrix tiles with hardware\-expected internal ordering, including tile\-level inner\-product order and row shuffling\. Since CPU\-oriented tensor layouts are incompatible with the formats directly consumed by NPU matrix engines, additional layout conversion is required, introducing non\-trivial runtime overhead\.

\(4\) Limited Virtual Address\.In addition to layout constraints, mobile NPUs access shared buffers with the CPU through a constrained virtual memory mechanism\. Before a buffer can be consumed by an NPU kernel, it must be registered and mapped into the NPU virtual address space\. This address space is limited by session\-level address windows, mapper entries, alignment requirements, and fragmentation\. As a result, frequent buffer allocation or remapping can introduce noticeable latency and may even cause mapping failures under memory pressure\. Efficient NPU\-backed execution therefore requires stable buffer reuse and careful management of the limited NPU virtual address space\.

### 2\.3\.Challenges: Static NPUs Meet Dynamic dLLMs

Despite the natural alignment between parallel decoding of dLLMs and the tensor computation capability of mobile NPUs, efficient dLLM inference requires more than simply mapping parallel tensor computation onto the hardware\. The core challenge ofllada\.cpplies in the mismatch between thestaticexecution model of mobile NPUs and thedynamicbehavior of dLLM inference\. Mobile NPUs are optimized for regular, dense, and shape\-stable tensor computation, whose efficiency relies on fixed tile granularity, contiguous data layouts, and predictable memory access\. In contrast, block\-wise dLLM inference continuously changes the active token set, reusable KV states, and accessed data regions across denoising steps\. This mismatch manifests in three dimensions:

![Refer to caption](https://arxiv.org/html/2606.13740v1/x5.png)Figure 5\.Dynamic behavior of block\-wise dLLM inference on mobile NPUs\. \(a\) Share of denoising steps in which the current block has only a small number of remaining masked tokens\. \(b\) For different commit thresholds, the number of tokens committed per step and the fraction of committed tokens whose confidence remains below 0\.9 after all tokens in the block are committed\. \(c\) Share of output decisions that are still evaluated and the share that can be skipped after positions become stable\. \(d\) Per\-layer FFN operator latency, showing spikes when buffer mappings are triggered to make weights visible in the NPU address space\.\(1\)Token\-Commit Level\.In block\-wise decoding, tokens in the current block are progressively committed as denoising proceeds\. Early in decoding, many masked tokens remain active, providing sufficient parallel workload for each NPU forward pass\. Later, however, the number of remaining masked tokens decreases, reducing the effective computation per forward pass and leaving the NPU underutilized\. Figure[5](https://arxiv.org/html/2606.13740#S2.F5)\(a\) shows that this low\-workload tail appears across many denoising steps, exposing little useful token work while still occupying NPU execution slots\.

\(2\)Token\-Revision Level\.Block\-wise decoding must decide when a predicted token can be exposed to later computation, but an early decision made under incomplete intra\-block context is not necessarily reliable\. Figure[5](https://arxiv.org/html/2606.13740#S2.F5)\(b\) quantifies this tension with a single commit threshold: lowering the threshold commits more tokens per step, improving decoding progress, but after the whole block is committed, a larger fraction of those tokens still have confidence below 0\.9\. This means that treating every committed token as final prefix context can carry uncertain token states into later computation; the runtime should keep such low\-confidence committed tokens revisable\. At the same time, revision should not force the system to repeatedly recompute output decisions for positions that are already reliable\. Figure[5](https://arxiv.org/html/2606.13740#S2.F5)\(c\) shows that many such decisions can be skipped, motivating a revision path that tracks uncertain tokens while avoiding redundant output projection and decision logic for stable positions\.

\(3\)Memory\-Access Level\.Mobile NPUs expose limited device\-visible address space, which must accommodate model weights, the KV cache, activations, and temporary workspaces\. dLLM inference repeatedly accesses and swaps among these data across denoising steps, amplifying the overhead of buffer mapping, data transfer, and replacement\. As shown in Figure[5](https://arxiv.org/html/2606.13740#S2.F5)\(d\), a representative FFN operator becomes a latency spike when a buffer mapping is triggered, even though the same operator is short when the required buffers are already visible\. Across one forward pass, only a small fraction of NPU operators trigger mapping, but their cumulative mapping time is large enough to enter the critical path\. Such recurring memory\-access overheads can substantially offset the latency reduction from NPU acceleration\.

## 3\.System Design

### 3\.1\.Overview ofllada\.cpp

We proposellada\.cpp, the first NPU\-aware inference framework for dLLMs on smartphones\.llada\.cppfollows a block\-wise decoding workflow: it partitions the sequence into consecutive blocks and decodes them from left to right\. Completed blocks are reused through the KV cache, while tokens in the current block are updated in parallel across denoising steps\. Figure[6](https://arxiv.org/html/2606.13740#S3.F6)presents the system overview\.

![Refer to caption](https://arxiv.org/html/2606.13740v1/x6.png)Figure 6\.Overview ofllada\.cpp\.As analyzed in Section[2\.3](https://arxiv.org/html/2606.13740#S2.SS3), efficient dLLM inference cannot be achieved by simply mapping tensor computation onto NPUs\. To address the challenges,llada\.cppdesigns three key components: ❶Multi\-Block Speculative Decodingaddresses the dynamictoken\-commitprocess within the current block: as the block approaches completion and the number of remaining masked tokens decreases,llada\.cppspeculatively includes tokens from future blocks into the same NPU forward pass to maintain high hardware utilization\. ❷Dual\-Path Progressive Revisionhandles the dynamictoken\-revisionprocess in the prefix context: it first identifies revisable tokens through fine\-grained state tracking and then offloads their sparse KV cache refreshes and logits computation to a CPU\-side path, keeping the NPU dedicated to dense denoising\. ❸Swap\-Optimized Memory Runtimetargets the dynamicmemory\-accessprocess within the limited NPU address space: It uses graph\-guided buffer mapping to compact fragmented tensors and reduce avoidable VA swaps, while pipelining mapping preparation, data transfer, and NPU execution to hide the latency of unavoidable swaps\.

### 3\.2\.Multi\-Block Speculative Decoding

In standard block\-wise decoding, although each forward pass performs global attention over the block, the number of masked tokens gradually decreases as denoising progresses\. As a result, later decoding steps provide insufficient effective workload to fully utilize the NPU compute capacity\. As depicted in Figure[7](https://arxiv.org/html/2606.13740#S3.F7),llada\.cppdynamically incorporates a subset of tokens from subsequent blocks into current block decoding\. By using future tokens asspeculativework, this design increases the effective workload of each NPU forward pass while preserving the original decoding semantics\.

![Refer to caption](https://arxiv.org/html/2606.13740v1/x7.png)Figure 7\.Multi\-block speculative decoding\.llada\.cppuses two complementary strategies to expand decoding window\.llada\.cppintroduces adecoding windowas the basic execution unit, defined as the token range included in each NPU forward pass\. Unlike a block, which serves as the semantic unit for left\-to\-right commitment, a decoding window may extend beyond the current block to include tokens from future blocks\. These future\-block tokens are marked as*draft*: they record early denoising progress but neither update the prefix KV cache nor affect the commitment order\. When a future block later becomes current,llada\.cppre\-evaluates its draft tokens and accepts them only if they satisfy the commitment criterion\. By decoupling execution from commitment,llada\.cppadapts the executed token range while preserving decoding semantics\. Specifically,llada\.cppadapts the decoding window through two complementary strategies:

![Refer to caption](https://arxiv.org/html/2606.13740v1/x8.png)Figure 8\.Average first commitment step by token position across 32\-token blocks in LLaDA\-8B on GSM8K\.Strategy \#1: Continuous Sliding\.llada\.cppfirst exploits a key property of block\-wise diffusion decoding: token commitment within a block follows a left\-to\-right frontier rather than completing uniformly across all positions\. As shown in Figure[8](https://arxiv.org/html/2606.13740#S3.F8), positions closer to the confirmed prefix tend to commit earlier, while later positions require more denoising because they depend on still\-uncertain preceding context\. This means the useful active region of a block naturally moves rightward as decoding proceeds, creating an opportunity to move the execution window with the commit frontier\.

Based on this observation,llada\.cppslides the decoding window along the token\-commit frontier to focus computation on tokens that remain actively denoised\. At the beginning of decoding, the window covers the current block\. After each denoising step,llada\.cppcommits tokens in the window that satisfy the confidence criterion\. Once tokens on the left side of the window are committed, the window slides to the right while keeping its size unchanged, gradually extending into the next block\. These future positions then participate in subsequent denoising forward passes and produce tentative predictions and confidence states\.

Strategy \#2: Adaptive Lookahead\.Continuous sliding advances future blocks only as committed tokens release space in the window\. To better exploit late\-stage compute capacity,llada\.cppintroduces adaptive lookahead, which proactively extends the window to include additional future\-block tokens in the same NPU forward pass\. This strategy is motivated by the non\-linear latency behavior of mobile NPUs\. As discussed in Section[2\.2](https://arxiv.org/html/2606.13740#S2.SS2), NPU forward latency grows stepwise rather than linearly with the number of participating tokens\. Particularly, larger token batches better amortize fixed overheads from data transfers, kernel launches, and buffer allocation, allowing extra future\-block tokens to improve NPU utilization with limited marginal latency\.

Based on this observation,llada\.cppconverts otherwise wasted compute capacity in late\-stage decoding into useful speculative work for future blocks\. When only a few masked tokens remain in the current block, each forward pass still pays fixed scheduling, data movement, and kernel execution costs, but exposes too little effective parallel work\.llada\.cpptherefore expands the decoding window to include the next block and, when resources allow, additional future blocks in the same forward pass\. These future\-block tokens reuse the already paid NPU execution overhead for early denoising, increasing the effective workload per forward pass and making the available computation more productive\.

llada\.cppuses an adaptive policy to determine when to expand the decoding window and how many future tokens to include\. At each denoising step, it counts the remaining masked tokens in the current block\. When this count drops below a threshold, the block is considered to have entered the late decoding stage, triggering lookahead\.llada\.cppthen greedily appends future\-block tokens according to an offline\-profiled NPU latency table, which captures the stepwise NPU latency buckets\. This allows the runtime to estimate how many extra tokens can be added before crossing into the next bucket and incurring noticeable marginal latency\. A candidate expansion is accepted only if it remains within the profiled latency allowance and its additional memory usage fits the available budget\. Otherwise, expansion stops, and the resulting window is used for the next forward pass\.

### 3\.3\.Dual\-Path Progressive Revision

Block\-wise decoding enables dLLMs to generate tokens in parallel within each block and reuse previous blocks through the prefix KV cache\. However, a token reused as a prefix context may still become suboptimal as subsequent tokens are denoised and completed\. Conversely, repeatedly denoising such tokens within the current block can be wasteful, as insufficient context limits meaningful refinement\. To address this issue,llada\.cppintroduces a staged token stabilization mechanism that identifies tokens requiring revision and applies the corresponding updates through a dedicated CPU\-side path, allowing the NPU to focus on dense denoising\.

![Refer to caption](https://arxiv.org/html/2606.13740v1/x9.png)Figure 9\.\(a\) Example of the staged token stabilization algorithm\. \(b\)llada\.cppoffloads unstable\-token revision to a CPU path and merges the revised cache in a delayed manner\.Staged Token Stabilization\.llada\.cppfirst introduces a three\-state transition mechanism to manage the lifecycle of each generated token\. The key insight is that a token becoming available for subsequent denoising does not necessarily make it reliable enough to be permanently merged into the prefix KV cache\. As shown in Figure[9](https://arxiv.org/html/2606.13740#S3.F9)\(a\),llada\.cpptherefore separates short\-term visibility from long\-term stability\.

Specifically, each generated position is assigned one of three logical states:invisible \(IV\),visible \(V\), orstable \(S\)\. Aninvisibletoken is either masked or has a candidate prediction with insufficient confidence and thus is not used as context\. Avisibletoken has a concrete prediction and can participate in subsequent denoising steps, enabling decoding to progress in parallel\. However, visible tokens are not yet treated as complete: their token identities and KV states may still be revised as the surrounding context becomes more complete\. Astabletoken has passed a stronger stability criterion and is considered reliable enough for long\-term reuse\.

llada\.cppuses two thresholds to control token state transitions\. At each denoising step, tokens that satisfy thevisibility thresholdθvisible\\theta\_\{\\text\{visible\}\}are promoted to the visible state, while tokens that further satisfy thestability thresholdθstable\\theta\_\{\\text\{stable\}\}are promoted to the stable state\. Both visible and stable tokens can be exposed for parallel decoding, but only stable tokens are permanently reused as prefix context\. If no token satisfies theθvisible\\theta\_\{\\text\{visible\}\},llada\.cpppromotes the most confident token to visible to ensure continuous progress\. Importantly, early visibility does not imply permanent commitment: visible tokens remain tracked, even after their blocks have been reused as prefix context, until they satisfy theθstable\\theta\_\{\\text\{stable\}\}and are safely merged into the long\-term prefix KV cache\.

Sparse Cache Refresh\.The next challenge is to efficiently track and refresh visible tokens once they enter the prefix context\. These revisions are sparse, irregular, and position\-dependent, often involving only a few tokens scattered across previous blocks\. Since such computation poorly matches the NPU dense execution model, performing it on the NPU would disrupt current\-block denoising and add data movement and synchronization overheads\.

To avoid this inefficiency,llada\.cppseparates sparse token maintenance from dense NPU decoding\. As shown in Figure[9](https://arxiv.org/html/2606.13740#S3.F9)\(b\),llada\.cppintroduces a CPU\-side path that asynchronously recomputes token confidence, refreshes token identities, and updates their KV states\. This division matches the hardware characteristics of the two processors: the NPU handles high\-throughput dense computation, while the CPU handles sparse updates with lower scheduling overhead\.

llada\.cppenables the CPU path with two techniques\. First, the CPU refresh path reuses the same HMX\-specialized weights as the main NPU model\. Because these weights are stored in NPU\-specific tiled and quantized layouts, standard CPU matrix multiplication cannot directly interpret them under the original row\-column semantics\.llada\.cpptherefore implements specialized CPU kernels that explicitly decode the HMX tile permutation and quantized storage order, ensuring that CPU\-side recomputation preserves the shared memory layout and avoids inconsistent computation\.

Second,llada\.cppadopts a delayed\-merge policy to keep CPU refreshes off the critical path\. At the beginning of each denoising step, the NPU reads the current prefix KV cache and proceeds with current\-block decoding\. In parallel, the CPU refreshes selected KV entries based on the latest token states\. These refreshed entries are not consumed by the NPU within the same step; instead, they are merged back into the prefix KV cache only at step boundaries\. This step\-boundary merge avoids fine\-grained CPU\-NPU synchronization while making refreshed context available to later denoising steps\.

Selective Logits Skipping\.In addition to sparse cache refreshes,llada\.cppoffloads logits computation to the CPU path to relieve NPU memory pressure from vocabulary\-sized output projection weights\. Since naively computing logits for all tokens on the CPU would incur substantial overhead,llada\.cppperforms selective logits skipping to keep CPU computation manageable\. The key observation is that logits are needed only for positions whose token identities may still change or whose confidence must be evaluated for future decisions\. For stable tokens whose identities no longer affect commit decisions,llada\.cppskips output projection, confidence estimation, and sampling logic\.

### 3\.4\.Swap\-Optimized Memory Runtime

Mobile NPUs access system memory through specific buffers \(e\.g\., RPCMEM\(Qualcomm Technologies, Inc\.,[2026b](https://arxiv.org/html/2606.13740#bib.bib22)\)or DMABUF\(The Linux Kernel Developers,[2026a](https://arxiv.org/html/2606.13740#bib.bib23)\)\), which must be mapped into the NPU virtual address \(VA\) space before kernel execution\. However, this space is limited and often cannot simultaneously hold model weights, the KV cache, and workspaces, forcing frequent data swapping with extra mapping and transfer overheads\. As shown in Figure[10](https://arxiv.org/html/2606.13740#S3.F10),llada\.cppbuilds a swap\-optimized memory runtime that minimizes avoidable VA swaps and makes unavoidable swaps more efficient by hiding their latency behind NPU execution\.

![Refer to caption](https://arxiv.org/html/2606.13740v1/x10.png)Figure 10\.\(a\)llada\.cppuses the computational graph and profiled metrics to determine which tensors remain resident in NPU\-visible memory\. \(b\)llada\.cppuses double buffering to hide data transfer latency for short\-lived tensors\.Graph\-Guided Buffer Mapping\.llada\.cppfirst reduces unnecessary VA swaps by organizing buffer mappings according to the computational graph, which provides the execution order of NPU operators and the producer\-consumer relationships of tensors\. Using this information, the runtime classifies tensors as eitherlong\-livedorshort\-livedunder the available VA budget\. Tensors that must remain visible across denoising steps, such as the model weights and KV cache, are treated as long\-lived by default\. In contrast, activations and temporary tensors are classified as short\-lived because their live ranges are bounded by local graph dependencies\.

When the VA budget cannot keep all long\-lived data resident,llada\.cppuses a profile\-guided placement algorithm to determine which objects remain resident and which are mapped temporarily\. Figure[10](https://arxiv.org/html/2606.13740#S3.F10)\(a\) illustrates this procedure\. During initialization,llada\.cppprofiles representative operators to collect three types of information: \(1\) the size of associated long\-lived tensors, \(2\) NPU execution time, and \(3\) the transfer time required to prepare inputs on the host and make them visible in the NPU VA space\. Using this profile,llada\.cppconstructs alternatingresident dataandtemporary data\. Guided by the execution order,llada\.cpptraverses all operators and classifies their associated tensors\. Given the buffer budget,llada\.cppfirst fills it with as much resident data as possible and then selects temporary data whose transfers can be maximally overlapped with computation on resident data\. In particular, dependency\-free operators can be reordered to better satisfy the overlap criterion\.

For long\-lived data that remain resident,llada\.cppassigns stable VA regions that stay mapped across denoising steps\. These tensors are further packed according to their graph\-derived consumption order\. Particularly, tensors used by consecutive NPU operators are placed close to each other and accessed through a small number of large buffer mappings\. For short\-lived data, including naturally temporary tensors and long\-lived data demoted by the classification algorithm,llada\.cppuses compact scratch or staging regions and reuses offsets based on graph\-derived lifetimes\. Tensors with non\-overlapping live ranges share the same VA range, while operator\-local workspaces are released immediately after use\. This organization consolidates fragmented tensors into fewer mappings, reduces avoidable VA swaps, and lowers the bookkeeping cost of VA management\.

Pipeline\-Based Data Transfer\.After compacting buffer mappings,llada\.cppreduces unavoidable VA\-swap overhead by prefetching mapping metadata and staging data\. Guided by the computational graph, the runtime prepares buffer handles, offsets, sizes, and target staging slots while the NPU executes preceding operators\. For short\-lived data that must be staged before use, especially long\-lived data demoted by the placement algorithm, this prefetch step also identifies the target staging slot and starts preparing the data before its consumer operator is reached\. This preserves tensor layout and computation semantics while moving mapping preparation and data staging off the critical path\.

llada\.cppuses double\-buffered staging to hide transfer latency\. The runtime allocates two reusable staging slots within the NPU\-visible VA space\. While the NPU consumes data from one slot, the host prepares the next required data object in the other slot by preparing its buffer descriptor and validating its VA mapping\. Once the current operator finishes, the two slots exchange roles: the prepared slot becomes the input to the next NPU kernel, and the released slot is reused for the following transfer\. Following the graph\-derived access sequence, the runtime prefetches data ahead of its consumer operators, overlapping transfer and mapping costs with resident NPU computation whenever possible\.

## 4\.Implementation

We implementllada\.cppas an end\-to\-end framework built onllama\.cpp\(ggml\-org,[2026](https://arxiv.org/html/2606.13740#bib.bib16)\), extending it with over 12K lines of code\. Our implementation targets Qualcomm Hexagon NPUs due to their relatively open software ecosystem, while the core design can be adapted to other mobile platforms with similar execution paradigms\.llada\.cppconsists of two modules:

The first module is aQualcomm Hexagon NPU operator library, compiled as a Hexagon DSP shared object using the Hexagon SDK\(Qualcomm Technologies, Inc\.,[2026c](https://arxiv.org/html/2606.13740#bib.bib19)\)\. It targets the DSP\-coupled HMX\(Qualcomm Technologies, Inc\.,[2026d](https://arxiv.org/html/2606.13740#bib.bib20)\)and HVX\(Qualcomm Technologies, Inc\.,[2026a](https://arxiv.org/html/2606.13740#bib.bib21)\)engines, and implements FP16 matrix multiplication, Q4\_0 matrix multiplication for 4\-bit quantized weights, FlashAttention over FP16 KV states, RMSNorm, elementwise operators, shared\-memory mapping, and a device\-side worker pool\. The model converter extends the GGUF pipeline\([19](https://arxiv.org/html/2606.13740#bib.bib17)\)with a layout pass that rearranges FP16 weights into 32\-by\-32 HMX tiles and packs quantized weights for HVX\-based dequantization kernels\.

The second module is anAndroid host runtime, which adds a GGML Hexagon backend\([18](https://arxiv.org/html/2606.13740#bib.bib18)\), a dLLM model path, and a block\-wise diffusion driver\. It invokes the device library through FastRPC and uses RPCMEM shared memory\(Qualcomm Technologies, Inc\.,[2026b](https://arxiv.org/html/2606.13740#bib.bib22)\), backed by Linux DMABUF\(The Linux Kernel Developers,[2026a](https://arxiv.org/html/2606.13740#bib.bib23)\), as the tensor buffer type, allowing the CPU and NPU to access the same physical pages without copies\. Before dispatching each operator, the backend validates NPU\-visible mappings and passes tensors as file\-descriptor/offset pairs\. Supported operators are executed on the NPU, while the host manages block\-wise denoising, KV cache views, token\-state transitions, refresh candidates, lookahead positions, and output decisions\.

## 5\.Evaluation

### 5\.1\.Experimental Setup

Platform\.We evaluatellada\.cppon three Qualcomm smartphones \(Table[3](https://arxiv.org/html/2606.13740#S5.T3)\)\. The OnePlus Ace5 Pro with the Snapdragon 8 Elite \(SM8750\) SoC is the primary platform used for all end\-to\-end, ablation, and energy experiments\. The OnePlus 12 with the Snapdragon 8 Gen3 \(SM8650\) and the OnePlus 15 with the Snapdragon 8 Elite Gen5 \(SM8850\) are used in the cross\-device study to test portability across SoC generations\. They represent three successive SoC generations with progressively higher NPU compute capability\. All three phones run Android and execute the samellada\.cppbinary on the Hexagon NPU\. We enable the NPU FlashAttention\(Daoet al\.,[2022](https://arxiv.org/html/2606.13740#bib.bib47)\)operator on SM8650 and SM8750, while SM8850 uses the CPU FlashAttention operator due to runtime support limitations\.

Table 3\.Smartphone configurations\.Models\.LLaDA\-8B\-Instruct\(Nieet al\.,[2026](https://arxiv.org/html/2606.13740#bib.bib12)\)is the primary workload, as it is a mainstream dLLM model in recent studies\. We also include Llama\-3\-8B\-Instruct\(AI at Meta,[2024](https://arxiv.org/html/2606.13740#bib.bib46)\)as an autoregressive reference in the end\-to\-end generation, quality, and energy studies\. All models use Q4\_0 low\-bit weight quantization to enable LLaDA and Llama inference on smartphones\. In addition, we adapt Dream\-7B\(Yeet al\.,[2025](https://arxiv.org/html/2606.13740#bib.bib44)\), another dLLM whose most notable architectural difference from LLaDA is its grouped\-query attention with fewer KV heads, which changes attention computation\. This cross\-model experiment tests whetherllada\.cppbenefits extend beyond one dLLM architecture\.

Tasks\.We evaluate generation quality on 200\-sample subsets from each of four datasets \(Table[4](https://arxiv.org/html/2606.13740#S5.T4)\)\. These tasks are used both for model\-level comparison between LLaDA and Llama and for isolating the quality impact ofllada\.cppcomponents\.

Table 4\.Evaluation datasets and task types\.Baselines and settings\.We comparellada\.cppwith three baselines\. The*CPU baseline*follows vanilla dLLM decoding and repeatedly denoises the full sequence on the smartphone CPU\. The*CPU with prefix KV cache reuse*baseline keeps computation on the CPU but applies prefix KV cache reuse across completed blocks\. The*NPU baseline*keeps the vanilla denoising schedule but offloads dense Transformer computation, mainly large matrix multiplications, to the NPU\. Their denoising step count is set to the number of target generated tokens by default\. Unless otherwise stated, dual\-path progressive revision uses visibility and stability thresholds of 0\.7 and 0\.9\. For performance breakdown and sensitivity analysis, our implementation can enable or disable prefix KV cache reuse, NPU offloading, and each proposed component independently while keeping the other settings fixed\.

Metrics\.We report end\-to\-end latency, generation quality, average power, and energy per request\. Latency is measured as wall\-clock request time, including prompt processing and generation\. For each dataset, we measure five cases, repeat each run three times, and report exact generated\-token latency for 32, 64, or 128 target tokens\. Since the default dLLM block size is 32 tokens, this range covers both single\-block and multi\-block generation cases\. All generations run with temperature 0\. Generation quality follows the task\-specific scoring logic in lm\-evaluation\-harness\(Gaoet al\.,[2024](https://arxiv.org/html/2606.13740#bib.bib45)\)\. We map each model output to the corresponding task answer format and report accuracy\. Power is sampled from battery sysfs traces\(The Linux Kernel Developers,[2026b](https://arxiv.org/html/2606.13740#bib.bib36)\), and energy per request is computed by integrating idle\-subtracted power over the request window\.

### 5\.2\.End\-to\-End dLLM Generation

![Refer to caption](https://arxiv.org/html/2606.13740v1/x11.png)Figure 11\.End\-to\-end generation latency on different smartphones\. LLaDA groups report 32\-, 64\-, and 128\-token outputs under CPU, CPU with prefix KV cache reuse, NPU, and fullllada\.cppexecution; Llama is a 128\-token autoregressive reference\.Figure[11](https://arxiv.org/html/2606.13740#S5.F11)compares the time needed to generate 32, 64, and 128 tokens\. We report latency across four datasets because dLLM decoding speed depends on token predictability, and different tasks can produce different confidence patterns and effective denoising steps\. This effect is evident under the same setup, and the end\-to\-end latency differs noticeably across datasets\. As NPU compute capability increases from SM8650 to newer SoCs, the absolute latency ofllada\.cppgenerally decreases\. However, the relative speedup does not grow monotonically, because the baselines also benefit from newer platforms and the optimized execution starts to expose scheduling, token\-state management, and CPU\-side work\.

The comparison also shows that only NPU offloading is insufficient\. The NPU baseline accelerates dense Transformer computation, but it still follows the vanilla denoising schedule and repeatedly pays for full sequence forwards\. As a result, it does not separate clearly from the CPU baseline on some settings\.llada\.cppreduces this gap by increasing useful token work in each NPU forward pass and avoiding unnecessary output decisions for stable tokens, making NPU\-backed dLLM generation practical on recent SoCs\. For 128\-token generation,llada\.cppis still slower than Llama on SM8650, where NPU compute is more limited, but becomes faster than Llama on SM8750 and SM8850\. The result of Llama shows a smaller and less consistent NPU benefit, because autoregressive decoding exposes much less token\-level parallelism to the NPU than block\-wise dLLM decoding\.

![Refer to caption](https://arxiv.org/html/2606.13740v1/x12.png)Figure 12\.End\-to\-end generation latency on Dream\-7B with different generated\-token lengths, comparing CPU, CPU with prefix KV cache reuse, NPU, andllada\.cpp\.We adapt Dream\-7B and measure its performance on the SM8750 device\. Figure[12](https://arxiv.org/html/2606.13740#S5.F12)shows acceleration trends similar to LLaDA\-8B, demonstrating the scalability of our method\.

### 5\.3\.Quality Preservation

Table[5](https://arxiv.org/html/2606.13740#S5.T5)first compares LLaDA with the Llama autoregressive reference\. LLaDA remains competitive across the four tasks, which supports using it as a usable dLLM on smartphones\. The lower block then isolates the quality impact ofllada\.cppcomponents\. Moving dense computation to the NPU does not introduce a systematic accuracy drop, while multi\-block speculative decoding alone can reduce accuracy because future\-block work is used before the surrounding context is fully stable\. Dual\-path progressive revision brings quality back close to the CPU path by keeping early visible tokens revisable and refreshing unstable prefix states\.

Table 5\.Generation quality on four 200\-sample datasets\. The upper block compares LLaDA and Llama under CPU and NPU\-backed settings, while the lower block reports the LLaDA component ablation with multi\-block speculative decoding \(MBSD\) and dual\-path progressive revision \(DPPR\)\.Table 6\.128\-token generation performance breakdown ofllada\.cppusing GSM8K on SM8750, where the rows gradually add prefix KV cache reuse, swap\-optimized memory runtime \(SOMR\), multi\-block speculative decoding \(MBSD\), staged token stabilization \(STS\), selective logits skipping \(SLS\), and sparse cache refresh \(SCR\)\.
### 5\.4\.Performance Breakdown

Table[6](https://arxiv.org/html/2606.13740#S5.T6)reports the 128\-token latency breakdown on the SM8750 device using the GSM8K cases as the main latency experiment\. Prefix KV cache reuse reduces repeated prefix computation in both CPU and NPU baselines, while moving dense denoising to the NPU provides a larger system\-level reduction\. Swap\-optimized memory runtime further lowers latency by reducing recurrent NPU buffer remapping, but its isolated gain is smaller because dense denoising forward passes still dominate the NPU baseline\. The largest additional reductions come from the decoding components\. Multi\-block speculative decoding fills late\-stage forwards with future\-block work, and Staged Token Stabilization accelerates decoding by separating early visibility from long\-term stability\. Selective Logits Skipping removes redundant output decisions for stable tokens\. Its benefit varies by input, since early denoising steps may stabilize many tokens while later steps only decide a few tokens, allowing stable\-position logits to be skipped and reducing latency by more than 10% in some cases\. Sparse Cache Refresh adds the quality\-preserving CPU\-side refresh step with small latency overhead\.

![Refer to caption](https://arxiv.org/html/2606.13740v1/x13.png)Figure 13\.Sensitivity analysis forllada\.cppcomponents\. \(a\) Latency as future\-block tokens are admitted at different remaining masked\-token thresholds in multi\-block speculative decoding \(MBSD\)\. \(b\) Tokens that remain below the stability threshold when the current block finishes, under different dual\-path progressive revision \(DPPR\) visibility and stability thresholds\. \(c\) Synchronous staging time under different NPU\-visible weight budgets in swap\-optimized memory runtime \(SOMR\)\.
### 5\.5\.Sensitivity Analysis

Figure[13](https://arxiv.org/html/2606.13740#S5.F13)focuses on the runtime\-level control parameters that directly correspond tollada\.cppcomponents\. Figure[13](https://arxiv.org/html/2606.13740#S5.F13)\(a\) shows that the threshold for admitting future\-block tokens has a clear operating point on the device\. Triggering at four remaining tokens gives the lowest latency, while triggering too early adds unnecessary future\-block work and triggering too late misses useful NPU work\. Figure[13](https://arxiv.org/html/2606.13740#S5.F13)\(b\) reports how many tokens still fail the stability threshold when the current block finishes under different visibility and stability thresholds\. This result guides the threshold choice in dual\-path progressive revision\. A lower visibility threshold exposes more tokens earlier and improves decoding progress, but it also leaves more visible tokens that may still need revision\. A higher stability threshold protects quality by keeping uncertain tokens out of the long\-term prefix KV cache, but it increases the amount of additional CPU\-side refresh work\. Figure[13](https://arxiv.org/html/2606.13740#S5.F13)\(c\) fixes the temporary staging budget at 256 MB and reduces the NPU\-visible weight budget\. The model remains runnable even when no weights are kept visible across steps, but synchronous staging time increases as more weights must be streamed through staging buffers\.

### 5\.6\.Memory Boundary and Long\-Output Runs

Memory boundary\.Repeated denoising revisits the same weights, prefix KV states, activations, and temporary buffers across many steps\. This makes NPU\-visible address\-space management part of the dLLM critical path rather than a one\-time initialization cost\. Table[7](https://arxiv.org/html/2606.13740#S5.T7)reports the measured boundary runs under the same SM8750 device configuration\. Without swap\-optimized memory runtime, the NPU baseline fails before producing tokens at microbatch sizes 256 and 512\. With graph\-guided buffer mapping and pipelined staging,llada\.cppcompletes both boundary settings\.

Long\-output runs\.A longer generation makes the prefix KV cache grow and keeps more model states active across denoising steps, which increases pressure on the NPU\-visible address space\. Swap\-optimized memory runtime keeps these larger execution points runnable by combining graph\-guided buffer mapping with pipelined staging\. This allowsllada\.cppto support longer\-context dLLM generation on smartphones, improving its practical usability beyond short requests\.

Table 7\.Memory\-boundary runs on SM8750\. The table varies the maximum generated\-token budget and compares the full swap\-optimized memory runtime \(SOMR\) with a paging\-only setting that keeps replacement but disables graph\-guided buffer mapping and pipelined staging\.
### 5\.7\.Energy Behavior

We measure battery power on SM8750 using a fixed 128\-token workload\. Figure[14](https://arxiv.org/html/2606.13740#S5.F14)reports average power and idle\-subtracted energy\. For LLaDA,llada\.cppreduces both power and per\-request energy by giving the NPU sufficient parallel work to finish quickly\. In contrast, NPU\-accelerated Llama shows higher average power than its CPU setting: autoregressive decoding repeatedly issues small NPU workloads, incurring accelerator scheduling overhead without enough token\-level parallelism to amortize it\. As a result, its limited latency reduction largely cancels the energy benefit\. This contrast shows that mobile NPU energy efficiency depends on exposing sufficient parallel token work, not merely on offloading an LLM to the NPU\.

![Refer to caption](https://arxiv.org/html/2606.13740v1/x14.png)Figure 14\.Comparison of average power and energy consumption on SM8750 under fixed 128\-token workloads\.

## 6\.Related Works

On\-Device LLM Inference\.Recent work has explored quantization, heterogeneous execution, and mobile NPU specialization for LLM inference\.llm\.npu\(Xuet al\.,[2025](https://arxiv.org/html/2606.13740#bib.bib24)\)maps autoregressive LLMs to NPUs using INT8 NPU kernels, CPU\-side outlier compensation, chunk\-sharing graphs, and out\-of\-order subgraph scheduling\.llama\.cpp\-npu\(Haoet al\.,[2026](https://arxiv.org/html/2606.13740#bib.bib25)\)reverse\-engineers Hexagon HMX execution and builds a mobile\-NPU path for test\-time scaling with hardware\-aware tile quantization and LUT\-based vector kernels\. HeteroLLM\(Chenet al\.,[2025a](https://arxiv.org/html/2606.13740#bib.bib15)\)characterizes mobile SoC execution and improves heterogeneous LLM inference through GPU\-NPU tensor partitioning, fast synchronization, and reusable memory pools\. These systems show that mobile acceleration depends on layout, scheduling, synchronization, and memory organization, rather than peak TOPS alone\. However, they primarily target autoregressive decoding or static dense inference graphs, where each decoding step exposes limited matrix parallelism on the NPU\. In contrast,llada\.cpptargets block\-wise dLLM inference, whose parallel denoising naturally forms dense NPU forwards and converts into lower latency\.

dLLM Inference Optimization\.Prior work\(Austinet al\.,[2021](https://arxiv.org/html/2606.13740#bib.bib11); Nieet al\.,[2026](https://arxiv.org/html/2606.13740#bib.bib12); Khannaet al\.,[2025](https://arxiv.org/html/2606.13740#bib.bib13)\)reduces the cost of iterative denoising over masked sequences\. Block\-wise decoding\(Arriolaet al\.,[2025](https://arxiv.org/html/2606.13740#bib.bib14)\)makes long generation practical by combining intra\-block parallel updates with left\-to\-right block progression\. Fast\-dLLM\(Wuet al\.,[2025](https://arxiv.org/html/2606.13740#bib.bib27)\)enables KV cache reuse and confidence\-aware parallel decoding for dLLMs, while cache\-oriented methods such as dKV\-Cache\(Maet al\.,[2025](https://arxiv.org/html/2606.13740#bib.bib28)\), dLLM\-Cache\(Liuet al\.,[2025](https://arxiv.org/html/2606.13740#bib.bib29)\), and FlashBlock\(Chenet al\.,[2026](https://arxiv.org/html/2606.13740#bib.bib30)\)exploit step\-to\-step stability in KV states or attention outputs\. Other techniques improve decoding through remasking or token editing\(Wanget al\.,[2025](https://arxiv.org/html/2606.13740#bib.bib31); Bieet al\.,[2026](https://arxiv.org/html/2606.13740#bib.bib32)\), distillation and parallel decoding\(Qianet al\.,[2026](https://arxiv.org/html/2606.13740#bib.bib33); Chenet al\.,[2025b](https://arxiv.org/html/2606.13740#bib.bib34)\), or early skipping\(Zhuet al\.,[2026](https://arxiv.org/html/2606.13740#bib.bib35)\)\. These methods primarily optimize decoding algorithms or reusable model states\.llada\.cppfocuses on algorithm\-system co\-design, aligning the parallel decoding nature of dLLMs with the tensor\-computation capabilities of mobile NPUs\.

## 7\.Conclusion

We presentedllada\.cpp, the first NPU\-aware inference framework for accelerating dLLMs on smartphones\. We hopellada\.cpppaves the way for low\-latency, privacy\-preserving, and widely accessible AI capabilities on mobile devices\.

## References

- AI at Meta \(2024\)Llama 3 model card\.Note:[https://github\.com/meta\-llama/llama3/blob/main/MODEL\_CARD\.md](https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md)Accessed: 2026\-06\-10Cited by:[§5\.1](https://arxiv.org/html/2606.13740#S5.SS1.p2.1)\.
- M\. Arriola, A\. Gokaslan, J\. Chiu, Z\. Yang, Z\. Qi, J\. Han, S\. Sahoo, and V\. Kuleshov \(2025\)Block diffusion: interpolating between autoregressive and diffusion language models\.InInternational Conference on Learning Representations,Vol\.2025,pp\. 50726–50753\.Cited by:[§1](https://arxiv.org/html/2606.13740#S1.p2.1),[§6](https://arxiv.org/html/2606.13740#S6.p2.1)\.
- J\. Austin, D\. D\. Johnson, J\. Ho, D\. Tarlow, and R\. Van Den Berg \(2021\)Structured denoising diffusion models in discrete state\-spaces\.Advances in neural information processing systems34,pp\. 17981–17993\.Cited by:[§1](https://arxiv.org/html/2606.13740#S1.p2.1),[§6](https://arxiv.org/html/2606.13740#S6.p2.1)\.
- T\. Bie, M\. Cao, X\. Cao, B\. Chen,et al\.\(2026\)LLaDA2\.1: speeding up text diffusion via token editing\.CoRRabs/2602\.08676\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2602.08676)Cited by:[§6](https://arxiv.org/html/2606.13740#S6.p2.1)\.
- L\. Chen, D\. Feng, E\. Feng, Y\. Wang, R\. Zhao, Y\. Xia, P\. Xu, and H\. Chen \(2025a\)Characterizing mobile SoC for accelerating heterogeneous LLM inference\.InProceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles,pp\. 359–374\.External Links:[Document](https://dx.doi.org/10.1145/3731569.3764808)Cited by:[§2\.2](https://arxiv.org/html/2606.13740#S2.SS2.p1.1),[§6](https://arxiv.org/html/2606.13740#S6.p1.1)\.
- Z\. Chen, J\. Cai, and B\. Zhuang \(2026\)FlashBlock: attention caching for efficient long\-context block diffusion\.CoRRabs/2602\.05305\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2602.05305)Cited by:[§6](https://arxiv.org/html/2606.13740#S6.p2.1)\.
- Z\. Chen, G\. Fang, X\. Ma, R\. Yu, and X\. Wang \(2025b\)DParallel: learnable parallel decoding for dllms\.CoRRabs/2509\.26488\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2509.26488)Cited by:[§6](https://arxiv.org/html/2606.13740#S6.p2.1)\.
- X\. Chu, L\. Qiao, X\. Lin, S\. Xu, Y\. Yang, Y\. Hu, F\. Wei, X\. Zhang, B\. Zhang, X\. Wei,et al\.\(2023\)Mobilevlm: a fast, strong and open vision language assistant for mobile devices\.arXiv preprint arXiv:2312\.16886\.Cited by:[§1](https://arxiv.org/html/2606.13740#S1.p1.1)\.
- C\. Clark, K\. Lee, M\. Chang, T\. Kwiatkowski, M\. Collins, and K\. Toutanova \(2019\)BoolQ: exploring the surprising difficulty of natural yes/no questions\.InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,pp\. 2924–2936\.Cited by:[Table 4](https://arxiv.org/html/2606.13740#S5.T4.1.3.2.1)\.
- P\. Clark, I\. Cowhey, O\. Etzioni, T\. Khot, A\. Sabharwal, C\. Schoenick, and O\. Tafjord \(2018\)Think you have solved question answering? try ARC, the AI2 reasoning challenge\.arXiv preprint arXiv:1803\.05457\.Cited by:[Table 4](https://arxiv.org/html/2606.13740#S5.T4.1.4.3.1)\.
- K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano, C\. Hesse, and J\. Schulman \(2021\)Training verifiers to solve math word problems\.arXiv preprint arXiv:2110\.14168\.Cited by:[Table 4](https://arxiv.org/html/2606.13740#S5.T4.1.2.1.1)\.
- T\. Dao, D\. Y\. Fu, S\. Ermon, A\. Rudra, and C\. Ré \(2022\)FlashAttention: fast and memory\-efficient exact attention with IO\-awareness\.InAdvances in Neural Information Processing Systems,Vol\.35,pp\. 16344–16359\.Cited by:[§5\.1](https://arxiv.org/html/2606.13740#S5.SS1.p1.1)\.
- GadgetVersus \(2026a\)Qualcomm SM8650\-AB Snapdragon 8 Gen 3 Processor Specifications\.Note:[https://gadgetversus\.com/processor/qualcomm\-sm8650\-ab\-snapdragon\-8\-gen\-3\-specs/](https://gadgetversus.com/processor/qualcomm-sm8650-ab-snapdragon-8-gen-3-specs/)Accessed: 2026\-06\-10Cited by:[Table 3](https://arxiv.org/html/2606.13740#S5.T3.1.2.1.4)\.
- GadgetVersus \(2026b\)Qualcomm SM8750\-AB Snapdragon 8 Elite Processor Specifications\.Note:[https://gadgetversus\.com/processor/qualcomm\-sm8750\-ab\-snapdragon\-8\-elite\-specs/](https://gadgetversus.com/processor/qualcomm-sm8750-ab-snapdragon-8-elite-specs/)Accessed: 2026\-06\-10Cited by:[Table 3](https://arxiv.org/html/2606.13740#S5.T3.1.3.2.4)\.
- GadgetVersus \(2026c\)Qualcomm SM8850\-AC Snapdragon 8 Elite Gen 5 Processor Specifications\.Note:[https://gadgetversus\.com/processor/qualcomm\-sm8850\-ac\-snapdragon\-8\-elite\-gen\-5\-specs/](https://gadgetversus.com/processor/qualcomm-sm8850-ac-snapdragon-8-elite-gen-5-specs/)Accessed: 2026\-06\-10Cited by:[Table 3](https://arxiv.org/html/2606.13740#S5.T3.1.4.3.4)\.
- L\. Gao, J\. Tow, B\. Abbasi, S\. Biderman, S\. Black, A\. DiPofi, C\. Foster, L\. Golding, J\. Hsu, A\. Le Noac’h, H\. Li, K\. McDonell, N\. Muennighoff, C\. Ociepa, J\. Phang, L\. Reynolds, H\. Schoelkopf, A\. Skowron, L\. Sutawika, E\. Tang, A\. Thite, B\. Wang, K\. Wang, and A\. Zou \(2024\)A framework for few\-shot language model evaluation\.Note:[https://github\.com/EleutherAI/lm\-evaluation\-harness](https://github.com/EleutherAI/lm-evaluation-harness)Accessed: 2026\-06\-10Cited by:[§5\.1](https://arxiv.org/html/2606.13740#S5.SS1.p5.1)\.
- ggml\-org \(2026\)llama\.cpp: llm inference in c/c\+\+\.Note:[https://github\.com/ggml\-org/llama\.cpp](https://github.com/ggml-org/llama.cpp)Accessed: 2026\-06\-08Cited by:[§4](https://arxiv.org/html/2606.13740#S4.p1.1)\.
- \[18\]\(2026\)GGML: tensor library for machine learning\.Note:[https://github\.com/ggml\-org/ggml](https://github.com/ggml-org/ggml)Accessed: 2026\-06\-09Cited by:[§4](https://arxiv.org/html/2606.13740#S4.p3.1)\.
- \[19\]\(2026\)GGUF: ggml universal file format\.Note:[https://github\.com/ggml\-org/ggml/blob/master/docs/gguf\.md](https://github.com/ggml-org/ggml/blob/master/docs/gguf.md)Accessed: 2026\-06\-09Cited by:[§4](https://arxiv.org/html/2606.13740#S4.p2.1)\.
- Z\. Hao, J\. Wei, T\. Wang, M\. Huang, H\. Jiang, S\. Jiang, T\. Cao, and J\. Ren \(2026\)Scaling llm test\-time compute with mobile npu on smartphones\.InProceedings of the 21st European Conference on Computer Systems,pp\. 2157–2172\.Note:arXiv:2509\.23324External Links:[Document](https://dx.doi.org/10.1145/3767295.3769382)Cited by:[§6](https://arxiv.org/html/2606.13740#S6.p1.1)\.
- S\. Khanna, S\. Kharbanda, S\. Li, H\. Varma, E\. Wang, S\. Birnbaum, Z\. Luo, Y\. Miraoui, A\. Palrecha, S\. Ermon,et al\.\(2025\)Mercury: ultra\-fast language models based on diffusion\.arXiv e\-prints,pp\. arXiv–2506\.Cited by:[§1](https://arxiv.org/html/2606.13740#S1.p2.1),[§6](https://arxiv.org/html/2606.13740#S6.p2.1)\.
- S\. Lee, J\. Choi, J\. Lee, M\. H\. Wasi, H\. Choi, S\. Ko, S\. Oh, and I\. Shin \(2024\)MobileGPT: augmenting LLM with human\-like app memory for mobile task automation\.InProceedings of the 30th Annual International Conference on Mobile Computing and Networking,pp\. 1119–1133\.External Links:[Document](https://dx.doi.org/10.1145/3636534.3690682)Cited by:[§1](https://arxiv.org/html/2606.13740#S1.p1.1)\.
- Z\. Liu, C\. Zhao, F\. Iandola, C\. Lai, Y\. Tian, I\. Fedorov, Y\. Xiong, E\. Chang, Y\. Shi, R\. Krishnamoorthi, L\. Lai, and V\. Chandra \(2024\)MobileLLM: optimizing sub\-billion parameter language models for on\-device use cases\.InProceedings of the 41st International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.235,pp\. 32431–32454\.Cited by:[§1](https://arxiv.org/html/2606.13740#S1.p1.1)\.
- Z\. Liu, Y\. Yang, Y\. Zhang, J\. Chen, C\. Zou, Q\. Wei, S\. Wang, Y\. Zhu, and L\. Zhang \(2025\)DLLM\-cache: accelerating diffusion large language models with adaptive caching\.CoRRabs/2506\.06295\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2506.06295)Cited by:[§6](https://arxiv.org/html/2606.13740#S6.p2.1)\.
- X\. Ma, R\. Yu, G\. Fang, and X\. Wang \(2025\)DKV\-cache: the cache for diffusion language models\.CoRRabs/2505\.15781\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2505.15781)Cited by:[§6](https://arxiv.org/html/2606.13740#S6.p2.1)\.
- E\. Mahurin \(2023\)Qualcomm® hexagon™ npu\.In2023 IEEE Hot Chips 35 Symposium \(HCS\),pp\. 1–19\.External Links:[Document](https://dx.doi.org/10.1109/HCS59251.2023.10254715)Cited by:[§1](https://arxiv.org/html/2606.13740#S1.p4.1)\.
- S\. Nie, F\. Zhu, Z\. You, X\. Zhang, J\. Ou, J\. Hu, J\. Zhou, Y\. Lin, J\. Wen, and C\. Li \(2026\)Large language diffusion models\.Advances in Neural Information Processing Systems38,pp\. 50608–50646\.Cited by:[§1](https://arxiv.org/html/2606.13740#S1.p2.1),[§5\.1](https://arxiv.org/html/2606.13740#S5.SS1.p2.1),[§6](https://arxiv.org/html/2606.13740#S6.p2.1)\.
- Y\. Qian, J\. Su, L\. Hu, P\. Zhang,et al\.\(2026\)D3LLM: ultra\-fast diffusion llm using pseudo\-trajectory distillation\.CoRRabs/2601\.07568\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2601.07568)Cited by:[§6](https://arxiv.org/html/2606.13740#S6.p2.1)\.
- Qualcomm Technologies, Inc\. \(2023\)Snapdragon X Elite Product Brief\.Note:[https://www\.qualcomm\.com/content/dam/qcomm\-martech/dm\-assets/documents/Product\-Brief\-Snapdragon\-X\-Elite\.pdf](https://www.qualcomm.com/content/dam/qcomm-martech/dm-assets/documents/Product-Brief-Snapdragon-X-Elite.pdf)Accessed: 2026\-06\-09Cited by:[§1](https://arxiv.org/html/2606.13740#S1.p4.1)\.
- Qualcomm Technologies, Inc\. \(2026a\)Bringing xnnpack to qualcomm hexagon npu: accelerating ml inference on hvx\.Note:[https://www\.qualcomm\.com/developer/blog/2026/03/bringing\-xnnpack\-hexagon\-npu](https://www.qualcomm.com/developer/blog/2026/03/bringing-xnnpack-hexagon-npu)Accessed: 2026\-06\-08Cited by:[§4](https://arxiv.org/html/2606.13740#S4.p2.1)\.
- Qualcomm Technologies, Inc\. \(2026b\)FastRPC: qualcomm’s userspace library for remote procedure calls between cpu and dsp\.Note:[https://github\.com/qualcomm/fastrpc](https://github.com/qualcomm/fastrpc)Accessed: 2026\-06\-08Cited by:[§3\.4](https://arxiv.org/html/2606.13740#S3.SS4.p1.1),[§4](https://arxiv.org/html/2606.13740#S4.p3.1)\.
- Qualcomm Technologies, Inc\. \(2026c\)Hexagon NPU SDK\.Note:[https://www\.qualcomm\.com/developer/software/hexagon\-npu\-sdk](https://www.qualcomm.com/developer/software/hexagon-npu-sdk)Accessed: 2026\-06\-08Cited by:[§4](https://arxiv.org/html/2606.13740#S4.p2.1)\.
- Qualcomm Technologies, Inc\. \(2026d\)Qualcomm Hexagon V81 HMX Programmer’s Reference Manual\.Note:Document 80\-N2040\-62 Rev\. AACited by:[§4](https://arxiv.org/html/2606.13740#S4.p2.1)\.
- A\. Rico, S\. Pareek, J\. Cabezas, D\. Clarke, B\. Ozgul, F\. Barat, Y\. Fu, S\. Munz, D\. Stuart, P\. Schlangen, P\. Duarte, S\. Date, I\. Paul, J\. Weng, S\. Santan, V\. Kathail, A\. Sirasao, and J\. Noguera \(2024\)AMD xdna npu in ryzen ai processors\.IEEE Micro44\(6\),pp\. 73–82\.External Links:[Document](https://dx.doi.org/10.1109/MM.2024.3423692)Cited by:[§1](https://arxiv.org/html/2606.13740#S1.p4.1)\.
- The Linux Kernel Developers \(2026a\)Buffer sharing and synchronization \(dma\-buf\)\.Note:[https://docs\.kernel\.org/driver\-api/dma\-buf\.html](https://docs.kernel.org/driver-api/dma-buf.html)Accessed: 2026\-06\-08Cited by:[§3\.4](https://arxiv.org/html/2606.13740#S3.SS4.p1.1),[§4](https://arxiv.org/html/2606.13740#S4.p3.1)\.
- The Linux Kernel Developers \(2026b\)Linux Power Supply Class\.The Linux Kernel Documentation\.Note:Accessed: 2026\-06\-10External Links:[Link](https://docs.kernel.org/power/power_supply_class.html)Cited by:[§5\.1](https://arxiv.org/html/2606.13740#S5.SS1.p5.1)\.
- A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, L\. Kaiser, and I\. Polosukhin \(2017\)Attention is all you need\.InAdvances in Neural Information Processing Systems,Cited by:[§1](https://arxiv.org/html/2606.13740#S1.p1.1)\.
- B\. Wang, G\. Li, and Y\. Li \(2023\)Enabling conversational interaction with mobile ui using large language models\.InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems,pp\. 1–17\.External Links:[Document](https://dx.doi.org/10.1145/3544548.3580895)Cited by:[§1](https://arxiv.org/html/2606.13740#S1.p1.1)\.
- G\. Wang, Y\. Schiff, S\. S\. Sahoo, and V\. Kuleshov \(2025\)Remasking discrete diffusion models with inference\-time scaling\.CoRRabs/2503\.00307\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2503.00307)Cited by:[§6](https://arxiv.org/html/2606.13740#S6.p2.1)\.
- H\. Wen, Y\. Li, G\. Liu, S\. Zhao, T\. Yu, T\. J\. Li, S\. Jiang, Y\. Liu, Y\. Zhang, and Y\. Liu \(2024\)AutoDroid: LLM\-powered task automation in android\.InProceedings of the 30th Annual International Conference on Mobile Computing and Networking,pp\. 543–557\.External Links:[Document](https://dx.doi.org/10.1145/3636534.3649379)Cited by:[§1](https://arxiv.org/html/2606.13740#S1.p1.1)\.
- C\. Wu, H\. Zhang, S\. Xue, Z\. Liu, S\. Diao, L\. Zhu, P\. Luo, S\. Han, and E\. Xie \(2025\)Fast\-dllm: training\-free acceleration of diffusion llm by enabling kv cache and parallel decoding\.CoRRabs/2505\.22618\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2505.22618)Cited by:[§6](https://arxiv.org/html/2606.13740#S6.p2.1)\.
- D\. Xu, H\. Zhang, L\. Yang, R\. Liu, G\. Huang, M\. Xu, and X\. Liu \(2025\)Fast on\-device llm inference with npus\.InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1,pp\. 445–462\.External Links:[Document](https://dx.doi.org/10.1145/3669940.3707239)Cited by:[§6](https://arxiv.org/html/2606.13740#S6.p1.1)\.
- J\. Ye, Z\. Xie, L\. Zheng, J\. Gao, Z\. Wu, X\. Jiang, Z\. Li, and L\. Kong \(2025\)Dream 7b: diffusion large language models\.CoRRabs/2508\.15487\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2508.15487)Cited by:[§5\.1](https://arxiv.org/html/2606.13740#S5.SS1.p2.1)\.
- R\. Zellers, A\. Holtzman, Y\. Bisk, A\. Farhadi, and Y\. Choi \(2019\)HellaSwag: can a machine really finish your sentence?\.InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics,pp\. 4791–4800\.Cited by:[Table 4](https://arxiv.org/html/2606.13740#S5.T4.1.5.4.1)\.
- C\. Zhang, Z\. Yang, J\. Liu, Y\. Li, Y\. Han, X\. Chen, Z\. Huang, B\. Fu, and G\. Yu \(2025\)AppAgent: multimodal agents as smartphone users\.InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems,pp\. 1–20\.External Links:[Document](https://dx.doi.org/10.1145/3706598.3713600)Cited by:[§1](https://arxiv.org/html/2606.13740#S1.p1.1)\.
- Z\. Zhu, F\. Ren, Z\. Tan, and K\. Ma \(2026\)ES\-dllm: efficient inference for diffusion large language models by early\-skipping\.CoRRabs/2603\.10088\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2603.10088)Cited by:[§6](https://arxiv.org/html/2606.13740#S6.p2.1)\.

Efficient On-Device Diffusion LLM Inference with Mobile NPU

Similar Articles

Quant.npu: Enabling Efficient Mobile NPU Inference for on-device LLMs via Fully Static Quantization

Prefilling-dLLM: Predictive Prefilling for Long-Context Inference in Diffusion Language Models

@_avichawla: Researchers found a way to make LLMs 8.5x faster! (without compromising accuracy) Speculative decoding is quite an effe…

Efficient Diffusion LLMs via Temporal-Spatial Parallel Decoding and Confidence Extrapolation

$R^2$-dLLM: Accelerating Diffusion Large Language Models via Spatio-Temporal Redundancy Reduction

Submit Feedback

Similar Articles

Quant.npu: Enabling Efficient Mobile NPU Inference for on-device LLMs via Fully Static Quantization

Prefilling-dLLM: Predictive Prefilling for Long-Context Inference in Diffusion Language Models

@_avichawla: Researchers found a way to make LLMs 8.5x faster! (without compromising accuracy) Speculative decoding is quite an effe…

Efficient Diffusion LLMs via Temporal-Spatial Parallel Decoding and Confidence Extrapolation

$R^2$-dLLM: Accelerating Diffusion Large Language Models via Spatio-Temporal Redundancy Reduction