MIRAGE: Mobile Agents with Implicit Reasoning and Generative World Models

arXiv cs.AI Papers

Summary

MIRAGE is a framework for mobile GUI agents that replaces verbose chain-of-thought reasoning with compact continuous latent representations, incorporating a generative world model perspective to predict future screen states before acting. On AndroidWorld and AndroidControl benchmarks, it achieves competitive or superior performance while reducing generated tokens by over 75%.

arXiv:2606.04627v1 Announce Type: new Abstract: Mobile agents are increasingly expected to operate everyday applications from screenshots and language goals, where reliable control requires reasoning over screen affordances, multi-step navigation, and future state changes. However, many agents externalize this computation as long textual chains of thought, which slows interaction, increases supervision cost, and complicates deployment. We introduce MIRAGE, a framework that learns continuous latent reasoning representations from visible textual reasoning traces. MIRAGE transfers explicit reasoning into compact hidden states, enabling the agent to reason internally without decoding long rationales. It also incorporates a generative world-model objective: latent reasoning vectors are aligned with future screenshots, encouraging the agent to anticipate upcoming interface states before acting. This turns hidden computation into both a compressed thought representation and a forward-looking model of environment dynamics. At inference time, MIRAGE reasons in continuous latent space, reducing token generation while improving execution efficiency. On AndroidWorld, MIRAGE matches explicit chain-of-thought supervised fine-tuning in the 4B ablation with a 3-5x lower decoded-token budget and improves a comparable instruction-tuned baseline by 10.2 points; on AndroidControl, it improves action grounding while generating over 75% fewer tokens.
Original Article
View Cached Full Text

Cached at: 06/05/26, 02:08 AM

# MIRAGE: Mobile Agents with Implicit Reasoning and Generative World Models
Source: [https://arxiv.org/html/2606.04627](https://arxiv.org/html/2606.04627)
###### Abstract

Mobile agents are increasingly expected to operate everyday applications from screenshots and language goals, where reliable control requires reasoning over screen affordances, multi\-step navigation, and future state changes\. Yet many agents externalize this computation as long textual thoughts, making interaction slower, supervision more costly, and deployment less efficient\. We introduce MIRAGE \(Mobile agents withImplicitReasoningAndGenerative world modEls\), a framework that learns continuous latent reasoning representations from visible textual thoughts\. MIRAGE introduces an efficient latent\-space learning procedure that transfers explicit reasoning into compact hidden states, allowing the agent to reason internally without decoding long rationales\. It further brings a world\-model perspective into mobile\-agent training: the model’s latent reasoning vectors are aligned with future screenshots, encouraging the agent to predict upcoming interface states in latent space before executing an action\. This makes the hidden computation not only a compressed thought trace, but also a forward\-looking representation of how the environment may change\. At inference time, MIRAGE reasons in continuous latent space, reducing token generation while improving execution efficiency\. On AndroidWorld, MIRAGE matches explicit\-CoT SFT in the 4B ablation under a 3–5×\\timeslower decoded\-token budget, and improves a comparable instruction\-tuned baseline by 10\.2 points; on AndroidControl, it improves action grounding with over 75% fewer generated tokens\.

## 1Introduction

As vision\-language models improve, more mobile\-agent systems now use them to execute mobile operations directly from screenshots and user instructions\. Recent systems such as UI\-TARS, MAI\-UI, OS\-ATLAS, and SeeClick train VLMs to understand GUI screens and produce taps, swipes, text entry, or navigation commands\[[1](https://arxiv.org/html/2606.04627#bib.bib1),[2](https://arxiv.org/html/2606.04627#bib.bib2),[3](https://arxiv.org/html/2606.04627#bib.bib3),[4](https://arxiv.org/html/2606.04627#bib.bib4)\]\. Yet turning a screen\-level VLM into a reliable mobile agent still hinges on navigation reasoning: the model must track task progress, decide which screen to visit next, and anticipate how each action will change the interface\. Figure[1](https://arxiv.org/html/2606.04627#S1.F1)highlights this reasoning bottleneck\. Current mobile agents often make this reasoning explicit through long thoughts or verbose action traces, which increases decoding time, context usage, and supervision cost\. Their execution accuracy remains limited, suggesting that mobile agents need reasoning that preserves explicit\-trace capability while being cheaper to run under realistic deployment constraints, where every extra token slows the interaction and delays feedback during interactive control\.

![Refer to caption](https://arxiv.org/html/2606.04627v1/x1.png)Figure 1:Inference\-time comparison on a randomly sampled task and MIRAGE workflow: baselines emit verbose visible traces, while MIRAGE reasons with latent tokens and exposes concise actions\.To address these issues, we propose MIRAGE, whose central idea is that a mobile agent should learn to “think forward” inside the model\. Before emitting an action, the agent forms a compact internal representation of what the current screen affords, why an action is appropriate, and how the interface is likely to change afterwards\. This reasoning is carried out in latent space rather than decoded as visible text, so MIRAGE avoids emitting intermediate thoughts and reduces both output tokens and first\-token\-to\-last\-token latency\. MIRAGE uses a two\-stage training procedure\. In the first stage, it trains on explicit text traces so the backbone learns the mobile action space and how to express observation, rationale, and future\-screen prediction\. In the second stage, MIRAGE replaces the textual reasoning block with continuous latent reasoning slots and continues training, gradually teaching the model to move reasoning into latent space\. At inference time, only action tokens are decoded; no rationale text is emitted and the interaction latency is substantially reduced\.

To transfer explicit reasoning capability into latent space, MIRAGE introduces Approximate Parallel Latent Refinement \(APLR\), which refines all latent slots in parallel throughKKJacobi\-style rounds, approximating full sequential latent chain\-of\-thought: the firstKKslots provably match the serial rollout while the remaining tail error is bounded\. To further enable the model to internalize future\-state prediction, MIRAGE attaches a lightweight Q\-Former world\-model head that aligns the output latent representations with stop\-gradient visual features of the next screenshot from the frozen vision encoder\. This alignment teaches the agent to anticipate GUI state transitions, prevents latent representation collapse, and supplies dense supervision that compensates for APLR’s bounded tail error\. Crucially, the action cross\-entropy loss and the next\-frame feature\-alignment loss are optimized jointly, so the same latent states become both action\-discriminative and transition\-predictive\. This coupling lets MIRAGE retain CoT\-level reasoning capacity while moving the intermediate computation out of the decoded text stream, yielding comparable explicit\-reasoning capability at substantially lower inference cost\.

MIRAGE yields strong results on established mobile\-agent benchmarks\. On AndroidWorld, MIRAGE improves task success by 10\.2 percentage points over a comparable instruction\-tuned baseline and matches explicit CoT in the 4B ablation under a 3–5×\\timeslower decoded\-token cost\. On AndroidControl, it improves action\-grounding accuracy while reducing generated tokens by over 75%, demonstrating that latent reasoning and future\-state alignment preserve explicit\-reasoning capability with substantially leaner inference\.

This paper makes three contributions:

- •We present MIRAGE, a mobile agent that reasons entirely in latent space, producing far fewer output tokens and substantially lower inference latency while matching explicit\-CoT task performance in the 4B ablation under a 3–5×\\timeslower decoded\-token cost\.
- •We introduce APLR, a parallel Jacobi\-style latent refinement procedure that approximates full serial latent chain\-of\-thought at a fraction of the training cost, with a provable bound on the approximation error of the tail slots\.
- •We introduce a Q\-Former world\-model head that aligns latent reasoning states with future\-screenshot features in latent space, enabling the agent to predict upcoming GUI transitions and directly improving task capability\.

## 2Related Work

##### Mobile and GUI agents\.

GUI\-agent benchmarks have evolved from web interaction to dynamic Android control, covering grounded shopping, open\-web navigation, data\-scaling studies, and online task completion\[[5](https://arxiv.org/html/2606.04627#bib.bib5),[6](https://arxiv.org/html/2606.04627#bib.bib6),[7](https://arxiv.org/html/2606.04627#bib.bib7),[8](https://arxiv.org/html/2606.04627#bib.bib8)\]\. Recent systems specialize VLMs for screenshot grounding, GUI actions, and mobile\-device operation\[[4](https://arxiv.org/html/2606.04627#bib.bib4),[3](https://arxiv.org/html/2606.04627#bib.bib3),[1](https://arxiv.org/html/2606.04627#bib.bib1),[9](https://arxiv.org/html/2606.04627#bib.bib9)\]\. Instead of extending grounding or planning pipelines, MIRAGE trains internal agent states by replacing visible reasoning traces with latent slots predictive of the next GUI state during action decoding and transition modeling\.

##### Reasoning in language and vision\-language agents\.

Visible CoT and ReAct\-style traces improve reasoning and acting but expose long rationales and consume context\[[10](https://arxiv.org/html/2606.04627#bib.bib10),[11](https://arxiv.org/html/2606.04627#bib.bib11)\]\. Other work internalizes computation through pause tokens, private thoughts, distillation, or continuous latent CoT\[[12](https://arxiv.org/html/2606.04627#bib.bib12),[13](https://arxiv.org/html/2606.04627#bib.bib13),[14](https://arxiv.org/html/2606.04627#bib.bib14),[15](https://arxiv.org/html/2606.04627#bib.bib15)\]\. MIRAGE brings implicit reasoning to mobile GUI control, where latent thoughts support action selection and transition understanding, using APLR to approximate serial latent refinement in a strictly causal triangular system rather than an equilibrium model\.

##### World models and visual feature prediction\.

World models learn predictive dynamics for control, from compact latent simulators to latent\-imagination agents\[[16](https://arxiv.org/html/2606.04627#bib.bib16),[17](https://arxiv.org/html/2606.04627#bib.bib17)\]\. Joint\-embedding vision objectives show that feature prediction can learn semantics without pixel generation\[[18](https://arxiv.org/html/2606.04627#bib.bib18),[19](https://arxiv.org/html/2606.04627#bib.bib19)\], while modal\-alignment work highlights bottlenecks between visual and language representations\[[20](https://arxiv.org/html/2606.04627#bib.bib20)\]\. Q\-Former queries offer a lightweight cross\-attention bottleneck\[[21](https://arxiv.org/html/2606.04627#bib.bib21)\]\. Recent GUI world models predict future screens, sketches, or semantic states\[[22](https://arxiv.org/html/2606.04627#bib.bib22),[23](https://arxiv.org/html/2606.04627#bib.bib23)\]\. MIRAGE instead uses future prediction to shape latent reasoning states, encouraging action\-induced transition representations without generating pixels or future text at inference time\.

## 3Method

### 3\.1Problem Formulation: From Explicit\-Thought to Latent\-Thought Mobile Agents

At interaction steptt, a mobile GUI agent observes the current screenshotxtx\_\{t\}, a user instructionuu, and an interaction historyht=\(a<t,x<t\)h\_\{t\}=\(a\_\{<t\},x\_\{<t\}\)\. It outputs an actionata\_\{t\}, such as tapping a coordinate, typing text, scrolling, or navigating back\. We serialize actions as text tokens so that a VLM can model mobile control as conditional generation:

ot=\(xt,u,ht\),pθ​\(at∣ot\)\.o\_\{t\}=\(x\_\{t\},u,h\_\{t\}\),\\qquad p\_\{\\theta\}\(a\_\{t\}\\mid o\_\{t\}\)\.\(1\)
Mobile GUI agents face a critical trade\-off between reasoning capability and deployment efficiency\. Explicit chain\-of\-thought improves multi\-step decision making because the model can verbalize observations, rationales, and predicted state transitions before acting; however, this reasoning is paid for with visible tokens\. Many explicit\-thought mobile agents first generate a readable reasoning traceτt\\tau\_\{t\}and then generate the action\. This factorization can be written as

pθ​\(τt,at∣ot\)=∏i=1\|τt\|pθ​\(τt\(i\)∣τt\(<i\),ot\)⋅∏j=1\|at\|pθ​\(at\(j\)∣τt,at\(<j\),ot\)\.p\_\{\\theta\}\(\\tau\_\{t\},a\_\{t\}\\mid o\_\{t\}\)=\\prod\_\{i=1\}^\{\|\\tau\_\{t\}\|\}p\_\{\\theta\}\\\!\\left\(\\tau\_\{t\}^\{\(i\)\}\\mid\\tau\_\{t\}^\{\(<i\)\},o\_\{t\}\\right\)\\cdot\\prod\_\{j=1\}^\{\|a\_\{t\}\|\}p\_\{\\theta\}\\\!\\left\(a\_\{t\}^\{\(j\)\}\\mid\\tau\_\{t\},a\_\{t\}^\{\(<j\)\},o\_\{t\}\\right\)\.\(2\)The intermediate thoughtτt\\tau\_\{t\}is useful, but it is also a sequence of discrete text tokens: it consumes context budget, increases decoding latency, and usually requires human\-written or synthetic rationale supervision\. This makes explicit CoT attractive as a training scaffold but costly as an inference\-time interface for mobile deployment\.

To resolve this tension, MIRAGE formulates mobile control as a latent\-thought generation problem\. Instead of removing reasoning, we internalize it: the model keeps a structured computation budget, but the intermediate computation is carried by hidden states rather than text\. We introduceNNcontinuous latent variables𝐳t=\(zt,1,…,zt,N\)\\mathbf\{z\}\_\{t\}=\(z\_\{t,1\},\\ldots,z\_\{t,N\}\),zt,i∈ℝdz\_\{t,i\}\\in\\mathbb\{R\}^\{d\}, in the decoder context:

pθ​\(at∣ot\)=∫pθ​\(at∣𝐳t,ot\)​qθ​\(𝐳t∣ot\)​𝑑𝐳t\.p\_\{\\theta\}\(a\_\{t\}\\mid o\_\{t\}\)=\\int p\_\{\\theta\}\(a\_\{t\}\\mid\\mathbf\{z\}\_\{t\},o\_\{t\}\)q\_\{\\theta\}\(\\mathbf\{z\}\_\{t\}\\mid o\_\{t\}\)d\\mathbf\{z\}\_\{t\}\.\(3\)Unlike explicit CoT,𝐳t\\mathbf\{z\}\_\{t\}does not live in the vocabulary and is never detokenized\. Its inference distributionqθq\_\{\\theta\}is computed implicitly by decoder hidden states, so latent\-thought training does not need rationale\-text supervision\. This formulation gives four practical advantages: it preserves context by removing explicit rationale tokens, speeds up inference by shortening the decoded sequence, retains multi\-step reasoning capacity through latent slots, and exposes the hidden reasoning state to auxiliary regularization such as next\-frame prediction\. In this sense, latent thoughts bridge the gap between CoT\-style reasoning strength and efficient mobile\-agent deployment, motivating the Approximate Parallel Latent Refinement\(APLR\) refinement and world\-model regularization developed next\.

![Refer to caption](https://arxiv.org/html/2606.04627v1/x2.png)Figure 2:MIRAGE pipeline\. Stage 1 learns explicit mobile thoughts and action formatting\. Stage 2 replaces the thought text with latent slots, refines them with APLR, and trains a Q\-Former world model to align latent states with next\-frame visual features\.
### 3\.2Training Latent Chain\-of\-Thought via Approximate Parallel Latent Refinement

#### 3\.2\.1Structured Thought Data and Latent Insertion

Our training data uses a structured thought format with three mobile\-agent reasoning dimensions:

> <THOUGHT\> \[observation\] \[rationale\] \[predict\] <ACTION\_DESC\> \.\.\. <ACTION\> \.\.\.

Theobservationfield describes the visible screen state,rationaleexplains why a particular operation should be taken, andpredictdescribes the expected next\-screen transition\. Explicit\-thought warmup uses this structured text as ordinary next\-token supervision\. Latent\-thought training then replaces the entire<THOUGHT\>block withNNlatent tokens\. The token sequence becomes

\[ctx\];\[start\];⟨lat⟩,…,⟨lat⟩⏟N​slots;\[end\];⟨ACTION​\_​DESC⟩​⋯​⟨ACTION⟩\.\[\\mathrm\{ctx\}\]\\ ;\[\\mathrm\{start\}\]\\ ;\\underbrace\{\\langle\\mathrm\{lat\}\\rangle,\\ldots,\\langle\\mathrm\{lat\}\\rangle\}\_\{N\\ \\mathrm\{slots\}\}\\ ;\[\\mathrm\{end\}\]\\ ;\\langle\\mathrm\{ACTION\\\_DESC\}\\rangle\\cdots\\langle\\mathrm\{ACTION\}\\rangle\.\(4\)The latent slots occupy positionsp1<⋯<pNp\_\{1\}<\\cdots<p\_\{N\}and share the learned initializerelate\_\{\\mathrm\{lat\}\}\.

#### 3\.2\.2From Serial Masking to Approximate Parallel Latent Refinement

Original Coconut\-style latent reasoning\[[15](https://arxiv.org/html/2606.04627#bib.bib15)\]replaces visible reasoning steps with continuous hidden states and computes those states serially; Appendix[D](https://arxiv.org/html/2606.04627#A4)gives the full formalization\. Our APLR keeps the same latent\-thinking motivation but changes the execution pattern: rather than advancing through thought steps one by one, it treats the thought block asNNslots and refines all slots in parallel rounds\.

With APLR, latent reasoning becomes a small number of synchronous refinement passes instead of a long serial rollout\. Conceptually, this lets the model keep an internal multi\-step computation budget while making training cost depend on the refinement depthKK, not directly on the latent lengthNN\.

With fixed non\-latent contextc=otc=o\_\{t\}and latent positionsp1<⋯<pNp\_\{1\}<\\cdots<p\_\{N\}, the serial target states are

si=Gi​\(s1,…,si−1;c\),i=1,…,N,s\_\{i\}=G\_\{i\}\(s\_\{1\},\\ldots,s\_\{i\-1\};c\),\\qquad i=1,\\ldots,N,\(5\)whereGiG\_\{i\}is the hidden state at positionpi−1p\_\{i\}\-1when the previous latent states have already been written into the sequence\. This is a forward\-substitution procedure over a causal triangular system\. It is accurate, but expensive: fully refining allNNlatent tokens requires one full decoder pass per latent update, plus a final pass for action logits\.

We use APLR to eliminate the need forNNserial forward passes in traditional latent CoT training\. The key observation is that causal attention makes the latent system strictly triangular:ziz\_\{i\}cannot depend on itself or on any futurezj\>iz\_\{j\>i\}\. Therefore, instead of filling the slots one by one, APLR performs Jacobi\-style rounds:

zi\(k\+1\)=Gi​\(z1\(k\),z2\(k\),…,zi−1\(k\);c\),i=1,…,N,k=0,…,K−1\.z\_\{i\}^\{\(k\+1\)\}=G\_\{i\}\\\!\\left\(z\_\{1\}^\{\(k\)\},z\_\{2\}^\{\(k\)\},\\ldots,z\_\{i\-1\}^\{\(k\)\};c\\right\),\\qquad i=1,\\ldots,N,\\quad k=0,\\ldots,K\-1\.\(6\)All right\-hand sides use old round\-kkvalues, so one full forward updates all slots\. In practice, we use a small refinement budget,K=3K=3by default\.

APLR has a precise relationship to the serial refinement in \([5](https://arxiv.org/html/2606.04627#S3.E5)\)\. AfterKKparallel rounds, the firstKKlatent slots exactly match the serial solution,zi\(K\)=siz\_\{i\}^\{\(K\)\}=s\_\{i\}fori≤Ki\\leq K\. Compared with refining all latent tokens one by one, the unrefined tail slots retain a structured residual

δ\(K\)≈AK​δ\(0\),\\delta^\{\(K\)\}\\approx A^\{K\}\\delta^\{\(0\)\},\(7\)whereAAis a strictly lower\-triangular Jacobian\. Appendix[F](https://arxiv.org/html/2606.04627#A6)proves that afterKKAPLR rounds, the firstKKlatent slots exactly recover the serial solution, and then derives the tail\-error form in \([7](https://arxiv.org/html/2606.04627#S3.E7)\)\. In implementation, early bootstrap passes may run without gradient, while the first pass can retain gradient throughelate\_\{\\mathrm\{lat\}\}for stability\. Before the final gradient\-enabled pass, we rebuild the input embeddings with a mask: latent positions use detached bootstrap values, while non\-latent positions retain their original gradient lineage to the vision tower and token embeddings\.

The final APLR pass is trained with standard next\-token CE over serialized action tokens:

ℒce=−∑j=1\|at\|log⁡pθ​\(at\(j\)∣𝐳t\(K\),at\(<j\),ot\)\.\\mathcal\{L\}\_\{\\mathrm\{ce\}\}=\-\\sum\_\{j=1\}^\{\|a\_\{t\}\|\}\\log p\_\{\\theta\}\\\!\\left\(a\_\{t\}^\{\(j\)\}\\mid\\mathbf\{z\}\_\{t\}^\{\(K\)\},a\_\{t\}^\{\(<j\)\},o\_\{t\}\\right\)\.\(8\)WhenK≪NK\\ll N, the tail latentszK\+1:N\(K\)z\_\{K\+1:N\}^\{\(K\)\}retain approximation error relative to a trainer that refines every latent token serially\. In our structured thought format, thepredictdimension concerns how the screen will change after the action and typically lies late in the thought sequence\. Action CE only penalizes tail\-error directions that affect the next action tokens; it does not directly supervise errors that matter for environment prediction\. This motivates the world\-model objective\.

### 3\.3World Model: Q\-Former Next\-Frame Alignment

Thepredictfield describes future screen semantics after a mobile action\. Unlike GUI world models that generate pixels or sketches\[[22](https://arxiv.org/html/2606.04627#bib.bib22)\], our Q\-Former head directly regularizes latent reasoning states to predict future GUI features, forcing latent slots to encode environment dynamics rather than only current\-screen action correlations\. Tail\-error directions that harm future\-screen prediction therefore receive direct gradients; Appendix[G](https://arxiv.org/html/2606.04627#A7)formalizes this with a local second\-order argument\.

Because the training records are sequential, most non\-terminal step\-ttexamples provide the next screenshotxt\+1x\_\{t\+1\}\. We use it as feature supervision, not as a pixel\-generation target:xt\+1x\_\{t\+1\}is passed through the VLM’s own frozen vision tower, yielding a stable next\-frame feature target that does not drift with the world\-model head\.

We gather the final\-pass hidden states at latent positions to obtain latent CoT hidden states

Ct=Gather​\(Ht\(K\),p1,…,pN\)∈ℝL×H,C\_\{t\}=\\mathrm\{Gather\}\\\!\\left\(H\_\{t\}^\{\(K\)\},p\_\{1\},\\ldots,p\_\{N\}\\right\)\\in\\mathbb\{R\}^\{L\\times H\},\(9\)where each row is one latent thought vector after APLR\. The next screenshot is first converted by the VLM image processor into visual patches\. After the VLM’s spatial merge operation, each target patch has both a feature vector and a 2D grid coordinate\. We write the target features as

Vt\+1=sg​\(fvis​\(xt\+1\)\)=\[v1,…,vM\]∈ℝM×Hv,V\_\{t\+1\}=\\mathrm\{sg\}\\\!\\left\(f\_\{\\mathrm\{vis\}\}\(x\_\{t\+1\}\)\\right\)=\[v\_\{1\},\\ldots,v\_\{M\}\]\\in\\mathbb\{R\}^\{M\\times H\_\{v\}\},\(10\)wheresg​\(⋅\)\\mathrm\{sg\}\(\\cdot\)denotes stop\-gradient\. The stop\-gradient prevents collapse through the target branch, no image decoder is required, and the target lives on the same visual manifold as the backbone features\.

For each next\-frame patchjjwith grid coordinate\(rj,cj\)\(r\_\{j\},c\_\{j\}\), we build a learnable spatial query

qj=erjrow\+ecjcol∈ℝdq\.q\_\{j\}=e^\{\\mathrm\{row\}\}\_\{r\_\{j\}\}\+e^\{\\mathrm\{col\}\}\_\{c\_\{j\}\}\\in\\mathbb\{R\}^\{d\_\{q\}\}\.\(11\)This says, in simple terms, “predict the feature at rowrjr\_\{j\}and columncjc\_\{j\}of the next screen\.” The separable row/column design gives a stable 2D positional prior even when mobile screenshots have different resolutions and therefore different patch grids\.

The BLIP\-2\-style Q\-Former aligner\[[21](https://arxiv.org/html/2606.04627#bib.bib21)\]self\-attends over queries, cross\-attends toCtC\_\{t\}, and projects each query output into the VLM feature space:

Ut=QFormerϕ​\(Qt,Ct\),V^t\+1=Wvis​Ut\.U\_\{t\}=\\mathrm\{QFormer\}\_\{\\phi\}\(Q\_\{t\},C\_\{t\}\),\\qquad\\hat\{V\}\_\{t\+1\}=W\_\{\\mathrm\{vis\}\}U\_\{t\}\.\(12\)The outputV^t\+1=\[v^1,…,v^M\]\\hat\{V\}\_\{t\+1\}=\[\\hat\{v\}\_\{1\},\\ldots,\\hat\{v\}\_\{M\}\]is therefore a predicted next\-frame representation, one vector per next\-frame patch\. The default objective is masked per\-patch cosine distance:

ℒwm=1\|ℳ\|​∑\(b,j\)∈ℳ\(1−⟨v^b,j,vb,j⟩‖v^b,j‖2​‖vb,j‖2\)\.\\mathcal\{L\}\_\{\\mathrm\{wm\}\}=\\frac\{1\}\{\|\\mathcal\{M\}\|\}\\sum\_\{\(b,j\)\\in\\mathcal\{M\}\}\\left\(1\-\\frac\{\\langle\\hat\{v\}\_\{b,j\},v\_\{b,j\}\\rangle\}\{\\\|\\hat\{v\}\_\{b,j\}\\\|\_\{2\}\\\|v\_\{b,j\}\\\|\_\{2\}\}\\right\)\.\(13\)The maskℳ\\mathcal\{M\}keeps only valid patches from non\-terminal latent samples; cosine is default, with MSE used for ablations\.

### 3\.4Two\-Stage Training Pipeline

We use a curriculum learning strategy to make latent reasoning learnable rather than asking the model to discover a continuous thought space from scratch\. Stage 1 warm\-ups the model on explicit structured thoughts to learn action formatting and observation–rationale–prediction reasoning patterns\. Specifically, the full<THOUGHT\>trace is exposed as text supervision and the VLM is trained with standard next\-token CE over the structured thought, action description, and action tokens\. Stage 2 distills this explicit reasoning process into compact latent slots: the<THOUGHT\>block is replaced by the latent sequence in \([4](https://arxiv.org/html/2606.04627#S3.E4)\), APLR refines the latent states \(K=3K=3by default\), the collator loads next screenshots, and the world model regularizes the latent slots through Q\-Former next\-frame feature alignment\. The joint objective is

ℒ=λ​ℒce\+\(1−λ\)​ℒwm,λ∈\(0,1\)\.\\mathcal\{L\}=\\lambda\\mathcal\{L\}\_\{\\mathrm\{ce\}\}\+\(1\-\\lambda\)\\mathcal\{L\}\_\{\\mathrm\{wm\}\},\\qquad\\lambda\\in\(0,1\)\.\(14\)We useλ=0\.8\\lambda=0\.8by default\. Figure[2](https://arxiv.org/html/2606.04627#S3.F2)summarizes the pipeline, and Appendix[E](https://arxiv.org/html/2606.04627#A5)gives pseudocode\. At inference time, the agent only performs latent substitution and greedy action decoding; the Q\-Former head is used only for training\-time representation shaping\[[24](https://arxiv.org/html/2606.04627#bib.bib24)\]\.

## 4Experiments

### 4\.1Setup

##### Backbones\.

We fine\-tune two vision\-language models from the Qwen3\-VL family\[[25](https://arxiv.org/html/2606.04627#bib.bib25)\]:Qwen3\-VL\-4B\-Instruct\(4B parameters\) andQwen3\-VL\-8B\-Instruct\(8B parameters\)\. Both backbones share the same action\-serialization vocabulary, latent\-slot token format, and Q\-Former world\-model head\. Their latent computation budgets differ: the main Qwen3\-VL\-4B setting uses 9 latent slots with 3 APLR refinement passes, while the Qwen3\-VL\-8B setting uses 6 latent slots with 3 refinement passes\.

##### Evaluation benchmarks\.

We evaluate on two standard mobile\-agent benchmarks\.AndroidControl\[[7](https://arxiv.org/html/2606.04627#bib.bib7)\]provides paired high\-level and low\-level instructions with ground\-truth action sequences, allowing separate measurement of instruction\-following exact match \(EM\) and action accuracy\.AndroidWorld\[[8](https://arxiv.org/html/2606.04627#bib.bib8)\]is a dynamic, on\-device benchmark spanning 116 real\-world task instances across 20 Android apps, measuring end\-to\-end task completion rate under live Android dynamics\.

##### Baselines\.

We compare against size\-matchedQwen3\-VL\-4B/8B\-Instructbackbones\[[25](https://arxiv.org/html/2606.04627#bib.bib25)\], a general multimodal baselineGPT\-4o\[[26](https://arxiv.org/html/2606.04627#bib.bib26)\], reinforcement\-tuned GUI agentsGUI\-R1/UI\-R1\[[27](https://arxiv.org/html/2606.04627#bib.bib27),[28](https://arxiv.org/html/2606.04627#bib.bib28)\], and recent GUI\-agent systems includingShowUI,MAI\-UI,UI\-Venus\-Navi,UI\-TARS\-7B\-SFT, andFerret\-UI Liteacross AndroidControl and AndroidWorld\[[29](https://arxiv.org/html/2606.04627#bib.bib29),[2](https://arxiv.org/html/2606.04627#bib.bib2),[30](https://arxiv.org/html/2606.04627#bib.bib30),[1](https://arxiv.org/html/2606.04627#bib.bib1),[31](https://arxiv.org/html/2606.04627#bib.bib31)\]\.

### 4\.2Main Results

Table 1:AndroidControl results\. EM = exact match, Action Acc\. = action accuracy, and Tokens = average generated tokens per step\. External rows use reported Type/EM metrics; green parentheses show MIRAGE gains over size\-matched Qwen3\-VL\-Instruct baselines\. Best primary metrics and lowest Tokens are inbold\.Table 2:AndroidWorld results\. SR = task success rate; Avg\. Steps/Tokens are per task\. Green parentheses show MIRAGE gains over size\-matched Qwen3\-VL\-Instruct baselines; red marks higher step count\. Best SR and lowest averages are inbold\.##### AndroidControl\.

Table[1](https://arxiv.org/html/2606.04627#S4.T1)shows that MIRAGE improves the size\-matched Qwen3\-VL\-Instruct baselines while producing much shorter action outputs\. On the low\-level split, MIRAGE\-4B improves EM from 68\.48 to 77\.59 and action accuracy from 75\.15 to 91\.09, while reducing the average generation length from 115\.67 to 18\.92 tokens\. MIRAGE\-8B shows the same pattern, increasing low\-level EM from 77\.66 to 83\.75 and action accuracy from 82\.54 to 94\.62 with 18\.01 tokens per step\. On the high\-level split, MIRAGE\-4B improves EM/action accuracy by 9\.85/6\.28 percentage points over Qwen3\-VL\-4B\-Instruct, and MIRAGE\-8B improves them by 11\.57/8\.17 points over Qwen3\-VL\-8B\-Instruct\. These gains suggest that replacing verbose visible thoughts with latent reasoning can improve grounding and action selection without increasing the decoded token budget\.

##### AndroidWorld\.

Table[2](https://arxiv.org/html/2606.04627#S4.T2)evaluates the same models in a dynamic on\-device setting\. MIRAGE raises AndroidWorld SR from 42\.9 to 52\.6 for 4B and from 47\.6 to 57\.8 for 8B, while reducing average tokens from 103\.0/108\.0 to 31\.0/27\.0\. The 8B model uses slightly more steps than its Qwen3\-VL baseline, so the gain does not simply come from shorter trajectories\. Among retained specialized GUI agents, MIRAGE\-8B gives the highest AndroidWorld SR with far fewer generated tokens\. Together, the two benchmarks show that latent reasoning and world\-model training improve task effectiveness while keeping outputs compact\.

Figure[3](https://arxiv.org/html/2606.04627#S4.F3)further analyzes efficiency and robustness\. The left panel measures first\-to\-last\-token latency; MIRAGE\-4B is fastest, consistent with the token reductions in Tables[1](https://arxiv.org/html/2606.04627#S4.T1)and[2](https://arxiv.org/html/2606.04627#S4.T2)\. The right panel breaks AndroidControl low\-level results into IDD, app\-unseen, category\-unseen, and task\-unseen subsplits\. After correcting each model to its reported all\-split score, MIRAGE remains strong across subsplits, especially on action accuracy, suggesting that Table[1](https://arxiv.org/html/2606.04627#S4.T1)is not driven by a single easy split\.

![Refer to caption](https://arxiv.org/html/2606.04627v1/x3.png)

![Refer to caption](https://arxiv.org/html/2606.04627v1/x4.png)

Figure 3:Left:Average latency from the first generated token to the final generated token\. MIRAGE\-4B produces the shortest decoded sequence latency among the compared models\.Right:AndroidControl low\-level subsplit EM and action accuracy, corrected by subtracting each model’s offset between the raw low\-level subsplit average and the reported low\-level all\-split score\.

### 4\.3Ablation Study

We ablate the three core components of MIRAGE—latent CoT slots, APLR parallel refinement, and the Q\-Former world\-model head—using Qwen3\-VL\-4B\-Instruct on AndroidWorld\.

##### Ablation analysis\.

![Refer to caption](https://arxiv.org/html/2606.04627v1/x5.png)Figure 4:Cross\-entropy training loss for Qwen3\-VL\-4B variants matched to Table[3](https://arxiv.org/html/2606.04627#S4.T3)\.Figure[4](https://arxiv.org/html/2606.04627#S4.F4)reports cross\-entropy loss for the same Qwen3\-VL\-4B training variants summarized in Table[3](https://arxiv.org/html/2606.04627#S4.T3)\. For the latent settings, we allocate 9 latent slots; in the serial latent\-CoT reference, the observation, rationale, and prediction fields each correspond to three latent slots and are masked through a stepwise curriculum\. The sharp change after the first epoch is expected rather than a failure mode: training switches from explicit\-thought warmup to latent\-CoT training at that point, and the learning rate is reset with the new stage\. After this transition, APLR without the world\-model objective exhibits a higher loss than the serial latent\-CoT reference, consistent with the fact that parallel approximation leaves tail latent states less directly supervised by action CE alone\. Adding the Q\-Former world\-model objective makes the APLR curve track the serial latent\-CoT trend more closely, suggesting that next\-frame feature alignment supplies useful gradients for the transition\-predictive parts of the latent state\.

Table 3:Component ablation on AndroidWorld \(Qwen3\-VL\-4B, SR %\)\.Table[3](https://arxiv.org/html/2606.04627#S4.T3)shows the same pattern in final AndroidWorld SR\. Action\-only SFT performs worse than the base model, indicating that removing thought supervision without replacing it by internal computation can harm interactive decision making\. Explicit CoT SFT and full MIRAGE\-4B both reach 52\.6 SR, showing that latent reasoning can match explicit CoT while avoiding decoded rationale tokens\. Serial latent CoT preserves much of this benefit \(50\.9 SR\), and APLR without the world\-model objective reaches 48\.2; adding the Q\-Former world\-model objective restores the explicit\-CoT\-level result while keeping reasoning latent at inference time\.

We study the sensitivity of MIRAGE to the number of latent slotsLL, APLR refinement passesKK, and the loss balanceλce\\lambda\_\{\\mathrm\{ce\}\}on AndroidWorld\. Table[4](https://arxiv.org/html/2606.04627#S4.T4)shows that the latent computation budget matters\.

Table 4:AndroidWorld ablations over latent\-slot budget, APLR refinement passes, and loss balance\. SR is task success rate \(%\)\. Unless otherwise noted,λce=0\.8\\lambda\_\{\\mathrm\{ce\}\}=0\.8\.For MIRAGE\-8B, increasing APLR refinement from two to three passes improves AndroidWorld SR from 46\.6 to 57\.8, suggesting that the third pass substantially reduces harmful tail\-latent error\. For MIRAGE\-4B, reducing the latent budget from 9 to 3 slots lowers SR from 52\.6 to 32\.8, showing that the mobile\-agent thought state needs enough continuous capacity to represent observation, rationale, and future\-screen prediction\. Finally, settingλce=0\.1\\lambda\_\{\\mathrm\{ce\}\}=0\.1decreases SR from 52\.6 to 48\.3, consistent with the intended role of the world model as an auxiliary regularizer rather than the dominant training objective\.

### 4\.4Latent Reasoning Visualization

We analyze a MIRAGE\-8B checkpoint with six latent slots on the AndroidControl IDD split; Appendix[C](https://arxiv.org/html/2606.04627#A3)provides full t\-SNE and per\-slot action projections\. Figure[5](https://arxiv.org/html/2606.04627#S4.F5)shows that latent training does not collapse to a single undifferentiated representation\. Before slot\-centering, slots 1–2, 3–4, and 5–6 occupy three separated regions, corresponding to the observation, rationale, and prediction dimensions of the structured thought\. After subtracting the slot mean, examples remain organized by action type: open\-app, swipe, type, and tap states form distinguishable regions in the centered latent space\. These visualizations indicate that MIRAGE learns both position\-specific reasoning subspaces and action\-discriminative latent states\.

![Refer to caption](https://arxiv.org/html/2606.04627v1/x6.png)

![Refer to caption](https://arxiv.org/html/2606.04627v1/x7.png)

Figure 5:Left:UMAP by latent slot group\.Right:slot\-centered UMAP by action type after subtracting per\-slot means\.

## 5Limitations, Broader Impact, and Conclusion

MIRAGE replaces visible rationales with APLR\-refined latent slots regularized by next\-frame features, preserving explicit\-CoT capability at much lower decoding cost\. Limitations include supervised\-only training, feature\-level world modeling, next\-frame supervision, and the need for privacy and action safeguards before deployment\.

## References

- Qin et al\. \[2025\]Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al\.Ui\-tars: Pioneering automated gui interaction with native agents\.*arXiv preprint arXiv:2501\.12326*, 2025\.
- Zhou et al\. \[2025\]Hanzhang Zhou, Xu Zhang, Panrong Tong, Jianan Zhang, Liangyu Chen, Quyu Kong, Chenglin Cai, Chen Liu, Yue Wang, Jingren Zhou, et al\.Mai\-ui technical report: Real\-world centric foundation gui agents\.*arXiv preprint arXiv:2512\.22047*, 2025\.
- Wu et al\. \[2024\]Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, et al\.Os\-atlas: A foundation action model for generalist gui agents\.*arXiv preprint arXiv:2410\.23218*, 2024\.
- Cheng et al\. \[2024\]Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Li YanTao, Jianbing Zhang, and Zhiyong Wu\.Seeclick: Harnessing gui grounding for advanced visual gui agents\.In*Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 9313–9332, 2024\.
- Yao et al\. \[2022a\]Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan\.Webshop: Towards scalable real\-world web interaction with grounded language agents\.*Advances in Neural Information Processing Systems*, 35:20744–20757, 2022a\.
- Deng et al\. \[2023a\]Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su\.Mind2web: Towards a generalist agent for the web\.*Advances in Neural Information Processing Systems*, 36:28091–28114, 2023a\.
- Li et al\. \[2024\]Wei Li, William Bishop, Alice Li, Chris Rawles, Folawiyo Campbell\-Ajala, Divya Tyamagundlu, and Oriana Riva\.On the effects of data scale on ui control agents\.*Advances in Neural Information Processing Systems*, 37:92130–92154, 2024\.
- Rawles et al\. \[2024\]Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell\-Ajala, et al\.Androidworld: A dynamic benchmarking environment for autonomous agents\.*arXiv preprint arXiv:2405\.14573*, 2024\.
- Wang et al\. \[2024\]Junyang Wang, Haiyang Xu, Jiabo Ye, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang\.Mobile\-agent: Autonomous multi\-modal mobile device agent with visual perception\.*arXiv preprint arXiv:2401\.16158*, 2024\.
- Wei et al\. \[2022\]Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al\.Chain\-of\-thought prompting elicits reasoning in large language models\.*Advances in neural information processing systems*, 35:24824–24837, 2022\.
- Yao et al\. \[2022b\]Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao\.React: Synergizing reasoning and acting in language models\.*arXiv preprint arXiv:2210\.03629*, 2022b\.
- Goyal et al\. \[2023\]Sachin Goyal, Ziwei Ji, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar, and Vaishnavh Nagarajan\.Think before you speak: Training language models with pause tokens\.*arXiv preprint arXiv:2310\.02226*, 2023\.
- Zelikman et al\. \[2024\]Eric Zelikman, Georges Harik, Yijia Shao, Varuna Jayasiri, Nick Haber, and Noah D Goodman\.Quiet\-star: Language models can teach themselves to think before speaking\.*arXiv preprint arXiv:2403\.09629*, 2024\.
- Deng et al\. \[2023b\]Yuntian Deng, Kiran Prasad, Roland Fernandez, Paul Smolensky, Vishrav Chaudhary, and Stuart Shieber\.Implicit chain of thought reasoning via knowledge distillation\.*arXiv preprint arXiv:2311\.01460*, 2023b\.
- Hao et al\. \[2024\]Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong Tian\.Training large language models to reason in a continuous latent space\.*arXiv preprint arXiv:2412\.06769*, 2024\.
- Ha and Schmidhuber \[2018\]David Ha and Jürgen Schmidhuber\.World models\.*arXiv preprint arXiv:1803\.10122*, 2\(3\):440, 2018\.
- Hafner et al\. \[2019\]Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi\.Dream to control: Learning behaviors by latent imagination\.*arXiv preprint arXiv:1912\.01603*, 2019\.
- Assran et al\. \[2023\]Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas\.Self\-supervised learning from images with a joint\-embedding predictive architecture\.In*Proceedings of the IEEE/CVF conference on computer vision and pattern recognition*, pages 15619–15629, 2023\.
- Bardes et al\. \[2024\]Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas\.Revisiting feature prediction for learning visual representations from video\.*arXiv preprint arXiv:2404\.08471*, 2024\.
- Hu et al\. \[2025\]Yuanze Hu, Zhaoxin Fan, Xinyu Wang, Gen Li, Ye Qiu, Zhichao Yang, Wenjun Wu, Kejian Wu, Yifan Sun, Xiaotie Deng, et al\.Tinyalign: Boosting lightweight vision\-language models by mitigating modal alignment bottlenecks\.*arXiv preprint arXiv:2505\.12884*, 2025\.
- Li et al\. \[2023\]Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi\.Blip\-2: Bootstrapping language\-image pre\-training with frozen image encoders and large language models\.In*International conference on machine learning*, pages 19730–19742\. PMLR, 2023\.
- Luo et al\. \[2025a\]Dezhao Luo, Bohan Tang, Kang Li, Georgios Papoudakis, Jifei Song, Shaogang Gong, Jianye Hao, Jun Wang, and Kun Shao\.Vimo: A generative visual gui world model for app agents\.*arXiv preprint arXiv:2504\.13936*, 2025a\.
- Cao et al\. \[2026\]Yilin Cao, Yufeng Zhong, Zhixiong Zeng, Liming Zheng, Jing Huang, Haibo Qiu, Peng Shi, Wenji Mao, and Wan Guanglu\.Mobiledreamer: Generative sketch world model for gui agent\.*arXiv preprint arXiv:2601\.04035*, 2026\.
- Yuan et al\. \[2026\]Tianyuan Yuan, Zibin Dong, Yicheng Liu, and Hang Zhao\.Fast\-wam: Do world action models need test\-time future imagination?*arXiv preprint arXiv:2603\.16666*, 2026\.
- Bai et al\. \[2025\]Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al\.Qwen3\-vl technical report\.*arXiv preprint arXiv:2511\.21631*, 2025\.
- Hurst et al\. \[2024\]Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al\.Gpt\-4o system card\.*arXiv preprint arXiv:2410\.21276*, 2024\.
- Luo et al\. \[2025b\]Run Luo, Lu Wang, Wanwei He, Longze Chen, Jiaming Li, and Xiaobo Xia\.Gui\-r1: A generalist r1\-style vision\-language action model for gui agents\.*arXiv preprint arXiv:2504\.10458*, 2025b\.
- Lu et al\. \[2026\]Zhengxi Lu, Yuxiang Chai, Yaxuan Guo, Xi Yin, Liang Liu, Hao Wang, Han Xiao, Shuai Ren, Pengxiang Zhao, Guangyi Liu, et al\.Ui\-r1: Enhancing efficient action prediction of gui agents by reinforcement learning\.In*Proceedings of the AAAI Conference on Artificial Intelligence*, volume 40, pages 17608–17616, 2026\.
- Lin et al\. \[2025\]Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Stan Weixian Lei, Lijuan Wang, and Mike Zheng Shou\.Showui: One vision\-language\-action model for gui visual agent\.In*Proceedings of the Computer Vision and Pattern Recognition Conference*, pages 19498–19508, 2025\.
- Gu et al\. \[2025\]Zhangxuan Gu, Zhengwen Zeng, Zhenyu Xu, Xingran Zhou, Shuheng Shen, Yunfei Liu, Beitong Zhou, Changhua Meng, Tianyu Xia, Weizhi Chen, et al\.Ui\-venus technical report: Building high\-performance ui agents with rft\.*arXiv preprint arXiv:2508\.10833*, 2025\.
- Yang et al\. \[2025\]Zhen Yang, Zi\-Yi Dou, Di Feng, Forrest Huang, Anh Nguyen, Keen You, Omar Attia, Yuhao Yang, Michael Feng, Haotian Zhang, et al\.Ferret\-ui lite: Lessons from building small on\-device gui agents\.*arXiv preprint arXiv:2509\.26539*, 2025\.
- Chai et al\. \[2025\]Yuxiang Chai, Siyuan Huang, Yazhe Niu, Han Xiao, Liang Liu, Guozhi Wang, Dingyu Zhang, Shuai Ren, and Hongsheng Li\.Amex: Android multi\-annotation expo dataset for mobile gui agents\.In*Findings of the Association for Computational Linguistics: ACL 2025*, pages 2138–2156, 2025\.

## Appendix AAdditional Method Details

##### Q\-Former target construction\.

The world\-model target is built online during training\. For each sample with a valid next frame, the collator loadsxt\+1x\_\{t\+1\}and the base VLM vision stack extracts post\-spatial\-merge patch features\. The Q\-Former aligner never receives these target features as input; it receives only latent CoT hidden states and row/column query embeddings\.

##### No target leakage\.

The world\-model predictor receives only the final latent hidden states from the current observation and history\. Next\-frame pixels are used only on the detached target side of the feature\-alignment loss, and missing next frames are masked out\.

## Appendix BTraining Data and Schedule

##### Training data\.

We train on two sources of mobile\-interaction data\. First, we sample a subset ofAMEX\[[32](https://arxiv.org/html/2606.04627#bib.bib32)\], the Android Multi\-annotation Expo dataset, which provides 104K high\-resolution screenshots annotated with GUI\-element groundings and step\-wise action chains across diverse Android applications\. Second, we collectself\-explored trajectorieson AndroidWorld\[[8](https://arxiv.org/html/2606.04627#bib.bib8)\]by running an exploration policy on device, producing task rollouts that extend the distribution of observed app states beyond the static AMEX corpus\.

##### Training procedure\.

Training proceeds in two stages\. In Stage 1, we fine\-tune the backbone for 1 epoch on explicit chain\-of\-thought demonstrations, initializing the model with standard text\-based reasoning traces and action formatting\. In Stage 2, we switch to the latent\-CoT objective and train for 3 epochs\. The main configuration compresses reasoning intoL=6L=6latent slots and refines them overK=3K=3APLR passes; component ablations additionally evaluate the 4B model withL=9L=9latent slots\. Unless otherwise specified, the cross\-entropy loss weight isλce=0\.8\\lambda\_\{\\mathrm\{ce\}\}=0\.8\.

## Appendix CComplete Latent Visualization Diagnostics

This appendix reports the full latent\-slot diagnostic views used to support Section[4\.4](https://arxiv.org/html/2606.04627#S4.SS4)\. The analysis is run on a MIRAGE\-8B checkpoint evaluated on the AndroidControl IDD split\. The diagnostic dump contains 1,012 examples and 3,036 latent vectors after grouping the six latent slots into three adjacent pairs: slots 1–2, slots 3–4, and slots 5–6\. This grouping matches the main\-paper visualization, where the paired slots form three separated positional clusters\.

![Refer to caption](https://arxiv.org/html/2606.04627v1/figures/appendix/slot_identity_projection.png)Figure 6:Latent slot identity in existing projections\. The left panel colors a t\-SNE projection by latent slot group, while the right panel colors a UMAP projection by the same group\. Both projections show that adjacent slot pairs form three well\-separated regions\. This indicates that the latent representation contains a strong slot\-position component: early, middle, and late latent slots occupy distinct subspaces before any slot\-centering operation is applied\.Figure[6](https://arxiv.org/html/2606.04627#A3.F6)supports the interpretation that the latent sequence is not a bag of exchangeable hidden states\. The three slot groups form separate clusters, especially in UMAP, which means the model uses different parts of the continuous latent sequence differently\. In the MIRAGE design, this is desirable: the first slot group can specialize toward observation\-level evidence, the middle group toward action rationale, and the final group toward transition prediction\. The visualization does not prove this semantic assignment by itself, but it shows that the model has learned a stable positional organization on which such specialization can be built\.

![Refer to caption](https://arxiv.org/html/2606.04627v1/figures/appendix/per_slot_action_projections.png)Figure 7:Per\-slot action projections\. Each subplot projects one slot group and colors points by the ground\-truth action type\. These plots ask whether action semantics are visible inside each slot group without first removing the slot mean\.Figure[7](https://arxiv.org/html/2606.04627#A3.F7)shows that action information is present but partially entangled with slot identity\. High\-frequency tap examples spread broadly, as expected for generic GUI interaction, while more specialized actions such as open\-app, swipe, and type appear in more localized regions\. The per\-slot views are useful because they reveal whether one slot group alone carries the action signal\. In our diagnostic run, action separation is visible in each group but remains imperfect, suggesting that action semantics are distributed across the latent sequence rather than isolated in a single slot\.

![Refer to caption](https://arxiv.org/html/2606.04627v1/figures/appendix/slot_centered_action_projection.png)Figure 8:Action semantics after removing the slot mean\. The left panel shows slot\-centered t\-SNE by action type, and the right panel shows slot\-centered UMAP by action type\. Slot centering subtracts the average representation of each slot group before projection, reducing the dominant slot\-identity component\.Figure[8](https://arxiv.org/html/2606.04627#A3.F8)complements Figure[6](https://arxiv.org/html/2606.04627#A3.F6)\. Once the slot\-specific mean is removed, action structure becomes clearer\. Open\-app examples form a compact region, swipe examples occupy a separate region, type examples gather in another area, and tap remains a broad central manifold\. This behavior is consistent with the main claim of Section 4\.3: MIRAGE latent states encode both where they are in the latent reasoning sequence and what action semantics they support\. The slot\-position component is strong, but it does not erase action\-relevant information; after centering, the residual representation still organizes examples by action type\.

## Appendix DFormal Derivation of Original Coconut

This section formalizes the original Coconut\-style latent reasoning procedure\[[15](https://arxiv.org/html/2606.04627#bib.bib15)\]that motivates our APLR approximation\. Coconut starts from supervised examples with a contextccand a visible chain of reasoning split intoMMordered chunksρ1:M\\rho\_\{1:M\}, followed by a final answer or actionaa\. For mobile agents,cccorresponds to the current screenshot, instruction, and interaction history, whileaais the serialized GUI action\.

##### Explicit chain\-of\-thought objective\.

The ordinary explicit\-CoT likelihood factorizes as

pθ​\(ρ1:M,a∣c\)=∏i=1Mpθ​\(ρi∣c,ρ<i\)⋅pθ​\(a∣c,ρ1:M\),p\_\{\\theta\}\(\\rho\_\{1:M\},a\\mid c\)=\\prod\_\{i=1\}^\{M\}p\_\{\\theta\}\(\\rho\_\{i\}\\mid c,\\rho\_\{<i\}\)\\cdot p\_\{\\theta\}\(a\\mid c,\\rho\_\{1:M\}\),\(15\)where each factor over a chunk denotes the product of token probabilities inside that chunk\. Stage 0 of Coconut is equivalent to minimizing the standard next\-token loss on the full sequence\(c,ρ1:M,a\)\(c,\\rho\_\{1:M\},a\)\.

##### Continuous\-thought substitution\.

Coconut introduces boundary tokens⟨bot⟩\\langle\\mathrm\{bot\}\\rangleand⟨eot⟩\\langle\\mathrm\{eot\}\\ranglefor a latent reasoning span\. LetHθ​\(S\)H\_\{\\theta\}\(S\)be the final\-layer hidden states produced by the decoder for a mixed sequenceSScontaining both token embeddings and continuous vectors\. LetFθ​\(S\)F\_\{\\theta\}\(S\)denote the hidden state at the last position ofSS, optionally followed by a projectionΠ\\Piinto the input\-embedding space:

Fθ​\(S\)=Π​Hθ​\(S\)\|S\|\.F\_\{\\theta\}\(S\)=\\Pi\\,H\_\{\\theta\}\(S\)\_\{\|S\|\}\.\(16\)At curriculum stagemm, Coconut replaces the firstmmvisible reasoning chunks bymmcontinuous thoughtsz1:mz\_\{1:m\}\. These thoughts are generated recursively:

S0\\displaystyle S\_\{0\}=\[c;⟨bot⟩\],\\displaystyle=\[c;\\langle\\mathrm\{bot\}\\rangle\],\(17\)zi\\displaystyle z\_\{i\}=Fθ​\(\[S0;z1;…;zi−1\]\),i=1,…,m\.\\displaystyle=F\_\{\\theta\}\(\[S\_\{0\};z\_\{1\};\\ldots;z\_\{i\-1\}\]\),\\qquad i=1,\\ldots,m\.\(18\)The resulting training sequence is

S~\(m\)=\[c;⟨bot⟩;z1;…;zm;⟨eot⟩;ρm\+1;…;ρM;a\]\.\\tilde\{S\}^\{\(m\)\}=\[c;\\langle\\mathrm\{bot\}\\rangle;z\_\{1\};\\ldots;z\_\{m\};\\langle\\mathrm\{eot\}\\rangle;\\rho\_\{m\+1\};\\ldots;\\rho\_\{M\};a\]\.\(19\)Continuous positions are not vocabulary targets\. The supervised likelihood at stagemmis therefore

pθ\(m\)​\(ρm\+1:M,a∣c\)=∏i=m\+1Mpθ​\(ρi∣c,z1:m,ρm\+1:i−1\)⋅pθ​\(a∣c,z1:m,ρm\+1:M\),p\_\{\\theta\}^\{\(m\)\}\(\\rho\_\{m\+1:M\},a\\mid c\)=\\prod\_\{i=m\+1\}^\{M\}p\_\{\\theta\}\(\\rho\_\{i\}\\mid c,z\_\{1:m\},\\rho\_\{m\+1:i\-1\}\)\\cdot p\_\{\\theta\}\(a\\mid c,z\_\{1:m\},\\rho\_\{m\+1:M\}\),\(20\)with loss

ℒCoconut\(m\)=−∑ℓ∈ℐ\(m\)log⁡pθ​\(S~ℓ\(m\)∣S~<ℓ\(m\)\),\\mathcal\{L\}\_\{\\mathrm\{Coconut\}\}^\{\(m\)\}=\-\\sum\_\{\\ell\\in\\mathcal\{I\}^\{\(m\)\}\}\\log p\_\{\\theta\}\(\\tilde\{S\}^\{\(m\)\}\_\{\\ell\}\\mid\\tilde\{S\}^\{\(m\)\}\_\{<\\ell\}\),\(21\)whereℐ\(m\)\\mathcal\{I\}^\{\(m\)\}indexes only discrete supervised tokens after the latent prefix\. Increasingmmgradually moves supervision from explicit rationales to latent thoughts:m=0m=0recovers standard CoT training, whilem=Mm=Mleaves only the final answer or action as visible supervision after the latent span\.

##### Serial triangular computation\.

Equation \([18](https://arxiv.org/html/2606.04627#A4.E18)\) is inherently serial becauseziz\_\{i\}must be computed and inserted beforezi\+1z\_\{i\+1\}can attend to it\. Abstracting away the visible suffix, define a decoder\-induced mapGiG\_\{i\}for theii\-th latent update under fixed contextcc:

zi=Gi​\(z1,…,zi−1;c\),i=1,…,m\.z\_\{i\}=G\_\{i\}\(z\_\{1\},\\ldots,z\_\{i\-1\};c\),\\qquad i=1,\\ldots,m\.\(22\)Causal attention makes this system strictly triangular:GiG\_\{i\}cannot depend onziz\_\{i\}or any future latentzjz\_\{j\}withj\>ij\>i\. Original Coconut solves Eq\. \([22](https://arxiv.org/html/2606.04627#A4.E22)\) by forward substitution, i\.e\., it computesz1z\_\{1\}, writes it into the sequence, then computesz2z\_\{2\}, and so on\. This exact serial execution is the reference solution that APLR approximates with parallel Jacobi\-style refinement\.

## Appendix EAppendix Pipeline Pseudocode

Algorithm[1](https://arxiv.org/html/2606.04627#alg1)summarizes the full training pipeline\.

Algorithm 1MIRAGE two\-stage training pipeline1:explicit records

\(ot,τt,dt,at\)\(o\_\{t\},\\tau\_\{t\},d\_\{t\},a\_\{t\}\)and sequential records with optional next frame

xt\+1x\_\{t\+1\}
2:a latent\-reasoning GUI policy; discard the Q\-Former head at inference time

3:Stage 1: explicit\-thought warmup

4:Serialize

\[ot;⟨THOUGHT⟩;τt;⟨ACTION​\_​DESC⟩;dt;⟨ACTION⟩;at\]\[o\_\{t\};\\langle\\mathrm\{THOUGHT\}\\rangle;\\tau\_\{t\};\\langle\\mathrm\{ACTION\\\_DESC\}\\rangle;d\_\{t\};\\langle\\mathrm\{ACTION\}\\rangle;a\_\{t\}\]
5:Optimize next\-token CE over the structured thought, action description, and action tokens

6:Stage 2: latent CoT and world\-model training

7:Replace the<THOUGHT\>block with

NNlearned latent slots as in Eq\. \([4](https://arxiv.org/html/2606.04627#S3.E4)\)

8:for

r=0r=0to

K−1K\-1do

9:Run one causal VLM pass and update all latent slots synchronously using Eq\. \([6](https://arxiv.org/html/2606.04627#S3.E6)\)

10:endfor

11:Decode action tokens and compute

ℒce\\mathcal\{L\}\_\{\\mathrm\{ce\}\}by Eq\. \([8](https://arxiv.org/html/2606.04627#S3.E8)\)

12:ifa valid next frame

xt\+1x\_\{t\+1\}existsthen

13:Extract detached targets

Vt\+1=sg​\(fvis​\(xt\+1\)\)V\_\{t\+1\}=\\mathrm\{sg\}\(f\_\{\\mathrm\{vis\}\}\(x\_\{t\+1\}\)\)
14:Predict

V^t\+1\\hat\{V\}\_\{t\+1\}from latent states with the Q\-Former and compute

ℒwm\\mathcal\{L\}\_\{\\mathrm\{wm\}\}
15:else

16:Mask out the world\-model term for this sample

17:endif

18:Update parameters with

ℒ=λ​ℒce\+\(1−λ\)​ℒwm\\mathcal\{L\}=\\lambda\\mathcal\{L\}\_\{\\mathrm\{ce\}\}\+\(1\-\\lambda\)\\mathcal\{L\}\_\{\\mathrm\{wm\}\}
19:returnlatent\-reasoning GUI policy

## Appendix FAPLR Recovery Proof

We formalize the statement used in Section[3\.2\.2](https://arxiv.org/html/2606.04627#S3.SS2.SSS2)\. The goal is not to prove that a small number of APLR rounds exactly reproduces every serial latent state\. Instead, the result is sharper: afterKKrefinement rounds, the firstKKlatent states are exactly identical to the serial rollout, while the remaining latent states have a structured tail error that propagates only through the strictly causal dependency graph\.

##### Setup\.

Consider one sequence withNNlatent slots\. Each latent state lies inℝd\\mathbb\{R\}^\{d\}\. Letccdenote all non\-latent context: instruction tokens, history tokens, image embeddings, attention masks, and position encodings\. For each latent indexii, defineGiG\_\{i\}as the decoder\-induced update that maps the fixed context and all earlier latent states to the hidden state used to replace latent slotii\. Causal attention implies

Gi=Gi​\(z1,…,zi−1;c\),G\_\{i\}=G\_\{i\}\(z\_\{1\},\\ldots,z\_\{i\-1\};c\),\(23\)soGiG\_\{i\}cannot depend onziz\_\{i\}or any future latentzjz\_\{j\}withj\>ij\>i\.

The serial latent rollout is the forward\-substitution solution

si=Gi​\(s1,…,si−1;c\),i=1,…,N,s\_\{i\}=G\_\{i\}\(s\_\{1\},\\ldots,s\_\{i\-1\};c\),\\qquad i=1,\\ldots,N,\(24\)with the convention thatG1G\_\{1\}depends only oncc\. APLR initializes every latent slot with the same learned latent embedding, written abstractly aszi\(0\)=elatz\_\{i\}^\{\(0\)\}=e\_\{\\mathrm\{lat\}\}, and performs the parallel Jacobi\-style update

zi\(r\+1\)=Gi​\(z1\(r\),…,zi−1\(r\);c\),i=1,…,N\.z\_\{i\}^\{\(r\+1\)\}=G\_\{i\}\(z\_\{1\}^\{\(r\)\},\\ldots,z\_\{i\-1\}^\{\(r\)\};c\),\\qquad i=1,\\ldots,N\.\(25\)The word “parallel” means that all right\-hand sides in Eq\.[25](https://arxiv.org/html/2606.04627#A6.E25)are evaluated using round\-rrlatent states before any slot is overwritten for roundr\+1r\+1\.

###### Lemma 1\(Causal dependency cone\)\.

AfterrrAPLR refinement rounds, latent slotzi\(r\)z\_\{i\}^\{\(r\)\}can depend on the initial latent values only through slots with indices at mosti−ri\-r\. Equivalently, information from the exact serial prefix can advance by at most one latent position per refinement round\.

###### Proof\.

The claim is immediate forr=0r=0becausezi\(0\)z\_\{i\}^\{\(0\)\}is itself an initial value\. Assume it holds for roundrr\. At roundr\+1r\+1,zi\(r\+1\)z\_\{i\}^\{\(r\+1\)\}is a function ofz1\(r\),…,zi−1\(r\)z\_\{1\}^\{\(r\)\},\\ldots,z\_\{i\-1\}^\{\(r\)\}and the fixed contextcc\. By the induction hypothesis, each predecessorzj\(r\)z\_\{j\}^\{\(r\)\}can depend on initial slots only up to indexj−rj\-r\. Sincej≤i−1j\\leq i\-1, the largest such index is\(i−1\)−r=i−\(r\+1\)\(i\-1\)\-r=i\-\(r\+1\)\. Thuszi\(r\+1\)z\_\{i\}^\{\(r\+1\)\}can depend on initial values only through slots with indices at mosti−\(r\+1\)i\-\(r\+1\)\. This proves the claim by induction\. ∎

###### Proposition 1\(Exact recovery of early latent states\)\.

For any number of refinement roundsr≥0r\\geq 0, APLR satisfies

zi\(r\)=sifor all​i≤r\.z\_\{i\}^\{\(r\)\}=s\_\{i\}\\qquad\\text\{for all \}i\\leq r\.\(26\)Consequently, afterKKrefinement rounds, APLR exactly recovers the firstKKlatent states of the serial rollout, and afterNNrounds it recovers the full serial latent sequence\.

###### Proof\.

We prove the claim by induction on the refinement roundrr\.

*Base case\.*Forr=0r=0, the set of indices satisfyingi≤0i\\leq 0is empty, so the claim holds vacuously\. No latent has been refined yet, and no equality with the serial rollout is required\.

*Inductive step\.*Assume that after roundrr,zj\(r\)=sjz\_\{j\}^\{\(r\)\}=s\_\{j\}for everyj≤rj\\leq r\. Consider roundr\+1r\+1and any latent indexi≤r\+1i\\leq r\+1\. Ifi=1i=1, then the causal update has no latent predecessor\. Both APLR and serial forward substitution therefore use exactly the same context\-only map:

z1\(r\+1\)=G1​\(c\)=s1\.z\_\{1\}^\{\(r\+1\)\}=G\_\{1\}\(c\)=s\_\{1\}\.\(27\)Ifi\>1i\>1, then every predecessor indexj<ij<isatisfiesj≤rj\\leq rbecausei≤r\+1i\\leq r\+1\. By the induction hypothesis, all predecessor values used by the APLR update are already equal to their serial values:

\(z1\(r\),…,zi−1\(r\)\)=\(s1,…,si−1\)\.\(z\_\{1\}^\{\(r\)\},\\ldots,z\_\{i\-1\}^\{\(r\)\}\)=\(s\_\{1\},\\ldots,s\_\{i\-1\}\)\.\(28\)Substituting these equal predecessors into the parallel update gives

zi\(r\+1\)=Gi​\(z1\(r\),…,zi−1\(r\);c\)=Gi​\(s1,…,si−1;c\)=si,z\_\{i\}^\{\(r\+1\)\}=G\_\{i\}\(z\_\{1\}^\{\(r\)\},\\ldots,z\_\{i\-1\}^\{\(r\)\};c\)=G\_\{i\}\(s\_\{1\},\\ldots,s\_\{i\-1\};c\)=s\_\{i\},\(29\)where the final equality follows from the serial recurrence in Eq\.[24](https://arxiv.org/html/2606.04627#A6.E24)\. Thus the claim holds for roundr\+1r\+1, completing the induction\. ∎

###### Corollary 1\(Which latent slots remain approximate\)\.

AfterKKrefinement rounds, the only latent slots that can differ from the serial rollout are the tail slots

zK\+1\(K\),zK\+2\(K\),…,zN\(K\)\.z\_\{K\+1\}^\{\(K\)\},z\_\{K\+2\}^\{\(K\)\},\\ldots,z\_\{N\}^\{\(K\)\}\.\(30\)The firstKKslots have zero serial\-approximation error\.

##### Local tail\-error expression\.

The same triangular structure explains the approximation error for the tail slots whenK<NK<N\. Letδi\(r\)=zi\(r\)−si\\delta\_\{i\}^\{\(r\)\}=z\_\{i\}^\{\(r\)\}\-s\_\{i\}\. Assume eachGiG\_\{i\}is twice continuously differentiable in a neighborhood of the serial solution\. Define the block Jacobian

Ai​j=\{∂Gi∂zj\|s,c,j<i,0,j≥i\.A\_\{ij\}=\\begin\{cases\}\\frac\{\\partial G\_\{i\}\}\{\\partial z\_\{j\}\}\\big\|\_\{s,c\},&j<i,\\\\ 0,&j\\geq i\.\\end\{cases\}\(31\)The matrixAAis strictly block\-lower\-triangular\. A Taylor expansion around\(s1,…,si−1\)\(s\_\{1\},\\ldots,s\_\{i\-1\}\)gives

δi\(r\+1\)=∑j<iAi​j​δj\(r\)\+Ri\(r\),‖Ri\(r\)‖≤Ci​‖δ<i\(r\)‖2\\delta\_\{i\}^\{\(r\+1\)\}=\\sum\_\{j<i\}A\_\{ij\}\\delta\_\{j\}^\{\(r\)\}\+R\_\{i\}^\{\(r\)\},\\qquad\\\|R\_\{i\}^\{\(r\)\}\\\|\\leq C\_\{i\}\\\|\\delta\_\{<i\}^\{\(r\)\}\\\|^\{2\}\(32\)for constantsCiC\_\{i\}determined by local Hessian bounds\. Stacking all latent errors yields

δ\(r\+1\)=A​δ\(r\)\+R\(r\),‖R\(r\)‖=O​\(‖δ\(r\)‖2\)\.\\delta^\{\(r\+1\)\}=A\\delta^\{\(r\)\}\+R^\{\(r\)\},\\qquad\\\|R^\{\(r\)\}\\\|=O\(\\\|\\delta^\{\(r\)\}\\\|^\{2\}\)\.\(33\)Ignoring higher\-order terms, theKK\-round tail error is

δ\(K\)≈AK​δ\(0\)\.\\delta^\{\(K\)\}\\approx A^\{K\}\\delta^\{\(0\)\}\.\(34\)BecauseAAis strictly lower triangular,\(AK\)i​j\(A^\{K\}\)\_\{ij\}can be nonzero only whenj≤i−Kj\\leq i\-K\. Thus Eq\.[34](https://arxiv.org/html/2606.04627#A6.E34)has two consequences\. First, rowsi≤Ki\\leq Kare exactly zero, matching the exact\-recovery proposition\. Second, for a tail latenti\>Ki\>K, the remaining error comes only from causal chains of length at leastKKthat connect earlier imperfect initial values to slotii\. These are precisely the deep latent states that have not received enough refinement rounds to match the serial rollout\.

## Appendix GHow the World\-Model Loss Regularizes Tail Errors

The previous section identifies the approximation error left by usingK<NK<NAPLR rounds: it lives only in the tail latent slotszK\+1:N\(K\)z\_\{K\+1:N\}^\{\(K\)\}\. The Q\-Former world\-model loss does not make these tail slots algebraically identical to the serial rollout\. Instead, it adds direct supervision to the components of their error that affect prediction of the next mobile screen in the VLM visual feature space\. This is the useful notion of compensation for a mobile agent: the auxiliary loss penalizes tail errors that discard environment\-transition information\.

##### Tail\-error notation\.

Let

τ\(K\)=\[δK\+1\(K\),…,δN\(K\)\]\\tau^\{\(K\)\}=\\big\[\\delta\_\{K\+1\}^\{\(K\)\},\\ldots,\\delta\_\{N\}^\{\(K\)\}\\big\]\(35\)denote the stacked tail error afterKKAPLR rounds\. LetPϕP\_\{\\phi\}denote the Q\-Former aligner, including the row/column query embeddings, cross\-attention into latent CoT hidden states, and final visual projection\. For a valid next\-frame transition, define the detached target feature matrix

V⋆=sg​\(fvis​\(xt\+1\)\)∈ℝM×Hv\.V^\{\\star\}=\\mathrm\{sg\}\(f\_\{\\mathrm\{vis\}\}\(x\_\{t\+1\}\)\)\\in\\mathbb\{R\}^\{M\\times H\_\{v\}\}\.\(36\)The Q\-Former prediction as a function of the latent state is

V^​\(τ\)=Pϕ​\(s\+τ;R\),\\hat\{V\}\(\\tau\)=P\_\{\\phi\}\(s\+\\tau;R\),\(37\)whereRRis the set of valid row/column patch coordinates\. For the cosine variant used by default, write normalized predicted and target features as

g​\(τ\)=norm​\(V^​\(τ\)\),u=norm​\(V⋆\),g\(\\tau\)=\\mathrm\{norm\}\(\\hat\{V\}\(\\tau\)\),\\qquad u=\\mathrm\{norm\}\(V^\{\\star\}\),\(38\)where normalization is applied patchwise\. The masked world\-model loss is

ℓwm​\(τ\)=1\|ℳ\|​∑j∈ℳ\(1−⟨gj​\(τ\),uj⟩\)=12​\|ℳ\|​∑j∈ℳ‖gj​\(τ\)−uj‖22,\\ell\_\{\\mathrm\{wm\}\}\(\\tau\)=\\frac\{1\}\{\|\\mathcal\{M\}\|\}\\sum\_\{j\\in\\mathcal\{M\}\}\\left\(1\-\\langle g\_\{j\}\(\\tau\),u\_\{j\}\\rangle\\right\)=\\frac\{1\}\{2\|\\mathcal\{M\}\|\}\\sum\_\{j\\in\\mathcal\{M\}\}\\\|g\_\{j\}\(\\tau\)\-u\_\{j\}\\\|\_\{2\}^\{2\},\(39\)because bothgjg\_\{j\}anduju\_\{j\}have unit norm\. Hereℳ\\mathcal\{M\}contains only real next\-frame patches from samples with a valid next frame and at least one latent slot\.

##### Local expansion\.

LetJJbe the Jacobian of the normalized Q\-Former predictiong​\(τ\)g\(\\tau\)with respect to the tail error atτ=0\\tau=0:

g​\(τ\)=g​\(0\)\+J​τ\+O​\(‖τ‖2\)\.g\(\\tau\)=g\(0\)\+J\\tau\+O\(\\\|\\tau\\\|^\{2\}\)\.\(40\)Substituting Eq\.[40](https://arxiv.org/html/2606.04627#A7.E40)into Eq\.[39](https://arxiv.org/html/2606.04627#A7.E39)yields

ℓwm​\(τ\)−ℓwm​\(0\)\\displaystyle\\ell\_\{\\mathrm\{wm\}\}\(\\tau\)\-\\ell\_\{\\mathrm\{wm\}\}\(0\)=1\|ℳ\|​⟨J⊤​\(g​\(0\)−u\),τ⟩\+12​\|ℳ\|​‖J​τ‖22\+O​\(‖τ‖3\)\.\\displaystyle=\\frac\{1\}\{\|\\mathcal\{M\}\|\}\\langle J^\{\\top\}\(g\(0\)\-u\),\\tau\\rangle\+\\frac\{1\}\{2\|\\mathcal\{M\}\|\}\\\|J\\tau\\\|\_\{2\}^\{2\}\+O\(\\\|\\tau\\\|^\{3\}\)\.\(41\)When the serial latent solution is locally optimized for the world\-model target, or when we analyze directions orthogonal to the residual gradientJ⊤​\(g​\(0\)−u\)J^\{\\top\}\(g\(0\)\-u\), the linear term vanishes\. The leading term is then a quadratic penalty on the components ofτ\\tauthat the Q\-Former prediction can observe\.

###### Proposition 2\(Q\-Former world\-model supervision controls future\-predictive tail error\)\.

LetUUbe a subspace of tail\-error directions on which the normalized Q\-Former Jacobian is locally observable: for allξ∈U\\xi\\in U,

‖J​ξ‖2≥σU​‖ξ‖2\\\|J\\xi\\\|\_\{2\}\\geq\\sigma\_\{U\}\\\|\\xi\\\|\_\{2\}\(42\)for someσU\>0\\sigma\_\{U\}\>0\. Assume the residual\-gradient term in Eq\.[41](https://arxiv.org/html/2606.04627#A7.E41)is zero or treated as a first\-order optimization gradient\. Then, for sufficiently small tail error,

ℓwm​\(τ\(K\)\)−ℓwm​\(0\)≥σU22​\|ℳ\|​‖PU​τ\(K\)‖2−O​\(‖τ\(K\)‖3\),\\ell\_\{\\mathrm\{wm\}\}\(\\tau^\{\(K\)\}\)\-\\ell\_\{\\mathrm\{wm\}\}\(0\)\\geq\\frac\{\\sigma\_\{U\}^\{2\}\}\{2\|\\mathcal\{M\}\|\}\\\|P\_\{U\}\\tau^\{\(K\)\}\\\|^\{2\}\-O\(\\\|\\tau^\{\(K\)\}\\\|^\{3\}\),\(43\)wherePUP\_\{U\}projects onto the future\-predictive subspace of tail errors that change normalized next\-frame feature predictions\.

###### Proof\.

Using Eq\.[41](https://arxiv.org/html/2606.04627#A7.E41)and dropping the zero residual\-gradient term gives

ℓwm​\(τ\(K\)\)−ℓwm​\(0\)≥12​\|ℳ\|​‖J​τ\(K\)‖22−O​\(‖τ\(K\)‖3\)\.\\ell\_\{\\mathrm\{wm\}\}\(\\tau^\{\(K\)\}\)\-\\ell\_\{\\mathrm\{wm\}\}\(0\)\\geq\\frac\{1\}\{2\|\\mathcal\{M\}\|\}\\\|J\\tau^\{\(K\)\}\\\|\_\{2\}^\{2\}\-O\(\\\|\\tau^\{\(K\)\}\\\|^\{3\}\)\.\(44\)Decompose the tail error into a future\-predictive component and a locally unobserved component:

τ\(K\)=PU​τ\(K\)\+\(I−PU\)​τ\(K\)\.\\tau^\{\(K\)\}=P\_\{U\}\\tau^\{\(K\)\}\+\(I\-P\_\{U\}\)\\tau^\{\(K\)\}\.\(45\)By definition ofUU, the Q\-Former prediction is informative onPU​τ\(K\)P\_\{U\}\\tau^\{\(K\)\}, with minimum singular value at leastσU\\sigma\_\{U\}\. Therefore

‖J​τ\(K\)‖2≥σU​‖PU​τ\(K\)‖\.\\\|J\\tau^\{\(K\)\}\\\|\_\{2\}\\geq\\sigma\_\{U\}\\\|P\_\{U\}\\tau^\{\(K\)\}\\\|\.\(46\)Substituting this inequality into the previous display yields Eq\.[43](https://arxiv.org/html/2606.04627#A7.E43)\. ∎

##### MSE variant\.

If the implementation uses the MSE world\-model loss instead of cosine distance, the same argument applies without patchwise normalization\. LetV^​\(τ\)=V^​\(0\)\+B​τ\+O​\(‖τ‖2\)\\hat\{V\}\(\\tau\)=\\hat\{V\}\(0\)\+B\\tau\+O\(\\\|\\tau\\\|^\{2\}\)\. The masked MSE loss has local expansion

ℓmse​\(τ\)−ℓmse​\(0\)=1\|ℳ\|​⟨B⊤​\(V^​\(0\)−V⋆\),τ⟩\+1\|ℳ\|​‖B​τ‖22\+O​\(‖τ‖3\)\.\\ell\_\{\\mathrm\{mse\}\}\(\\tau\)\-\\ell\_\{\\mathrm\{mse\}\}\(0\)=\\frac\{1\}\{\|\\mathcal\{M\}\|\}\\langle B^\{\\top\}\(\\hat\{V\}\(0\)\-V^\{\\star\}\),\\tau\\rangle\+\\frac\{1\}\{\|\\mathcal\{M\}\|\}\\\|B\\tau\\\|\_\{2\}^\{2\}\+O\(\\\|\\tau\\\|^\{3\}\)\.\(47\)Thus any tail\-error direction that changes predicted next\-frame features is quadratically penalized once the first\-order residual term is optimized\.

##### Interpretation\.

The proposition says exactly which unrefined latent\-token errors the Q\-Former world\-model objective can compensate: errors in the tail slotszK\+1:N\(K\)z\_\{K\+1:N\}^\{\(K\)\}that are visible to next\-frame feature prediction, i\.e\., errors outside the local nullspace of the Q\-Former prediction Jacobian\. These directions correspond to latent information that changes the predicted semantic layout of the next screenshot, such as whether a tap opens a menu, whether a typed query changes a text field, or whether navigation moves to a new screen\. When the combined loss

ℒ=λce​ℒce\+λwm​ℒwm\\mathcal\{L\}=\\lambda\_\{\\mathrm\{ce\}\}\\mathcal\{L\}\_\{\\mathrm\{ce\}\}\+\\lambda\_\{\\mathrm\{wm\}\}\\mathcal\{L\}\_\{\\mathrm\{wm\}\}\(48\)is optimized, any such future\-predictive tail error increases the objective by at least

λwm​σU22​\|ℳ\|​‖PU​τ\(K\)‖2\\frac\{\\lambda\_\{\\mathrm\{wm\}\}\\sigma\_\{U\}^\{2\}\}\{2\|\\mathcal\{M\}\|\}\\\|P\_\{U\}\\tau^\{\(K\)\}\\\|^\{2\}\(49\)up to higher\-order terms\. Thus the auxiliary world\-model loss supplies curvature and gradients for tail latent states that may be weakly constrained by action imitation alone\. It does not control tail\-error directions that do not affect next\-frame visual feature predictions; those directions are either irrelevant to the chosen environment\-model target or must be constrained by the action loss and other regularizers\.

## Appendix HSystem Prompt

Figure[9](https://arxiv.org/html/2606.04627#A8.F9)shows the complete system prompt provided to the MIRAGE agent at inference time\. The prompt defines the role, input format, output format with structured<THOUGHT\>/<ACTION\_DESC\>/<ACTION\>tags, and the full command vocabulary\. At inference time the<THOUGHT\>block is replaced by theNNlatent slots described in Section[3\.2\.2](https://arxiv.org/html/2606.04627#S3.SS2.SSS2); the explicit text template here is therefore used only during Stage 1 warmup training\.

System Prompt \(inference\-time template\)[⬇](data:text/plain;base64,WW91IGFyZSBhbiBBbmRyb2lkIG1vYmlsZSBHVUkgYWdlbnQuCklucHV0OiB0aGUgdXNlciB0YXNrLCBhIHN1bW1hcnkgbGlzdCBvZiBjb21wbGV0ZWQgcGFzdCBhY3Rpb25zLCBhbmQgdGhlIGN1cnJlbnQgc2NyZWVuc2hvdC4KQ2hvb3NlIHRoZSBzaW5nbGUgYmVzdCBuZXh0IGFjdGlvbiBmcm9tIHRoZSBjdXJyZW50IHNjcmVlbiBvbmx5LiBEbyBub3QgaW52ZW50IHVuc2VlbiBVSS4KUHJvZHVjZSBvbmUgYXRvbWljIHN0ZXAuIENvb3JkaW5hdGVzIGFyZSBub3JtYWxpemVkIGludGVnZXJzIGluIFswLDk5OV0uIERvIG5vdCBlbWl0Cm1hcmtkb3duIGZlbmNlcyBvciBoaWRkZW4gcmVhc29uaW5nIHRhZ3Mgc3VjaCBhcyA8VEhPVUdIVD4uCgpPdXRwdXQ6Ci0gRW1pdCBvbmx5IHRoZSByZXF1aXJlZCBibG9ja3MgdXNpbmcgdGhlIGxpdGVyYWwgdGFncyBgPFRIT1VHSFQ+YCwgYDxBQ1RJT05fREVTQz5gLAogIGFuZCBgPEFDVElPTj5gLgotIGA8VEhPVUdIVD5gOiBvbmUgRW5nbGlzaCBsaW5lIHdpdGggZXhhY3RseQogIGAxLiBvYnNlcnZhdGlvbjogLi4uIHwgMi4gcmF0aW9uYWxlOiAuLi4gfCAzLiBwcmVkaWN0OiAuLi5gLgogIC0gYG9ic2VydmF0aW9uYDogZGVzY3JpYmUgb25seSB0aGUgY3VycmVudGx5IHZpc2libGUgVUkgb24gdGhlIGN1cnJlbnQgc2NyZWVuc2hvdC4KICAtIGByYXRpb25hbGVgOiBjb21iaW5lIHRhc2sgZ29hbCwgcHJpb3IgcHJvZ3Jlc3MsIGFuZCB0aGUgcmVhc29uIHRoaXMgc2luZ2xlIG5leHQKICAgIGFjdGlvbiBpcyBiZXN0LgogIC0gYHByZWRpY3RgOiBkZXNjcmliZSB0aGUgaW1tZWRpYXRlIHZpc2libGUgcmVzdWx0IHJpZ2h0IGFmdGVyIHRoZSBhY3Rpb24uCiAgS2VlcCBpdCBncm91bmRlZCBpbiB0aGUgY3VycmVudCBzY3JlZW5zaG90IGFuZCBoaXN0b3J5LiBEbyBub3QgbWVudGlvbiBjb29yZGluYXRlcy4KICBEbyBub3QgbWVudGlvbiBwcm9tcHQgc3RydWN0dXJlLCBhbm5vdGF0aW9uIG1ldGFkYXRhLCBvciBzY3JlZW5zaG90LWF2YWlsYWJpbGl0eQogIHBocmFzZXMgc3VjaCBhcyAibmV4dCBzY3JlZW5zaG90IiwgIm5vIG5leHQgc2NyZWVuc2hvdCIsICJvbmUgaW1hZ2UiLCBvciAidHdvIGltYWdlcyIuCi0gYDxBQ1RJT05fREVTQz5gOiBFbmdsaXNoLCBvbmUgaW1wZXJhdGl2ZSBwaHJhc2UuCi0gYDxBQ1RJT04+YDogZXhhY3RseSBvbmUgZXhlY3V0YWJsZSBvcGVyYXRpb24gY29tbWFuZCBzdHJpbmcuCgpJZiB0aGUgdGFzayBpcyBjb21wbGV0ZWQgc3VjY2Vzc2Z1bGx5LCB1c2UgYGZpbmlzaChzdWNjZXNzKWAuIElmIHRoZSB0YXNrIHNob3VsZAp0ZXJtaW5hdGUgdW5zdWNjZXNzZnVsbHksIHVzZSBgZmluaXNoKGZhaWwpYC4KCkNvbW1hbmQgZm9ybWF0OgotIFRhcCBhIHZpc2libGUgVUkgZWxlbWVudDoKICAgIGNsaWNrKDx8cG9pbnRfc3RhcnR8Pih4LHkpPHxwb2ludF9lbmR8PikKLSBTd2lwZSB0byBtb3ZlIHRoZSBwYWdlIG9yIG5hdmlnYXRlIGFjcm9zcyB0aGUgc2NyZWVuOgogICAgc3dpcGUoPHxwb2ludF9zdGFydHw+KHgxLHkxKTx8cG9pbnRfZW5kfD4sIDx8cG9pbnRfc3RhcnR8Pih4Mix5Mik8fHBvaW50X2VuZHw+KQotIERyYWcgdG8gbW92ZSBhIHNsaWRlciwgaGFuZGxlLCBvciBkcmFnZ2FibGUgb2JqZWN0OgogICAgZHJhZyg8fHBvaW50X3N0YXJ0fD4oeDEseTEpPHxwb2ludF9lbmR8PiwgPHxwb2ludF9zdGFydHw+KHgyLHkyKTx8cG9pbnRfZW5kfD4pCi0gTG9uZyBwcmVzcyB0byBvcGVuIGEgY29udGV4dCBhY3Rpb24gb3Igc2VsZWN0IGFuIGl0ZW06CiAgICBsb25nX3ByZXNzKDx8cG9pbnRfc3RhcnR8Pih4LHkpPHxwb2ludF9lbmR8PikKLSBUeXBlIGludG8gdGhlIGN1cnJlbnRseSBmb2N1c2VkIGZpZWxkOgogICAgdHlwZSgidGV4dCIpCi0gVHlwZSBpbnRvIGEgc3BlY2lmaWMgdGV4dCBib3g6CiAgICB0eXBlKDx8Ym94X3N0YXJ0fD4oeDEseTEpLCh4Mix5Mik8fGJveF9lbmR8PiwgInRleHQiKQotIFByZXNzIGEgc3VwcG9ydGVkIGRldmljZSBrZXk6CiAgICBwcmVzcyhIT01FKQotIE9wZW4gYW4gYXBwIGRpcmVjdGx5IHdoZW4gdGhhdCBpcyB0aGUgaW50ZW5kZWQgc3lzdGVtIGFjdGlvbjoKICAgIG9wZW5fYXBwKCJTZXR0aW5ncyIpCi0gV2FpdCBicmllZmx5IGZvciB0aGUgVUkgdG8gbG9hZCBvciB1cGRhdGU6CiAgICB3YWl0KCkgIG9yICB3YWl0KHQpCi0gUmV0dXJuIGEgdGV4dHVhbCBhbnN3ZXIgZm9yIGEgbm9uLUdVSSBxdWVzdGlvbjoKICAgIGFuc3dlcigidGV4dCIpCi0gUmVxdWVzdCBodW1hbiB0YWtlb3ZlciB3aGVuIHRoZSB0YXNrIGNhbm5vdCBiZSBzYWZlbHkgY29udGludWVkOgogICAgY2FsbF91c2VyKCkKLSBFbmQgdGhlIHRhc2s6CiAgICBmaW5pc2goc3VjY2VzcykgIG9yICBmaW5pc2goZmFpbCkKClVzZSBleGFjdGx5IG9uZSBjb21tYW5kLiBEbyBub3Qgd3JhcCBpdCBpbiBKU09OIG9yIG5hdHVyYWwgbGFuZ3VhZ2UuIFVzZSBFbmdsaXNoIGRvdWJsZQpxdW90ZXMgZm9yIHN0cmluZyBhcmd1bWVudHMgYW5kIHVwcGVyY2FzZSBFbmdsaXNoIGZvciBrZXkgbmFtZXMgaW4gcHJlc3MoLi4uKS4K)YouareanAndroidmobileGUIagent\.Input:theusertask,asummarylistofcompletedpastactions,andthecurrentscreenshot\.Choosethesinglebestnextactionfromthecurrentscreenonly\.DonotinventunseenUI\.Produceoneatomicstep\.Coordinatesarenormalizedintegersin\[0,999\]\.Donotemitmarkdownfencesorhiddenreasoningtagssuchas<THOUGHT\>\.Output:\-Emitonlytherequiredblocksusingtheliteraltags‘<THOUGHT\>‘,‘<ACTION\_DESC\>‘,and‘<ACTION\>‘\.\-‘<THOUGHT\>‘:oneEnglishlinewithexactly‘1\.observation:\.\.\.\|2\.rationale:\.\.\.\|3\.predict:\.\.\.‘\.\-‘observation‘:describeonlythecurrentlyvisibleUIonthecurrentscreenshot\.\-‘rationale‘:combinetaskgoal,priorprogress,andthereasonthissinglenextactionisbest\.\-‘predict‘:describetheimmediatevisibleresultrightaftertheaction\.Keepitgroundedinthecurrentscreenshotandhistory\.Donotmentioncoordinates\.Donotmentionpromptstructure,annotationmetadata,orscreenshot\-availabilityphrasessuchas"nextscreenshot","nonextscreenshot","oneimage",or"twoimages"\.\-‘<ACTION\_DESC\>‘:English,oneimperativephrase\.\-‘<ACTION\>‘:exactlyoneexecutableoperationcommandstring\.Ifthetaskiscompletedsuccessfully,use‘finish\(success\)‘\.Ifthetaskshouldterminateunsuccessfully,use‘finish\(fail\)‘\.Commandformat:\-TapavisibleUIelement:click\(<\|point\_start\|\>\(x,y\)<\|point\_end\|\>\)\-Swipetomovethepageornavigateacrossthescreen:swipe\(<\|point\_start\|\>\(x1,y1\)<\|point\_end\|\>,<\|point\_start\|\>\(x2,y2\)<\|point\_end\|\>\)\-Dragtomoveaslider,handle,ordraggableobject:drag\(<\|point\_start\|\>\(x1,y1\)<\|point\_end\|\>,<\|point\_start\|\>\(x2,y2\)<\|point\_end\|\>\)\-Longpresstoopenacontextactionorselectanitem:long\_press\(<\|point\_start\|\>\(x,y\)<\|point\_end\|\>\)\-Typeintothecurrentlyfocusedfield:type\("text"\)\-Typeintoaspecifictextbox:type\(<\|box\_start\|\>\(x1,y1\),\(x2,y2\)<\|box\_end\|\>,"text"\)\-Pressasupporteddevicekey:press\(HOME\)\-Openanappdirectlywhenthatistheintendedsystemaction:open\_app\("Settings"\)\-WaitbrieflyfortheUItoloadorupdate:wait\(\)orwait\(t\)\-Returnatextualanswerforanon\-GUIquestion:answer\("text"\)\-Requesthumantakeoverwhenthetaskcannotbesafelycontinued:call\_user\(\)\-Endthetask:finish\(success\)orfinish\(fail\)Useexactlyonecommand\.DonotwrapitinJSONornaturallanguage\.UseEnglishdoublequotesforstringargumentsanduppercaseEnglishforkeynamesinpress\(\.\.\.\)\.Figure 9:Full system prompt used by MIRAGE\. The<THOUGHT\>block is a visible text template during Stage 1 warmup; it is replaced by learned latent slots during Stage 2 and at inference time\. Coordinates are normalized integers in\[0,999\]\[0,999\]\.
## Appendix IAction Space

MIRAGE operates over a compact, typed action space that maps directly to Android system APIs\. Each action is a single atomic command; the agent never emits more than one command per step\. Table[5](https://arxiv.org/html/2606.04627#A9.T5)lists all supported action types\.

Table 5:Complete action space of MIRAGE\. Coordinates\(x,y\)\(x,y\)are normalized integers in\[0,999\]\[0,999\]relative to the screenshot dimensions\.##### Coordinate convention\.

All spatial coordinates are normalized to\[0,999\]\[0,999\]independently in width and height, mapping the top\-left corner of the screenshot to\(0,0\)\(0,0\)and the bottom\-right corner to\(999,999\)\(999,999\)\. This convention is device\-resolution\-agnostic: the same action is valid on any screen size and is converted to absolute pixel coordinates at execution time by the environment driver\.

##### Action tokenization\.

Each action command is serialized as a plain ASCII string \(e\.g\.click\(\(512,340\)\)\) and appended to the VLM’s output token sequence\. The model is trained with next\-token cross\-entropy over the action tokensℒce\\mathcal\{L\}\_\{\\mathrm\{ce\}\}\(Eq\.[8](https://arxiv.org/html/2606.04627#S3.E8)\)\. No structured JSON, markup, or special delimiters are used; the action string is the only content inside the<ACTION\>block\.

Similar Articles

MotiMotion: Motion-Controlled Video Generation with Visual Reasoning

Hugging Face Daily Papers

MotiMotion introduces a reasoning-then-generation framework for motion-controlled video generation that uses vision-language reasoning to refine trajectories and a confidence-aware control scheme to improve plausibility, outperforming existing approaches on a new benchmark.

Deep Reasoning in General Purpose Agents via Structured Meta-Cognition

arXiv cs.CL

This paper introduces Deep Reasoning, an inference-time approach that uses structured meta-reasoning to construct task-specific scaffolds for general-purpose agents. The proposed agent, Dolores, outperforms existing methods by distributing cognition across lower-load reasoning threads, reducing hallucinations and improving performance across multiple benchmarks.