Perceive Before Reasoning: A Pre-Reasoning Perception Framework for Efficient and Reliable Proactive Mobile Agents

arXiv cs.AI 06/03/26, 04:00 AM Papers
Summary
This paper proposes a Pre-Reasoning Perception Framework (PRPF) for proactive mobile agents, decoupling intervention timing from assistance generation to improve efficiency and reduce false triggers.
arXiv:2606.03236v1 Announce Type: new Abstract: Multimodal large language models (MLLMs) have substantially advanced mobile agents, yet proactive mobile assistance remains challenging because agents must decide \emph{when} to intervene before determining \emph{how} to assist. Existing systems often implement these two decisions within a unified MLLM-based pipeline, leading to goal misalignment between conservative intervention filtering and comprehensive assistance generation, as well as redundant inference when the agent should remain silent. To address these limitations, we propose the \textbf{Pre-Reasoning Perception Framework (PRPF)}, a two-stage framework built on perceiving before reasoning. PRPF introduces a lightweight Multimodal Proactive Perceptor (MPP) for intervention gating and context compression, and activates the Proactive Agent Reasoner (PAR) only when intervention is warranted. Experiments on the ProactiveMobile benchmark show that PRPF substantially reduces false trigger rates (FTR) while improving success rates (SR) and inference efficiency over the ProactiveMobile baseline.
Original Article
View Cached Full Text
Cached at: 06/03/26, 09:43 AM
# Perceive Before Reasoning: A Pre-Reasoning Perception Framework for Efficient and Reliable Proactive Mobile Agents
Source: [https://arxiv.org/html/2606.03236](https://arxiv.org/html/2606.03236)
Zhijie Ding1,2Weinan Hong1,311footnotemark:1Zicheng Zhu1,411footnotemark:1Lei Li111footnotemark:1Dezhi Kong1 Hao Wang1Peng Zhou1Xuchu Jiang2Jiaming Xu122footnotemark:2 1HyperAI Team, Xiaomi Corporation2Zhongnan University of Economics and Law 3Jilin University4The Chinese University of Hong Kong, Shenzhen

###### Abstract

Multimodal large language models \(MLLMs\) have substantially advanced mobile agents, yet proactive mobile assistance remains challenging because agents must decide*when*to intervene before determining*how*to assist\. Existing systems often implement these two decisions within a unified MLLM\-based pipeline, leading to goal misalignment between conservative intervention filtering and comprehensive assistance generation, as well as redundant inference when the agent should remain silent\. To address these limitations, we propose thePre\-Reasoning Perception Framework \(PRPF\), a two\-stage framework built on perceiving before reasoning\. PRPF introduces a lightweight Multimodal Proactive Perceptor \(MPP\) for intervention gating and context compression, and activates the Proactive Agent Reasoner \(PAR\) only when intervention is warranted\. Experiments on the ProactiveMobile benchmark show that PRPF substantially reduces false trigger rates \(FTR\) while improving success rates \(SR\) and inference efficiency over the ProactiveMobile baseline\.

Perceive Before Reasoning: A Pre\-Reasoning Perception Framework for Efficient and Reliable Proactive Mobile Agents

Zhijie Ding1,2††thanks:These authors contributed equally\.Weinan Hong1,311footnotemark:1Zicheng Zhu1,411footnotemark:1Lei Li111footnotemark:1Dezhi Kong1Hao Wang1Peng Zhou1Xuchu Jiang2††thanks:Corresponding authors\.Jiaming Xu122footnotemark:21HyperAI Team, Xiaomi Corporation2Zhongnan University of Economics and Law3Jilin University4The Chinese University of Hong Kong, Shenzhen

## 1Introduction

Multimodal large language models \(MLLMs\), particularly vision\-language models \(VLMs\), have substantially advanced mobile agents by enabling them to perceive mobile interfaces and execute user instructions\(Baiet al\.,[2025](https://arxiv.org/html/2606.03236#bib.bib33); Hurstet al\.,[2024](https://arxiv.org/html/2606.03236#bib.bib34); Zhanget al\.,[2025b](https://arxiv.org/html/2606.03236#bib.bib13)\)\. However, mobile agents are increasingly expected to move beyond reactive instruction following toward proactive assistance, where they anticipate user needs and intervene without explicit prompts\(Luet al\.,[2025](https://arxiv.org/html/2606.03236#bib.bib8); Konget al\.,[2026](https://arxiv.org/html/2606.03236#bib.bib9)\)\. Existing studies commonly formulate proactive assistance as a*when–how*problem, in which an agent must first determine*when*to intervene and then decide*how*to assist\(Yanget al\.,[2026b](https://arxiv.org/html/2606.03236#bib.bib12); Xieet al\.,[2026](https://arxiv.org/html/2606.03236#bib.bib18); Liu and others,[2026](https://arxiv.org/html/2606.03236#bib.bib11)\)\.

![Refer to caption](https://arxiv.org/html/2606.03236v1/Figure/Figure1_comparison.png)Figure 1:Comparison between unified proactive reasoning and PRPF\.Recent systems typically realize this formulation within a unified VLM\-based agent\. For example, ProAgentBench\(Tanget al\.,[2026](https://arxiv.org/html/2606.03236#bib.bib15)\)treats when to assist and how to assist as timing judgment and content generation subtasks handled by the same fine\-tuned backbone\. Similarly, PARE\-Bench\(Nathaniet al\.,[2026](https://arxiv.org/html/2606.03236#bib.bib10)\)frames proactive assistance as an Observe–Execute process, where observation and execution are realized by switching tool\-use modes within one VLM\-based assistant\. Despite differences in implementation, these systems generally couple intervention timing and assistance generation within a single VLM\-agent pipeline\. Figure[1](https://arxiv.org/html/2606.03236#S1.F1)contrasts this unified reasoning paradigm with PRPF, which filters non\-intervention cases before reasoning to reduce unwarranted user disturbance\. The unified design gives rise to two key limitations:

Goal misalignment\. The when decision requires conservative and highly discriminative judgment to suppress false triggers under asymmetric intervention costs, whereas the*how*decision requires broad multimodal reasoning and flexible content generation\. Coupling these goals within a single VLM\-agent makes it difficult to simultaneously achieve conservative intervention filtering and comprehensive assistance generation\. As observed in Proactive Agent\(Luet al\.,[2025](https://arxiv.org/html/2606.03236#bib.bib8)\)and PRISM\(Fuet al\.,[2026](https://arxiv.org/html/2606.03236#bib.bib16)\), even strong models struggle to maintain the low false\-trigger rates required for satisfactory mobile user experience\. Excessive interventions disrupt user workflows, while overly conservative policies risk reducing proactive agents to passive assistants\.

Inference inefficiency\. Existing single\-stage VLM agents execute the full reasoning pipeline regardless of whether intervention is ultimately needed\. As a result, large models are activated even when the correct behavior is to remain silent, incurring unnecessary long\-context and multi\-turn reasoning costs\(Gaoet al\.,[2024](https://arxiv.org/html/2606.03236#bib.bib55); Yanget al\.,[2026a](https://arxiv.org/html/2606.03236#bib.bib56)\)\. This inefficiency becomes more severe in mobile environments with diverse scenarios and large function spaces\. For example, ProactiveMobile\(Konget al\.,[2026](https://arxiv.org/html/2606.03236#bib.bib9)\)requires reasoning over schemas spanning 14 scenarios, substantially increasing prompt length and inference cost\. ContextAgent\(Yanget al\.,[2026b](https://arxiv.org/html/2606.03236#bib.bib12)\)similarly reports higher computational costs with limited performance gains under complex contextual settings\.

To address these limitations, we propose thePre\-ReasoningPerceptionFramework \(PRPF\), a mobile proactive intelligence framework built on the principle of perceiving before reasoning\. Rather than using a large VLM to jointly determine when to intervene and how to assist, PRPF separates the two stages at the architectural level\. A lightweight perceptual front\-end first performs intervention gating and candidate function compression, assigning the conservative when judgment to an efficient discriminative module\. The VLM\-based reasoner is then activated only when intervention is warranted, allowing it to focus on the generative*how*\-stage\. This separation aligns each module with its corresponding goal while avoiding full VLM inference in cases where the agent should remain silent\.

Concretely, PRPF introduces aMultimodal Proactive Perceptor \(MPP\), a lightweight multimodal fusion encoder with short\- and long\-term pathways that operates before large\-model reasoning\. MPP first decides whether the agent should intervene or remain silent, thereby reducing false triggers at the source\. It then compresses the multimodal context into a small set of Top\-KKscenario candidates, filtering irrelevant long\-tail contexts and reducing the reasoning burden for the subsequent stage\. Given this compact evidence, theProactive Agent Reasoner \(PAR\)conducts focused deep reasoning to generate the final proactive recommendation\. As a result, MPP serves as a pre\-reasoning perceptual cutoff, while PAR concentrates its reasoning capacity on cases where proactive assistance is actually needed\. Experiments on the ProactiveMobile benchmark show that PRPF improves the success rate from 20\.82% to 41\.15%, reduces the false trigger rate from 13\.76% to 7\.21%, and cuts expected inference compute by 69\.3%\. These results show that PRPF improves proactive recommendation and tool invocation accuracy while reducing false interruptions by better identifying when to remain silent\. In summary, our contributions are as follows:

- •We propose PRPF, a two\-stage pre\-reasoning perception framework that architecturally decouples*when*judgment from*how*\-stage reasoning for proactive mobile agents\.
- •We design MPP, a lightweight and plug\-and\-play multimodal perceptor for intervention gating and context compression, together with PAR, a Proactive Agent Reasoner for complete and focused*how*\-stage reasoning\.
- •Extensive experiments on the ProactiveMobile benchmark demonstrate that PRPF substantially reduces false trigger rates while improving success rates and inference efficiency\.

![Refer to caption](https://arxiv.org/html/2606.03236v1/Figure/Figure3_pipeline.png)Figure 2:Overall framework of PRPF\.
## 2Related Work

### 2\.1Proactive Agents

Proactive\-agent research extends LLM agents from instruction following to active assistance, where agents must judge intervention timing and generate useful assistance from ongoing context\(Luet al\.,[2025](https://arxiv.org/html/2606.03236#bib.bib8); Tanget al\.,[2026](https://arxiv.org/html/2606.03236#bib.bib15); Fuet al\.,[2026](https://arxiv.org/html/2606.03236#bib.bib16)\)\. Prior work studies this distinction through intervention\-timing prediction, assistance\-content generation, cost\-sensitive selective intervention, and staged proactive behavior\(Tanget al\.,[2026](https://arxiv.org/html/2606.03236#bib.bib15); Fuet al\.,[2026](https://arxiv.org/html/2606.03236#bib.bib16); Nathaniet al\.,[2026](https://arxiv.org/html/2606.03236#bib.bib10); Xieet al\.,[2026](https://arxiv.org/html/2606.03236#bib.bib18)\)\. Proactive GUI and mobile settings make this problem concrete because user intent must be inferred from interface trajectories before an explicit command is issued\. Existing benchmarks study intent recommendation over GUI trajectories and executable function\-sequence generation from on\-device context\(Chaiet al\.,[2026](https://arxiv.org/html/2606.03236#bib.bib19); Konget al\.,[2026](https://arxiv.org/html/2606.03236#bib.bib9)\)\. Using ProactiveMobile as the evaluation setting, PRPF focuses on the architectural separation between lightweight pre\-reasoning intervention perception and heavy VLM\-based assistance reasoning\.

### 2\.2GUI Perception and Efficient Reasoning

Mobile and GUI agents provide the perception and execution substrate for proactive assistance, but most existing systems remain reactive\. Prior work has advanced smartphone operation, mobile task execution, screenshot\-based GUI understanding, and visual grounding under explicit instructions\(Zhanget al\.,[2025b](https://arxiv.org/html/2606.03236#bib.bib13); Wanget al\.,[2024](https://arxiv.org/html/2606.03236#bib.bib24); Rawleset al\.,[2025](https://arxiv.org/html/2606.03236#bib.bib14); Denget al\.,[2024](https://arxiv.org/html/2606.03236#bib.bib25); Qinet al\.,[2025](https://arxiv.org/html/2606.03236#bib.bib39); Chenget al\.,[2024](https://arxiv.org/html/2606.03236#bib.bib26)\)\. These advances improve how an agent perceives and acts on a requested GUI task, but not whether a continuously observed mobile context warrants intervention\. Efficiency\-oriented work further introduces intermediate perception, adaptive visual processing, region selection, or model routing before expensive reasoning\(Wuet al\.,[2025](https://arxiv.org/html/2606.03236#bib.bib27); Mehrotraet al\.,[2025](https://arxiv.org/html/2606.03236#bib.bib28); Tanget al\.,[2025](https://arxiv.org/html/2606.03236#bib.bib29); Liuet al\.,[2026](https://arxiv.org/html/2606.03236#bib.bib30); Onget al\.,[2024](https://arxiv.org/html/2606.03236#bib.bib31); Xuet al\.,[2025](https://arxiv.org/html/2606.03236#bib.bib32)\)\. These methods use front\-end selection, but their decision signals are typically task difficulty, region relevance, or model\-call cost for a given query\. They do not by themselves resolve the proactive intervention variable: whether a mobile context warrants intervention and which intent candidates should condition subsequent assistance\.

## 3Method

### 3\.1Task Definition

Proactive intelligence on mobile devices requires models to infer users’ latent needs before explicit requests and generate executable function\-call sequences\(Konget al\.,[2026](https://arxiv.org/html/2606.03236#bib.bib9)\)\. Therefore, this task not only requires accurately determining when to trigger recommendations, but also generating function sequences aligned with user intent\. Given a mobile interaction sample, its input includes the user profile \(UU\), device state \(DD\), world information \(WW\), and interaction history \(II\)\. In the multimodal setting,IIis represented as a GUI screenshot sequence, whereas in the text setting, it is represented as trajectory text\. The model needs to produce an executable function\-call sequence from the function set \(FF\):

y^=\{∅s^=\[\(f^1,a^1\),…,\(f^m,a^m\)\]\\hat\{y\}=\\begin\{cases\}\\varnothing\\\\ \\hat\{s\}=\[\(\\hat\{f\}\_\{1\},\\hat\{a\}\_\{1\}\),\\ldots,\(\\hat\{f\}\_\{m\},\\hat\{a\}\_\{m\}\)\]\\end\{cases\}\(1\)where∅\\varnothingindicates that proactive recommendation should not be triggered under the current state;s^\\hat\{s\}denotes the function\-call sequence predicted by the model;f^m∈F\\hat\{f\}\_\{m\}\\in\{F\}denotes themm\-th predicted function, anda^m\\hat\{a\}\_\{m\}denotes the corresponding function argument\.

### 3\.2PRPF Framework

To address the high false\-trigger rate and inference inefficiency in proactive mobile intelligence, we propose the Pre\-Reasoning Perception Framework \(PRPF\), as shown in Figure[2](https://arxiv.org/html/2606.03236#S1.F2)\. PRPF consists of two parts: 1\) the lightweight Multimodal Proactive Perceptor \(MPP\), which quickly determines whether the current mobile interaction state should trigger proactive recommendation and predicts candidate functions; 2\) the Proactive Agent Reasoner \(PAR\), which generates function\-call sequences that are consistent with users’ intent\. For a sample that does not require recommendation, MPP directly filters it out; otherwise, it enters PAR reasoning to generate the function\-call sequence\.

### 3\.3Multimodal Proactive Perceptor

MPP consists of a fast–slow dual\-channel interaction module and two task\-specific MLP heads for trigger gating and function\-scenario prediction\. The two heads determine whether to trigger proactive recommendation and predict candidate functions, providing upfront preprocessing for PAR\.

Taking multimodal data as an example, the textual information and GUI screenshot sequence are first encoded by the text encoder and image encoder, respectively:

Htext=ftext\(\{U,D,W\}\),Himg=\{fimg\(It\)\}t=1T,H\_\{\\mathrm\{text\}\}=f\_\{\\mathrm\{text\}\}\(\\\{U,D,W\\\}\),H\_\{\\mathrm\{img\}\}=\\\{f\_\{\\mathrm\{img\}\}\(I\_\{t\}\)\\\}\_\{t=1\}^\{T\},

\(2\)
whereftextf\_\{\\mathrm\{text\}\}andfimgf\_\{\\mathrm\{img\}\}denote the text encoder and image encoder, respectively;ItI\_\{t\}denotes thett\-th GUI screenshot; andTTdenotes the length of the complete GUI screenshot sequence\.

The textual and visual features are then projected into a unified dimensional space through linear mappings:

ctext=WcHtext,cimg=WvHimg,c\_\{text\}=W\_\{c\}H\_\{text\},\\quad c\_\{img\}=W\_\{v\}H\_\{img\},\(3\)
whereWcW\_\{c\}andWvW\_\{v\}are the linear projection matrices\.ctext∈ℝ3×dc\_\{text\}\\in\\mathbb\{R\}^\{3\\times d\}andcimg∈ℝT×dc\_\{img\}\\in\\mathbb\{R\}^\{T\\times d\}denote the projected textual context representation and screenshot sequence representation, respectively, andddis the unified projection dimension\.

The function\-call sequence depends on both short\-term GUI dynamics and long\-term behavioral preferences reflected in the textual context\. Accordingly, we construct fast and slow channels to capture these complementary signals\.

The slow channel captures long\-range behavioral patterns that reflect users’ preferences\. First, we apply attention pooling over the complete GUI sequence to obtain the global visual\-state representation:

x¯=AttnPool\(Himg\)\.\\bar\{x\}=\\mathrm\{AttnPool\}\(H\_\{\\text\{img\}\}\)\.\(4\)The pooled visual representation is then concatenated withctextc\_\{\\text\{text\}\}and fed into a Transformer Encoder to model the long\-term cross\-modal representationHslowH\_\{\\text\{slow\}\}overUU,DD,WW, and the complete GUI screenshot sequence\(Vaswaniet al\.,[2017](https://arxiv.org/html/2606.03236#bib.bib59)\)\.

The fast channel captures short\-term user dynamics\. Based on empirical observations, we select the most recent four GUI screenshot representations and feed them through a projection layer and a Transformer Encoder to obtain the short\-term dynamic representationHfastH\_\{\\text\{fast\}\}\.

Hfast=Himg\[T−k\+1:T\],k=4\.H\_\{\\text\{fast\}\}=H\_\{\\text\{img\}\}\[T\-k\+1:T\],\\quad k=4\.\(5\)
To enhance the information interaction between the fast and slow channels, MPP introduces a cross\-attention mechanism, enabling the model to simultaneously consider users’ long\-term behavioral background and the instantaneous state of the current interface when determining the recommendation timing\. Specifically, for any channel pair\(c¯,c\)∈\{\(fast,slow\),\(slow,fast\)\}\(\\bar\{c\},c\)\\in\\\{\(\\mathrm\{fast\},\\mathrm\{slow\}\),\(\\mathrm\{slow\},\\mathrm\{fast\}\)\\\}, the representation of the current channel is used as the Query, while the representation of the other channel is used as the Key and Value, thereby obtaining the cross\-channel context\-enhanced representation:

H~c=MHA\(Q=Hc,K=Hc¯,V=Hc¯\)\.\\widetilde\{H\}\_\{c\}=\\mathrm\{MHA\}\\left\(Q=H\_\{c\},K=H\_\{\\bar\{c\}\},V=H\_\{\\bar\{c\}\}\\right\)\.\(6\)
Attention pooling is then applied to the two enhanced representations, which are concatenated to obtain the fused representationz∈ℝ2dz\\in\\mathbb\{R\}^\{2d\}\.

The fused representationzzis fed into the trigger\-gating head to obtain the probability of triggering proactive recommendation:

ptrig=σ\(MLPtrig\(z\)\),p\_\{trig\}=\\sigma\(\\mathrm\{MLP\}\_\{trig\}\(z\)\),\(7\)
whereσ\\sigmadenotes the sigmoid function\. Ifptrig<τp\_\{trig\}<\\tau, the sample is directly filtered out\. Otherwise, it is passed to PAR for subsequent reasoning\. The details are provided in Appendix[B\.1](https://arxiv.org/html/2606.03236#A2.SS1)\.

To further compress the function candidate set and enhance reasoning efficiency, we feedzzinto the function\-scenario prediction head to obtain the probabilities of all scenarios:

pscenario=σ\(MLPscenario\(z\)\)\.p\_\{scenario\}=\\sigma\(\\mathrm\{MLP\}\_\{scenario\}\(z\)\)\.\(8\)
We select the top\-5 candidate function scenarios with the highestpscenariop\_\{\\mathrm\{scenario\}\}and retain only the functions associated with these scenarios, thereby constructing the compressed candidate function setF~\\tilde\{F\}\. This filtering reduces the function space by approximately68\.25%68\.25\\%\. We present the quantitative analysis of MPP filtering effects in Appendix[B\.4](https://arxiv.org/html/2606.03236#A2.SS4)\.

### 3\.4Proactive Agent Reasoner

For samples that MPP predicts as requiring recommendations, we inputUU,DD,WW, the complete GUI screenshot sequence, and the compressed candidate function setF~\\tilde\{F\}into PAR for structured reasoning and content generation\. Since MPP serves as a lightweight pre\-reasoning gate, PAR performs a final consistency verification before generating executable recommendations, mitigating potential gating errors from the upstream perceiver\. To better predict users’ intent, we design a multi\-step reasoning process\.

First, a UI summary is generated based on the GUI screenshots to understand the semantics and sequential information of the GUI screenshots\. Then, combined withUU,DD,WW, UI summary, andF~\\tilde\{F\}, a thinking chain containing function selection is generated\. Finally, PAR outputs the structured proactive recommendation results, including validated intervention decisions, selected function names, and corresponding parameters\.

Reasoning Prompt𝒫r\\mathcal\{P\}\_\{r\}Given the user profile, device status, environmental information, screenshots, predicted scenes, and available functions, analyze the user’s current behavior and determine whether a recommendation is needed\.1\. ui\_summary: Summarize the current interface, key content, and the user’s operation stage\.2\. thinking: Analyze the user’s behavior, intent, task status, and whether a recommendation is needed; if needed, determine the recommendation function and parameters\.3\. recommendations: Output the recommendation instruction and function call; if no recommendation is needed, output “No recommendation”\.

### 3\.5PRPF Training Objective

Since the two stages play different roles, we train MPP and PAR separately, and use the training set of ProactiveMobile\(Konget al\.,[2026](https://arxiv.org/html/2606.03236#bib.bib9)\)to train the two modules\.

MPP is jointly optimized with two losses\. Due to class imbalance, non\-recommendation samples are oversampled to the same proportion as recommendation samples, and Focal Loss is used\(Linet al\.,[2017](https://arxiv.org/html/2606.03236#bib.bib60)\)\. Given the ground\-truth labelytrigy\_\{\\mathrm\{trig\}\}and the predicted trigger probabilityptrigp\_\{\\mathrm\{trig\}\}, whereytrig=1y\_\{\\mathrm\{trig\}\}=1indicates triggering recommendation, the trigger\-gate loss is denoted asℒtrig\\mathcal\{L\}\_\{\\mathrm\{trig\}\}:

pt=ytrigptrig\+\(1−ytrig\)\(1−ptrig\),p\_\{t\}=y\_\{\\mathrm\{trig\}\}p\_\{\\mathrm\{trig\}\}\+\(1\-y\_\{\\mathrm\{trig\}\}\)\(1\-p\_\{\\mathrm\{trig\}\}\),\(9\)ℒtrig=−αt\(1−pt\)γlog⁡\(pt\),\\mathcal\{L\}\_\{\\mathrm\{trig\}\}=\-\\alpha\_\{t\}\(1\-p\_\{t\}\)^\{\\gamma\}\\log\(p\_\{t\}\),\(10\)
whereptp\_\{t\}denotes the true\-class predicted probability,γ\\gammareduces the weight of well\-classified samples, andαt\\alpha\_\{t\}balances different categories\.

For triggered samples, BCE loss is used for optimization, and we compute the function\-scenario prediction lossℒscenario\\mathcal\{L\}\_\{scenario\}:

ℒscenario=BCE\(yscenario,pscenario\),\\mathcal\{L\}\_\{\\mathrm\{scenario\}\}=\\mathrm\{BCE\}\\left\(y\_\{\\mathrm\{scenario\}\},p\_\{\\mathrm\{scenario\}\}\\right\),\(11\)whereyscenarioy\_\{\\mathrm\{scenario\}\}denotes the corresponding ground\-truth multi\-label function\-scenario vector, represented in multi\-hot form\. Finally, the joint training objective of MPP is defined asℒMPP\\mathcal\{L\}\_\{\\mathrm\{MPP\}\}:

ℒMPP=ℒtrig\+λℒscenario,\\mathcal\{L\}\_\{\\mathrm\{MPP\}\}=\\mathcal\{L\}\_\{\\mathrm\{trig\}\}\+\\lambda\\mathcal\{L\}\_\{\\mathrm\{scenario\}\},\(12\)
whereλ\\lambdais the weight of the function\-scenario prediction loss\.

To better train PAR, we further fine\-tune Qwen3\.5\-9B using SFT and GRPO\. We construct a total of 8,876 multimodal and text data samples for SFT on Qwen3\.5\-9B, and we present the detailed construction process in Appendix[A\.3](https://arxiv.org/html/2606.03236#A1.SS3)\. Thus, given the inputxix\_\{i\}, the model generates the structured outputyiy\_\{i\}\. The two types of data are uniformly trained using the standard autoregressive cross\-entropy loss, and the SFT loss is defined as:

ℒSFT=−1Z∑i,jmi,jlog⁡Pθ\(yi,j∣xi,yi,<j\),\\mathcal\{L\}\_\{\\mathrm\{SFT\}\}=\-\\frac\{1\}\{Z\}\\sum\_\{\\begin\{subarray\}\{c\}i,j\\end\{subarray\}\}m\_\{i,j\}\\log P\_\{\\theta\}\\left\(y\_\{i,j\}\\mid x\_\{i\},y\_\{i,<j\}\\right\),

\(13\)
whereyi,jy\_\{i,j\}denotes thejj\-th token in the output of theii\-th sample, andyi,<jy\_\{i,<j\}denotes all tokens generated before thejj\-th token\.ZZdenotes the total number of valid supervised tokens\.

While SFT teaches PAR the structured response format, it may still be unstable on boundary cases and imperfect in function selection or argument grounding\. We therefore apply GRPO\(Guoet al\.,[2025](https://arxiv.org/html/2606.03236#bib.bib23)\)after SFT to stabilize no\-intervention decisions and improve executable assistance generation\.

We construct GRPO prompts from rollouts of the SFT\-trained PAR, denoted as the reference policyπref\\pi\_\{\\mathrm\{ref\}\}, and retain only rollout groups with non\-trivial reward variance and intermediate correctness rates, leaving boundary no\-intervention cases and intervention cases with disagreement in function selection or argument grounding\. For each retained promptxix\_\{i\}, GRPO samplesNNrollouts\{y^i,j\}j=1N\\\{\\hat\{y\}\_\{i,j\}\\\}\_\{j=1\}^\{N\}and forms the group\-relative advantage

A^i,j=R\(y^i,j,xi\)−μiσi\+δ,\\hat\{A\}\_\{i,j\}=\\frac\{R\(\\hat\{y\}\_\{i,j\},x\_\{i\}\)\-\\mu\_\{i\}\}\{\\sigma\_\{i\}\+\\delta\},\(14\)whereμi\\mu\_\{i\}andσi\\sigma\_\{i\}are the mean and standard deviation of the rollout rewards, andδ\\deltais a small numerical constant\. We optimize the policy with the clipped GRPO objective and a KL penalty toπref\\pi\_\{\\mathrm\{ref\}\}\.

For a rollouty^\\hat\{y\}, intervention outcomes are determined by whether the predicted and gold function sequences are empty\. We use a structured reward

R\(y^,x\)=Racc\(y^,x\)\+Rfmt\(y^\)\+Rcal\(y^,x\),R\(\\hat\{y\},x\)=R\_\{\\mathrm\{acc\}\}\(\\hat\{y\},x\)\+R\_\{\\mathrm\{fmt\}\}\(\\hat\{y\}\)\+R\_\{\\mathrm\{cal\}\}\(\\hat\{y\},x\),

\(15\)whereRaccR\_\{\\mathrm\{acc\}\}scores the intervention decision, function\-name sequence matching, and argument grounding, whileRfmtR\_\{\\mathrm\{fmt\}\}scores compliance with the required output schema\. The calibration termRcalR\_\{\\mathrm\{cal\}\}adjusts the reward across intervention outcomes: it anchors correct silence on no\-intervention cases, keeps false interventions penalized even when they are well formed, rewards complete gold\-function coverage, and discourages redundant function calls\. Full reward\-component definitions and calibration details are provided in Appendix[A\.4](https://arxiv.org/html/2606.03236#A1.SS4.SSS0.Px2)\.

## 4Experiments

### 4\.1Experimental Settings

#### Training Details\.

We use the ProactiveMobile train set, which contains 8,876 samples \(see Appendix[A\.2](https://arxiv.org/html/2606.03236#A1.SS2)\)\. The text and image encoders used in PRPF are BGE\-small\-zh\-v1\.5\(Luoet al\.,[2024](https://arxiv.org/html/2606.03236#bib.bib45)\)and CLIP ViT\-B\-32\(Radfordet al\.,[2021](https://arxiv.org/html/2606.03236#bib.bib36)\)\. The base model used by PRPF is Qwen3\.5\-9B\. Specific experimental details are provided in Appendix[A\.1](https://arxiv.org/html/2606.03236#A1.SS1)\.

#### Baselines\.

To verify the effectiveness of PRPF, we select three representative types of models for comparison experiments\.Closed\-source models:GPT\-5\.5\(OpenAI,[2026](https://arxiv.org/html/2606.03236#bib.bib42)\), o3\(OpenAI,[2025](https://arxiv.org/html/2606.03236#bib.bib43)\), Gemini\-3\.1\-Pro\(Google DeepMind,[2026](https://arxiv.org/html/2606.03236#bib.bib44)\), Claude\-Opus\-4\.7\(Anthropic,[2026b](https://arxiv.org/html/2606.03236#bib.bib46)\), GLM\-4\.6V\(Honget al\.,[2025](https://arxiv.org/html/2606.03236#bib.bib40)\), Kimi\-K2\.5\(Kimi Teamet al\.,[2026](https://arxiv.org/html/2606.03236#bib.bib38)\), and MiMo\-2\.5v\(Xiaomi,[2026](https://arxiv.org/html/2606.03236#bib.bib47)\)\.Open\-source models:TongUI\-7B and Qwen3\.5\-9B\(Zhanget al\.,[2025a](https://arxiv.org/html/2606.03236#bib.bib41); Baiet al\.,[2025](https://arxiv.org/html/2606.03236#bib.bib33)\)\.Proactive intelligence models:ProactiveMobile \(7B\), UI\-TARS\-7B\-DPO\+Proactive, and Qwen3\.5\-9B\+Proactive are SFT\-trained models using the same dataset and configuration\.

ModelMultimodalTextALLType\-Acc↑\\uparrowSR↑\\uparrowFTR↓\\downarrowType\-Acc↑\\uparrowSR↑\\uparrowFTR↓\\downarrowType\-Acc↑\\uparrowSR↑\\uparrowFTR↓\\downarrowGPT\-5\.540\.3418\.4033\.4855\.0949\.0219\.2447\.7133\.6923\.84o339\.6317\.2541\.4551\.3144\.5326\.8945\.4630\.8731\.61Gemini\-3\.1\-Pro18\.949\.3376\.4733\.9726\.7556\.6026\.4518\.0361\.11Claude\-Opus\-4\.737\.2315\.8341\.0355\.9148\.6918\.8146\.5632\.2426\.11GLM\-4\.6V36\.2511\.7474\.4529\.9820\.1967\.0433\.1215\.9669\.33Kimi\-K2\.525\.339\.3330\.0239\.5035\.5632\.5132\.4022\.4331\.64MiMo\-2\.5v32\.1011\.6339\.4539\.0834\.5437\.0635\.5823\.0737\.87TongUI\-7B1\.750\.3315\.798\.158\.0418\.584\.954\.1818\.10UI\-TARS\-7B\-DPO\+Proactive30\.908\.5263\.1314\.179\.9064\.6722\.549\.2163\.84Qwen3\.5\-9B8\.302\.9571\.528\.155\.6977\.138\.224\.3275\.30ProactiveMobile \(7B\)33\.6815\.6123\.9156\.8426\.048\.5145\.2520\.8213\.76Qwen3\.5\-9B\+Proactive39\.5214\.8518\.5459\.5254\.3210\.6349\.5134\.5613\.49PRPF \(Ours\)40\.8317\.1917\.9969\.2065\.151\.7555\.0041\.157\.21

Table 1:Overall performance comparison on the ProactiveMobile test set\. We reportType\-Acc↑\\mathrm\{Type\\text\{\-\}Acc\}\\uparrow,SR↑\\mathrm\{SR\}\\uparrow, andFTR↓\\mathrm\{FTR\}\\downarrowunder multimodal, text, and overall settings\. The best and second\-best results are highlighted in bold and underlined, respectively\. All scores are reported in percentage \(%\)\.
#### Metrics\.

Following ProactiveMobile\(Konget al\.,[2026](https://arxiv.org/html/2606.03236#bib.bib9)\), we adopt three core metrics: function\-name sequence accuracy \(Type\-Acc\), success rate \(SR\), and false trigger rate \(FTR\)\. Type\-Acc measures whether the predicted function\-name sequence exactly matches the ground truth\. SR evaluates the overall correctness of the prediction, while FTR quantifies the proportion of false proactive triggers among samples that do not require recommendation\. The detailed calculation formulas are provided in Section[B\.3](https://arxiv.org/html/2606.03236#A2.SS3)\.

### 4\.2Overall Performance Analysis

Table[1](https://arxiv.org/html/2606.03236#S4.T1)presents the overall performance comparison between PRPF and various baseline models on the ProactiveMobile test set\. From these results, we make the following observations:

1\. PRPF achieves the best overall performance\.Compared with the strongest fine\-tuned baseline Qwen3\.5\-9B\+Proactive, PRPF improves Type\-Acc by 5\.49 percentage points, improves SR by 6\.59 percentage points, and reduces FTR by 6\.28 percentage points\. This suggests that, compared with standalone end\-to\-end VLMs, the two\-stage PRPF better separates when to recommend from how to recommend, thereby improving recommendation success while suppressing false triggers\.

2\. FTR is significantly reduced\.From the table, we observe that models without task\-specific fine\-tuning often exhibit high FTR\. For example, GPT\-5\.5 reaches 23\.84%\. Fine\-tuned models can better reduce false triggers, such as ProactiveMobile \(7B\), which reduces FTR to 13\.76%\. In contrast, our method directly reduces FTR to 7\.21%, and even reduces it to 1\.75% in the text setting\. The extremely low false trigger rate verifies the effectiveness of the two\-stage training and inference of PRPF\.

### 4\.3Ablation Study on Core Components

To further verify the effectiveness of each core component in PRPF, we conduct ablation studies on the same test set under the same settings, covering the following four module groups:Overall Structure:w/o MPP denotes removing MPP, where all samples directly enter PAR; w/o PAR denotes using Qwen3\.5\-9B for direct inference in the*how*stage\.MPP Structure:w/o Slow Channel denotes removing the slow channel from MPP, and w/o Fast Channel denotes removing the fast channel from MPP\.MPP Functionality:w/o Compression denotes that MPP only provides the recommendation judgment, while the full function set is input into PAR; w/o Recommend denotes that MPP does not perform recommendation judgment and only provides the function compression results\.PAR Training Strategy:w/o SFT denotes removing SFT from PAR, and w/o GRPO denotes removing GRPO from PAR\.

ModelMultimodalTextALLSR↑\\uparrowFTR↓\\downarrowSR↑\\uparrowFTR↓\\downarrowSR↑\\uparrowFTR↓\\downarrowFull PRPF \(Ours\)17\.1917\.9965\.151\.7541\.157\.21\- w/o MPP15\.5622\.0161\.164\.2238\.3310\.38\- w/o PAR6\.0660\.6460\.893\.3333\.4418\.46\- w/o Slow Channel16\.5920\.8862\.913\.4439\.739\.38\- w/o Fast Channel16\.5417\.7364\.332\.0240\.417\.32\- w/o Compression16\.6519\.1164\.502\.0240\.557\.78\- w/o Recommend15\.8821\.2862\.753\.8739\.299\.74\- w/o GRPO16\.0548\.6363\.352\.8139\.6715\.56\- w/o SFT4\.9161\.6060\.184\.3732\.5118\.61

Table 2:Ablation study of PRPF\.As shown in Table[2](https://arxiv.org/html/2606.03236#S4.T2), MPP and PAR provide complementary benefits, as removing either module decreases ALL SR and increases FTR\. Within MPP, the slow channel contributes more than the fast channel, since removing it leads to a larger drop in ALL SR and a greater increase in FTR\. For MPP functionality, trigger gating plays a more important role than candidate function compression, as w/o Recommend causes more severe performance degradation than w/o Compression\. For PAR training, SFT establishes the basic generation capability, whereas GRPO further improves the decision policy\. Removing SFT substantially degrades both ALL SR and FTR, while removing GRPO mainly leads to a notable increase in ALL FTR\.

### 4\.4Efficiency Analysis

We evaluate inference efficiency on the ProactiveMobile test set using three metrics: per\-sample inference compute \(TFLOPs\), peak GPU memory \(GB\), and end\-to\-end latency \(ms\)\. The benchmark follows the ablation settings above, and detailed aggregation rules are provided in Appendix[B\.6](https://arxiv.org/html/2606.03236#A2.SS6)\. Figure[3](https://arxiv.org/html/2606.03236#S4.F3)reports the resulting comparison\.

Compared with the single\-stage ProactiveMobile \(7B\) baseline, our PRPF \(9B\) reduces compute by69\.3%69\.3\\%and end\-to\-end latency by60\.1%60\.1\\%, at the cost of a12\.0%12\.0\\%increase in peak memory because PRPF runs a stronger 9B PAR backbone on the gate\-accepted fraction\. Both MPP functions contribute to this saving, but in different ways\. The intervention gate removes PAR calls for no\-intervention observations, giving w/o Compression a24\.6%24\.6\\%compute reduction and29\.8%29\.8\\%latency reduction; peak memory rises by26\.7%26\.7\\%because accepted contexts still use the full function pool on the 9B engine\. By contrast, intent\-candidate filtering reduces the candidate function pool from 63 functions to no more than 20 on average, so w/o Recommend lowers the cost of every PAR call, reducing compute by57\.1%57\.1\\%and latency by50\.5%50\.5\\%with the same12\.0%12\.0\\%memory overhead\. Thus, in the efficiency benchmark, candidate function\-pool reduction has the larger standalone effect, while PRPF \(9B\) combines it with intervention gating so that heavy VLM\-based assistance reasoning is invoked less often and with a shorter function\-pool prompt when invoked\.

![Refer to caption](https://arxiv.org/html/2606.03236v1/Figure/Figure2_efficiency.png)Figure 3:Inference efficiency comparison on ProactiveMobile\. Lower values indicate better efficiency\.
### 4\.5Analysis of MPP

We select three groups of comparison experiments to verify the lightweight and plug\-and\-play characteristics of the perceptor\. Three groups of experiments are ProactiveMobile \(7B\) vs\. ProactiveMobile \(7B\) \+ MPP, GLM\-4\.6V vs\. GLM\-4\.6V \+ MPP, and Qwen3\.5\-9B \(SFT\) Perceptor \+ PAR vs\. PRPF\. Among them, Qwen3\.5\-9B \(SFT\) Perceptor denotes that Qwen3\.5\-9B is trained with the same SFT method as ProactiveMobile \(7B\), serving as the perceptor to implement filtering and function selection, as shown in Table[3](https://arxiv.org/html/2606.03236#S4.T3)\.

ModelMultimodalTextALLSR↑\\uparrowFTR↓\\downarrowSR↑\\uparrowFTR↓\\downarrowSR↑\\uparrowFTR↓\\downarrowProactiveMobile \(7B\)15\.6123\.9126\.048\.5120\.8213\.76ProactiveMobile \(7B\) \+ MPP12\.6121\.9562\.201\.6037\.388\.08GLM\-4\.6V11\.7474\.4520\.1967\.0415\.9669\.33GLM\-4\.6V \+ MPP13\.9268\.7562\.914\.1538\.3919\.97Qwen3\.5\-9B \(SFT\) \+ PAR17\.0819\.5562\.583\.5539\.808\.98Full PRPF \(Ours\)17\.1917\.9965\.151\.7541\.157\.21

Table 3:Plug\-and\-play effectiveness of MPP on PRPF\.The first two groups of comparisons demonstrate the plug\-and\-play property of MPP, showing that MPP becomes effective once integrated into different models\. The third group verifies that MPP achieves better results than the fine\-tuned Qwen3\.5\-9B while using far fewer parameters \(0\.1B vs\. 9B\)\.

### 4\.6Case Study

To localize PRPF’s failures, we partition every test sample into one of five mutually exclusive outcomes under SR scoring and report the per\-modality breakdown in Figure[4](https://arxiv.org/html/2606.03236#S4.F4)\. PRPF achieves SR=1=1on65\.2%65\.2\\%of TEXT samples but only17\.2%17\.2\\%of Multimodal samples; the multimodal gap is mainly caused by*refusal*\(34\.8%34\.8\\%\), where the system stays silent on samples that require recommendation, and*function error*\(24\.1%24\.1\\%\), where PAR selects an incorrect intent\. A detailed qualitative case and a finer error decomposition \(separating gate\-side from PAR\-side refusal, plus six sub\-patterns of non\-empty mismatch\) are deferred to Appendix[C](https://arxiv.org/html/2606.03236#A3)\.

![Refer to caption](https://arxiv.org/html/2606.03236v1/Figure/Figure4_casestudy.png)Figure 4:Per\-modality breakdown of PRPF outcomes on the ProactiveMobile under SR scoring\.

## 5Conclusion

In this paper, we introduced PRPF, a Pre\-Reasoning Perception Framework for mobile proactive intelligence\. PRPF decouples when to recommend from how to recommend by using MPP for lightweight pre\-reasoning perception and PAR for focused recommendation reasoning\. On ProactiveMobile, PRPF achieves 41\.15% overall success rate and 7\.21% false trigger rate\. It also reduces inference compute by 69\.3% and end\-to\-end latency by 60\.1% over the ProactiveMobile\(7B\) baseline, showing the lightweight efficiency of the proposed framework\. The consistent gains of MPP across different reasoners further demonstrate its plug\-and\-play capability\. These results highlight the value of separating lightweight intervention perception from expensive recommendation reasoning in proactive mobile agents\.

## Limitations

MPP is trained on the 14 high\-level intent scenarios defined in the ProactiveMobile benchmark\. While this design suffices for the current function pool, extending PRPF to new domains or a significantly larger API space would require re\-training \(or at least fine\-tuning\) the perceiver to learn new intent\-scenario distributions and update the Top\-KKfiltering vocabulary\. Future work could explore continual\-learning or prompt\-based adapters to mitigate this re\-training cost\.

Multimodal understanding remains a bottleneck\. Despite the overall improvement, absolute success rates on multimodal tasks are still modest \(17\.19% SR\), and even the strongest baselines struggle to exceed 18% SR on ProactiveMobile’s GUI screenshot setting\. This gap suggests that current vision\-language models still lack the fine\-grained interface understanding required for ProactiveMobile assistance\. Closing this gap likely demands stronger visual grounding, higher\-resolution GUI encoders, or larger\-scale multimodal pre\-training beyond the scope of this work\.

## Ethics Statement

ProactiveMobile agents inherently require access to sensitive on\-device signals, including GUI screenshots, user profiles, and interaction histories, which raises significant privacy concerns\. We emphasize that all training and evaluation data used in this work are derived from the publicly available ProactiveMobile benchmark; no private user data or real\-device traces were collected or used\. For real\-world deployment, PRPF must enforce explicit user consent and strict adherence to the principle of least privilege to prevent unauthorized surveillance or data leakage\. Beyond initial consent, users should retaingranular controlover which on\-device signals the agent may access, with the ability to audit and modify these permissions at any time\. Anopt\-outmechanism must allow users to disable proactive interventions globally or per\-application without degrading passive query functionality\. Furthermore,intervention personalizationshould enable users to customize the frequency, timing, and sensitivity of proactive triggers—for example, suppressing suggestions during specific activities or setting daily intervention limits—thereby preventing behavioral over\-inference and ensuring the agent respects individual boundaries\.

## References

- Anthropic \(2026a\)Claude sonnet 4\.6\.Note:[https://www\.anthropic\.com/news/claude\-sonnet\-4\-6](https://www.anthropic.com/news/claude-sonnet-4-6)Accessed: 2026\-05\-25Cited by:[§A\.3](https://arxiv.org/html/2606.03236#A1.SS3.SSS0.Px1.p1.1)\.
- Anthropic \(2026b\)Introducing Claude Opus 4\.7\.Technical reportAnthropic\.External Links:[Link](https://www.anthropic.com/news/claude-opus-4-7)Cited by:[§4\.1](https://arxiv.org/html/2606.03236#S4.SS1.SSS0.Px2.p1.1)\.
- S\. Bai, Y\. Cai, R\. Chen, K\. Chen, X\. Chen, Z\. Cheng, L\. Deng, W\. Ding, C\. Gao, C\. Ge,et al\.\(2025\)Qwen3\-vl technical report\.arXiv preprint arXiv:2511\.21631\.Cited by:[§1](https://arxiv.org/html/2606.03236#S1.p1.1),[§4\.1](https://arxiv.org/html/2606.03236#S4.SS1.SSS0.Px2.p1.1)\.
- Y\. Chai, S\. Tang, H\. Xiao, R\. Liu, and H\. Li \(2026\)PIRA\-bench: a transition from reactive gui agents to gui\-based proactive intent recommendation agents\.arXiv preprint arXiv:2603\.08013\.Cited by:[§2\.1](https://arxiv.org/html/2606.03236#S2.SS1.p1.1)\.
- K\. Cheng, Q\. Sun, Y\. Chu, F\. Xu, L\. YanTao, J\. Zhang, and Z\. Wu \(2024\)Seeclick: harnessing gui grounding for advanced visual gui agents\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 9313–9332\.Cited by:[§2\.2](https://arxiv.org/html/2606.03236#S2.SS2.p1.1)\.
- G\. Comanici, E\. Bieber, M\. Schaekermann, I\. Pasupat, N\. Sachdeva, I\. Dhillon, M\. Blistein, O\. Ram, D\. Zhang, E\. Rosen,et al\.\(2025\)Gemini 2\.5: pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities\.arXiv preprint arXiv:2507\.06261\.Cited by:[§B\.3](https://arxiv.org/html/2606.03236#A2.SS3.SSS0.Px1.p10.2)\.
- S\. Deng, W\. Xu, H\. Sun, W\. Liu, T\. Tan, L\. Liujianfeng, A\. Li, J\. Luan, B\. Wang, R\. Yan,et al\.\(2024\)Mobile\-bench: an evaluation benchmark for llm\-based mobile agents\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 8813–8831\.Cited by:[§2\.2](https://arxiv.org/html/2606.03236#S2.SS2.p1.1)\.
- T\. Fawcett \(2006\)An introduction to roc analysis\.Pattern recognition letters27\(8\),pp\. 861–874\.Cited by:[§B\.1](https://arxiv.org/html/2606.03236#A2.SS1.p2.11)\.
- Y\. Fu, X\. Tan, T\. Hao, C\. Zhan, and X\. Qiu \(2026\)PRISM: festina lente proactivity–risk\-sensitive, uncertainty\-aware deliberation for proactive agents\.arXiv preprint arXiv:2602\.01532\.Cited by:[§1](https://arxiv.org/html/2606.03236#S1.p3.1),[§2\.1](https://arxiv.org/html/2606.03236#S2.SS1.p1.1)\.
- B\. Gao, Z\. He, P\. Sharma, Q\. Kang, D\. Jevdjic, J\. Deng, X\. Yang, Z\. Yu, and P\. Zuo \(2024\)Cost\-efficient large language model serving for multi\-turn conversations with cachedattention\.InUSENIX Annual Technical Conference,External Links:[Link](https://api.semanticscholar.org/CorpusID:268793498)Cited by:[§1](https://arxiv.org/html/2606.03236#S1.p4.1)\.
- Google DeepMind \(2026\)Gemini 3\.1 pro model card\.Technical reportGoogle DeepMind\.External Links:[Link](https://deepmind.google/models/model-cards/gemini-3-1-pro/)Cited by:[§4\.1](https://arxiv.org/html/2606.03236#S4.SS1.SSS0.Px2.p1.1)\.
- D\. Guo, D\. Yang, H\. Zhang, J\. Song, P\. Wang, Q\. Zhu, R\. Xu, R\. Zhang, S\. Ma, X\. Bi,et al\.\(2025\)DeepSeek\-r1 incentivizes reasoning in llms through reinforcement learning\.Nature645\(8081\),pp\. 633–638\.Cited by:[§A\.4](https://arxiv.org/html/2606.03236#A1.SS4.p1.1),[§3\.5](https://arxiv.org/html/2606.03236#S3.SS5.p8.1)\.
- W\. Hong, W\. Yu, X\. Gu, G\. Wang, G\. Gan, H\. Tang, J\. Cheng, J\. Qi, J\. Ji, L\. Pan,et al\.\(2025\)Glm\-4\.5 v and glm\-4\.1 v\-thinking: towards versatile multimodal reasoning with scalable reinforcement learning\.arXiv preprint arXiv:2507\.01006\.Cited by:[§4\.1](https://arxiv.org/html/2606.03236#S4.SS1.SSS0.Px2.p1.1)\.
- A\. Hurst, A\. Lerer, A\. P\. Goucher, A\. Perelman, A\. Ramesh, A\. Clark, A\. Ostrow, A\. Welihinda, A\. Hayes, A\. Radford,et al\.\(2024\)Gpt\-4o system card\.arXiv preprint arXiv:2410\.21276\.Cited by:[§1](https://arxiv.org/html/2606.03236#S1.p1.1)\.
- Kimi Team, T\. Bai, Y\. Bai, Y\. Bao, S\. H\. Cai, Y\. Cao, Y\. Charles, H\. S\. Che,et al\.\(2026\)Kimi k2\.5: visual agentic intelligence\.External Links:2602\.02276,[Link](https://arxiv.org/abs/2602.02276)Cited by:[§4\.1](https://arxiv.org/html/2606.03236#S4.SS1.SSS0.Px2.p1.1)\.
- D\. Kong, Z\. Feng, Q\. Liang, H\. Wang, H\. Sun, C\. Yang, Y\. Li, P\. Zhou, S\. Nie, H\. Wang, L\. Zhou, H\. Jia, J\. Xu, R\. Shi, and Y\. Huang \(2026\)ProactiveMobile: a comprehensive benchmark for boosting proactive intelligence on mobile devices\.External Links:2602\.21858,[Link](https://arxiv.org/abs/2602.21858)Cited by:[§A\.2](https://arxiv.org/html/2606.03236#A1.SS2.p1.1),[§A\.3](https://arxiv.org/html/2606.03236#A1.SS3.SSS0.Px1.p1.1),[§B\.3](https://arxiv.org/html/2606.03236#A2.SS3.SSS0.Px1.p10.2),[§B\.3](https://arxiv.org/html/2606.03236#A2.SS3.SSS0.Px1.p5.1),[§1](https://arxiv.org/html/2606.03236#S1.p1.1),[§1](https://arxiv.org/html/2606.03236#S1.p4.1),[§2\.1](https://arxiv.org/html/2606.03236#S2.SS1.p1.1),[§3\.1](https://arxiv.org/html/2606.03236#S3.SS1.p1.6),[§3\.5](https://arxiv.org/html/2606.03236#S3.SS5.p1.1),[§4\.1](https://arxiv.org/html/2606.03236#S4.SS1.SSS0.Px3.p1.1)\.
- T\. Lin, P\. Goyal, R\. Girshick, K\. He, and P\. Dollár \(2017\)Focal loss for dense object detection\.In2017 IEEE International Conference on Computer Vision \(ICCV\),External Links:[Document](https://dx.doi.org/10.1109/ICCV.2017.324)Cited by:[§3\.5](https://arxiv.org/html/2606.03236#S3.SS5.p2.4)\.
- F\. Liuet al\.\(2026\)Help without being asked: a deployed proactive agent system for on\-call support with continuous self\-improvement\.InCompanion Proceedings of the 34th ACM Symposium on the Foundations of Software Engineering,Cited by:[§1](https://arxiv.org/html/2606.03236#S1.p1.1)\.
- Y\. Liu, H\. Shen, L\. Yu, S\. Liu, Z\. Chen, and Y\. Zhou \(2026\)DRS\-gui: dynamic region search for training\-free gui grounding\.arXiv preprint arXiv:2605\.15542\.Cited by:[§2\.2](https://arxiv.org/html/2606.03236#S2.SS2.p1.1)\.
- Y\. Lu, S\. Yang, C\. Qian, G\. Chen, Q\. Luo, Y\. Wu, H\. Wang, X\. Cong, Z\. Zhang, Y\. Lin,et al\.\(2025\)Proactive agent: shifting llm agents from reactive responses to active assistance\.InInternational Conference on Learning Representations,Vol\.2025,pp\. 47431–47457\.Cited by:[§1](https://arxiv.org/html/2606.03236#S1.p1.1),[§1](https://arxiv.org/html/2606.03236#S1.p3.1),[§2\.1](https://arxiv.org/html/2606.03236#S2.SS1.p1.1)\.
- K\. Luo, Z\. Liu, S\. Xiao, and K\. Liu \(2024\)Bge landmark embedding: a chunking\-free embedding method for retrieval augmented long\-context large language models\.arXiv preprint arXiv:2402\.11573\.Cited by:[§4\.1](https://arxiv.org/html/2606.03236#S4.SS1.SSS0.Px1.p1.1)\.
- S\. Mehrotra, S\. V\. Rebbapragada, M\. H\. R\. Bonthu, and V\. N\. Balasubramanian \(2025\)ISHIFT: lightweight slow\-fast gui agent with adaptive perception\.arXiv preprint arXiv:2512\.22009\.Cited by:[§2\.2](https://arxiv.org/html/2606.03236#S2.SS2.p1.1)\.
- D\. Nathani, C\. Zhang, C\. Huan, J\. Shan, Y\. Yang, A\. Patel, Z\. Gan, W\. Y\. Wang, M\. Saxon, and X\. E\. Wang \(2026\)Proactive agent research environment: simulating active users to evaluate proactive assistants\.External Links:2604\.00842,[Link](https://arxiv.org/abs/2604.00842)Cited by:[§1](https://arxiv.org/html/2606.03236#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.03236#S2.SS1.p1.1)\.
- I\. Ong, A\. Almahairi, V\. Wu, W\. Chiang, T\. Wu, J\. E\. Gonzalez, M\. W\. Kadous, and I\. Stoica \(2024\)Routellm: learning to route llms with preference data\.arXiv preprint arXiv:2406\.18665\.Cited by:[§2\.2](https://arxiv.org/html/2606.03236#S2.SS2.p1.1)\.
- OpenAI \(2025\)OpenAI o3 and o4\-mini system card\.Technical reportOpenAI\.External Links:[Link](https://cdn.openai.com/pdf/2221c875-02dc-4789-800b-e7758f3722c1/o3-and-o4-mini-system-card.pdf)Cited by:[§4\.1](https://arxiv.org/html/2606.03236#S4.SS1.SSS0.Px2.p1.1)\.
- OpenAI \(2026\)GPT\-5\.5 system card\.Technical reportOpenAI\.External Links:[Link](https://openai.com/index/gpt-5-5-system-card/)Cited by:[§4\.1](https://arxiv.org/html/2606.03236#S4.SS1.SSS0.Px2.p1.1)\.
- Y\. Qin, Y\. Ye, J\. Fang, H\. Wang, S\. Liang, S\. Tian, J\. Zhang,et al\.\(2025\)UI\-tars: pioneering automated gui interaction with native agents\.External Links:2501\.12326,[Link](https://arxiv.org/abs/2501.12326)Cited by:[§2\.2](https://arxiv.org/html/2606.03236#S2.SS2.p1.1)\.
- A\. Radford, J\. W\. Kim, C\. Hallacy, A\. Ramesh, G\. Goh, S\. Agarwal, G\. Sastry, A\. Askell, P\. Mishkin, J\. Clark,et al\.\(2021\)Learning transferable visual models from natural language supervision\.InInternational conference on machine learning,pp\. 8748–8763\.Cited by:[§4\.1](https://arxiv.org/html/2606.03236#S4.SS1.SSS0.Px1.p1.1)\.
- C\. Rawles, S\. Clinckemaillie, Y\. Chang, J\. Waltz, G\. Lau, M\. Fair, A\. Li, W\. Bishop, W\. Li, F\. Campbell\-Ajala,et al\.\(2025\)Androidworld: a dynamic benchmarking environment for autonomous agents\.InInternational Conference on Learning Representations,Vol\.2025,pp\. 406–441\.Cited by:[§2\.2](https://arxiv.org/html/2606.03236#S2.SS2.p1.1)\.
- F\. Tang, Y\. Shen, H\. Zhang, S\. Chen, G\. Hou, W\. Zhang, W\. Zhang, K\. Song, W\. Lu, and Y\. Zhuang \(2025\)Think twice, click once: enhancing gui grounding via fast and slow systems\.arXiv preprint arXiv:2503\.06470\.Cited by:[§2\.2](https://arxiv.org/html/2606.03236#S2.SS2.p1.1)\.
- Y\. Tang, H\. Tang, T\. Cao, L\. Nguyen, A\. Zhang, X\. Cao, C\. Liu, W\. Ding, and Y\. Li \(2026\)ProAgentBench: evaluating llm agents for proactive assistance with real\-world data\.External Links:2602\.04482,[Link](https://arxiv.org/abs/2602.04482)Cited by:[§1](https://arxiv.org/html/2606.03236#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.03236#S2.SS1.p1.1)\.
- A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, L\. Kaiser, and I\. Polosukhin \(2017\)Attention is all you need\.InNeural Information Processing Systems,External Links:[Link](https://api.semanticscholar.org/CorpusID:13756489)Cited by:[§3\.3](https://arxiv.org/html/2606.03236#S3.SS3.p7.5)\.
- J\. Wang, H\. Xu, J\. Ye, M\. Yan, W\. Shen, J\. Zhang, F\. Huang, and J\. Sang \(2024\)Mobile\-agent: autonomous multi\-modal mobile device agent with visual perception\.arXiv preprint arXiv:2401\.16158\.Cited by:[§2\.2](https://arxiv.org/html/2606.03236#S2.SS2.p1.1)\.
- Z\. Wu, P\. Cheng, Z\. Wu, T\. Ju, Z\. Zhang, and G\. Liu \(2025\)Smoothing grounding and reasoning for mllm\-powered gui agents with query\-oriented pivot tasks\.arXiv preprint arXiv:2503\.00401\.Cited by:[§2\.2](https://arxiv.org/html/2606.03236#S2.SS2.p1.1)\.
- Xiaomi \(2026\)MiMo\-V2\.5\-Pro\.Technical reportXiaomi\.External Links:[Link](https://mimo.xiaomi.com/mimo-v2-5-pro/)Cited by:[§4\.1](https://arxiv.org/html/2606.03236#S4.SS1.SSS0.Px2.p1.1)\.
- Z\. Xie, Z\. Hu, F\. Ye, X\. Zhang, H\. Chai, Z\. Liu, P\. Wu, G\. Zhang, Y\. Liao, X\. Hu,et al\.\(2026\)PASK: toward intent\-aware proactive agents with long\-term memory\.arXiv preprint arXiv:2604\.08000\.Cited by:[§1](https://arxiv.org/html/2606.03236#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.03236#S2.SS1.p1.1)\.
- Z\. Xu, K\. D\. Nguyen, P\. Mukherjee, S\. Bagchi, S\. Chaterji, Y\. Liang, and Y\. Li \(2025\)Learning to inference adaptively for multimodal large language models\.InProceedings of the IEEE/CVF International Conference on Computer Vision,pp\. 3552–3563\.Cited by:[§2\.2](https://arxiv.org/html/2606.03236#S2.SS2.p1.1)\.
- B\. Yang, L\. Xu, L\. Zeng, Y\. Guo, S\. Jiang, W\. Lu, K\. Liu, Y\. Li, X\. Jiang, G\. Xing, and Z\. Yan \(2026a\)ProAgent: harnessing on\-demand sensory contexts for proactive llm agent systems in the wild\.External Links:2512\.06721,[Link](https://arxiv.org/abs/2512.06721)Cited by:[§1](https://arxiv.org/html/2606.03236#S1.p4.1)\.
- B\. Yang, L\. Xu, L\. Zeng, K\. Liu, S\. Jiang, W\. Lu, H\. Chen, X\. Jiang, G\. Xing, and Z\. Yan \(2026b\)Contextagent: context\-aware proactive llm agents with open\-world sensory perceptions\.Advances in Neural Information Processing Systems38,pp\. 167509–167543\.Cited by:[§A\.3](https://arxiv.org/html/2606.03236#A1.SS3.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2606.03236#S1.p1.1),[§1](https://arxiv.org/html/2606.03236#S1.p4.1)\.
- B\. Zhang, Z\. Shang, Z\. Gao, W\. Zhang, R\. Xie, X\. Ma, T\. Yuan, X\. Wu, S\. Zhu, and Q\. Li \(2025a\)Tongui: building generalized gui agents by learning from multimodal web tutorials\.arXiv e\-prints,pp\. arXiv–2504\.Cited by:[§4\.1](https://arxiv.org/html/2606.03236#S4.SS1.SSS0.Px2.p1.1)\.
- C\. Zhang, Z\. Yang, J\. Liu, Y\. Li, Y\. Han, X\. Chen, Z\. Huang, B\. Fu, and G\. Yu \(2025b\)Appagent: multimodal agents as smartphone users\.InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems,pp\. 1–20\.Cited by:[§1](https://arxiv.org/html/2606.03236#S1.p1.1),[§2\.2](https://arxiv.org/html/2606.03236#S2.SS2.p1.1)\.

## Appendix AExperimental Settings

### A\.1Implementation Details

The training hyperparameters of MPP, PAR supervised fine\-tuning \(SFT\) and GRPO optimization are summarized in Table[4](https://arxiv.org/html/2606.03236#A1.T4)\. All experiments are evaluated under the same setting, with the prompts, output format, and the list of available functions in the API pool kept consistent\. PRPF is trained on NVIDIA H20 \(96 GB HBM3\) GPUs, consuming approximately2,8172\{,\}817GPU\-hours in total\.

ParameterValueMPPOptimizerAdamWLearning Rate1×10−41\\times 10^\{\-4\}Trigger\-Gating Threshold0\.78Batch Size8Maximum Epochs50Early\-Stopping Patience10Hidden Dimension256Scenario Loss Weightλ=0\.5\\lambda=0\.5PAR\-SFTOptimizerAdamWLearning Rate1×10−51\\times 10^\{\-5\}LR ScheduleCosine annealingWarmup Ratio0\.1Training Epochs5Batch Size16Gradient Accumulation Steps2PAR\-GRPOOptimizerAdamWLearning Rate5×10−75\\times 10^\{\-7\}KL Coefficient0\.05Rollouts per PromptN=8N=8Batch Size32Training Epochs1Table 4:Training hyperparameters for MPP, PAR supervised fine\-tuning, and GRPO optimization\.
### A\.2Datasets

ProactiveMobile\(Konget al\.,[2026](https://arxiv.org/html/2606.03236#bib.bib9)\)is a comprehensive benchmark dataset for mobile proactive intelligence\. It is designed to evaluate whether a model can proactively infer users’ latent needs from contextual information on mobile devices before users explicitly issue requests, and generate executable function\-call sequences\.

The dataset is constructed around four types of mobile contexts, including user profiles, device states, world information, and behavioral trajectories\. Specifically, user profiles describe users’ long\-term habits, preferences, and basic attributes; device states reflect the current device and environmental status; world information provides external background such as weather, time, and holidays; and behavioral trajectories characterize the continuous interaction process between users and mobile devices\. Meanwhile, the dataset also constructs a function pool containing 63 composite API functions\. In terms of data format, ProactiveMobile contains both text scenarios and multimodal scenarios, where text scenarios use textual descriptions of behavioral trajectories, while multimodal scenarios use consecutive GUI screenshot sequences\. Detailed dataset statistics are reported in Table[5](https://arxiv.org/html/2606.03236#A1.T5)\. In addition, considering the diversity of user needs in real\-world mobile scenarios, ProactiveMobile adopts a multi\-answer annotation mechanism, where the same context may have multiple reasonable intent–function sequence annotations\. Following the same protocol, this paper selects the function sequence with the highest F1 score\.

SplitData TypeItemsIntentsImagesFunctionsTrainMultimodal44388977324189964Text44384438\-8259TestMultimodal18323711143414173Text18282676\-2266

Table 5:Statistics of the ProactiveMobile dataset, broken down by Train and Test splits and data modality\. The table summarizes the composition of ProactiveMobile benchmark, including the number of items, intents, UI images, and total functions\.
### A\.3SFT Data Construction

We construct two types of SFT data according to the sample modality: multimodal SFT data and text SFT data\.

#### Multimodal SFT data construction\.

We decompose the recommendation generation process into four types of supervision signals: interface\-state understanding, function\-selection constraints, reasoning\-process construction, and final recommendation generation\. For recommendation samples, we use Claude Sonnet 4\.6 for generation\(Anthropic,[2026a](https://arxiv.org/html/2606.03236#bib.bib51)\)\. Specifically, based on the GUI screenshot sequence, we first generate a UI Summary, which contains the current page stage, key interaction regions, and state changes\. Then, combined with the user profile, device state, world information, and candidate functions, we generate the Thinking process containing Function Selection\. The Function Selection includes candidate function\-pool analysis, function\-matching judgment, final function selection, and argument\-source explanation\. In this way, the model learns not only “what to recommend”, but also “based on what context to recommend”, “why to recommend at this moment”, and “how to select executable functions from the candidate function pool”\(Yanget al\.,[2026b](https://arxiv.org/html/2606.03236#bib.bib12); Konget al\.,[2026](https://arxiv.org/html/2606.03236#bib.bib9)\)\. For non\-recommendation samples, we do not construct the Thinking process and directly output the result\.

#### Text SFT data construction\.

We decompose the recommendation generation process into three types of supervision signals: function\-selection constraints, reasoning\-process construction, and final recommendation generation\. Except for the generation of UI Summary, all generation steps are consistent with those of multimodal SFT data\.

### A\.4GRPO Optimization Details

We use GRPO\(Guoet al\.,[2025](https://arxiv.org/html/2606.03236#bib.bib23)\)as a post\-SFT optimization stage for PAR\. This appendix summarizes the rollout filtering, objective, and reward used for the GRPO stage\.

#### Data construction\.

GRPO prompts are built from SFT\-format prompts and rollouts of the SFT\-trained PAR, which is used as the reference policyπref\\pi\_\{\\mathrm\{ref\}\}\. Let𝒟roll\\mathcal\{D\}\_\{\\mathrm\{roll\}\}denote the rollout set, where each case containsN=8N=8sampled outputs and rule\-based reward scores\. We retain only rollout groups with non\-trivial disagreement:

std⁡\(\{Ri,j\}j=1N\)\>0\.1,0\.1≤CRi≤0\.9,\\operatorname\{std\}\(\\\{R\_\{i,j\}\\\}\_\{j=1\}^\{N\}\)\>0\.1,\\qquad 0\.1\\leq\\operatorname\{CR\}\_\{i\}\\leq 0\.9,

\(16\)HereCRi\\operatorname\{CR\}\_\{i\}is the rollout\-level correctness rate under a case\-specific intervention criterion\. For intervention cases, correctness requires exceeding the TP\-quality threshold used by the reward implementation, thereby filtering out mere triggering without sufficient function\-name or argument correctness\. For no\-intervention cases, correctness means predicting an empty function\-name sequence\. Retained cases are weighted to emphasize function\-correct but argument\-weak cases and false\-positive\-prone no\-intervention cases, then resized to5,0005\{,\}000examples with at least30%30\\%no\-intervention records\. During resizing, intervention records in the high\-variance disagreement region,stdi\>0\.3\\operatorname\{std\}\_\{i\}\>0\.3and0\.1<CRi<0\.70\.1<\\operatorname\{CR\}\_\{i\}<0\.7, are preferentially preserved, followed by a95:595\{:\}5stratified train/test split\.

#### Reward\.

LetS\(y\)S\(y\)denote the function\-name sequence extracted from outputyy\. For the supervised targetyiy\_\{i\},Si⋆=S\(yi\)S\_\{i\}^\{\\star\}=S\(y\_\{i\}\)is the gold function\-name sequence in the SFT response; for rollouty^i,j\\hat\{y\}\_\{i,j\},S^i,j=S\(y^i,j\)\\hat\{S\}\_\{i,j\}=S\(\\hat\{y\}\_\{i,j\}\)is the predicted sequence\. When multiple gold recommendations are available,S⋆S^\{\\star\}denotes the best\-matched gold sequence selected by exact sequence accuracy and then function\-name F1\. In the reward formulas below, we omit instance and rollout indices and writeS^\\hat\{S\}andS⋆S^\{\\star\}for readability\.

The four intervention outcomes are defined by whetherS^\\hat\{S\}andS⋆S^\{\\star\}are empty:

𝟏TP=𝟏\[\|S^\|\>0∧\|S⋆\|\>0\],𝟏TN=𝟏\[\|S^\|=0∧\|S⋆\|=0\],\\begin\{aligned\} \\mathbf\{1\}\_\{\\mathrm\{TP\}\}&=\\mathbf\{1\}\[\|\\hat\{S\}\|\>0\\land\|S^\{\\star\}\|\>0\],&\\mathbf\{1\}\_\{\\mathrm\{TN\}\}&=\\mathbf\{1\}\[\|\\hat\{S\}\|=0\\land\|S^\{\\star\}\|=0\],\\end\{aligned\}

\(17\)𝟏FP=𝟏\[\|S^\|\>0∧\|S⋆\|=0\],𝟏FN=𝟏\[\|S^\|=0∧\|S⋆\|\>0\]\.\\begin\{aligned\} \\mathbf\{1\}\_\{\\mathrm\{FP\}\}&=\\mathbf\{1\}\[\|\\hat\{S\}\|\>0\\land\|S^\{\\star\}\|=0\],&\\mathbf\{1\}\_\{\\mathrm\{FN\}\}&=\\mathbf\{1\}\[\|\\hat\{S\}\|=0\\land\|S^\{\\star\}\|\>0\]\.\\end\{aligned\}

\(18\)
The rollout reward is decomposed as

R\(y^,x\)=Racc\(y^,x\)\+Rfmt\(y^\)\+Rcal\(y^,x\),R\(\\hat\{y\},x\)=R\_\{\\mathrm\{acc\}\}\(\\hat\{y\},x\)\+R\_\{\\mathrm\{fmt\}\}\(\\hat\{y\}\)\+R\_\{\\mathrm\{cal\}\}\(\\hat\{y\},x\),

\(19\)Racc=wTRtrig\+\(wFRfunc\+wARargs\)𝟏TP,R\_\{\\mathrm\{acc\}\}=w\_\{T\}R\_\{\\mathrm\{trig\}\}\+\(w\_\{F\}R\_\{\\mathrm\{func\}\}\+w\_\{A\}R\_\{\\mathrm\{args\}\}\)\\mathbf\{1\}\_\{\\mathrm\{TP\}\},

\(20\)Rfmt=wMRschema,R\_\{\\mathrm\{fmt\}\}=w\_\{M\}R\_\{\\mathrm\{schema\}\},\(21\)Rcal=bTN𝟏TN\+\(Rfp\+pFP\)𝟏FP\+\(bcov\+pover\)𝟏TP\.R\_\{\\mathrm\{cal\}\}=b\_\{\\mathrm\{TN\}\}\\mathbf\{1\}\_\{\\mathrm\{TN\}\}\+\(R\_\{\\mathrm\{fp\}\}\+p\_\{\\mathrm\{FP\}\}\)\\mathbf\{1\}\_\{\\mathrm\{FP\}\}\+\(b\_\{\\mathrm\{cov\}\}\+p\_\{\\mathrm\{over\}\}\)\\mathbf\{1\}\_\{\\mathrm\{TP\}\}\.

\(22\)
RtrigR\_\{\\mathrm\{trig\}\}assigns quadrant\-level intervention credit to TP/TN/FP/FN outcomes\.RfuncR\_\{\\mathrm\{func\}\}combines function\-name set\-F1 and longest\-common\-subsequence order consistency\.RargsR\_\{\\mathrm\{args\}\}combines must\-fill argument completeness and typed value similarity\.RschemaR\_\{\\mathrm\{schema\}\}scores output schema compliance, including tag order, function\-field presence, JSON validity, and function\-call structure\. The calibration terms reward correct silence, keep false interventions net\-penalized, reward full gold\-function coverage, and penalize redundant function calls\. Unparseable outputs receive only schema credit\.

We setwT=2\.0w\_\{T\}=2\.0,wF=3\.0w\_\{F\}=3\.0,wA=2\.5w\_\{A\}=2\.5, andwM=0\.5w\_\{M\}=0\.5, withbTN=6\.0b\_\{\\mathrm\{TN\}\}=6\.0,pFP=−1\.5p\_\{\\mathrm\{FP\}\}=\-1\.5,bcov=0\.2b\_\{\\mathrm\{cov\}\}=0\.2, andλo=0\.6\\lambda\_\{o\}=0\.6for the over\-prediction penalty\. These constants are calibrated on development rollouts to preserve the intended ordering from malformed outputs, to valid no\-intervention outputs, to partially correct TP outputs, and finally to fully correct executable recommendations\. Our empirical ablations focus on the MPP/PAR architecture rather than treating reward constants as a separate contribution\.

## Appendix BAdditional Experimental Results

### B\.1MPP Hyperparameter Analysis

The main hyperparameter affecting the effectiveness of MPP is the thresholdτ\\tauof the trigger\-gating head, which directly determines how many samples need to be filtered\. Since the goal of MPP is to filter out a large number of non\-recommendation samples while retaining as many recommendation samples as possible, we mainly select two evaluation metrics: Recall and Specificity\. Recall measures the proportion of recommendation samples that are correctly retained, i\.e\., avoiding missed recommendations, while Specificity measures the proportion of non\-recommendation samples that are correctly filtered, i\.e\., effectively suppressing false triggers:

Recall=TPTP\+FN,Specificity=TNTN\+FP\.\\mathrm\{Recall\}=\\frac\{\\mathrm\{TP\}\}\{\\mathrm\{TP\}\+\\mathrm\{FN\}\},\\mathrm\{Specificity\}=\\frac\{\\mathrm\{TN\}\}\{\\mathrm\{TN\}\+\\mathrm\{FP\}\}\.

\(23\)
We split the original training set into train/dev subsets with an 8:2 ratio\. Threshold selection is performed exclusively on the dev split, while the test set is used only once for final evaluation\. Following the standard idea of ROC operating\-point selection\(Fawcett,[2006](https://arxiv.org/html/2606.03236#bib.bib58)\), we formulate threshold selection as a constrained optimization problem: under the constraint ofRecall≥0\.90\\mathrm\{Recall\}\\geq 0\.90, i\.e\., the missed recommendation rate does not exceed10%10\\%, we maximize Specificity\. As shown in Figure[5](https://arxiv.org/html/2606.03236#A2.F5), we perform grid search on the validation set over thresholdsτ∈\[0\.05,0\.95\]\\tau\\in\[0\.05,0\.95\]with a step size of0\.010\.01\. Asτ\\tauincreases, Recall monotonically decreases while Specificity monotonically increases, forming a trade\-off curve between the two metrics\. We selectτ=0\.78\\tau=0\.78as the deployment threshold, achievingRecall=90\.42%\\mathrm\{Recall\}=90\.42\\%andSpecificity=76\.92%\\mathrm\{Specificity\}=76\.92\\%on the dev split\. This means that the gate can filter out approximately77%77\\%of non\-recommendation samples while retaining more than90%90\\%of samples that truly require recommendations\. Among the thresholds satisfying the recall constraint,τ=0\.78\\tau=0\.78provides the strongest filtering capability on the dev split\.

![Refer to caption](https://arxiv.org/html/2606.03236v1/Figure/threshold_curve.png)Figure 5:Sensitivity analysis of trigger\-gate performance under different thresholdsτ\\tau\.
### B\.2Detailed Results and Experimental Analysis of MPP

For multimodal data, the processing procedure of MPP has been described in the main text\. For text data, we only need to replace the GUI screenshot sequence with Trace Text, while keeping all other steps consistent\.

The experiments in this paper are based on the results of MPP under the random seed of 42\. The confusion matrix is shown in Table[6](https://arxiv.org/html/2606.03236#A2.T6)\. MPP can filter out 1,320 samples, among which 1,081 samples are correctly filtered, achieving a correct filtering rate of 81\.89%, demonstrating good performance\. Moreover, a portion of the 169 cases will also be correctly handled in the subsequent PAR stage\.

Pred\. Pos\.Pred\. Neg\.TotalActual Pos\.21712392410Actual Neg\.16910811250Total234013203660

Table 6:Trigger\-decision confusion matrix of MPP on the ProactiveMobile test set\. Pos\. and Neg\. denote recommendation and non\-recommendation, respectively\.Table[7](https://arxiv.org/html/2606.03236#A2.T7)reports the five\-seed mean±\\pmstandard deviation\. The results indicate that the trigger decision is stable and reliable\. The filtering capability is also effective: Specificity reaches 83\.81–86\.48%, which means that MPP can filter out more than 83% of non\-recommendation samples, significantly reducing the computational burden of the downstream decision model\.

Accuracy=TP\+TNTP\+TN\+FP\+FN\\mathrm\{Accuracy\}=\\frac\{TP\+TN\}\{TP\+TN\+FP\+FN\}\(24\)Precision=TPTP\+FP\\mathrm\{Precision\}=\\frac\{TP\}\{TP\+FP\}\(25\)F1=2×Precision×RecallPrecision\+Recall\\mathrm\{F1\}=\\frac\{2\\times\\mathrm\{Precision\}\\times\\mathrm\{Recall\}\}\{\\mathrm\{Precision\}\+\\mathrm\{Recall\}\}\(26\)IntentTop\-5Acc=∑c=1CNccorrect∑c=1CNc,\\mathrm\{Intent\\ Top\\text\{\-\}5\\ Acc\}=\\frac\{\\sum\_\{c=1\}^\{C\}N\_\{c\}^\{\\mathrm\{correct\}\}\}\{\\sum\_\{c=1\}^\{C\}N\_\{c\}\},\(27\)
whereCCdenotes the total number of scenario categories,NccorrectN\_\{c\}^\{\\mathrm\{correct\}\}denotes the number of samples in thecc\-th scenario category that are correctly hit by the Top\-5 prediction, andNcN\_\{c\}denotes the total number of samples in thecc\-th scenario category\.

ThresholdAccuracyRecallSpecificityPrecisionF1Intent Top\-5 Acc0\.59±0\.240\.59\\pm 0\.2487\.99±0\.7887\.99\\pm 0\.7890\.16±0\.1190\.16\\pm 0\.1183\.81±2\.4583\.81\\pm 2\.4591\.49±1\.1791\.49\\pm 1\.1790\.82±0\.5390\.82\\pm 0\.5387\.95±3\.4887\.95\\pm 3\.48

Table 7:Performance analysis of MPP under different random seeds\.
### B\.3Expanded Results with Granular Metrics

#### Metric Calculation\.

Given a test set𝒟\\mathcal\{D\}, letS\(y\)S\(y\)denote the function\-name sequence extracted from an outputyy\. For a model predictiony^\\hat\{y\}, we denote the predicted function\-name sequence asS^=S\(y^\)\\hat\{S\}=S\(\\hat\{y\}\)\.

When multiple valid ground\-truth recommendations are available,𝒢=\{S1⋆,…,SK⋆\}\\mathcal\{G\}=\\\{S^\{\\star\}\_\{1\},\\ldots,S^\{\\star\}\_\{K\}\\\}denotes the set of all valid gold function\-name sequences\. Each sequence consists of ordered function names\.

For metrics that require matching against a single reference,S⋆S^\{\\star\}denotes the best\-matched gold function\-name sequence selected from𝒢\\mathcal\{G\}by maximizing exact sequence accuracy and then function\-name F1\.

For SR evaluation, the associated function arguments are additionally considered by the LLM judge\.

Following ProactiveMobile\(Konget al\.,[2026](https://arxiv.org/html/2606.03236#bib.bib9)\), the calculation formulas for Type\-Acc, SR, and FTR are shown in Eq\. \([28](https://arxiv.org/html/2606.03236#A2.E28)\), Eq\. \([29](https://arxiv.org/html/2606.03236#A2.E29)\), and Eq\. \([30](https://arxiv.org/html/2606.03236#A2.E30)\), respectively\.

Type\-Acc\\displaystyle\\mathrm\{Type\\text\{\-\}Acc\}=1\|𝒟\|∑i∈𝒟𝕀\[∃S⋆∈𝒢,\\displaystyle=\\frac\{1\}\{\|\\mathcal\{D\}\|\}\\sum\_\{i\\in\\mathcal\{D\}\}\\mathbb\{I\}\\Big\[\\exists S^\{\\star\}\\in\\mathcal\{G\},\(28\)S^=S⋆\],\\displaystyle\\qquad\\qquad\\hat\{S\}=S^\{\\star\}\\Big\],
where\|𝒟\|\|\\mathcal\{D\}\|denotes the number of test instances,𝒢\\mathcal\{G\}denotes the set of valid gold function\-name sequences, and𝕀\[⋅\]\\mathbb\{I\}\[\\cdot\]is the indicator function that returns 1 if the condition is satisfied and 0 otherwise\.

SR\\displaystyle\\mathrm\{SR\}=1\|𝒟\|∑i∈𝒟𝕀\[∃S⋆∈𝒢,\\displaystyle=\\frac\{1\}\{\|\\mathcal\{D\}\|\}\\sum\_\{i\\in\\mathcal\{D\}\}\\mathbb\{I\}\\Big\[\\exists S^\{\\star\}\\in\\mathcal\{G\},\(29\)y^≡funcy⋆\],\\displaystyle\\qquad\\qquad\\hat\{y\}\\equiv\_\{\\mathrm\{func\}\}y^\{\\star\}\\Big\],
wherey⋆y^\{\\star\}denotes the ground\-truth response corresponding toS⋆S^\{\\star\}, and≡func\\equiv\_\{\\mathrm\{func\}\}denotes functional equivalence judged by the evaluation protocol\.

Following ProactiveMobile\(Konget al\.,[2026](https://arxiv.org/html/2606.03236#bib.bib9)\), we employ Gemini\-2\.5\-Pro as the LLM judge to evaluate functional equivalence\(Comaniciet al\.,[2025](https://arxiv.org/html/2606.03236#bib.bib48)\)\. Specifically, when the predicted outputy^\\hat\{y\}and a ground\-truth responsey⋆y^\{\\star\}are functionally equivalent, i\.e\., the selected functions and their corresponding arguments are semantically consistent, the SR score of this instance is assigned as 1; otherwise, the SR score is 0\.

FTR=∑i∈𝒟𝕀\[𝒢=∅∧S^≠∅\]∑i∈𝒟𝕀\[𝒢=∅\],\\mathrm\{FTR\}=\\frac\{\\sum\_\{i\\in\\mathcal\{D\}\}\\mathbb\{I\}\\left\[\\mathcal\{G\}=\\emptyset\\land\\hat\{S\}\\neq\\emptyset\\right\]\}\{\\sum\_\{i\\in\\mathcal\{D\}\}\\mathbb\{I\}\\left\[\\mathcal\{G\}=\\emptyset\\right\]\},\(30\)
where𝒢=∅\\mathcal\{G\}=\\emptysetdenotes that the current instance does not require proactive recommendation, whileS^≠∅\\hat\{S\}\\neq\\emptysetindicates that the model incorrectly triggers a recommendation\.

#### Precision \(P\) and Recall \(R\)\.

To provide a more forgiving set\-based evaluation, we treat the prediction and ground truth as unordered sets of function names\. LetP^set=set\(S^\)\\hat\{P\}\_\{\\mathrm\{set\}\}=\\mathrm\{set\}\(\\hat\{S\}\)andGset=set\(S⋆\)G\_\{\\mathrm\{set\}\}=\\mathrm\{set\}\(S^\{\\star\}\)\.

### B\.4Sensitivity Analysis of the Top\-KKCandidate\-Pool Size in MPP

Beyond the gating thresholdτ\\tau, a second design choice is the candidate\-pool sizeKK, which truncates MPP’s ranked function list before it is passed to PAR\. A natural intuition is that a largerKKraises the chance the ground\-truth function lies in the pool and should therefore monotonically improve the pipeline success rate \(SR\)\. We test this by fixing the threshold at the value selected in §[B\.1](https://arxiv.org/html/2606.03236#A2.SS1)and sweepingK∈\{1,…,10\}K\\in\\\{1,\\ldots,10\\\}on the ProactiveMobile test set, reporting Top\-KKRecall and SR in Table[8](https://arxiv.org/html/2606.03236#A2.T8)\. Top\-KKRecall counts an instance as recalled if at least one of its ground\-truth function sequences has all of its scenes covered by the shortlist\.

KK12345678910Top\-KKRecall \(%\)74\.9285\.1990\.6693\.6995\.5297\.0297\.8498\.4798\.9399\.37SR \(%\)39\.5439\.8940\.9040\.1941\.1540\.8740\.4940\.9840\.5540\.11Table 8:Sensitivity of MPP’s top\-KKcandidate\-pool size on ProactiveMobile\. The bolded column \(K=5K=5\) marks the saturation point of Top\-KKRecall, at which SR also peaks; we adopt it as the default\.Recall saturates atK=5K=5while SR is non\-monotonic\.Top\-KKRecall rises from 74\.92% to 99\.37% but flattens sharply afterK=5K=5: atK=5K=5recall is already 95\.52%, and the marginal gain over the entireK=5→10K\{=\}5\{\\to\}10range is only 3\.85 pp\. SR, by contrast, is not monotonic — it peaks at 41\.15% atK=5K=5and drifts*down*to 40\.11–40\.98% acrossK∈\{6,…,10\}K\\in\\\{6,\\ldots,10\\\}, contradicting the naive prior\. We attribute this to distractor injection: once the ground\-truth function is reliably in the shortlist, every additional slot is by construction almost always filled with an irrelevant function that occupies prompt budget and competes for PAR’s attention during candidate scoring, rather than a missing\-but\-useful candidate\.K=5K=5is therefore both the saturation point of recall and the onset of distractor interference, and we adopt it as the default candidate\-pool size in all main\-paper experiments\.

DifficultyModelMultimodalTextALLType\-Acc↑\\uparrowSR↑\\uparrowFTR↓\\downarrowF1↑\\uparrowP↑\\uparrowR↑\\uparrowType\-Acc↑\\uparrowSR↑\\uparrowFTR↓\\downarrowF1↑\\uparrowP↑\\uparrowR↑\\uparrowType\-Acc↑\\uparrowSR↑\\uparrowFTR↓\\downarrowF1↑\\uparrowP↑\\uparrowR↑\\uparrowL1GPT\-5\.552\.5422\.0357\.1461\.7263\.1462\.1583\.7881\.087\.7687\.6588\.2287\.9074\.0162\.6013\.3679\.5380\.3779\.84o353\.3922\.8874\.0760\.0361\.4459\.7580\.7076\.4514\.1682\.0582\.6381\.7972\.1559\.6820\.7375\.1676\.0074\.89Gemini\-3\.1\-Pro36\.4414\.4193\.3347\.8548\.0249\.5858\.6955\.6038\.2163\.3563\.4564\.0351\.7242\.7141\.8558\.5058\.6259\.51Claude\-Opus\-4\.748\.3123\.7376\.0063\.0563\.1465\.1186\.8782\.248\.1588\.1088\.6187\.9774\.8063\.9315\.0480\.2680\.6480\.81GLM\-4\.6V50\.8523\.7391\.3063\.1464\.4163\.8443\.6337\.8458\.5745\.6146\.0845\.6245\.8933\.4261\.8051\.1051\.8151\.33Kimi\-K2\.532\.2013\.5640\.6349\.6947\.0259\.6167\.9564\.8724\.5470\.2169\.8871\.2456\.7648\.8126\.6163\.7862\.7367\.60MiMo\-2\.5v50\.8513\.5648\.2862\.5462\.0165\.4059\.4657\.5332\.5563\.2662\.7765\.1956\.7643\.7734\.4463\.0362\.5365\.25TongUI\-7B1\.700\.000\.001\.701\.701\.708\.118\.1132\.268\.118\.118\.116\.105\.5730\.306\.106\.106\.10UI\-TARS\-7B\-DPO\+Proactive36\.446\.7875\.0043\.0844\.9242\.2321\.2418\.9253\.8522\.2722\.7822\.0126\.0015\.1257\.6628\.7829\.7128\.34Qwen3\.5\-9B14\.415\.0954\.5517\.9418\.0818\.645\.414\.6386\.496\.987\.146\.898\.224\.7882\.3510\.4110\.5710\.57ProactiveMobile \(7B\)32\.2010\.1742\.3138\.8440\.2538\.5682\.2437\.076\.8283\.4583\.4683\.5966\.5828\.6510\.5769\.4969\.9469\.50Qwen3\.5\-9B\+Proactive38\.985\.0921\.6242\.0943\.2241\.5386\.8784\.173\.6587\.6587\.6587\.6571\.8859\.426\.2573\.3973\.7473\.21PRPF \(Ours\)43\.2213\.5632\.2650\.0050\.8550\.0092\.2890\.350\.9093\.3193\.4493\.4476\.9266\.314\.7479\.7580\.1179\.84L2GPT\-5\.542\.4119\.5838\.2455\.9155\.8558\.6264\.1458\.5120\.3371\.8972\.8172\.1754\.0840\.4824\.2764\.4964\.9665\.89o341\.6019\.0954\.7653\.9355\.4954\.3058\.0951\.0631\.2664\.2865\.6863\.9850\.4536\.2536\.2659\.4860\.9659\.49Gemini\-3\.1\-Pro21\.0410\.2881\.2539\.6437\.6545\.7334\.6028\.2762\.2847\.1047\.2949\.3528\.3219\.9464\.8843\.6442\.8347\.68Claude\-Opus\-4\.737\.3615\.9954\.4756\.2655\.6161\.2664\.1458\.3719\.2071\.2072\.1571\.5351\.7438\.7526\.4764\.2964\.5066\.77GLM\-4\.6V40\.9512\.0783\.6753\.6155\.7453\.5929\.2621\.1070\.7537\.8939\.8037\.9434\.6716\.9273\.2945\.1747\.1845\.19Kimi\-K2\.522\.357\.3436\.0341\.6737\.8152\.7247\.2644\.3032\.2055\.7854\.9159\.1235\.7327\.1933\.1049\.2546\.9956\.16MiMo\-2\.5v31\.8111\.7546\.4047\.0946\.4751\.5845\.2942\.0538\.8554\.7554\.0257\.5639\.0528\.0240\.4851\.2050\.5354\.79TongUI\-7B1\.960\.4920\.001\.961\.961\.9611\.8111\.8113\.4011\.8111\.8111\.817\.256\.5714\.297\.257\.257\.25UI\-TARS\-7B\-DPO\+Proactive35\.899\.4664\.9143\.4745\.3542\.7414\.2111\.1164\.2416\.9518\.0016\.4624\.2510\.3564\.5229\.2330\.6728\.63Qwen3\.5\-9B8\.812\.9470\.7313\.8213\.7214\.769\.287\.1775\.4211\.9012\.3411\.989\.065\.2174\.5512\.7912\.9813\.27ProactiveMobile \(7B\)34\.7517\.6223\.7843\.3944\.1344\.0269\.0632\.499\.0772\.3472\.6072\.6853\.1725\.6012\.6158\.9359\.4259\.41Qwen3\.5\-9B\+Proactive36\.5413\.8725\.1540\.6241\.6040\.2470\.0464\.9811\.1173\.6474\.9773\.0854\.5341\.3114\.6758\.3559\.5257\.87PRPF \(Ours\)43\.2317\.4619\.6450\.2451\.3350\.2781\.7278\.901\.1086\.6987\.2087\.0463\.9050\.455\.4869\.8170\.5970\.02L3GPT\-5\.537\.8817\.3529\.1451\.0652\.9851\.5438\.9331\.4726\.4553\.1055\.5853\.0838\.3423\.5327\.8651\.9554\.1252\.22o337\.0615\.6233\.0049\.7452\.2049\.6436\.8329\.4929\.7052\.0355\.4251\.1936\.9621\.7031\.4650\.7453\.6150\.32Gemini\-3\.1\-Pro15\.908\.2772\.9633\.5532\.8837\.7825\.9916\.7864\.8047\.3248\.4650\.5320\.3212\.0068\.4539\.5839\.7043\.37Claude\-Opus\-4\.735\.9714\.9033\.1351\.8251\.3555\.2739\.7430\.5427\.1057\.3259\.5558\.5037\.6221\.7530\.4154\.2354\.9456\.68GLM\-4\.6V32\.0610\.2669\.1445\.4748\.6244\.8526\.4614\.1068\.4743\.8647\.2743\.7729\.6111\.9568\.8344\.7748\.0344\.38Kimi\-K2\.526\.259\.9926\.3541\.0038\.4448\.9324\.4819\.4640\.4344\.2141\.4252\.2025\.4714\.1432\.3642\.4039\.7450\.36MiMo\-2\.5v30\.2511\.3535\.8742\.2442\.8343\.9027\.7721\.3537\.6645\.8645\.3450\.1429\.1615\.7336\.6443\.8343\.9346\.63TongUI\-7B1\.640\.2714\.291\.641\.641\.645\.134\.9020\.005\.145\.135\.253\.172\.3018\.423\.173\.173\.22UI\-TARS\-7B\-DPO\+Proactive27\.528\.1761\.4535\.9239\.0334\.6012\.016\.1878\.2116\.2617\.4115\.8120\.737\.3065\.2927\.3129\.5626\.37Qwen3\.5\-9B7\.362\.7373\.4512\.4113\.0512\.668\.044\.7872\.7312\.8413\.4113\.217\.663\.6273\.1312\.6013\.2112\.90ProactiveMobile \(7B\)33\.2415\.0822\.6339\.7940\.7740\.2139\.0417\.378\.7544\.5045\.2844\.8535\.7816\.0816\.0841\.8542\.7442\.24Qwen3\.5\-9B\+Proactive41\.2416\.4415\.2645\.3447\.1844\.4842\.5436\.4814\.7450\.9153\.6749\.6541\.8125\.2215\.0347\.7850\.0346\.75PRPF \(Ours\)39\.2417\.4416\.1044\.8746\.1344\.6951\.8746\.153\.2060\.4762\.2660\.5144\.7730\.029\.6851\.7053\.1951\.62AvgGPT\-5\.540\.3418\.4033\.4853\.3754\.5954\.5955\.0949\.0219\.2465\.3066\.9065\.4447\.7133\.6923\.8459\.3360\.7460\.01o339\.6317\.2541\.4551\.8053\.8951\.8551\.3144\.5326\.8961\.0563\.2760\.5045\.4630\.8731\.6156\.4258\.5856\.17Gemini\-3\.1\-Pro18\.949\.3376\.4736\.5135\.4541\.2033\.9726\.7556\.6049\.5150\.1351\.9926\.4518\.0361\.1143\.0042\.7846\.59Claude\-Opus\-4\.737\.2315\.8341\.0354\.0353\.5357\.9155\.9148\.6918\.8167\.0868\.5767\.7446\.5632\.2426\.1160\.5561\.0462\.82GLM\-4\.6V36\.2511\.7474\.4549\.3352\.0249\.0029\.9820\.1967\.0441\.7944\.2041\.7733\.1215\.9669\.3345\.5648\.1145\.39Kimi\-K2\.525\.339\.3330\.0241\.7838\.7850\.8839\.5035\.5632\.5152\.3950\.7057\.5932\.4022\.4331\.6447\.0844\.7354\.23MiMo\-2\.5v32\.1011\.6339\.4545\.1745\.2847\.8539\.0834\.5437\.0651\.7951\.1955\.1635\.5823\.0737\.8748\.4848\.2351\.50TongUI\-7B1\.750\.3315\.791\.751\.751\.758\.158\.0418\.588\.168\.158\.214\.954\.1818\.104\.954\.954\.97UI\-TARS\-7B\-DPO\+Proactive30\.908\.5263\.1338\.9141\.5237\.8114\.179\.9064\.6717\.3818\.4016\.9422\.549\.2163\.8428\.1529\.9727\.39Qwen3\.5\-9B8\.302\.9571\.5213\.2413\.6013\.758\.155\.6977\.1311\.6512\.1111\.838\.224\.3275\.3012\.4412\.8512\.79ProactiveMobile \(7B\)33\.6815\.6123\.9140\.9341\.8641\.3856\.8426\.048\.5160\.8461\.3261\.1645\.2520\.8213\.7650\.8851\.5851\.26Qwen3\.5\-9B\+Proactive39\.5214\.8518\.5443\.5545\.0642\.8759\.5254\.3210\.6364\.9666\.7764\.1549\.5134\.5613\.4954\.2455\.9053\.50PRPF \(Ours\)40\.8317\.1917\.9947\.0048\.1746\.9069\.2065\.151\.7575\.3276\.3875\.4955\.0041\.157\.2161\.1462\.2661\.18

Table 9:Detailed performance comparison using granular, set\-based metrics on the ProactiveMobile test set\. We report function\-name sequence accuracy \(Type\-Acc↑\\uparrow\), success rate \(SR↑\\uparrow\), false trigger rate \(FTR↓\\downarrow\), F1↑\\uparrow, Precision \(P↑\\uparrow\), and Recall \(R↑\\uparrow\) across different difficulties and modalities\. Best results are in bold, and second\-best results are underlined\. All scores are reported in percentage \(%\)\.
### B\.5Detailed Results Analysis of Three Difficulty Levels

Following the difficulty division method of the ProactiveMobile dataset, we divide the dataset into three difficulty levels: L1, L2, and L3\.

- •Level 1 \(Easy\):Correctly solved by 4–5 out of 5 reference models\.
- •Level 2 \(Medium\):Correctly solved by 2–3 out of 5 reference models\.
- •Level 3 \(Hard\):Correctly solved by 0–1 out of 5 reference models\.

The results of PRPF and all baseline models are shown in Table[9](https://arxiv.org/html/2606.03236#A2.T9), which also provides more detailed insights:

1\. PRPF maintains stable advantages across different difficulty levels\.As the task difficulty increases from L1 to L3, PRPF achieves ALL SR of 66\.31%, 50\.45%, and 30\.02%, respectively, all outperforming all baselines\. In particular, PRPF still maintains a clear lead on L2 and L3, indicating that the gains of the model do not come from overfitting to easy samples, but are also effective in medium\- and high\-difficulty proactive recommendation scenarios\.

2\. PRPF can generate higher\-quality function sets\.The ALL F1 and Precision of PRPF are the highest among all baselines, while Recall is the second highest\. This indicates that the improvement of PRPF does not only come from the increase of a small number of exactly matched samples, but also from the overall improvement in the selection quality of function sets, enabling the model to more accurately cover the key functions required by users’ latent needs\.

#### 3\. PRPF demonstrates strong cross\-modal applicability\.

Across different difficulty levels, PRPF achieves higher ALL SR than the strongest baseline, Qwen3\.5\-9B\+Proactive, in both text and multimodal settings\. In the text setting, PRPF improves ALL SR from 54\.32% to 65\.15% compared with Qwen3\.5\-9B\+Proactive, while reducing FTR from 10\.63% to 1\.75%\. In the multimodal setting, PRPF also improves SR from 14\.85% to 17\.19%\. These results indicate that PRPF provides stable proactive perception and content generation capabilities under both text and multimodal conditions\.

#### 4\. Multimodal tasks remain a bottleneck\.

Although PRPF achieves an SR of 17\.19% on multimodal tasks, outperforming Qwen3\.5\-9B\+Proactive, the absolute SR remains low\. Even GPT\-5\.5, which performs best on multimodal tasks, achieves an SR of only 18\.40%\. This further demonstrates the challenge of the ProactiveMobile task and indicates that stronger interface understanding capability is still required in the future to achieve a breakthrough in this domain\.

### B\.6Efficiency Benchmark Details

This appendix details the evaluation protocol for the efficiency analysis in Section[4\.4](https://arxiv.org/html/2606.03236#S4.SS4)\. All reported values are computed on the ProactiveMobile test split and aggregated per test sample after applying the routing policy of each configuration\.

#### Compute Estimation\.

For each PAR invocation, we estimate FLOPs from the actual prompt and generation lengths observed during inference\. LetTpT\_\{p\}denote the prompt length,TgT\_\{g\}the generated length,NpN\_\{p\}the number of model parameters, andLLandddthe number of transformer layers and hidden size\. The PAR FLOPs are estimated as

FPAR=2Np\(Tp\+Tg\)\+4LdTp\(Tp\+Tg\)\.F\_\{\\mathrm\{PAR\}\}=2N\_\{p\}\(T\_\{p\}\+T\_\{g\}\)\+4LdT\_\{p\}\(T\_\{p\}\+T\_\{g\}\)\.\(31\)The first term approximates parameter\-dominated dense computation, while the second term accounts for the attention cost induced by long multimodal prompts\.

MPP is measured as an end\-to\-end front\-end, including text encoding, image encoding, and the slow–fast multimodal fusion network\. Its measured per\-sample cost is 23\.58 GFLOPs, 14\.16 ms latency, and 1\.51 GB peak memory\. Letrrdenote the empirical fraction of test samples routed from MPP to PAR:

r=Ntrig=1N=2,3403,660=0\.639\.r=\\frac\{N\_\{\\mathrm\{trig\}=1\}\}\{N\}=\\frac\{2\{,\}340\}\{3\{,\}660\}=0\.639\.\(32\)

#### Aggregation Across the Test Set\.

Compute and latency are reported as per\-sample averages after applying each system’s routing policy\. The single\-stage baseline invokes the proactive VLM on every sample with the full function pool\. The w/o Compression variant keeps the MPP intervention gate but passes the full function pool to PAR, so PAR is invoked only on therrfraction accepted by the gate\. The w/o Recommend variant disables the intervention shortcut but keeps MPP’s candidate\-function filtering, so PAR is invoked on every sample with the candidate function pool\. Full PRPF combines both mechanisms: the intervention gate skips no\-intervention observations, and candidate\-function filtering reduces the function pool for accepted contexts\.

MethodComputeLatencyProactiveMobile \(7B\)FPARfullF\_\{\\mathrm\{PAR\}\}^\{\\mathrm\{full\}\}TPARfullT\_\{\\mathrm\{PAR\}\}^\{\\mathrm\{full\}\}w/o CompressionFMPP\+rFPARfullF\_\{\\mathrm\{MPP\}\}\{\+\}rF\_\{\\mathrm\{PAR\}\}^\{\\mathrm\{full\}\}TMPP\+rTPARfullT\_\{\\mathrm\{MPP\}\}\{\+\}rT\_\{\\mathrm\{PAR\}\}^\{\\mathrm\{full\}\}w/o RecommendFMPP\+FPARcompF\_\{\\mathrm\{MPP\}\}\{\+\}F\_\{\\mathrm\{PAR\}\}^\{\\mathrm\{comp\}\}TMPP\+TPARcompT\_\{\\mathrm\{MPP\}\}\{\+\}T\_\{\\mathrm\{PAR\}\}^\{\\mathrm\{comp\}\}Table 10:Aggregation rules for per\-sample compute and latency in the efficiency benchmark\.
#### Memory and Latency Measurement\.

Peak memory is reported as the maximum active stage because MPP and PAR are invoked sequentially: for the single\-stage baseline ProactiveMobile \(7B\),M=MPARM=M\_\{\\mathrm\{PAR\}\}; for the PRPF variants,M=max⁡\(MMPP,MPAR\)M=\\max\(M\_\{\\mathrm\{MPP\}\},M\_\{\\mathrm\{PAR\}\}\)\. This convention reflects the deployed cascade in which the lightweight pre\-reasoning stage and the heavy PAR stage are not treated as one simultaneous forward pass\. A conservative co\-resident upper bound can be obtained by summing the two stage memories, but the main efficiency table reports the sequential peak\.

Latency is measured with the same batched inference path used by the evaluation pipeline\. For each configuration, PAR latency is the total wall\-clock inference time divided by the number of samples that enter PAR; for observations rejected by the intervention gate, the latency aggregation includes the measured MPP cost and no PAR forward pass\.

## Appendix CFailure Case Analysis

This appendix provides the failure analysis referenced in Section[4\.6](https://arxiv.org/html/2606.03236#S4.SS6)and complements the multimodal limitation discussed in Limitations\. All numbers reported here are computed on the ProactiveMobile test set \(N=3,660N=3\{,\}660\) and correspond to the PRPF row of Table[1](https://arxiv.org/html/2606.03236#S4.T1): ALL Type\-Acc=55\.00=55\.00, SR=41\.15=41\.15, FTR=7\.21=7\.21\. The analysis is conducted on the same predictions, so the aggregate numbers in this section can be reconciled with the main table by composition\.

### C\.1Error Categorisation

We use the trigger gate outputg∈\{0,1\}g\\in\\\{0,1\\\}from MPP, the predicted function\-name sequenceS^\\hat\{S\}from PAR, and the best\-matched gold sequenceS⋆S^\{\\star\}defined in Appendix[A\.4](https://arxiv.org/html/2606.03236#A1.SS4.SSS0.Px2)\. A prediction is counted as a Type\-Acc error whenS^\\hat\{S\}is not order\- and length\-equal toS⋆S^\{\\star\}\. Errors are partitioned into four mutually exclusive categories:

- •Non\-empty mismatch:\|S^\|\>0\|\\hat\{S\}\|\>0andS^≠S⋆\\hat\{S\}\\neq S^\{\\star\}\.
- •Post\-gate abstention:g=1g=1and\|S^\|=0\|\\hat\{S\}\|=0while\|S⋆\|\>0\|S^\{\\star\}\|\>0\. PAR was invoked but the policy declined to recommend\.
- •Gate misfire:g=0g=0while the best\-matched gold intent requires recommendation\. PAR is never invoked, so no recovery is possible\.
- •Parse failure:S^\\hat\{S\}cannot be parsed back into a function\-call sequence\.

CategoryCount% of errorsNon\-empty mismatch75145\.8Post\-gate abstention68942\.0Gate misfire20012\.2Parse failure7—Total Type\-Acc errors1,647100\.0Table 11:Error categorisation for PRPF on the ProactiveMobile test set\. Parse failures are reported as orphan items because they fall outside the four well\-formed prediction quadrants\.The non\-empty mismatch bucket can be further decomposed by comparing the predicted set against the best\-matched gold set \(Table[12](https://arxiv.org/html/2606.03236#A3.T12)\)\.*Off\-scene*mismatch dominates \(40\.6% of non\-empty mismatches\): PAR returns a syntactically valid call sequence whose function names share no overlap withS⋆S^\{\\star\}, indicating that the failure is at the intent\-selection level rather than the argument\-filling level\.

Sub\-patternCount%Off\-scene \(S^∩S⋆=∅\\hat\{S\}\\cap S^\{\\star\}=\\emptyset\)30540\.6Spurious \(S⋆=∅,S^≠∅S^\{\\star\}=\\emptyset,\\hat\{S\}\\neq\\emptyset\)12416\.5Missed steps \(S^⊊S⋆\\hat\{S\}\\subsetneq S^\{\\star\}\)15420\.5Extra steps \(S⋆⊊S^S^\{\\star\}\\subsetneq\\hat\{S\}\)8010\.7Reordering \(same set, wrong order\)567\.5Partial overlap324\.3Total751100\.0Table 12:Sub\-patterns of non\-empty mismatch errors\. The dominant failure mode is off\-scene misrouting, not argument\-level error\.
### C\.2Where Errors Concentrate: Modality and Gating Path

To localize the bottleneck, we cross\-tabulate Type\-Acc by modality \(TEXT vs Multimodal\) and gate decision \(g=0g\{=\}0silenced vsg=1g\{=\}1routed to PAR\), shown in Table[13](https://arxiv.org/html/2606.03236#A3.T13)\.

Modalityg=0g\{=\}0\(silenced\)g=1g\{=\}1\(routed to PAR\)nnaccnnaccTEXT1,2350\.8665930\.329Multimodal850\.5881,7470\.400Table 13:Type\-Acc cross\-tabulated by modality and gating path\. Most TEXT samples \(1,235/1,828=67\.6%1\{,\}235/1\{,\}828=67\.6\\%\) are silenced by the gate at high accuracy\. The bottleneck shifts to PAR once a sample is routed:g=1g\{=\}1accuracy is below0\.410\.41for both modalities\.Two observations matter for the multimodal\-bottleneck argument\. First, theg=0g\{=\}0silenced path is concentrated on TEXT \(93\.6%93\.6\\%of silenced samples are TEXT\), where the trigger decision is comparatively easy and acc reaches0\.8660\.866\. The high overall TEXT accuracy reported in the main table is largely attributable to this silence dividend rather than to stronger TEXT recommendation by PAR\. Second, on theg=1g\{=\}1path TEXT accuracy \(0\.3290\.329\) is in fact*lower*than Multimodal accuracy \(0\.4000\.400\): the easy TEXT cases have already been gated out, and what remains for PAR is the harder TEXT residual\. This indicates that further gains require improving PAR’s reasoning on the routed subset, not pushing the gate to filter more aggressively\.

### C\.3Scene\-Level Concentration

The 14 high\-level intent scenes used by MPP differ markedly in error profile\. We group them into three substantive clusters \(Table[14](https://arxiv.org/html/2606.03236#A3.T14)\); two scenes with very smallnn\(Logistics & Delivery,n=54n=54; Smart Home,n=10n=10\) are excluded from clustering to avoid noise\.

ClusterRepresentative scenesnnaccΔ¯T−M\\bar\{\\Delta\}\_\{T\-M\}I\. Multimodal\-hardSports & Health
Entertainment & Media
Office Work
Social Communication1,2000\.645\+0\.41\+0\.41II\. Domain\-knowledge\-hardTravel & Lodging
Food & Dining
Transportation9970\.431\+0\.18\+0\.18III\. BalancedDevice & System Management
Shopping
Personal Management
Content Creation
Financial Services1,3990\.557\+0\.24\+0\.24Table 14:Scene\-level clusters with mean Type\-Acc and mean TEXT−\-Multimodal gapΔ¯T−M\\bar\{\\Delta\}\_\{T\-M\}\. Cluster II is the largest absolute drag on overall accuracy and shows a small modality gap, indicating that even the TEXT branch underperforms in these scenes\. Cluster I has high overall accuracy but the largest modality gap, so further gains specifically require stronger GUI grounding\.The off\-scene mismatch sub\-pattern is highly concentrated in Cluster II: Travel & Lodging \(60\), Shopping \(47\), and Personal Management \(44\) together account for151/305=49\.5%151/305=49\.5\\%of all off\-scene errors, suggesting that the residual errors in Cluster II are disproportionately rooted in confusable functions across closely related lifestyle scenes \(e\.g\. booking, ordering, ride\-hailing\) rather than in interface understanding\.

### C\.4A Reference Multimodal Success Case

To complement the failure statistics above with a concrete example of successful behavior, TableLABEL:tab:case\_creproduces a single multimodal sample on which PRPF achieves SR=1=1in a Financial Services scene—the behavior the system already exhibits when the gate routes correctly and PAR’s<thinking\>and<function\_selection\>both align with the gold function\. The table reproduces the abridged benchmark inputs—user profileUU, device stateDD, world contextWW, and traceHH—together with PAR’s parsed structured output as<ui\_summary\>→\\rightarrow<thinking\>→\\rightarrow<recommendations\>\. The<thinking\>block reproduces both the free\-form reasoning and the embedded<function\_selection\>sub\-block\. The case is rendered as a single\-columnlongtablein TableLABEL:tab:case\_c\. The original ProactiveMobile data is in Chinese; we render all natural\-language content in English here for readability, with function names, parameter keys, and structural tags preserved verbatim\.

### C\.5Implications

The category\- and slice\-level numbers above isolate two complementary residual problems for PRPF\. \(i\) On the modality axis,g=1g\{=\}1Multimodal samples remain the dominant absolute error mass and the multimodal limitation reported in the Limitations section is quantitatively concentrated in Cluster I, where TEXT accuracy already exceeds0\.800\.80\. Closing this gap will mainly require stronger GUI grounding inside PAR\. \(ii\) On the intent\-selection axis, the off\-scene sub\-pattern in Cluster II indicates that PAR’s failures in travel\-/shopping\-/management\-style scenes are routing failures rather than argument\-level failures\. This points to expanding the function set seen during training and refining the candidate\-pool compression in MPP, rather than further reward shaping at the GRPO stage\.

## Appendix DPrompts for LLM Agents

At inference time, the trained PAR model \(Section[3\.4](https://arxiv.org/html/2606.03236#S3.SS4)\) consumes a single user\-message prompt template assembled by the eval\-data builder: each ProactiveMobile sample is rewritten into one turn that bundles the per\-sample context\(U,D,W\)\(U,D,W\), the MPP\-predicted intent scenes, and the MPP\-restricted function pool, and the resulting message is wrapped with the standard Qwen\-3 chat template before being fed to vLLM\. Only samples for which the MPP trigger gate predictsg=1g=1are routed through this prompt; gate\-silenced samples are short\-circuited to the empty\-recommendation output without a forward pass through PAR\. The sampler\-side conversation is built with a fixed system message"You are a helpful assistant\."followed by a single user message that concatenates a Chinese task header, an XML\-tagged context bundle, the MPP\-predicted scenes, the available function pool, and a structured\-output instruction tail; for multimodal samples the leading<image\>placeholders are stripped from the textual content and the corresponding screenshot files are attached asimage\_urlentries so that vLLM places the image tokens at the same position relative to the textual context as during SFT\. A sample at the end of the appendix documents the assembled prompt for the multimodal branch; the original prompts are in Chinese, and we render them in English here for readability while preserving JSON keys, structural tags, and function names verbatim\.

Table 15:Case study: Multimodal — Match\. Scene: Financial Services\. Sample id40590079\.User Profile \(UU\)A 35\-year\-old technology\-industry professional with seven years of investing experience; holds shares in Zoom, Apple, Tesla, Amazon, and other tech companies…Device Status \(DD\)Time 12:48; WiFi connected; battery sufficient\. Multiple finance apps installed: Yahoo Finance, Investing\.com, Bloomberg, TradingView\. Over the past three months the user has averaged 15\+ daily opens of finance apps…World Information \(WW\)Currently in market hours; NASDAQ\+2\.03%\+2\.03\\%, S&P 500\+1\.02%\+1\.02\\%; market sentiment positive\. Tech sector broadly stronger; multiple cloud\-computing and e\-commerce stocks up…Behavioral Trajectories \(HH\)![[Uncaptioned image]](https://arxiv.org/html/2606.03236v1/Figure/case_c_finance_f1.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03236v1/Figure/case_c_finance_f2.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03236v1/Figure/case_c_finance_f3.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03236v1/Figure/case_c_finance_f4.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03236v1/Figure/case_c_finance_f5.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03236v1/Figure/case_c_finance_f6.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03236v1/Figure/case_c_finance_f7.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03236v1/Figure/case_c_finance_f8.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03236v1/Figure/case_c_finance_f9.png)![[Uncaptioned image]](https://arxiv.org/html/2606.03236v1/Figure/case_c_finance_f10.png)
Frames 1–10: Yahoo Finance→\\rightarrowsearch “Zoom”→\\rightarrowSimply Wall St\. article→\\rightarrowInvesting\.com→\\rightarrowZoom Video \(NASDAQ: ZM\) ticker page\.<ui\_summary\>Page progression: from the home screen the user opens Yahoo Finance and checks “My List” \(containing AMZN, AAPL, TSLA\), enters “Zoom” in the search bar, taps a Simply Wall St…<thinking\>The user is currently viewing the Zoom Video \(NASDAQ: ZM\) ticker page on Investing\.com, with price $62\.00 \(\+1\.57%\+1\.57\\%\)\. Earlier he searched Zoom on Yahoo Finance and read a Simply Wall St\. article on Scully Royalty Ltd\., then switched platforms\. This sequence indicates multi\-platform information gathering at the pre\-decision stage…<function\_selection\>Candidate function pool\.search\_financial\_info,create\_financial\_analysis\_view,create\_portfolio,manage\_budget\_and\_plans,bank\_service\_assistance, …Step\-by\-step reasoning\.•search\_financial\_info: supportssearch\_target="analysis\_articles",search\_keywords="Zoom",location="US"; directly matches\.Match\.•create\_financial\_analysis\_view: builds a chart view, not a search\.Excluded\.•…Selected function and parameters\.search\_financial\_infowithsearch\_target="analysis\_articles",search\_keywords="Zoom",location="US"\.</function\_selection\></thinking\><recommendations\>Instruction:“Recommend deep analysis articles or research reports on Zoom\.”Function call \(parsed\):\[\{
"name":
"search\_financial\_info",
"parameters": \{
"search\_target":
"analysis\_articles",
"search\_keywords": "Zoom",
"filter\_criteria": "",
"sort\_order": "",
"category": "",
"include\_analysis": "",
"location": "US"
\}
\}\]OutcomeType\-Acc match; the LLM judge marks it as passing,SR=1=1\.Prompt𝒫inf\-GUI\\mathcal\{P\}\_\{\\text\{inf\-GUI\}\}: PAR Inference Prompt \(Multimodal/GUI\)You are an intelligent assistant\. Based on the user profile \(profile\), device status \(phone\), environmental information \(world\), and screenshots, analyze the user’s current behavior and determine whether a recommendation is needed\.User profileDevice statusEnvironmental informationpredicted\_scenes\{List of predicted scene probabilities, different for each sample\}available\_functions\{List of available functions and parameter definitions, different for each sample\}Please output in the following order:ui\_summary: Based on the screenshots, describe the current stage of the user interface, key interaction areas, state changes, and the current stage of the workflow\. Include key identifiable entities in the screenshots, such as specific app names, page titles, contacts, file names, etc\.thinking: Conduct sufficient step\-by\-step reasoning covering the following aspects:a\) Behavior recognition: Describe what the user is doing based on the screenshots and extract key entities such as app names, file names, contacts, page content, etc\.;b\) Intent analysis: Combine the user profile \(profile\), device status \(phone\), and environmental information \(world\) to analyze the user’s deeper intent and the current task stage \(just started, in progress, completed\);c\) Need assessment: Determine whether the user currently has any unmet needs or whether there is an optimizable next step\. If the task has been completed and there is no obvious follow\-up need, clearly conclude that no recommendation is needed;d\) If a recommendation is needed, explain what to recommend, why it should be recommended at this moment rather than letting the user complete it themselves, and the key parameter information required for the recommendation, such as target app, operation object, recipient, etc\.recommendations: Based on the selected functions and parameters in function\_selection, generate the final function call\(s\), which may include 0, 1, or multiple functions\. The function names and parameter values should be consistent with the reasoning conclusion in function\_selection, in the following format: \["instruction": "Recommendation instruction", "function": \["name": "FunctionName1", "parameters": …, "name": "FunctionName2", "parameters": …\]\]If there is no recommendation, function\_selection is not needed in thinking; directly output: No recommendation
Perceive Before Reasoning: A Pre-Reasoning Perception Framework for Efficient and Reliable Proactive Mobile Agents

Similar Articles

PRISM: Perception Reasoning Interleaved for Sequential Decision Making

Inducing Reasoning Primitives from Agent Traces

MIRAGE: Mobile Agents with Implicit Reasoning and Generative World Models

Efficient Agentic Reasoning Through Self-Regulated Simulative Planning

Anticipate and Learn: Unleashing Idle-Time Compute in Proactive Agents

Submit Feedback

Similar Articles

PRISM: Perception Reasoning Interleaved for Sequential Decision Making
Inducing Reasoning Primitives from Agent Traces
MIRAGE: Mobile Agents with Implicit Reasoning and Generative World Models
Efficient Agentic Reasoning Through Self-Regulated Simulative Planning
Anticipate and Learn: Unleashing Idle-Time Compute in Proactive Agents