MobileExplorer: Accelerating On-Device Inference for Mobile GUI Agents via Online Exploration

arXiv cs.AI Papers

Summary

MobileExplorer is a new framework that accelerates on-device inference for mobile GUI agents by performing lightweight parallel exploration of UI elements during model inference, reducing reasoning steps and latency by 23% while maintaining or improving task success rates.

arXiv:2605.26546v1 Announce Type: new Abstract: Mobile graphical user interface (GUI) agents enable AI models to autonomously operate smartphones on behalf of users. However, most existing systems focus primarily on optimizing task accuracy and rely on cloud-hosted models for inference, which introduces privacy concerns and network-dependent latency. As a result, fully on-device deployment of mobile GUI agents remains underexplored. We propose MobileExplorer, a new framework that accelerates on-device inference for vision-based mobile GUI agents via online exploration. The key idea is to exploit the long per-step reasoning time of vision-language models (VLMs) by performing lightweight, parallel exploration of UI elements. During model inference, the agent proactively probes semantically relevant UI elements and records these exploration traces as structured memory. To ensure reliable execution in live mobile environments, we design a two-level rollback mechanism that robustly restores the initial UI state when a fast but naive backtracking strategy fails. The collected exploration traces are then summarized into concise contextual hints and injected into the prompt to enhance the subsequent reasoning step. We evaluate MobileExplorer on multiple off-the-shelf devices using the AndroidWorld benchmark, as well as newly designed, more complex tasks and dynamic on-device environments. MobileExplorer reduces the average number of reasoning steps and end-to-end latency by 23\%, while maintaining or improving task success rates by up to 5\%. A video demonstration of MobileExplorer performance in the real world is available at https://youtu.be/thK7MJmdlvM .
Original Article
View Cached Full Text

Cached at: 05/27/26, 09:06 AM

# MobileExplorer: Accelerating On-Device Inference for Mobile GUI Agents via Online Exploration
Source: [https://arxiv.org/html/2605.26546](https://arxiv.org/html/2605.26546)
###### Abstract\.

Mobile graphical user interface \(GUI\) agents enable AI models to autonomously operate smartphones on behalf of users\. However, most existing systems focus primarily on optimizing task accuracy and rely on cloud\-hosted models for inference, which introduces privacy concerns and network\-dependent latency\. As a result, fully on\-device deployment of mobile GUI agents remains underexplored\. We propose MobileExplorer, a new framework that accelerates on\-device inference for vision\-based mobile GUI agents via online exploration\. The key idea is to exploit the long per\-step reasoning time of vision\-language models \(VLMs\) by performing lightweight, parallel exploration of UI elements\. During model inference, the agent proactively probes semantically relevant UI elements and records exploration of these traces as structured memory\. To ensure reliable execution in live mobile environments, we design a two\-level rollback mechanism that robustly restores the initial UI state when fast but naive backtracking strategy fails\. The collected exploration traces are then summarized into concise contextual hints and injected into the prompt for enhancing the subsequent reasoning step\. We evaluate MobileExplorer on multiple off\-the\-shelf devices using the AndroidWorld benchmark, as well as newly designed, more complex tasks and dynamic on‑device environments\. MobileExplorer reduces the average number of reasoning steps and end\-to\-end latency by 23% , while maintaining or improving task success rates by up to 5%\. A video demonstration of MobileExplorer’s performance in real world is available at[https://youtu\.be/thK7MJmdlvM](https://youtu.be/thK7MJmdlvM)\.111The source code for MobileExplorer will be released once the paper is accepted\.

††copyright:none## 1\.Introduction

Mobile GUI agents have rapidly advanced with the progress of large language models \(LLMs\)\(Achiamet al\.,[2023](https://arxiv.org/html/2605.26546#bib.bib10); Dubeyet al\.,[2024](https://arxiv.org/html/2605.26546#bib.bib13)\)and vision‑language models \(VLMs\)\(Baiet al\.,[2025](https://arxiv.org/html/2605.26546#bib.bib12); Teamet al\.,[2023](https://arxiv.org/html/2605.26546#bib.bib11)\), enabling end‑to‑end mobile task automation where understanding and planning are handled within a single model\(Wenet al\.,[2024](https://arxiv.org/html/2605.26546#bib.bib6)\)\. These agents generally adopt two input modalities:text‑based, which rely on accessibility trees\(Ding,[2024](https://arxiv.org/html/2605.26546#bib.bib5); Wenet al\.,[2024](https://arxiv.org/html/2605.26546#bib.bib6),[2025](https://arxiv.org/html/2605.26546#bib.bib7); Leeet al\.,[2024](https://arxiv.org/html/2605.26546#bib.bib8); Daiet al\.,[2025](https://arxiv.org/html/2605.26546#bib.bib9)\), andvision‑based, which operate directly on screenshots\(Wanget al\.,[2024](https://arxiv.org/html/2605.26546#bib.bib14); Yeet al\.,[2025](https://arxiv.org/html/2605.26546#bib.bib15); Wanget al\.,[2025](https://arxiv.org/html/2605.26546#bib.bib16); Zhouet al\.,[2025](https://arxiv.org/html/2605.26546#bib.bib17); Yanet al\.,[2025](https://arxiv.org/html/2605.26546#bib.bib18)\)\. Compared with text\-based accessibility trees that mainly expose textual attributes, screenshots offer richer visual context—layout, spatial relations, and icons, enabling stronger visual grounding for complex interfaces\(Youet al\.,[2024](https://arxiv.org/html/2605.26546#bib.bib27); Li and Li,[2022](https://arxiv.org/html/2605.26546#bib.bib26)\)\. Consequently, vision‑based agents often outperform text‑based methods on challenging GUI tasks\(Li and Li,[2022](https://arxiv.org/html/2605.26546#bib.bib26); Honget al\.,[2024](https://arxiv.org/html/2605.26546#bib.bib21)\), and have become the dominant approach, which we focus on in this work\. However, most existing GUI agent systems–no matter text or vision\-based input–run LLM/VLM reasoning models in the cloud, and only execute actions locally\. This design requires uploading user interface data, creating significant privacy risks\. Therefore, this highlights the growing need for fully on‑device mobile GUI agents that perform perception, reasoning, and action entirely locally\.

Despite these advantages, building a fully on‑device vision‑based mobile GUI agent remains challenging\. First, while VLMs provide stronger visual understanding than LLMs, they incur much higher computation and memory costs, making on‑device deployment difficult even for lightweight language models\. For example, MAI‑UI‑2B\(Zhouet al\.,[2025](https://arxiv.org/html/2605.26546#bib.bib17)\)still requires about 40 seconds of latency on Samsumg Galaxy S24\. Second, mobile GUI tasks demand fine‑grained visual grounding over complex interface elements \(icons, layouts, text\), which cannot be easily compressed or structured like accessibility‑tree inputs\(Wenet al\.,[2024](https://arxiv.org/html/2605.26546#bib.bib6)\)\. Finally, mobile interfaces are highly dynamic due to pop‑ups, content changes, and device‑specific variations, often requiring multiple rounds of VLM reasoning\.

However, existing approaches for accelerating VLM inference in mobile GUI agents still face notable limitations\. Multi‑step planning or script‑style execution\(Wenet al\.,[2024](https://arxiv.org/html/2605.26546#bib.bib6),[2025](https://arxiv.org/html/2605.26546#bib.bib7)\)reduces the number of model calls by generating action sequences in advance, but such plans are brittle and often fail under dynamic UI changes\. Verifier‑based pipelines\(Daiet al\.,[2025](https://arxiv.org/html/2605.26546#bib.bib9)\)shift action generation to lightweight validation over candidate actions, yet their effectiveness depends heavily on candidate quality and still incurs nontrivial reasoning overhead\. Token‑ or context‑pruning techniques\(Linet al\.,[2025](https://arxiv.org/html/2605.26546#bib.bib33)\)reduces complexity of input data, but aggressive pruning may discard fine‑grained visual information which is essential for accurate GUI grounding\. Moreover, most existing systems process reasoning and GUI interaction sequentially, leaving the long VLM inference time underutilized\.

In this paper, we propose MobileExplorer, a new on‑device mobile GUI agent framework that exploits the long VLM reasoning step latency to perform parallel online exploration\. We observe that on\-device mobile GUI agents exhibit a clear latency imbalance: UI perception and interaction are relatively lightweight, while on\-device VLM reasoning incurs substantial latency\. Instead of remaining idle during model inference, the system proactively interacts with the current screen to gather task‑relevant information that strengthens subsequent reasoning steps, thereby reducing overall latency\. To enable efficient exploration within the model’s reasoning latency, MobileExplorer adopts a task relevance‑driven exploration strategy that prioritizes semantically important and clickable UI elements using lightweight text embeddings\(Reimers and Gurevych,[2019](https://arxiv.org/html/2605.26546#bib.bib20)\), with each selected element associated with precise coordinates\. To ensure stable navigation, we design a robust rollback mechanism that reliably restores the interface to its initial UI state after each exploration attempt, preventing UI drift and guaranteeing that reasoning decisions are executed on the same screen state used for model inference\. The explored UI elements and interaction traces are then converted into structured, compact textual hints through lightweight template‑based summarization, where semantic labels derived from UI attributes are ranked and refined into concise prompts for the model\. By externalizing and reusing exploration knowledge as a dedicated reasoning context, MobileExplorer enhances per‑step reasoning accuracy to reduce both reasoning steps and end‑to‑end system latency\.

We evaluate MobileExplorer on multiple off\-the\-shelf smartphones using the AndroidWorld benchmark\(Rawleset al\.,[2024](https://arxiv.org/html/2605.26546#bib.bib3)\), along with newly designed complex tasks and dynamic on‑device environments\. Under fully on\-device execution, MobileExplorer preserves task success rates while reducing the average number of interaction steps and end\-to\-end latency by approximately 23%\.

The main contributions of this work are:

- •We conduct an in\-depth analysis of end\-to\-end execution latency in on\-device mobile GUI agents, and identify that the long VLM reasoning time is underutilized in existing sequential interaction pipelines\.
- •We propose MobileExplorer, a new on‑device mobile GUI agent framework that exploits model reasoning time to perform lightweight, parallel online exploration of UI elements, allowing the system to gather task‑relevant information to enhance subsequent reasoning steps\.
- •To enable effective exploration during model reasoning, we design a task relevance\-driven exploration strategy that probes semantically meaningful and diverse UI elements, as well as a two‑level rollback mechanism that restores the initial UI state\. The resulting exploration traces are then converted into structured prompt hints to enhance model reasoning\.
- •We evaluate MobileExplorer on multiple commercial devices using the AndroidWorld benchmark and newly designed real‑world tasks\. The results show that MobileExplorer improves task success rates by up to 5% while reducing both reasoning steps and end\-to\-end latency by 23%\.

## 2\.Related Work

Mobile GUI agent systems\.Mobile GUI agents have been widely studied with both text\-based and vision\-based inputs\. Early systems rely on LLMs operating on structured textual representations such as accessibility trees and action histories\. For example, AutoDroid\(Wenet al\.,[2024](https://arxiv.org/html/2605.26546#bib.bib6)\)and AutoDroid\-V2\(Wenet al\.,[2025](https://arxiv.org/html/2605.26546#bib.bib7)\)leverage planning and memory mechanisms for long\-horizon GUI interactions, while V\-Droid\(Daiet al\.,[2025](https://arxiv.org/html/2605.26546#bib.bib9)\)adopts a verifier\-based paradigm that selects actions from candidate UI elements\. However, text\-only representations miss rich visual cues such as icons, layout structures, and spatial relationships\. Recent work therefore explores vision\-based agents that operate directly on screenshots, including CogAgent\(Honget al\.,[2024](https://arxiv.org/html/2605.26546#bib.bib21)\), Mobile\-Agent\-V3\(Yeet al\.,[2025](https://arxiv.org/html/2605.26546#bib.bib15)\), MAI\-UI\(Zhouet al\.,[2025](https://arxiv.org/html/2605.26546#bib.bib17)\), and STEP\-UI\(Yanet al\.,[2025](https://arxiv.org/html/2605.26546#bib.bib18)\)\. These systems improve GUI grounding through visual\-region alignment and interaction trajectory training\. However, most existing agents rely on cloud\-based inference and interact with devices remotely via ADB\([Google Android Developers,](https://arxiv.org/html/2605.26546#bib.bib23)\), which introduces privacy risks and network dependency\.

On\-device deployment of mobile GUI agents\.A few studies explore running language\-model\-based agents directly on mobile devices\(Wenet al\.,[2024](https://arxiv.org/html/2605.26546#bib.bib6),[2025](https://arxiv.org/html/2605.26546#bib.bib7)\)\. For example, AutoDroid\(Wenet al\.,[2024](https://arxiv.org/html/2605.26546#bib.bib6)\)performs local reasoning by representing GUI states as structured text and constructing UI transition graphs, while AutoDroid\-V2\(Wenet al\.,[2025](https://arxiv.org/html/2605.26546#bib.bib7)\)reduces latency by generating action plans from app documentation\. However, these methods rely on textual input and planning priors, which cannot be directly applied to vision\-based agents that must reason over raw screenshots\. Moreover, vision\-based agents require VLM inference over high\-dimensional visual inputs, introducing substantially higher computational overhead for on\-device deployment\. In contrast, we focuses on accelerating on\-device inference for vision\-based GUI agents by utilizing the reasoning time to perform lightweight exploration in parallel\.

Offline knowledge base construction for mobile GUI agents\.Some mobile GUI agents improve decision making by incorporating prior knowledge from explored interfaces\. For example, AutoDroid\(Wenet al\.,[2024](https://arxiv.org/html/2605.26546#bib.bib6)\)constructs a UI transition graph as structured memory, while AutoDroid\-V2\(Wenet al\.,[2025](https://arxiv.org/html/2605.26546#bib.bib7)\)leverages app documentation and generated task samples for planning\. Other works build such knowledge through offline exploration\. GUI\-explorer\(Xieet al\.,[2025](https://arxiv.org/html/2605.26546#bib.bib24)\)mines transition\-aware knowledge from state–action traces, and LLM\-Explorer\(Zhaoet al\.,[2025](https://arxiv.org/html/2605.26546#bib.bib19)\)constructs reusable repositories of UI states and interaction graphs via large\-scale app exploration\. However, these approaches depend on offline\-collected trajectories or pre\-built knowledge bases, which are costly to construct and difficult to generalize to dynamic interfaces\. In contrast, our work performs lightweight online exploration during inference without requiring offline knowledge construction, enabling adaptation to dynamic real\-world mobile applications\.

## 3\.A Motivation Study

### 3\.1\.Background

#### 3\.1\.1\.Workflow of Mobile GUI Agents

Figure[1](https://arxiv.org/html/2605.26546#S3.F1)illustrates the typical end\-to\-end workflow of a mobile GUI agent, which consists of perception, reasoning, and operation\. \(1\) Perception\. Given the current screen, the agent first captures a screenshot that reflects the visual layout, icons, text, and spatial relationships among UI elements\. \(2\) Reasoning and planning\. Reasoning and planning\. An image encoder converts the screenshot into visual embeddings\. These embeddings are then combined with textual token–such as user instructions, task descriptions, or dialogue context—–nd fed into a VLM for reasoning\. The VLM outputs executable GUI actions, such as tapping icons, typing text, scrolling, or navigating across apps\. \(3\) Operation\. The system executes these actions on the device and repeats the perception–reasoning–operation loop until the task is completed\. For example, to accomplish the task “Turn on Wi‑Fi,” the agent must open the Settings app, scroll to locate the Wi‑Fi menu, and tap it\.

![Refer to caption](https://arxiv.org/html/2605.26546v1/Figures/end_to_end.png)Figure 1\.End\-to\-end workflow of a mobile GUI agent\. The agent takes a screenshot as visual input, performs multimodal reasoning with VLMs, and outputs text that are mapped to executable GUI actions\.
#### 3\.1\.2\.Vision\-based Mobile GUI Agents

Compared with accessibility trees that mainly expose textual attributes, screenshots provide richer visual context such as layout structures, spatial relations, and icons, which are essential for understanding complex mobile interfaces\(Youet al\.,[2024](https://arxiv.org/html/2605.26546#bib.bib27); Li and Li,[2022](https://arxiv.org/html/2605.26546#bib.bib26)\)\. As shown in Fig\.[2\(a\)](https://arxiv.org/html/2605.26546#S3.F2.sf1), the top\-performing agents on the AndroidWorld benchmark are predominantly vision\-based\. For example, strong agents such as GUI\-Owl\(Yeet al\.,[2025](https://arxiv.org/html/2605.26546#bib.bib15)\)and MAI\-UI\(Zhouet al\.,[2025](https://arxiv.org/html/2605.26546#bib.bib17)\)rely on screenshot inputs for perception\. In contrast, approaches that depend mainly on accessibility trees rarely appear among the leading entries\. This difference is particularly evident in tasks that require recognizing icons or spatial layout\. These observations indicate that vision is a fundamental capability for reliable mobile GUI agents\.

However, visual perception also introduces significant system challenges\. Vision–language models incur substantially higher computational cost than LLMs, leading to large latency when deployed directly on mobile devices\. To quantify this overhead, we measure the inference latency of representative VLMs on a Samsung Galaxy S24 smartphone with 12GB RAM\. All models are executed usingllama\.cppand quantized to Q8 to enable on\-device execution\. Each inference processes a screenshot with a resolution of540×1200540\\times 1200\. The results in Fig\.[2\(b\)](https://arxiv.org/html/2605.26546#S3.F2.sf2)show that VLM inference introduces substantial latency on mobile hardware, and the delay increases rapidly with model size\. These observations highlight a key systems challenge: although visual perception is essential for mobile GUI agents, directly executing VLM reasoning on\-device can lead to significant end\-to\-end latency\.

![Refer to caption](https://arxiv.org/html/2605.26546v1/Results/motivation_benchmark.png)\(a\)Top models on AndroidWorld\(Zhouet al\.,[2025](https://arxiv.org/html/2605.26546#bib.bib17); Daiet al\.,[2025](https://arxiv.org/html/2605.26546#bib.bib9); Yeet al\.,[2025](https://arxiv.org/html/2605.26546#bib.bib15); Wanget al\.,[2025](https://arxiv.org/html/2605.26546#bib.bib16)\)\.
![Refer to caption](https://arxiv.org/html/2605.26546v1/Results/motivation_comparison.png)\(b\)On\-device VLM inference latency compared with server\.

Figure 2\.Vision\-based agents dominate mobile GUI benchmarks but introduce substantial on\-device inference latency\.

### 3\.2\.Understanding Latency of On\-device Mobile GUI Agent Systems

![Refer to caption](https://arxiv.org/html/2605.26546v1/Results/android_steps.png)\(a\)Number of steps of tasks in AndroidWorld\(Rawleset al\.,[2024](https://arxiv.org/html/2605.26546#bib.bib3)\)\.
![Refer to caption](https://arxiv.org/html/2605.26546v1/Results/motivation_latency.png)\(b\)Latency across different steps of a task\.

Figure 3\.Latency of Mobile GUI Agent Systems\. There are lots of idle reasoning time of model inference across different steps\.![Refer to caption](https://arxiv.org/html/2605.26546v1/Figures/overview.png)Figure 4\.System overview of MobileExplorer\. During each reasoning step, the system performs lightweight online exploration in parallel with VLM inference, uses a robust rollback mechanism to restore the initial UI state, summarizes the discovered screen information into compact hints, and feeds them into the next reasoning step\.Visual reasoning introduces a critical systems bottleneck on mobile devices due to the high computational cost of VLM inference on resource\-constrained hardware\. To quantify this overhead, we conduct a preliminary measurement using a task from the AndroidWorld benchmark—recording and saving an audio file on a commercial smartphone\. The experiment is performed on a Samsung Galaxy S24 \(12GB RAM\) by running a 2B VLM \(MAI\-UI\(Zhouet al\.,[2025](https://arxiv.org/html/2605.26546#bib.bib17)\)\) fully on\-device\. Fig\.[3\(a\)](https://arxiv.org/html/2605.26546#S3.F3.sf1)shows the latency breakdown of a single decision step\. Even for this simple task, each step incurs substantial planning latency \(tens of seconds\), while perception and UI operations account for only a small fraction of the total time\. This indicates that most of the system latency is spent waiting for the VLM to complete reasoning before issuing the next action\. Fig\.[3\(b\)](https://arxiv.org/html/2605.26546#S3.F3.sf2)further measures the inference latency of VLMs with different model sizes on the same device\.

Moreover, mobile GUI tasks are inherently multi\-step and sequential\. Fig\.[3\(b\)](https://arxiv.org/html/2605.26546#S3.F3.sf2)shows that completing a task often requires many interaction steps, with some tasks exceeding 15–20 actions\. Consequently, the large per\-step reasoning latency accumulates across the interaction trajectory, leading to extremely high task completion time on resource\-constrained devices\. These observations suggest that the inefficiency of on\-device GUI agents stems not from a single slow inference, but from the combination of long\-horizon decision making and heavy per\-step reasoning cost\.

#### 3\.2\.1\.Opportunity for online exploration\.

The above preliminary analysis also reveal a key opportunity for accelerating on\-device mobile GUI agent systems\. Each VLM reasoning step takes tens of seconds, whereas lightweight UI interactions—such as taps or screen transitions—typically require only 1–2 seconds\. Yet existing mobile GUI agents leverage a sequential pipeline: the system waits for the model’s next action before interacting with the UI, leaving the device idle during long inference periods\. We argue that this idle interval can instead be exploited\. Rather than treating reasoning latency as unavoidable overhead, the system can perform lightweight online exploration in parallel with model inference\. By probing selected UI elements and observing the resulting screen transitions, the agent can uncover additional UI context that reduces uncertainty in later decisions and shortens the overall interaction trajectory\.

### 3\.3\.Summary

We now summarize the key findings from our motivation study\.

- •Although vision‑based mobile GUI agents dominate realistic mobile benchmarks, they incur substantial end‑to‑end latency for on\-device deployment due to long VLM inference time and the accumulation of delays across multi‑step reasoning\.
- •Since UI interactions are much faster than model reasoning, the large idle intervals that arise during model inference provide opportunities for lightweight parallel exploration\.

## 4\.System Overview

We propose MobileExplorer, a fully on\-device vision\-based mobile GUI agent that overlaps VLM reasoning with lightweight online exploration, as illustrated in Fig\.[4](https://arxiv.org/html/2605.26546#S3.F4)\. Unlike conventional GUI agents that follow a sequential perception–reasoning–operation pipeline, MobileExplorer treats the long VLM reasoning phase as an exploitable time window and performs lightweight UI exploration concurrently on the same device to gather additional task\-relevant UI context\. All processes, including perception, reasoning, exploration, and interaction, are executed locally on the smartphone, preserving user privacy and eliminating any need for cloud communication\.

At each stepii, the agent captures the current screenshot together with the task description and feeds them to the VLM for action reasoning\. Meanwhile, instead of waiting for the model output, the system parses UI elements from the current screen and computes semantic similarity between candidate interactive elements and the task description using a lightweight text embedding model\. Based on this relevance score, the agent selects a small set of clickable and diverse UI elements and performs lightweight probing actions \(e\.g\., tapping elements and observing screen transitions\) to explore potential UI branches\. To maintain consistency with the main interaction trajectory, the system records exploration traces and reliably restores the interface to the original UI state where exploration started before executing the next decision\. The exploration outcomes, including visited screens and discovered UI semantics, are then summarized into compact textual hints through lightweight template\-based summarization\. These hints are appended to the prompt of the next reasoning step, enabling the VLM to incorporate newly discovered contextual information instead of relying solely on the original screenshot\.

Overall, MobileExplorer forms a partially parallel execution pipeline: VLM reasoning operates on the current stepii, while exploration simultaneously collects contextual knowledge that can assist future reasoning steps\. By overlapping model inference with lightweight UI exploration, the system utilizes otherwise idle reasoning time to gather additional UI evidence without increasing the number of model invocations\. This design improves decision accuracy and reduces unnecessary trial\-and\-error interactions in long\-horizon mobile GUI tasks, ultimately reducing the overall end\-to\-end latency while remaining fully deployable on commodity smartphones\.

## 5\.Design of MobileExplorer

### 5\.1\.Problem Formulation

We consider an on\-device vision\-based mobile GUI agent that interacts with an application to finish a user\-defined task through multi\-step interactions \(including perception, reasoning and operations\)\. At stepii, the agent observes the current UI screenshotsis\_\{i\}and receives a task description𝒢\\mathcal\{G\}Due to the partial observability of mobile interfaces, the full UI structure cannot be known a priori and can only be revealed through interaction\.

*State and Action Space\.*At stepii, the agent observes the current screenshotsis\_\{i\}and parses the set of interactive UI elements:

\(1\)𝒜i=\{ai\(1\),ai\(2\),…,ai\(Ki\)\}\.\\mathcal\{A\}\_\{i\}=\\\{a\_\{i\}^\{\(1\)\},a\_\{i\}^\{\(2\)\},\\dots,a\_\{i\}^\{\(K\_\{i\}\)\}\\\}\.Each action corresponds to interacting with a UI element \(e\.g\., click or scroll\), which may trigger a transition to a new UI state\.

*Latency\-Constrained Interactions\.*On\-device VLM inference incurs significant latency\. Letτivlm\\tau\_\{i\}^\{\\text\{vlm\}\}denote the reasoning latency at stepii\. During this period, the device remains idle from the interaction perspective, providing an opportunity to perform auxiliary operations without increasing user\-perceived delay\.

*System Objective\.*The objective of MobileExplorer is to maximize the probability of completing a task while minimizing interaction cost under on\-device latency constraints:

\(2\)maxπ⁡Pr⁡\(vN∈𝒱goal​\(𝒢\)\),s\.t\.∑iτiinteract≤T,\\max\_\{\\pi\}\\Pr\(v\_\{N\}\\in\\mathcal\{V\}\_\{\\text\{goal\}\}\(\\mathcal\{G\}\)\),\\quad\\text\{s\.t\.\}\\quad\\sum\_\{i\}\\tau\_\{i\}^\{\\text\{interact\}\}\\leq T,whereTTis the latency budget andπ\\pidenotes the interaction policy\.

To achieve this objective, MobileExplorer leverages the otherwise idle reasoning latency to perform lightweight online exploration and gather additional UI context without increasing the overall execution delay\. This design introduces several challenges, including selecting informative interactions under strict time constraints, summarizing discovered UI states into compact reasoning context, and reliably restoring the original UI state after exploration\.

### 5\.2\.Task Relevance\-driven Exploration

![Refer to caption](https://arxiv.org/html/2605.26546v1/Figures/exploration_strategy.png)Figure 5\.Task relevance\-driven exploration\. During each VLM reasoning step, the system ranks clickable UI elements by task relevance and performs short exploratory probes within model reasoning time\.As motivated by Section[3](https://arxiv.org/html/2605.26546#S3), the goal of online exploration is to proactively gather additional UI context during the VLM inference to improve subsequent reasoning\. During model reasoning, the agent interacts with selected UI elements to uncover hidden UI branches and collect auxiliary information that may not be visible from the current screenshot alone\.

#### 5\.2\.1\.Challenges\.

First, online exploration must operate strictly within the latency of the VLM inference at each reasoning step to avoid introducing additional delay\. Consequently, exhaustive traversal of the UI graph is infeasible on resource\-constrained mobile devices\. In addition, a typical mobile screen may contain many clickable UI elements associated with different applications or actions, making it difficult to determine which interactions are most informative for the current task\. Therefore, the exploration mechanism must both prioritize task\-relevant elements and avoid repeatedly probing previously explored UI branches\.

#### 5\.2\.2\.Exploration Strategy\.

To efficiently identify interaction candidates, the exploration module leverages the accessibility tree, which exposes structured UI elements with precise coordinates and interaction attributes while remaining far more lightweight than vision‑based UI parsing\. At the beginning of stepii, the exploration controller parses the accessibility tree of the current screensis\_\{i\}and extracts the clickable element set:

\(3\)𝒜i=\{ai\(1\),ai\(2\),…,ai\(Ki\)\}\.\\mathcal\{A\}\_\{i\}=\\\{a\_\{i\}^\{\(1\)\},a\_\{i\}^\{\(2\)\},\\dots,a\_\{i\}^\{\(K\_\{i\}\)\}\\\}\.
Each elementai\(k\)a\_\{i\}^\{\(k\)\}is converted into a lightweight textual representation \(e\.g\., text label, content description, resource identifier\)\. As illustrated in Fig\.[5](https://arxiv.org/html/2605.26546#S5.F5), a typical mobile screen may contain multiple candidate UI elements corresponding to different applications or actions\. For example, on a home screen the accessibility tree may expose elements associated with apps such as Phone, Chrome, or YouTube\. Given a task description𝒢\\mathcal\{G\}, the system computes semantic embeddings for both the task descriptions and each UI element, and ranks candidates based on their similarity\. Elements whose semantics are more relevant to the task \(e\.g\., launching a browser for a web search task\) receive higher exploration priority, while unrelated elements receive lower scores\. Specifically, we compute the semantic similarity between the element representation and the task description𝒢\\mathcal\{G\}using a lightweight embedding model, producing a task relevance score

\(4\)r​\(ai\(k\)\)=sim​\(e​\(ai\(k\)\),e​\(𝒢\)\),r\(a\_\{i\}^\{\(k\)\}\)=\\text\{sim\}\\\!\\left\(e\(a\_\{i\}^\{\(k\)\}\),e\(\\mathcal\{G\}\)\\right\),
wheree​\(⋅\)e\(\\cdot\)denotes a lightweight text embedding model used to encode both UI element descriptions and the task goal into semantic vectors, andsimdenotes the cosine similarity between embeddings\.

To avoid repeatedly probing the same UI branches, the system incorporates exploration history\. Letℋi\\mathcal\{H\}\_\{i\}denote the set of previously visited UI elements\. The final exploration priority score is defined as

\(5\)S​\(ai\(k\)\)=r​\(ai\(k\)\)−λ⋅𝟏​\[ai\(k\)∈ℋi\],S\(a\_\{i\}^\{\(k\)\}\)=r\(a\_\{i\}^\{\(k\)\}\)\-\\lambda\\cdot\\mathbf\{1\}\[a\_\{i\}^\{\(k\)\}\\in\\mathcal\{H\}\_\{i\}\],
whereλ\\lambdais a penalty weight that discourages revisiting explored elements\.

The exploration controller then probes candidates in descending order ofS​\(⋅\)S\(\\cdot\)\. Each probe consists of executing a click action and observing the resulting screen transition\. The exploration proceeds sequentially up to a bounded depthdd, meaning that at mostddexploratory actions are executed before returning to the starting UI state\. The resulting screen is parsed again to collect additional UI elements and contextual information\. Because each probe is short and bounded, multiple probes can be executed within the same reasoning window\.

Exploration terminates immediately once the time budgetτiexplore\\tau\_\{i\}^\{\\text\{explore\}\}is exhausted\. The discovered UI elements, screens, and contextual observations are stored as exploration context𝒞i\\mathcal\{C\}\_\{i\}for subsequent reasoning steps\. As a result, MobileExplorer can efficiently gather task\-relevant UI context within the VLM reasoning window while avoiding redundant exploration, enabling informative probing of UI branches without increasing end\-to\-end latency\.

### 5\.3\.Robust Rollback via Two\-level Checking

Online exploration in mobile GUI environments presents significant challenges compared with offline exploration in simulators\. In offline settings, agents can freely duplicate or clone environment states before exploring alternative branches\. In contrast, MobileExplorer operates on a live mobile device, where every exploratory action directly alters the real UI state\. Without reliable state recovery, exploration may drift the system away from the screen on which the reasoning decision was made, leading to inconsistent or incorrect execution\. Therefore, developing a robust rollback strategy is essential for effective online exploration in dynamic, real‑time settings\.

#### 5\.3\.1\.Challenges

Rollback in mobile GUI exploration is challenging due to two key properties of mobile interfaces\. First, many UI transitions lack a true inverse operation\. For example, abackaction may close a dialog, dismiss a keyboard, or skip intermediate screens rather than returning to the exact previous state, making simple reversal of exploration actions unreliable\. Second, UI states can evolve between interactions even when following the same navigation path\. At stepii, the agent only observes the rendered screenshotsis\_\{i\}corresponding to the current UI statesis\_\{i\}, while the underlying application state and navigation stack remain hidden\.

These properties make naïve backtracking insufficient\. Thebackoperation may skip multiple pages, modal dialogs may insert temporary layers with non‑standard navigation behavior, and tapping certain elements may trigger permission prompts or confirmation windows not present before\. Moreover, asynchronous updates—–uch as notification banners or dynamically loaded content–can alter the interface, producing different screenshots even along identical navigation paths\. Therefore, reliably restoring the exact starting state requires a robust rollback mechanism that tolerates minor visual variations while ensuring the agent returns to the correct UI context\.

![Refer to caption](https://arxiv.org/html/2605.26546v1/Figures/exploration_return.png)Figure 6\.Two\-level rollback strategy with perceptual\-hash state verification and home\-and\-replay recovery\.
#### 5\.3\.2\.Two\-Level Rollback Strategy

To ensure reliable recovery under these conditions, MobileExplorer adopts a two\-level rollback strategy illustrated in Fig\.[6](https://arxiv.org/html/2605.26546#S5.F6)\. The two levels serve complementary roles\. The first level performs fast depth\-bounded backtracking, which efficiently restores the original UI state in most cases\. However, due to irreversible UI transitions and dynamic interface changes, backtracking may occasionally fail to reach the exact starting screen\. The second level therefore provides a deterministic recovery mechanism that reconstructs the UI state through action replay\. Together, these two levels ensure both efficiency and robustness during exploration\.

*Level\-1: Depth\-Bounded Backtracking\.*Exploration is performed with a bounded depthdd\. Let the exploration start from UI statesis\_\{i\}\. Meanwhile, the system records the interaction trace that led to this exploration state\. We compute the perceptual hash of the starting screen as

\(6\)H0=H​\(si\),H\_\{0\}=H\(s\_\{i\}\),
whereH​\(⋅\)H\(\\cdot\)denotes the pHash function\. We use screenshot\-based verification instead of accessibility\-tree because capturing screenshots on mobile devices is significantly faster than retrieving the accessibility tree\. Frequent screen verification is required during rollback, and screenshot capture introduces much lower latency\. Perceptual hashing further allows efficient comparison between screens while tolerating minor visual differences caused by dynamic UI elements\. After finishing an exploratory branch, the agent issues thebackaction up toddtimes to return to the previous screens\. After each rollback stepkk, the current screensrks\_\{r\_\{k\}\}is captured and verified against the starting screen:

\(7\)dH​\(H​\(srk\),H0\)≤τ,d\_\{H\}\(H\(s\_\{r\_\{k\}\}\),H\_\{0\}\)\\leq\\tau,
wheredH​\(⋅\)d\_\{H\}\(\\cdot\)denotes the Hamming distance between perceptual hashes andτ\\tauis a small threshold allowing minor UI variations \(e\.g\., scrolling offsets or dynamic content\)\. If the condition holds, the system concludes that the original UI state has been successfully restored\.

*Level\-2: Home\-and\-Replay Recovery\.*If the Level\-1 rollback fails \(e\.g\., due to irreversible UI transitions or navigation inconsistencies\), the system performs a fallback recovery\. For example, during exploration an interaction may trigger a confirmation dialog or permission prompt that was not previously visible\. In such cases, issuing thebackaction may dismiss the dialog and return to an earlier screen rather than the exact exploration starting state\. As a result, the agent cannot reliably restore the original UI state through simple backtracking\.

To recover from such situations, the system falls back to a deterministic reconstruction procedure\. Let the interaction trace that led to the exploration starting state be

\(8\)Πi=\(a1,a2,…,ai\),\\Pi\_\{i\}=\(a\_\{1\},a\_\{2\},\\dots,a\_\{i\}\),
whereaka\_\{k\}denotes the action executed at stepkk\. The agent first returns to the Home screen and then deterministically replays the recorded traceΠi\\Pi\_\{i\}to reconstruct the UI statesis\_\{i\}\. Since exploration depth is small in practice, the replay sequence is typically short and introduces negligible overhead compared to VLM inference latency\.

Algorithm 1Exploration with Rollback0:Task

𝒢\\mathcal\{G\}, start screen

sis\_\{i\}, clickable set

𝒜i\\mathcal\{A\}\_\{i\}, depth

dd, budget

τi\\tau\_\{i\}, threshold

δ\\delta
0:Exploration hints

𝒞i\\mathcal\{C\}\_\{i\}and restored UI state

sis\_\{i\}
1:

𝒞i←∅\\mathcal\{C\}\_\{i\}\\leftarrow\\emptyset
2:

H0←H​\(si\)H\_\{0\}\\leftarrow H\(s\_\{i\}\)\(compute pHash of start screen\)

3:Rank candidates

𝒜icand\\mathcal\{A\}\_\{i\}^\{\\text\{cand\}\}by semantic similarity to

𝒢\\mathcal\{G\}
4:for all

a∈𝒜icanda\\in\\mathcal\{A\}\_\{i\}^\{\\text\{cand\}\}within

τi\\tau\_\{i\}do

5:Interact with

aaand explore transitions up to depth

dd
6:Record discovered screens / elements into

𝒞i\\mathcal\{C\}\_\{i\}
7:Issuebackfor

ddsteps\(Level\-1 rollback\)

8:Capture current screen

s′s^\{\\prime\}
9:

H′←H​\(s′\)H^\{\\prime\}\\leftarrow H\(s^\{\\prime\}\)
10:if

dH​\(H′,H0\)\>δd\_\{H\}\(H^\{\\prime\},H\_\{0\}\)\>\\deltathen

11:Gohomeand replay reasoning trace

Πi\\Pi\_\{i\}\(Level\-2 recovery\)

12:endif

13:endfor

14:return

𝒞i\\mathcal\{C\}\_\{i\}

Algorithm[1](https://arxiv.org/html/2605.26546#alg1)outlines the overall exploration process with rollback\. This two\-level design provides a fast rollback path in the common case \(Level\-1\) while guaranteeing correctness under non\-deterministic UI behaviors \(Level\-2\)\. As a result, exploration can safely probe alternative UI branches without permanently altering the agent’s main decision trajectory\.

### 5\.4\.Exploration\-Augmented Reasoning

Online exploration allows the agent to proactively gather additional UI context during the VLM inference window\. However, exploration is performed before the reasoning decision is finalized\. As a result, the reasoning action may lead the agent to a different screen than those visited during exploration\. Blindly injecting all exploration results into the prompt may therefore introduce irrelevant information and increase reasoning latency\.

#### 5\.4\.1\.Challenges\.

Two key challenges arise in this process\. First, exploration observations may become misaligned with the UI state on which reasoning is performed\. Exploration starts from screensis\_\{i\}, but the executed reasoning action may lead to a new screensi\+1s\_\{i\+1\}, making some exploration trajectories no longer relevant to the current decision context\. Second, exploration may discover many UI elements within a short time window\. Directly injecting all discovered elements into the prompt may introduce noise and unnecessarily increase prompt length, negatively affecting both reasoning accuracy and latency\.

#### 5\.4\.2\.UI State Selection\.

During exploration, the agent may visit multiple intermediate UI states\. To ensure that exploration results correspond to the current reasoning context, MobileExplorer records lightweight exploration observations and verifies their relevance to the current screen\.

*Exploration Observations\.*Each visited screen is recorded as a compact observation consisting of the perceptual hash \(pHash\) of the screenshot and the parsed UI elements extracted from the accessibility tree\. These observations capture screen identity and UI structure while avoiding large memory overhead\.

*Step Alignment\.*After the reasoning action is executed and the agent reaches screensi\+1s\_\{i\+1\}, the system computes the pHash of the current screen and retrieves exploration observations with similar visual structure\. Formally, the matched observation set is defined as

\(9\)𝒪imatch=\{oj∈𝒪i∣d​\(hi\+1,hj\)<δ\},\\mathcal\{O\}\_\{i\}^\{\\text\{match\}\}=\\\{o\_\{j\}\\in\\mathcal\{O\}\_\{i\}\\mid d\(h\_\{i\+1\},h\_\{j\}\)<\\delta\\\},
whered​\(⋅\)d\(\\cdot\)denotes the Hamming distance between perceptual hashes\. This step filters out exploration results that correspond to unrelated UI states\.

![Refer to caption](https://arxiv.org/html/2605.26546v1/Figures/exploration_augment.png)Figure 7\.Exploration\-augmented reasoning\. The agent explores nearby UI states during inference, ranks discovered elements by task relevance, and generates textual hints to guide the next reasoning step\.
#### 5\.4\.3\.UI Element Selection\.

The matched exploration states may still contain multiple interactive UI elements corresponding to different applications or actions\. Directly injecting all discovered elements into the prompt may introduce irrelevant information and mislead the model’s reasoning\.

Therefore, MobileExplorer selects only a small set of task\-relevant elements from the explored screens\. Specifically, for each elementai\(k\)a\_\{i\}^\{\(k\)\}, the system computes a task relevance score using the semantic similarity defined in Eq\.[4](https://arxiv.org/html/2605.26546#S5.E4), which measures the similarity between the element representation and the task goal𝒢\\mathcal\{G\}\. Elements with higher similarity scores are considered more relevant to the task and are prioritized for hint construction, while unrelated elements are filtered out\.

#### 5\.4\.4\.Hint Generation\.

From the selected elements, MobileExplorer summarizes exploration results into concise textual hints that describe potentially useful UI locations\. These hints form the exploration context𝒞i\\mathcal\{C\}\_\{i\}, which is appended to the prompt for the next reasoning step\.

The VLM therefore reasons over an augmented input consisting of the current screenshot, the task instruction, and the exploration context\. This design allows the agent to leverage additional UI knowledge discovered during exploration while keeping the reasoning input compact and relevant for on\-device deployment\.

## 6\.Evaluation

### 6\.1\.Experiment Setup

Benchmark\.We evaluate MobileExplorer on AndroidWorld\(Rawleset al\.,[2024](https://arxiv.org/html/2605.26546#bib.bib3)\), a task\-oriented mobile GUI interaction benchmark that requires multi\-step reasoning over real smartphone interfaces \(details in Table[1](https://arxiv.org/html/2605.26546#S6.T1)\)\. Unlike static GUI datasets such as Android\-in\-the\-Wild\([Rawleset al\.,](https://arxiv.org/html/2605.26546#bib.bib1)\)and AndroidControl\(Liet al\.,[2024](https://arxiv.org/html/2605.26546#bib.bib2)\), which rely on offline interaction traces, AndroidWorld provides an interactive environment where agents operate real Android applications through system\-level control interfaces\. This closed\-loop setting captures dynamic execution conditions such as runtime delays, UI transitions, and system responses, better reflecting practical mobile assistant scenarios where interaction efficiency and system responsiveness are critical\. Table[1](https://arxiv.org/html/2605.26546#S6.T1)summarizes the key characteristics of the AndroidWorld benchmark\. In addition to the benchmark evaluation, we also conduct an end\-to\-end real\-world case study on a physical smartphone to validate the practicality of MobileExplorer in real deployment scenarios\.

Devices\.We evaluate MobileExplorer on three representative on\-device platforms \(as shwon in Table[4](https://arxiv.org/html/2605.26546#S6.T4), including a Samsung Galaxy S24 smartphone, an NVIDIA Jetson AGX Orin, and a MacBook Air M4 laptop\. These platforms represent typical deployment environments for mobile GUI agents with different compute capabilities\.

DifficultyCountRatioExample of TaskEasy6152\.6%AudioRecorderRecordAudioMedium3631\.0%NotesTodoItemCountHard1916\.4%ExpenseAddMultipleFromMarkorTable 1\.Task summary of AndroidWorld\(Rawleset al\.,[2024](https://arxiv.org/html/2605.26546#bib.bib3)\)\.Implementation\.MobileExplorer is implemented as an end\-to\-end mobile GUI agent system\. During evaluation, AndroidWorld tasks are executed in an Android emulator on a host machine, which runs the interaction environment and provides GUI states \(e\.g\., screenshots and accessibility trees\)\. The agent communicates with the model through HTTP requests\. To emulate practical on\-device deployment while ensuring stable latency measurement, we adopt a dual\-device setup: the emulator executes UI interactions, while model inference runs on target hardware platforms \(phone, NVIDIA Jetson, and laptop\)\. The models are deployed usingllama\.cppwith Q8 quantization and served through thevLLMframework, providing a unified API interface for the AndroidWorld agent\. On the smartphone,llama\.cppruns inside the Termux application to enable local model execution\. This design allows us to simulate realistic on\-device agent execution while measuring the true end\-to\-end latency of the perception–reasoning–action loop across heterogeneous devices\. The evaluation are mainly based on a 4B VLM\(Yanet al\.,[2025](https://arxiv.org/html/2605.26546#bib.bib18)\)\.

DeviceCompute UnitsMemorySamsung Galaxy S24CPU, GPU, NPU12 GBJetson AGX OrinCPU, GPU64 GBMacBook Air M4CPU, GPU24 GBTable 2\.Mobile and edge devices used in evaluation\.Baselines\.We compare MobileExplorer with the following representative baselines based on 4B VLMs/LLMs\.

- •M3A Agent\(Rawleset al\.,[2024](https://arxiv.org/html/2605.26546#bib.bib3)\): Base mobile GUI agent that uses a VLM to reason over the current screen in a sequential perception–reasoning–action loop\.
- •T3A Agent\(Rawleset al\.,[2024](https://arxiv.org/html/2605.26546#bib.bib3)\): A mobile GUI agent that relies on accessibility trees and performs step\-wise reasoning using an LLM\.
- •Input\-pruning VLM agent: Agent that reduces visual token overhead via screenshot or token pruning\(Linet al\.,[2025](https://arxiv.org/html/2605.26546#bib.bib33)\)\.
- •Offline exploration agent: Agent that construct knowledge through offline exploration\(Zhaoet al\.,[2025](https://arxiv.org/html/2605.26546#bib.bib19); Xieet al\.,[2025](https://arxiv.org/html/2605.26546#bib.bib24); Zhanget al\.,[2025](https://arxiv.org/html/2605.26546#bib.bib25)\)\.

Beyond the above baselines, we further compare our method with existing methods reported on the AndroidWorld\([Rawleset al\.,](https://arxiv.org/html/2605.26546#bib.bib1)\)leaderboard\.

Evaluation metrics\. We evaluate MobileExplorer in terms of both effectiveness and system efficiency on real on\-device execution\.Success ratemeasures the percentage of tasks successfully completed within a predefined step budget, indicating the overall task\-solving capability of the agent\.Total stepsdenotes the number of interaction steps required to finish a task, reflecting decision efficiency and the amount of trial\-and\-error during long\-horizon execution\.Latencyis evaluated at two granularities\. Step latency is defined as the time of a single interaction cycle, measured from capturing the current screenshotsis\_\{i\}to completing the execution of the selected actionaia\_\{i\}\. This includes perception, model reasoning, online exploration, and action execution\. End\-to\-end latency is defined as the total task completion time from task initialization to the final successful state, i\.e\., the sum of all step latencies across the task\. We also measure overhead like CPU usage, memory and power consumption in Section[6\.6](https://arxiv.org/html/2605.26546#S6.SS6)\.

### 6\.2\.Overall Performance

![Refer to caption](https://arxiv.org/html/2605.26546v1/Results/overall_performance.png)\(a\)Success rate and end\-to\-end latency\.
![Refer to caption](https://arxiv.org/html/2605.26546v1/Results/overall_steps.png)\(b\)Average reasoning steps\.

Figure 8\.Overall performance on AndroidWorld\.MethodAndroidWorld \(%\)MobileGPT\(Leeet al\.,[2024](https://arxiv.org/html/2605.26546#bib.bib8)\)23\.0AutoDroid\-V2\(Wenet al\.,[2025](https://arxiv.org/html/2605.26546#bib.bib7)\)26\.0M3A \(a11y, GPT\-4\-Turbo\)\(Rawleset al\.,[2024](https://arxiv.org/html/2605.26546#bib.bib3)\)30\.6M3A \(a11y, Gemini\-2\.5\-Pro\)\(Rawleset al\.,[2024](https://arxiv.org/html/2605.26546#bib.bib3)\)31\.0M3A \(SoM, GPT\-4\-Turbo\)\(Rawleset al\.,[2024](https://arxiv.org/html/2605.26546#bib.bib3)\)25\.4M3A \(SoM, Gemini\-2\.5\-Pro\)\(Rawleset al\.,[2024](https://arxiv.org/html/2605.26546#bib.bib3)\)39\.7GLM\-4\.1V\-9B\-Thinking\(Honget al\.,[2025](https://arxiv.org/html/2605.26546#bib.bib34)\)41\.7UI\-TARS \(UI\-TARS\-7B\)\(Wanget al\.,[2025](https://arxiv.org/html/2605.26546#bib.bib16)\)33\.0MobileExplorer50\.9Table 3\.Success Rate \(%\) on AndroidWorld\.We evaluate MobileExplorer on the AndroidWorld benchmark under a fully on\-device setting\. As shown in Table[3](https://arxiv.org/html/2605.26546#S6.T3)and Fig\.[8\(a\)](https://arxiv.org/html/2605.26546#S6.F8.sf1), MobileExplorer achieves a success rate of 50\.86% \(59/116 tasks\), which is the best among all compared methods\. Compared with the VLM\-based baseline M3A, which achieves 46\.55%, this corresponds to a 9\.3% relative improvement\. This improvement indicates that integrating latency\-bounded online exploration into the reasoning process allows the agent to obtain more useful UI context and make better interaction decisions\. As a result, MobileExplorer improves the ability of mobile GUI agents to complete long\-horizon tasks under the on\-device setting\.

Meanwhile, MobileExplorer also improves the efficiency of task execution\. As illustrated in Fig\.[8\(b\)](https://arxiv.org/html/2605.26546#S6.F8.sf2), our method requires only 9\.24 interaction steps on average, compared with 10\.93 steps for M3A, corresponding to a 15\.5% reduction in interaction cost\. This reduction in trial\-and\-error operations further translates into lower end\-to\-end latency\. In particular, MobileExplorer completes tasks in 185\.82 s on average, reducing the overall latency by 15\.9% compared with M3A\. Importantly, this improvement is achieved without increasing per\-step reasoning overhead, since exploration is performed in parallel with model reasoning and utilizes otherwise idle inference time\.

### 6\.3\.Real\-world Case Study

While AndroidWorld provides a controlled benchmark for evaluating mobile GUI agents, its tasks are relatively simplified and do not capture several important characteristics of real mobile environments\. In practice, mobile applications contain significantly more complex UI structures, experience dynamic resource conditions, and frequently introduce interrupt\-driven UI changes such as system dialogs or notifications\. To evaluate how these factors affect agent behavior, we construct a set of real\-world smartphone tasks using popular mobile applications and analyze them along three practical dimensions, as shwon in Fig\.[9](https://arxiv.org/html/2605.26546#S6.F9)\.

SettingFeatureTaskComplicated UI Elements48 elements/pagein average \(while ¡20 in AndroidWorld\)Trip planning for a city in Trip APP\.Pop\-up Interfering ElementsType of Interfering:alarm,message,call,app notificationsInput text in Notes APP\.Resource DynamicsBackground tasks:video and music playingModify system settings \(Bluetooth, WiFi\)\.Table 4\.Task categories and features in the case study\.![Refer to caption](https://arxiv.org/html/2605.26546v1/Figures/case_study_setting.png)Figure 9\.Examples of settings in the case study\.Complicated UI Elements\.Many AndroidWorld tasks are executed in relatively simple applications with limited UI elements and shallow interface hierarchies\. On average, AndroidWorld pages contain about 19\.75 interactive elements, whereas real\-world mobile applications often present much denser interfaces\. For instance, typical screens in the Trip application contain around 48 interactive elements due to long result lists and recommendation panels\. To evaluate the agent under such conditions, we design a trip planning task where the agent searches for attractions in the Trip application\. We randomly generate 10 different city queries to create diverse UI states, requiring the agent to navigate through dense search results to locate the target item\.

Interrupt On\-pop Elements\.Real mobile environments often introduce unexpected interruptions that temporarily modify the interface structure or occlude interactive elements, such as alarms, message notifications, or incoming calls\. These interruptions are largely absent in AndroidWorld, where tasks run in stable environments\. To emulate such scenarios, we design a stopwatch task in which different types of interruptions \(alarm, message, call, and application notifications\) are injected during execution, forcing the agent to correctly recover the UI state before continuing\.

Resource Dynamics\.Unlike controlled benchmark environments, real smartphones operate under dynamic system conditions where available resources fluctuate due to concurrent background activities, as shown in Fig\.[10](https://arxiv.org/html/2605.26546#S6.F10)\. Background workloads such as video playback or music streaming may consume CPU and memory resources, affecting the latency of on\-device perception and reasoning modules\. Since AndroidWorld does not model such resource contention, we design a system settings task where the agent toggles Bluetooth or WiFi while background applications run concurrently, creating realistic resource dynamics during execution\.

![Refer to caption](https://arxiv.org/html/2605.26546v1/Figures/resource_dynamic.png)Figure 10\.Resource dynamics on mobile phone\.T1T\_\{1\}toT3T\_\{3\}: Video playing,T2T\_\{2\}toT4T\_\{4\}: VLM reasoning\.Results\.Together, these tasks evaluate MobileExplorer in realistic mobile scenarios beyond the AndroidWorld benchmark\. Each task is repeated three times and the averaged results are shown in Fig\.[11](https://arxiv.org/html/2605.26546#S6.F11)\. MobileExplorer consistently reduces end\-to\-end latency while achieving comparable or higher success rates across all settings\. The improvement is most significant for complicated UI environments, and the results also demonstrate robustness under interrupt\-driven UI changes and dynamic resource conditions\.

![Refer to caption](https://arxiv.org/html/2605.26546v1/Results/case_study_accuracy.png)\(a\)Success rate\.
![Refer to caption](https://arxiv.org/html/2605.26546v1/Results/case_study_latency.png)\(b\)End\-to\-End latency\.

Figure 11\.Performance in different complicated settings of the case study\.
### 6\.4\.Understanding MobileExplorer’s Performance

#### 6\.4\.1\.Ablation Study

Fig\.[12\(a\)](https://arxiv.org/html/2605.26546#S6.F12.sf1)evaluates the contribution of key components in MobileExplorer\. Replacing the task\-relevance\-driven element selection with random exploration reduces the success rate from 50\.9% to 42\.2%, as random exploration often probes task\-irrelevant UI elements within the limited reasoning window\. Removing the two\-level rollback mechanism causes the largest degradation, reducing the success rate to 39\.7% and increasing end\-to\-end latency because exploration may fail to return to the original UI state\. Disabling exploration alignment further decreases the success rate to 47\.4%, since unfiltered exploration observations may introduce misleading information into the prompt\. These results highlight the importance of task\-aware exploration, reliable state recovery, and exploration–reasoning alignment\.

#### 6\.4\.2\.Performance on different task categories

AndroidWorld organizes tasks into several categories includingcomplex UI understanding,search,information retrieval,data entry,data edit, andverification\(Rawleset al\.,[2024](https://arxiv.org/html/2605.26546#bib.bib3)\)\. Fig\.[12\(b\)](https://arxiv.org/html/2605.26546#S6.F12.sf2)shows that MobileExplorer consistently improves performance on visually intensive tasks such ascomplex UI understanding,search, andinformation retrieval, where identifying relevant UI elements among many candidates is critical\. In contrast, for structured interaction tasks such asdata entryanddata edit, MobileExplorer performs slightly below M3A because these tasks rely more on deterministic action sequences than UI exploration\. Overall, MobileExplorer is particularly effective for exploration\-intensive tasks involving complex UI structures\.

![Refer to caption](https://arxiv.org/html/2605.26546v1/Results/ablation_study.png)\(a\)Ablation study\.
![Refer to caption](https://arxiv.org/html/2605.26546v1/Results/robustness_task_types.png)\(b\)Task categories\.
![Refer to caption](https://arxiv.org/html/2605.26546v1/Results/rollback_rate.png)\(c\)Effectiveness of two\-level rollback strategy\.
![Refer to caption](https://arxiv.org/html/2605.26546v1/Results/hint_hit.png)\(d\)Effectiveness of exploration\-augmented reasoning\.

Figure 12\.Understanding MobileExplorer’s performance\.
#### 6\.4\.3\.Effectiveness of two\-level rollback strategy\.

We analyze rollback events on UI pages categorized asSimple,Medium, andComplicatedbased on the number of clickable elements\. As shown in Fig\.[12\(c\)](https://arxiv.org/html/2605.26546#S6.F12.sf3), Level\-1 rollback handles most cases, while complicated pages show a higher proportion of Level\-2 rollbacks, indicating that simple backtracking is more likely to fail in deeper or less reversible navigation paths\. Despite this, the overall rollback success rate remains consistently high across all categories, demonstrating that Level\-2 recovery effectively complements Level\-1 rollback\.

#### 6\.4\.4\.Effectiveness of exploration\-augmented reasoning\.

To evaluate whether exploration\-derived hints contribute to task completion, we measure the hint\-follow rate and relate it to task success\. We aggregate step\-level hint hits into a task\-level metric and group tasks into three complexity levels \(Simple,Medium, andComplicated\)\. Fig\.[12\(d\)](https://arxiv.org/html/2605.26546#S6.F12.sf4)shows that the hint\-follow rate increases as UI complexity grows\. Meanwhile, tasks with higher hint\-follow rates consistently exhibit higher success rates across all complexity levels\. This observation suggests that exploration\-derived hints provide useful guidance, especially when the interface contains many candidate elements\.

### 6\.5\.Micro Benchmark Performance

![Refer to caption](https://arxiv.org/html/2605.26546v1/Results/robustness_model_size.png)\(a\)Different sizes of models\.
![Refer to caption](https://arxiv.org/html/2605.26546v1/Results/robustness_resolution.png)\(b\)Different resolutions\.

Figure 13\.Micro benchmark performance\.#### 6\.5\.1\.Performance with different model sizes\.

To further examine the robustness of our design across different model scales, we evaluate the agents using MAI\-UI models of different sizes, including 2B and 8B\. Fig\.[13\(a\)](https://arxiv.org/html/2605.26546#S6.F13.sf1)reports the end\-to\-end latency and success rate under these settings\. Across different model sizes, MobileExplorer consistently maintains comparable task success rates while reducing the average number of interaction steps, leading to lower end\-to\-end latency\. In particular, our exploration\-augmented design typically saves around one reasoning step per task, demonstrating that the benefit of online exploration generalizes across models with different capacities\.

#### 6\.5\.2\.Performance under different resolutions of screenshots\.

We evaluate three input resolutions to examine their impact on task performance\. As shown in Fig\.[13\(b\)](https://arxiv.org/html/2605.26546#S6.F13.sf2), higher resolutions generally improve success rates for both methods since more visual details of UI elements are preserved\. MobileExplorer achieves its best performance at a moderate resolution, while M3A improves more gradually as resolution increases\. Across all settings, MobileExplorer consistently requires fewer interaction steps than M3A, suggesting that exploration\-based hints help guide the agent more efficiently\. Overall, a moderate resolution provides a good balance between visual fidelity and interaction efficiency\.

### 6\.6\.System Overhead

We evaluate the system overhead of MobileExplorer across different platforms, including a commercial smartphone, Jetson, and a laptop\. Fig\.[14\(a\)](https://arxiv.org/html/2605.26546#S6.F14.sf1)reports the end\-to\-end latency, peak memory usage, and peak power consumption\. Across all devices, MobileExplorer consistently reduces end\-to\-end latency compared with the baseline while maintaining nearly identical memory and power usage\. This indicates that the exploration controller introduces only minimal additional system overhead\.

We further analyze the runtime overhead of individual modules on the smartphone\. Fig\.[15](https://arxiv.org/html/2605.26546#S6.F15)shows that the additional control logic introduces only tens of milliseconds of latency per step, which is negligible compared with multi\-second VLM inference\. Rollback verification incurs the largest latency due to screenshot\-based state checking, while exploration and element selection remain lightweight\. Memory and power usage also remain stable across modules\.

![Refer to caption](https://arxiv.org/html/2605.26546v1/Results/robustness_platform.png)\(a\)Latency\.
![Refer to caption](https://arxiv.org/html/2605.26546v1/Results/device_peak.png)\(b\)Memory and power\.

Figure 14\.System overhead on different platforms\.![Refer to caption](https://arxiv.org/html/2605.26546v1/Results/phone_latency_overhead.png)\(a\)Latency\.
![Refer to caption](https://arxiv.org/html/2605.26546v1/Results/phone_memory_overhead.png)\(b\)Memory\.
![Refer to caption](https://arxiv.org/html/2605.26546v1/Results/phone_power_overhead.png)\(c\)Power\.

Figure 15\.Overhead of different components\.

## 7\.Conclusion and Discussions

In this paper, we introduce MobileExplorer, an on\-device mobile GUI agent framework that improves task efficiency through task relevance\-driven online exploration\. Instead of leaving the system idle during expensive VLM reasoning, MobileExplorer utilizes the reasoning window to probe semantically relevant UI elements and collect additional interface context, while a two\-level rollback mechanism ensures that exploration does not disrupt the main execution trajectory\. Experimental results show that MobileExplorer maintains comparable task success while reducing the average number of interaction steps by approximately 23%, demonstrating that latency\-bounded exploration can effectively improve the efficiency of mobile GUI agents\. In future work, we will explore more adaptive exploration strategies that dynamically adjust exploration paths based on task demands and UI complexity, and investigate more complicated interplay between UI exploration and model reasoning to further improve the efficiency of on\-device GUI agent systems\.

## References

- J\. Achiam, S\. Adler, S\. Agarwal, L\. Ahmad, I\. Akkaya, F\. L\. Aleman, D\. Almeida, J\. Altenschmidt, S\. Altman, S\. Anadkat,et al\.\(2023\)Gpt\-4 technical report\.arXiv preprint arXiv:2303\.08774\.Cited by:[§1](https://arxiv.org/html/2605.26546#S1.p1.1)\.
- S\. Bai, K\. Chen, X\. Liu, J\. Wang, W\. Ge, S\. Song, K\. Dang, P\. Wang, S\. Wang, J\. Tang,et al\.\(2025\)Qwen2\. 5\-vl technical report\.arXiv preprint arXiv:2502\.13923\.Cited by:[§1](https://arxiv.org/html/2605.26546#S1.p1.1)\.
- G\. Dai, S\. Jiang, T\. Cao, Y\. Li, Y\. Yang, R\. Tan, M\. Li, and L\. Qiu \(2025\)Advancing mobile gui agents: a verifier\-driven approach to practical deployment\.arXiv preprint arXiv:2503\.15937\.Cited by:[§1](https://arxiv.org/html/2605.26546#S1.p1.1),[§1](https://arxiv.org/html/2605.26546#S1.p3.1),[§2](https://arxiv.org/html/2605.26546#S2.p1.1),[2\(a\)](https://arxiv.org/html/2605.26546#S3.F2.sf1),[2\(a\)](https://arxiv.org/html/2605.26546#S3.F2.sf1.3.2)\.
- T\. Ding \(2024\)Mobileagent: enhancing mobile control via human\-machine interaction and sop integration\.arXiv preprint arXiv:2401\.04124\.Cited by:[§1](https://arxiv.org/html/2605.26546#S1.p1.1)\.
- A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Yang, A\. Fan,et al\.\(2024\)The llama 3 herd of models\.arXiv e\-prints,pp\. arXiv–2407\.Cited by:[§1](https://arxiv.org/html/2605.26546#S1.p1.1)\.
- \[6\]Google Android DevelopersAndroid debug bridge \(adb\)\.Note:[https://developer\.android\.com/tools/adb](https://developer.android.com/tools/adb)Cited by:[§2](https://arxiv.org/html/2605.26546#S2.p1.1)\.
- W\. Hong, W\. Wang, Q\. Lv, J\. Xu, W\. Yu, J\. Ji, Y\. Wang, Z\. Wang, Y\. Dong, M\. Ding,et al\.\(2024\)Cogagent: a visual language model for gui agents\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 14281–14290\.Cited by:[§1](https://arxiv.org/html/2605.26546#S1.p1.1),[§2](https://arxiv.org/html/2605.26546#S2.p1.1)\.
- W\. Hong, W\. Yu, X\. Gu, G\. Wang, G\. Gan, H\. Tang, J\. Cheng, J\. Qi, J\. Ji, L\. Pan,et al\.\(2025\)Glm\-4\.5 v and glm\-4\.1 v\-thinking: towards versatile multimodal reasoning with scalable reinforcement learning\.arXiv preprint arXiv:2507\.01006\.Cited by:[Table 3](https://arxiv.org/html/2605.26546#S6.T3.2.8.1)\.
- S\. Lee, J\. Choi, J\. Lee, M\. H\. Wasi, H\. Choi, S\. Ko, S\. Oh, and I\. Shin \(2024\)Mobilegpt: augmenting llm with human\-like app memory for mobile task automation\.InProceedings of the 30th Annual International Conference on Mobile Computing and Networking,pp\. 1119–1133\.Cited by:[§1](https://arxiv.org/html/2605.26546#S1.p1.1),[Table 3](https://arxiv.org/html/2605.26546#S6.T3.2.2.1)\.
- G\. Li and Y\. Li \(2022\)Spotlight: mobile ui understanding using vision\-language models with a focus\.arXiv preprint arXiv:2209\.14927\.Cited by:[§1](https://arxiv.org/html/2605.26546#S1.p1.1),[§3\.1\.2](https://arxiv.org/html/2605.26546#S3.SS1.SSS2.p1.1)\.
- W\. Li, W\. E\. Bishop, A\. Li, C\. Rawles, F\. Campbell\-Ajala, D\. Tyamagundlu, and O\. Riva \(2024\)On the effects of data scale on ui control agents\.Advances in Neural Information Processing Systems37,pp\. 92130–92154\.Cited by:[§6\.1](https://arxiv.org/html/2605.26546#S6.SS1.p1.1)\.
- K\. Q\. Lin, L\. Li, D\. Gao, Z\. Yang, S\. Wu, Z\. Bai, S\. W\. Lei, L\. Wang, and M\. Z\. Shou \(2025\)Showui: one vision\-language\-action model for gui visual agent\.InProceedings of the Computer Vision and Pattern Recognition Conference,pp\. 19498–19508\.Cited by:[§1](https://arxiv.org/html/2605.26546#S1.p3.1),[3rd item](https://arxiv.org/html/2605.26546#S6.I1.i3.p1.1)\.
- C\. Rawles, S\. Clinckemaillie, Y\. Chang, J\. Waltz, G\. Lau, M\. Fair, A\. Li, W\. Bishop, W\. Li, F\. Campbell\-Ajala,et al\.\(2024\)Androidworld: a dynamic benchmarking environment for autonomous agents\.arXiv preprint arXiv:2405\.14573\.Cited by:[§1](https://arxiv.org/html/2605.26546#S1.p5.1),[3\(a\)](https://arxiv.org/html/2605.26546#S3.F3.sf1),[3\(a\)](https://arxiv.org/html/2605.26546#S3.F3.sf1.3.2),[1st item](https://arxiv.org/html/2605.26546#S6.I1.i1.p1.1.1),[2nd item](https://arxiv.org/html/2605.26546#S6.I1.i2.p1.1.1),[§6\.1](https://arxiv.org/html/2605.26546#S6.SS1.p1.1),[§6\.4\.2](https://arxiv.org/html/2605.26546#S6.SS4.SSS2.p1.1),[Table 1](https://arxiv.org/html/2605.26546#S6.T1),[Table 1](https://arxiv.org/html/2605.26546#S6.T1.4.2),[Table 3](https://arxiv.org/html/2605.26546#S6.T3.2.4.1),[Table 3](https://arxiv.org/html/2605.26546#S6.T3.2.5.1),[Table 3](https://arxiv.org/html/2605.26546#S6.T3.2.6.1),[Table 3](https://arxiv.org/html/2605.26546#S6.T3.2.7.1)\.
- \[14\]C\. Rawles, A\. Li, D\. Rodriguez, O\. Riva, and T\. LillicrapAndroid in the wild: a large\-scale dataset for android device control, 2023\.URL https://arxiv\. org/abs/2307\.10088\.Cited by:[§6\.1](https://arxiv.org/html/2605.26546#S6.SS1.p1.1),[§6\.1](https://arxiv.org/html/2605.26546#S6.SS1.p5.1)\.
- N\. Reimers and I\. Gurevych \(2019\)Sentence\-bert: sentence embeddings using siamese bert\-networks\.arXiv preprint arXiv:1908\.10084\.Cited by:[§1](https://arxiv.org/html/2605.26546#S1.p4.1)\.
- G\. Team, R\. Anil, S\. Borgeaud, J\. Alayrac, J\. Yu, R\. Soricut, J\. Schalkwyk, A\. M\. Dai, A\. Hauth, K\. Millican,et al\.\(2023\)Gemini: a family of highly capable multimodal models\.arXiv preprint arXiv:2312\.11805\.Cited by:[§1](https://arxiv.org/html/2605.26546#S1.p1.1)\.
- H\. Wang, H\. Zou, H\. Song, J\. Feng, J\. Fang, J\. Lu, L\. Liu, Q\. Luo, S\. Liang, S\. Huang,et al\.\(2025\)Ui\-tars\-2 technical report: advancing gui agent with multi\-turn reinforcement learning\.arXiv preprint arXiv:2509\.02544\.Cited by:[§1](https://arxiv.org/html/2605.26546#S1.p1.1),[2\(a\)](https://arxiv.org/html/2605.26546#S3.F2.sf1),[2\(a\)](https://arxiv.org/html/2605.26546#S3.F2.sf1.3.2),[Table 3](https://arxiv.org/html/2605.26546#S6.T3.2.9.1)\.
- J\. Wang, H\. Xu, H\. Jia, X\. Zhang, M\. Yan, W\. Shen, J\. Zhang, F\. Huang, and J\. Sang \(2024\)Mobile\-agent\-v2: mobile device operation assistant with effective navigation via multi\-agent collaboration\.Advances in Neural Information Processing Systems37,pp\. 2686–2710\.Cited by:[§1](https://arxiv.org/html/2605.26546#S1.p1.1)\.
- H\. Wen, Y\. Li, G\. Liu, S\. Zhao, T\. Yu, T\. J\. Li, S\. Jiang, Y\. Liu, Y\. Zhang, and Y\. Liu \(2024\)Autodroid: llm\-powered task automation in android\.InProceedings of the 30th Annual International Conference on Mobile Computing and Networking,pp\. 543–557\.Cited by:[§1](https://arxiv.org/html/2605.26546#S1.p1.1),[§1](https://arxiv.org/html/2605.26546#S1.p2.1),[§1](https://arxiv.org/html/2605.26546#S1.p3.1),[§2](https://arxiv.org/html/2605.26546#S2.p1.1),[§2](https://arxiv.org/html/2605.26546#S2.p2.1),[§2](https://arxiv.org/html/2605.26546#S2.p3.1)\.
- H\. Wen, S\. Tian, B\. Pavlov, W\. Du, Y\. Li, G\. Chang, S\. Zhao, J\. Liu, Y\. Liu, Y\. Zhang,et al\.\(2025\)Autodroid\-v2: boosting slm\-based gui agents via code generation\.InProceedings of the 23rd Annual International Conference on Mobile Systems, Applications and Services,pp\. 223–235\.Cited by:[§1](https://arxiv.org/html/2605.26546#S1.p1.1),[§1](https://arxiv.org/html/2605.26546#S1.p3.1),[§2](https://arxiv.org/html/2605.26546#S2.p1.1),[§2](https://arxiv.org/html/2605.26546#S2.p2.1),[§2](https://arxiv.org/html/2605.26546#S2.p3.1),[Table 3](https://arxiv.org/html/2605.26546#S6.T3.2.3.1)\.
- B\. Xie, R\. Shao, G\. Chen, K\. Zhou, Y\. Li, J\. Liu, M\. Zhang, and L\. Nie \(2025\)Gui\-explorer: autonomous exploration and mining of transition\-aware knowledge for gui agent\.arXiv preprint arXiv:2505\.16827\.Cited by:[§2](https://arxiv.org/html/2605.26546#S2.p3.1),[4th item](https://arxiv.org/html/2605.26546#S6.I1.i4.p1.1)\.
- H\. Yan, J\. Wang, X\. Huang, Y\. Shen, Z\. Meng, Z\. Fan, K\. Tan, J\. Gao, L\. Shi, M\. Yang,et al\.\(2025\)Step\-gui technical report\.arXiv preprint arXiv:2512\.15431\.Cited by:[§1](https://arxiv.org/html/2605.26546#S1.p1.1),[§2](https://arxiv.org/html/2605.26546#S2.p1.1),[§6\.1](https://arxiv.org/html/2605.26546#S6.SS1.p3.1)\.
- J\. Ye, X\. Zhang, H\. Xu, H\. Liu, J\. Wang, Z\. Zhu, Z\. Zheng, F\. Gao, J\. Cao, Z\. Lu,et al\.\(2025\)Mobile\-agent\-v3: fundamental agents for gui automation\.arXiv preprint arXiv:2508\.15144\.Cited by:[§1](https://arxiv.org/html/2605.26546#S1.p1.1),[§2](https://arxiv.org/html/2605.26546#S2.p1.1),[2\(a\)](https://arxiv.org/html/2605.26546#S3.F2.sf1),[2\(a\)](https://arxiv.org/html/2605.26546#S3.F2.sf1.3.2),[§3\.1\.2](https://arxiv.org/html/2605.26546#S3.SS1.SSS2.p1.1)\.
- K\. You, H\. Zhang, E\. Schoop, F\. Weers, A\. Swearngin, J\. Nichols, Y\. Yang, and Z\. Gan \(2024\)Ferret\-ui: grounded mobile ui understanding with multimodal llms\.InEuropean Conference on Computer Vision,pp\. 240–255\.Cited by:[§1](https://arxiv.org/html/2605.26546#S1.p1.1),[§3\.1\.2](https://arxiv.org/html/2605.26546#S3.SS1.SSS2.p1.1)\.
- Y\. Zhang, Z\. Ma, Y\. Ma, Z\. Han, Y\. Wu, and V\. Tresp \(2025\)Webpilot: a versatile and autonomous multi\-agent system for web task execution with strategic exploration\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.39,pp\. 23378–23386\.Cited by:[4th item](https://arxiv.org/html/2605.26546#S6.I1.i4.p1.1)\.
- S\. Zhao, H\. Wen, W\. Du, C\. Liang, Y\. Liu, X\. Ye, Y\. Ouyang, and Y\. Li \(2025\)LLM\-explorer: towards efficient and affordable llm\-based exploration for mobile apps\.InProceedings of the 31st Annual International Conference on Mobile Computing and Networking,pp\. 589–603\.Cited by:[§2](https://arxiv.org/html/2605.26546#S2.p3.1),[4th item](https://arxiv.org/html/2605.26546#S6.I1.i4.p1.1)\.
- H\. Zhou, X\. Zhang, P\. Tong, J\. Zhang, L\. Chen, Q\. Kong, C\. Cai, C\. Liu, Y\. Wang, J\. Zhou,et al\.\(2025\)MAI\-ui technical report: real\-world centric foundation gui agents\.arXiv preprint arXiv:2512\.22047\.Cited by:[§1](https://arxiv.org/html/2605.26546#S1.p1.1),[§1](https://arxiv.org/html/2605.26546#S1.p2.1),[§2](https://arxiv.org/html/2605.26546#S2.p1.1),[2\(a\)](https://arxiv.org/html/2605.26546#S3.F2.sf1),[2\(a\)](https://arxiv.org/html/2605.26546#S3.F2.sf1.3.2),[§3\.1\.2](https://arxiv.org/html/2605.26546#S3.SS1.SSS2.p1.1),[§3\.2](https://arxiv.org/html/2605.26546#S3.SS2.p1.1)\.

Similar Articles

MIRAGE: Mobile Agents with Implicit Reasoning and Generative World Models

arXiv cs.AI

MIRAGE is a framework for mobile GUI agents that replaces verbose chain-of-thought reasoning with compact continuous latent representations, incorporating a generative world model perspective to predict future screen states before acting. On AndroidWorld and AndroidControl benchmarks, it achieves competitive or superior performance while reducing generated tokens by over 75%.

MobileMoE: Scaling On-Device Mixture of Experts

Hugging Face Daily Papers

MobileMoE introduces efficient on-device mixture-of-experts language models with sub-billion parameters, achieving better performance and efficiency than dense baselines and existing MoE models. The models are trained on open-source datasets and demonstrate significant speedups on commodity smartphones.

MineExplorer: Evaluating Open-World Exploration of MLLM Agents in Minecraft

Hugging Face Daily Papers

The MineExplorer benchmark evaluates multimodal large language model agents' open-world exploration abilities in Minecraft using atomic and multi-hop tasks designed through multi-agent synthesis. Experiments show that open-world exploration remains challenging, with strong models degrading sharply over longer trajectories.