One Forward Beats Two: InnerZoom for Accurate and Efficient GUI Grounding
Summary
InnerZoom proposes a single-forward framework for cross-layer evidence bridging in GUI grounding, achieving state-of-the-art performance on multiple benchmarks while reducing latency by up to 31.8%.
View Cached Full Text
Cached at: 06/30/26, 07:35 AM
Paper page - One Forward Beats Two: InnerZoom for Accurate and Efficient GUI Grounding
Source: https://huggingface.co/papers/2606.30084
Abstract
InnerZoom addresses GUI grounding challenges by preserving target-region awareness across decoder layers through a single-forward pass that bridges cross-layer evidence, achieving state-of-the-art performance with reduced computational cost.
MLLM-based GUI groundingmethods commonly formulate target localization asautoregressive coordinate generation, enabling models to leverage the strong instruction-following and semantic understanding capabilities of MLLMs. However, this formulation requires the model to retain region-level target evidence while decoding coordinate tokens with thespatial precisiondemanded by GUI clicking. Our diagnostic analysis reveals thattarget-region awarenessemerges in intermediatedecoder layersbut is neither retained nor translated into the final coordinate prediction. ExistingZoomIn-style methodsaddress this issue through an external crop-and-rerun pass, which improves localization but increasesend-to-end latencyand computational cost. To retain the accuracy benefits of two-pass zooming without this extra cost, we propose InnerZoom, a single-forward framework forcross-layer evidence bridging. InnerZoom transforms target-related cues from the original forward pass into a compact cross-layer evidence state, then preserves, refines, and reinjects this state throughout later decoding layers to guide coordinate prediction. Extensive experimental results suggest that InnerZoom-4B achieves state-of-the-art performance on all six GUI grounding benchmarks, obtaining 64.7 on OSWorld-G, 40.2 on UI-Vision, 73.1 on OSWorld-GR, and 87.6 on MMBench-GUI, surpassing the previous best results by 4.1, 3.2, 2.9, and 2.3 points, respectively. Under a controlled 4B setting, InnerZoom improves the sameSFT+RLbaseline by 5.3 points on average and outperforms two-pass ZoomIn by 1.3 points on average, while reducingend-to-end latencyby up to 31.8% andTFLOPsby about 29%. Code and models will be publicly available.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2606\.30084
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.30084 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.30084 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.30084 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
DRS-GUI: Dynamic Region Search for Training-Free GUI Grounding
DRS-GUI proposes a training-free dynamic region search framework for GUI grounding, using a lightweight UI Perceptor with human-like perceptual actions and Monte Carlo Tree Search to progressively locate instruction-relevant elements. Experiments show a 14% improvement on ScreenSpot-Pro for both general and GUI-specific MLLMs.
Trust the Right Teacher: Quality-Aware Self-Distillation for GUI Grounding
Proposes quality-aware self-distillation for GUI grounding, improving coordinate-token teacher signals via correctness-aware gating and probability scaling to enhance vision-language model performance.
@HuggingPapers: Microsoft just released Phi-Ground-Any on Hugging Face A 4B parameter vision model for GUI grounding that achieves SOTA…
Microsoft has released Phi-Ground-Any, a 4B parameter vision model for GUI grounding on Hugging Face that achieves state-of-the-art results, enabling AI agents to precisely interact with screen elements.
VISTA: View-Consistent Self-Verified Training for GUI Grounding
VISTA introduces a view-consistent self-verified training method for GUI grounding that improves GRPO-based coordinate prediction by using multiple target-preserving views, achieving consistent accuracy gains across benchmarks.
PAGER: Bridging the Semantic-Execution Gap in Point-Precise Geometric GUI Control
This paper introduces PAGER, a topology-aware agent that bridges the semantic-execution gap in point-precise GUI control, achieving 4.1x higher task success than baselines on the new PAGE Bench.