One Forward Beats Two: InnerZoom for Accurate and Efficient GUI Grounding

Hugging Face Daily Papers Papers

Summary

InnerZoom proposes a single-forward framework for cross-layer evidence bridging in GUI grounding, achieving state-of-the-art performance on multiple benchmarks while reducing latency by up to 31.8%.

MLLM-based GUI grounding methods commonly formulate target localization as autoregressive coordinate generation, enabling models to leverage the strong instruction-following and semantic understanding capabilities of MLLMs. However, this formulation requires the model to retain region-level target evidence while decoding coordinate tokens with the spatial precision demanded by GUI clicking. Our diagnostic analysis reveals that target-region awareness emerges in intermediate decoder layers but is neither retained nor translated into the final coordinate prediction. Existing ZoomIn-style methods address this issue through an external crop-and-rerun pass, which improves localization but increases end-to-end latency and computational cost. To retain the accuracy benefits of two-pass zooming without this extra cost, we propose InnerZoom, a single-forward framework for cross-layer evidence bridging. InnerZoom transforms target-related cues from the original forward pass into a compact cross-layer evidence state, then preserves, refines, and reinjects this state throughout later decoding layers to guide coordinate prediction. Extensive experimental results suggest that InnerZoom-4B achieves state-of-the-art performance on all six GUI grounding benchmarks, obtaining 64.7 on OSWorld-G, 40.2 on UI-Vision, 73.1 on OSWorld-GR, and 87.6 on MMBench-GUI, surpassing the previous best results by 4.1, 3.2, 2.9, and 2.3 points, respectively. Under a controlled 4B setting, InnerZoom improves the same SFT+RL baseline by 5.3 points on average and outperforms two-pass ZoomIn by 1.3 points on average, while reducing end-to-end latency by up to 31.8% and TFLOPs by about 29%. Code and models will be publicly available.
Original Article
View Cached Full Text

Cached at: 06/30/26, 07:35 AM

Paper page - One Forward Beats Two: InnerZoom for Accurate and Efficient GUI Grounding

Source: https://huggingface.co/papers/2606.30084

Abstract

InnerZoom addresses GUI grounding challenges by preserving target-region awareness across decoder layers through a single-forward pass that bridges cross-layer evidence, achieving state-of-the-art performance with reduced computational cost.

MLLM-based GUI groundingmethods commonly formulate target localization asautoregressive coordinate generation, enabling models to leverage the strong instruction-following and semantic understanding capabilities of MLLMs. However, this formulation requires the model to retain region-level target evidence while decoding coordinate tokens with thespatial precisiondemanded by GUI clicking. Our diagnostic analysis reveals thattarget-region awarenessemerges in intermediatedecoder layersbut is neither retained nor translated into the final coordinate prediction. ExistingZoomIn-style methodsaddress this issue through an external crop-and-rerun pass, which improves localization but increasesend-to-end latencyand computational cost. To retain the accuracy benefits of two-pass zooming without this extra cost, we propose InnerZoom, a single-forward framework forcross-layer evidence bridging. InnerZoom transforms target-related cues from the original forward pass into a compact cross-layer evidence state, then preserves, refines, and reinjects this state throughout later decoding layers to guide coordinate prediction. Extensive experimental results suggest that InnerZoom-4B achieves state-of-the-art performance on all six GUI grounding benchmarks, obtaining 64.7 on OSWorld-G, 40.2 on UI-Vision, 73.1 on OSWorld-GR, and 87.6 on MMBench-GUI, surpassing the previous best results by 4.1, 3.2, 2.9, and 2.3 points, respectively. Under a controlled 4B setting, InnerZoom improves the sameSFT+RLbaseline by 5.3 points on average and outperforms two-pass ZoomIn by 1.3 points on average, while reducingend-to-end latencyby up to 31.8% andTFLOPsby about 29%. Code and models will be publicly available.

View arXiv pageView PDFAdd to collection

Get this paper in your agent:

hf papers read 2606\.30084

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.30084 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.30084 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.30084 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

DRS-GUI: Dynamic Region Search for Training-Free GUI Grounding

arXiv cs.AI

DRS-GUI proposes a training-free dynamic region search framework for GUI grounding, using a lightweight UI Perceptor with human-like perceptual actions and Monte Carlo Tree Search to progressively locate instruction-relevant elements. Experiments show a 14% improvement on ScreenSpot-Pro for both general and GUI-specific MLLMs.

VISTA: View-Consistent Self-Verified Training for GUI Grounding

Hugging Face Daily Papers

VISTA introduces a view-consistent self-verified training method for GUI grounding that improves GRPO-based coordinate prediction by using multiple target-preserving views, achieving consistent accuracy gains across benchmarks.