One Forward Beats Two: InnerZoom for Accurate and Efficient GUI Grounding

Hugging Face Daily Papers 06/29/26, 12:00 AM Papers

gui-grounding cross-layer single-forward mllm efficient localization zoom-in

Summary

InnerZoom proposes a single-forward framework for cross-layer evidence bridging in GUI grounding, achieving state-of-the-art performance on multiple benchmarks while reducing latency by up to 31.8%.

MLLM-based GUI grounding methods commonly formulate target localization as autoregressive coordinate generation, enabling models to leverage the strong instruction-following and semantic understanding capabilities of MLLMs. However, this formulation requires the model to retain region-level target evidence while decoding coordinate tokens with the spatial precision demanded by GUI clicking. Our diagnostic analysis reveals that target-region awareness emerges in intermediate decoder layers but is neither retained nor translated into the final coordinate prediction. Existing ZoomIn-style methods address this issue through an external crop-and-rerun pass, which improves localization but increases end-to-end latency and computational cost. To retain the accuracy benefits of two-pass zooming without this extra cost, we propose InnerZoom, a single-forward framework for cross-layer evidence bridging. InnerZoom transforms target-related cues from the original forward pass into a compact cross-layer evidence state, then preserves, refines, and reinjects this state throughout later decoding layers to guide coordinate prediction. Extensive experimental results suggest that InnerZoom-4B achieves state-of-the-art performance on all six GUI grounding benchmarks, obtaining 64.7 on OSWorld-G, 40.2 on UI-Vision, 73.1 on OSWorld-GR, and 87.6 on MMBench-GUI, surpassing the previous best results by 4.1, 3.2, 2.9, and 2.3 points, respectively. Under a controlled 4B setting, InnerZoom improves the same SFT+RL baseline by 5.3 points on average and outperforms two-pass ZoomIn by 1.3 points on average, while reducing end-to-end latency by up to 31.8% and TFLOPs by about 29%. Code and models will be publicly available.

Original Article

View Cached Full Text

Cached at: 06/30/26, 07:35 AM

Paper page - One Forward Beats Two: InnerZoom for Accurate and Efficient GUI Grounding

Source: https://huggingface.co/papers/2606.30084

Abstract

InnerZoom addresses GUI grounding challenges by preserving target-region awareness across decoder layers through a single-forward pass that bridges cross-layer evidence, achieving state-of-the-art performance with reduced computational cost.

MLLM-based GUI groundingmethods commonly formulate target localization asautoregressive coordinate generation, enabling models to leverage the strong instruction-following and semantic understanding capabilities of MLLMs. However, this formulation requires the model to retain region-level target evidence while decoding coordinate tokens with thespatial precisiondemanded by GUI clicking. Our diagnostic analysis reveals thattarget-region awarenessemerges in intermediatedecoder layersbut is neither retained nor translated into the final coordinate prediction. ExistingZoomIn-style methodsaddress this issue through an external crop-and-rerun pass, which improves localization but increasesend-to-end latencyand computational cost. To retain the accuracy benefits of two-pass zooming without this extra cost, we propose InnerZoom, a single-forward framework forcross-layer evidence bridging. InnerZoom transforms target-related cues from the original forward pass into a compact cross-layer evidence state, then preserves, refines, and reinjects this state throughout later decoding layers to guide coordinate prediction. Extensive experimental results suggest that InnerZoom-4B achieves state-of-the-art performance on all six GUI grounding benchmarks, obtaining 64.7 on OSWorld-G, 40.2 on UI-Vision, 73.1 on OSWorld-GR, and 87.6 on MMBench-GUI, surpassing the previous best results by 4.1, 3.2, 2.9, and 2.3 points, respectively. Under a controlled 4B setting, InnerZoom improves the sameSFT+RLbaseline by 5.3 points on average and outperforms two-pass ZoomIn by 1.3 points on average, while reducingend-to-end latencyby up to 31.8% andTFLOPsby about 29%. Code and models will be publicly available.

View arXiv page View PDF Add to collection

Get this paper in your agent:

hf papers read 2606\.30084

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.30084 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.30084 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.30084 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

One Forward Beats Two: InnerZoom for Accurate and Efficient GUI Grounding

Paper page - One Forward Beats Two: InnerZoom for Accurate and Efficient GUI Grounding

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

DRS-GUI: Dynamic Region Search for Training-Free GUI Grounding

Trust the Right Teacher: Quality-Aware Self-Distillation for GUI Grounding

@HuggingPapers: Microsoft just released Phi-Ground-Any on Hugging Face A 4B parameter vision model for GUI grounding that achieves SOTA…

VISTA: View-Consistent Self-Verified Training for GUI Grounding

PAGER: Bridging the Semantic-Execution Gap in Point-Precise Geometric GUI Control

Submit Feedback

Similar Articles

DRS-GUI: Dynamic Region Search for Training-Free GUI Grounding

Trust the Right Teacher: Quality-Aware Self-Distillation for GUI Grounding

@HuggingPapers: Microsoft just released Phi-Ground-Any on Hugging Face A 4B parameter vision model for GUI grounding that achieves SOTA…

VISTA: View-Consistent Self-Verified Training for GUI Grounding

PAGER: Bridging the Semantic-Execution Gap in Point-Precise Geometric GUI Control