VISTA: View-Consistent Self-Verified Training for GUI Grounding
Summary
VISTA introduces a view-consistent self-verified training method for GUI grounding that improves GRPO-based coordinate prediction by using multiple target-preserving views, achieving consistent accuracy gains across benchmarks.
View Cached Full Text
Cached at: 06/15/26, 09:06 AM
Paper page - VISTA: View-Consistent Self-Verified Training for GUI Grounding
Source: https://huggingface.co/papers/2606.14579 VISTA introduces View-Consistent Self-Verified Training for GUI grounding, addressing a key limitation of applying GRPO to coordinate prediction: rollouts from a single screenshot view often collapse into all-success or all-failure groups, providing weak relative advantages.
Our approach builds each GRPO comparison group from multiple target-preserving views of the same GUI instance. These views are generated by crops that keep the target element visible while exactly remapping its bounding box, enabling comparisons across semantically equivalent but geometrically different inputs. VISTA also adds a self-verified cross-view anchor to stabilize short coordinate generation without turning RL into unconditional imitation.
Across five GUI-grounding benchmarks and multiple Qwen backbones, VISTA consistently improves accuracy. On ScreenSpot-Pro, it improves Qwen3-VL 4B/8B/30B-A3B from 55.5/52.7/53.7 to 63.4/65.8/67.0. Code, project page, and open checkpoints are available:
Code:https://github.com/ZJUSCL/VISTA Project page:https://zjuscl.github.io/VISTA/ Models:https://huggingface.co/inclusionAI/VISTA-9Bandhttps://huggingface.co/inclusionAI/VISTA-4B
Similar Articles
DRS-GUI: Dynamic Region Search for Training-Free GUI Grounding
DRS-GUI proposes a training-free dynamic region search framework for GUI grounding, using a lightweight UI Perceptor with human-like perceptual actions and Monte Carlo Tree Search to progressively locate instruction-relevant elements. Experiments show a 14% improvement on ScreenSpot-Pro for both general and GUI-specific MLLMs.
Trust the Right Teacher: Quality-Aware Self-Distillation for GUI Grounding
Proposes quality-aware self-distillation for GUI grounding, improving coordinate-token teacher signals via correctness-aware gating and probability scaling to enhance vision-language model performance.
@HuggingPapers: Microsoft just released Phi-Ground-Any on Hugging Face A 4B parameter vision model for GUI grounding that achieves SOTA…
Microsoft has released Phi-Ground-Any, a 4B parameter vision model for GUI grounding on Hugging Face that achieves state-of-the-art results, enabling AI agents to precisely interact with screen elements.
Video2GUI: Synthesizing Large-Scale Interaction Trajectories for Generalized GUI Agent Pretraining
Proposes Video2GUI, a framework to automatically extract GUI interaction trajectories from unlabeled instructional videos, building WildGUI dataset with 12M trajectories across 1500+ apps. Pre-training on this data yields 5-20% improvements on GUI grounding and action benchmarks.
Thinking with Visual Grounding
This paper introduces visually grounded thinking, a method for vision-language models to interleave natural-language reasoning with explicit visual evidence grounding using points or boxes. A scalable synthesis pipeline and grounding-aware reinforcement learning improve reasoning accuracy, enabling a 4B model to match or surpass a 27B model on spatial and counting benchmarks.