Tag
Proposes quality-aware self-distillation for GUI grounding, improving coordinate-token teacher signals via correctness-aware gating and probability scaling to enhance vision-language model performance.
VISTA introduces a view-consistent self-verified training method for GUI grounding that improves GRPO-based coordinate prediction by using multiple target-preserving views, achieving consistent accuracy gains across benchmarks.
DRS-GUI proposes a training-free dynamic region search framework for GUI grounding, using a lightweight UI Perceptor with human-like perceptual actions and Monte Carlo Tree Search to progressively locate instruction-relevant elements. Experiments show a 14% improvement on ScreenSpot-Pro for both general and GUI-specific MLLMs.
Microsoft has released Phi-Ground-Any, a 4B parameter vision model for GUI grounding on Hugging Face that achieves state-of-the-art results, enabling AI agents to precisely interact with screen elements.