Tag
Proposes quality-aware self-distillation for GUI grounding, improving coordinate-token teacher signals via correctness-aware gating and probability scaling to enhance vision-language model performance.
VISTA introduces a view-consistent self-verified training method for GUI grounding that improves GRPO-based coordinate prediction by using multiple target-preserving views, achieving consistent accuracy gains across benchmarks.