Trust the Right Teacher: Quality-Aware Self-Distillation for GUI Grounding
Summary
Proposes quality-aware self-distillation for GUI grounding, improving coordinate-token teacher signals via correctness-aware gating and probability scaling to enhance vision-language model performance.
View Cached Full Text
Cached at: 06/18/26, 03:55 AM
Paper page - Trust the Right Teacher: Quality-Aware Self-Distillation for GUI Grounding
Source: https://huggingface.co/papers/2606.18101
Abstract
Quality-aware self-distillation improves vision-language model performance for GUI grounding by enhancing coordinate-token teacher signals through correctness-aware gating and probability scaling.
Graphical user interface (GUI) grounding requiresvision-language models(VLMs) to identify small target elements in high-resolution screenshots and predict precisescreen coordinates.On-policy self-distillation(OPSD) is a promising post-training approach for thiscoordinate-sensitive task, since it providesdense token-level teacher signalsbeyond hard coordinate labels. However, naive OPSD is not well suited toGUI grounding: OPSD evaluates the teacher on student-generated prefixes, the quality of coordinate-token teacher signals can degrade when the prefix has already deviated from the target coordinate, leading to unreliable teacher signal. To mitigate this, We propose quality-aware self-distillation for VLM-basedGUI grounding, which improves coordinate-token teacher-signal quality throughsoft correctness-aware gatingandteacher-probability scaling. The soft correctness-aware gate checks whether the teacher’s current coordinate-token prediction can still be completed into the ground-truth box under the student-generated prefix. If not, the corresponding teacher signal is down-weighted.Teacher-probability scalingthen uses the teacher’s confidence as a lightweight factor to further calibrate the strength of the gated supervision. A key empirical finding is that neither component alone improves overall performance, whereas combining them consistently improves performance. This suggests that the two mechanisms play complementary roles: correctness-aware gating suppresses unreliable coordinate-token supervision, whileteacher-probability scalingcalibrates the strength of the remaining signals. Experiments across sixGUI groundingbenchmarks show that our method consistently improves the base model and outperforms strong baselines.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2606\.18101
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.18101 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.18101 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.18101 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
DRS-GUI: Dynamic Region Search for Training-Free GUI Grounding
DRS-GUI proposes a training-free dynamic region search framework for GUI grounding, using a lightweight UI Perceptor with human-like perceptual actions and Monte Carlo Tree Search to progressively locate instruction-relevant elements. Experiments show a 14% improvement on ScreenSpot-Pro for both general and GUI-specific MLLMs.
VISTA: View-Consistent Self-Verified Training for GUI Grounding
VISTA introduces a view-consistent self-verified training method for GUI grounding that improves GRPO-based coordinate prediction by using multiple target-preserving views, achieving consistent accuracy gains across benchmarks.
Skill-Guided Continuation Distillation for GUI Agents
The paper proposes Skill-Guided Continuation Distillation (SGCD), an iterative self-improvement framework that uses skill-guided policies to generate supervision for off-trajectory states during closed-loop execution, improving GUI agent success rates on OSWorld-Verified from around 30% to over 50%.
When Are Teacher Tokens Reliable? Position-Weighted On-Policy Self-Distillation for Reasoning
This paper identifies that teacher token reliability in reasoning distillation is trajectory-structured and proposes Position-Weighted On-Policy Self-Distillation (PW-OPSD), which applies increasing position weights to improve performance without additional teacher computation.
Seeing Before Reasoning: Decoupling Perception and Reasoning for Shortcut-Resilient Multimodal On-Policy Self-Distillation
This paper introduces ViGOS, a method for multimodal on-policy self-distillation that decouples perception and reasoning by having the student model first produce a visual description before reasoning, reducing shortcut reliance and improving image-grounding behavior.