Tag
Proposes quality-aware self-distillation for GUI grounding, improving coordinate-token teacher signals via correctness-aware gating and probability scaling to enhance vision-language model performance.
This paper introduces BlendIn, an inference-time alignment framework that uses probabilistic model blending to assess guidance reliability and proportionally weight model contributions, achieving up to 50% performance improvement by avoiding harmful interventions.