Seeing Before Reasoning: Decoupling Perception and Reasoning for Shortcut-Resilient Multimodal On-Policy Self-Distillation

Hugging Face Daily Papers 06/17/26, 12:00 AM Papers

multimodal self-distillation reasoning perception shortcuts mllm grounding

Summary

This paper introduces ViGOS, a method for multimodal on-policy self-distillation that decouples perception and reasoning by having the student model first produce a visual description before reasoning, reducing shortcut reliance and improving image-grounding behavior.

On-policy self-distillation (OPSD) trains a model on its own rollouts and uses a frozen copy to provide dense token-level targets conditioned on a reference target. This works well for LLM reasoning, but a direct extension to multimodal large language models (MLLMs) can create a shortcut: the privileged target may guide tokens mainly based on the text reference target rather than the image. We propose ViGOS, a visually grounded OPSD framework for MLLM post-training. The student first writes a visual description and then reasons toward the final answer. For valid rollouts, an image-only perception teacher supervises the description, while a privileged reasoning teacher supervises the reasoning and final answer on the same student prefix. A reference teacher is used only for invalid rollouts to recover the output format. Across general vision-language, expert reasoning, visual math, spatial grounding, and visual-language-prior benchmarks, ViGOS keeps the main benefits of OPSD and improves image-grounded behavior in shortcut-prone settings.

Original Article

View Cached Full Text

Cached at: 06/18/26, 03:57 PM

Paper page - Seeing Before Reasoning: Decoupling Perception and Reasoning for Shortcut-Resilient Multimodal On-Policy Self-Distillation

Source: https://huggingface.co/papers/2606.19120 🚀ViGOS: Seeing Before Reasoning for Shortcut-Resilient Multimodal OPSD

MLLMs can reason impressively — but do they reallylookbefore they reason? 👀 Vanilla multimodal on-policy self-distillation may let the privileged reference answer leak into dense token supervision, pushing the model toward answer-compatible rationales before visual evidence is grounded.

ViGOSfixes this with a simple but powerful idea:see first, reason second. ✨ The student first writes an explicit visual description, supervised by animage-only perception teacher. Then, only after this visual prefix is in place, aprivileged reasoning teacherguides the reasoning and final answer. A reference teacher is used only as a fallback for malformed rollouts — and all teachers are removed at inference time.

📈Results:ViGOS keeps the main OPSD gains on multimodal reasoning benchmarks while improving image-grounded behavior in shortcut-prone settings. On Qwen2.5-VL backbones, ViGOS reaches71.97 mean Pass@5 on 3Band75.60 on 7B, and achieves the best ViLP prior-conflict scores across all tested settings — helping models trust the image when priors are wrong. 🔥

One-line pitch: 🧠➡️👁️ViGOS teaches MLLMs to ground visual evidence before reasoning — reducing shortcuts without sacrificing strong answer guidance.

🔗Links

Project Page:https://oedosoldier.github.io/ViGOS/
Paper:https://arxiv.org/abs/2606.19120
Code:https://github.com/OedoSoldier/ViGOS
ViGOS-3B:https://huggingface.co/OedoSoldier/ViGOS-3B
ViGOS-7B:https://huggingface.co/OedoSoldier/ViGOS-7B

Seeing Before Reasoning: Decoupling Perception and Reasoning for Shortcut-Resilient Multimodal On-Policy Self-Distillation

Paper page - Seeing Before Reasoning: Decoupling Perception and Reasoning for Shortcut-Resilient Multimodal On-Policy Self-Distillation

Similar Articles

From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models

More Thought, Less Accuracy? On the Dual Nature of Reasoning in Vision-Language Models

Training with Harnesses: On-Policy Harness Self-Distillation for Complex Reasoning

Bad Seeing or Bad Thinking? Rewarding Perception for Vision-Language Reasoning

OmniThoughtVis: A Scalable Distillation Pipeline for Deployable Multimodal Reasoning Models

Submit Feedback

Similar Articles

From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models

More Thought, Less Accuracy? On the Dual Nature of Reasoning in Vision-Language Models

Training with Harnesses: On-Policy Harness Self-Distillation for Complex Reasoning

Bad Seeing or Bad Thinking? Rewarding Perception for Vision-Language Reasoning

OmniThoughtVis: A Scalable Distillation Pipeline for Deployable Multimodal Reasoning Models