Seeing Before Reasoning: Decoupling Perception and Reasoning for Shortcut-Resilient Multimodal On-Policy Self-Distillation

Hugging Face Daily Papers Papers

Summary

This paper introduces ViGOS, a method for multimodal on-policy self-distillation that decouples perception and reasoning by having the student model first produce a visual description before reasoning, reducing shortcut reliance and improving image-grounding behavior.

On-policy self-distillation (OPSD) trains a model on its own rollouts and uses a frozen copy to provide dense token-level targets conditioned on a reference target. This works well for LLM reasoning, but a direct extension to multimodal large language models (MLLMs) can create a shortcut: the privileged target may guide tokens mainly based on the text reference target rather than the image. We propose ViGOS, a visually grounded OPSD framework for MLLM post-training. The student first writes a visual description and then reasons toward the final answer. For valid rollouts, an image-only perception teacher supervises the description, while a privileged reasoning teacher supervises the reasoning and final answer on the same student prefix. A reference teacher is used only for invalid rollouts to recover the output format. Across general vision-language, expert reasoning, visual math, spatial grounding, and visual-language-prior benchmarks, ViGOS keeps the main benefits of OPSD and improves image-grounded behavior in shortcut-prone settings.
Original Article
View Cached Full Text

Cached at: 06/18/26, 03:57 PM

Paper page - Seeing Before Reasoning: Decoupling Perception and Reasoning for Shortcut-Resilient Multimodal On-Policy Self-Distillation

Source: https://huggingface.co/papers/2606.19120 πŸš€ViGOS: Seeing Before Reasoning for Shortcut-Resilient Multimodal OPSD

MLLMs can reason impressively β€” but do they reallylookbefore they reason? πŸ‘€ Vanilla multimodal on-policy self-distillation may let the privileged reference answer leak into dense token supervision, pushing the model toward answer-compatible rationales before visual evidence is grounded.

ViGOSfixes this with a simple but powerful idea:see first, reason second. ✨ The student first writes an explicit visual description, supervised by animage-only perception teacher. Then, only after this visual prefix is in place, aprivileged reasoning teacherguides the reasoning and final answer. A reference teacher is used only as a fallback for malformed rollouts β€” and all teachers are removed at inference time.

πŸ“ˆResults:ViGOS keeps the main OPSD gains on multimodal reasoning benchmarks while improving image-grounded behavior in shortcut-prone settings. On Qwen2.5-VL backbones, ViGOS reaches71.97 mean Pass@5 on 3Band75.60 on 7B, and achieves the best ViLP prior-conflict scores across all tested settings β€” helping models trust the image when priors are wrong. πŸ”₯

One-line pitch: πŸ§ βž‘οΈπŸ‘οΈViGOS teaches MLLMs to ground visual evidence before reasoning β€” reducing shortcuts without sacrificing strong answer guidance.

πŸ”—Links

Similar Articles

More Thought, Less Accuracy? On the Dual Nature of Reasoning in Vision-Language Models

Papers with Code Trending

This paper uncovers that prolonged reasoning in vision-language models can impair perceptual grounding, causing recognition failures on basic visual questions. It proposes Vision-Anchored Policy Optimization (VAPO) to steer reasoning toward visually grounded trajectories, achieving state-of-the-art performance with the VAPO-Thinker-7B model.

Training with Harnesses: On-Policy Harness Self-Distillation for Complex Reasoning

arXiv cs.CL

This paper introduces On-Policy Harness Self-Distillation (OPHSD), a method that internalizes the capabilities of inference-time reasoning harnesses into the base model through self-distillation. The approach improves standalone performance on complex reasoning tasks, allowing the model to retain reasoning scaffolds without permanent external dependencies.

OmniThoughtVis: A Scalable Distillation Pipeline for Deployable Multimodal Reasoning Models

arXiv cs.CL

This paper introduces OmniThoughtVis, a scalable pipeline for distilling multimodal reasoning capabilities from large teacher models to smaller, deployment-oriented MLLMs. The method uses curated chain-of-thought data to significantly improve reasoning performance on benchmarks like MathVerse and MMMU-Pro for models ranging from 2B to 8B parameters.