Seeing Before Reasoning: Decoupling Perception and Reasoning for Shortcut-Resilient Multimodal On-Policy Self-Distillation
Summary
This paper introduces ViGOS, a method for multimodal on-policy self-distillation that decouples perception and reasoning by having the student model first produce a visual description before reasoning, reducing shortcut reliance and improving image-grounding behavior.
View Cached Full Text
Cached at: 06/18/26, 03:57 PM
Paper page - Seeing Before Reasoning: Decoupling Perception and Reasoning for Shortcut-Resilient Multimodal On-Policy Self-Distillation
Source: https://huggingface.co/papers/2606.19120 πViGOS: Seeing Before Reasoning for Shortcut-Resilient Multimodal OPSD
MLLMs can reason impressively β but do they reallylookbefore they reason? π Vanilla multimodal on-policy self-distillation may let the privileged reference answer leak into dense token supervision, pushing the model toward answer-compatible rationales before visual evidence is grounded.
ViGOSfixes this with a simple but powerful idea:see first, reason second. β¨ The student first writes an explicit visual description, supervised by animage-only perception teacher. Then, only after this visual prefix is in place, aprivileged reasoning teacherguides the reasoning and final answer. A reference teacher is used only as a fallback for malformed rollouts β and all teachers are removed at inference time.
πResults:ViGOS keeps the main OPSD gains on multimodal reasoning benchmarks while improving image-grounded behavior in shortcut-prone settings. On Qwen2.5-VL backbones, ViGOS reaches71.97 mean Pass@5 on 3Band75.60 on 7B, and achieves the best ViLP prior-conflict scores across all tested settings β helping models trust the image when priors are wrong. π₯
One-line pitch: π§ β‘οΈποΈViGOS teaches MLLMs to ground visual evidence before reasoning β reducing shortcuts without sacrificing strong answer guidance.
πLinks
Similar Articles
From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models
This paper proposes a staged training approach for vision-language models that separates visual perception, visual reasoning, and textual reasoning into distinct stages. The method improves visual reasoning accuracy while reducing reasoning trace length, demonstrating that stronger perception reduces the need for excessive reasoning.
More Thought, Less Accuracy? On the Dual Nature of Reasoning in Vision-Language Models
This paper uncovers that prolonged reasoning in vision-language models can impair perceptual grounding, causing recognition failures on basic visual questions. It proposes Vision-Anchored Policy Optimization (VAPO) to steer reasoning toward visually grounded trajectories, achieving state-of-the-art performance with the VAPO-Thinker-7B model.
Training with Harnesses: On-Policy Harness Self-Distillation for Complex Reasoning
This paper introduces On-Policy Harness Self-Distillation (OPHSD), a method that internalizes the capabilities of inference-time reasoning harnesses into the base model through self-distillation. The approach improves standalone performance on complex reasoning tasks, allowing the model to retain reasoning scaffolds without permanent external dependencies.
Bad Seeing or Bad Thinking? Rewarding Perception for Vision-Language Reasoning
This paper introduces a reinforcement learning framework that improves perception-reasoning synergy in vision-language models by explicitly rewarding perceptual fidelity, using a 'blindfolded reasoning' proxy and structured verbal verification to address ambiguity in modality credit assignment.
OmniThoughtVis: A Scalable Distillation Pipeline for Deployable Multimodal Reasoning Models
This paper introduces OmniThoughtVis, a scalable pipeline for distilling multimodal reasoning capabilities from large teacher models to smaller, deployment-oriented MLLMs. The method uses curated chain-of-thought data to significantly improve reasoning performance on benchmarks like MathVerse and MMMU-Pro for models ranging from 2B to 8B parameters.