DyCo-RL: Dynamic Cross-Modal Coordination for Visual Reasoning
Summary
This paper identifies that failures in visual reasoning often stem from breakdowns in dynamic cross-modal coordination between visual and textual evidence during chain-of-thought generation. It introduces DyCo-RL, a reinforcement learning framework that rewards effective cross-modal coordination, leading to improved reasoning performance.
View Cached Full Text
Cached at: 06/12/26, 02:52 AM
Paper page - DyCo-RL: Dynamic Cross-Modal Coordination for Visual Reasoning
Source: https://huggingface.co/papers/2606.08035 Why do visual reasoning failures persist even after RLVR training ?
We find that reasoning failures are often associated with breakdowns in this coordination process. We find that the issue is often not visual perception error or text reasoning error alone, but a failure of dynamic cross-modal coordination. During Chain-of-Thought generation, successful reasoning requires models to continuously switch between looking at visual evidence and thinking on previously established textual context. Existing RLVR methods optimize final outcomes but largely ignore this token-level behavior.
Through token-level analyses and causal interventions, we show that reasoning failures frequently occur when visually-oriented tokens stop attending to relevant image content, or when text-oriented tokens fail to remain grounded in prior reasoning history.
To address this problem, we introduce DyCo-RL, a plug-and-play RLVR framework that explicitly rewards effective cross-modal coordination. DyCo-RL identifies token functional roles using Fisher–Rao attention dynamics and reweights policy optimization according to role-attention alignment. The resulting models exhibit substantially stronger reasoning performance across diverse visual and mathematical reasoning benchmarks.
Similar Articles
DyCon: Dynamic Reasoning Control via Evolving Difficulty Modeling
This paper introduces DyCon, a training-free framework that uses step-level embeddings to model evolving task difficulty and dynamically control reasoning depth in Large Reasoning Models, effectively reducing overthinking and improving efficiency without sacrificing accuracy.
CORA: Analyzing and bridging thinking-answer gap in Multimodal RLVR via Consistency-Oriented Reasoning Alignment
This paper analyzes the thinking-answer inconsistency in multimodal reinforcement learning with verifiable rewards (RLVR) for large vision-language models and proposes CORA, a method that introduces a consistency reward model and hybrid reward advantage splitting to improve faithfulness and task performance.
CollabVR: Collaborative Video Reasoning with Vision-Language and Video Generation Models
CollabVR is a research paper proposing a closed-loop framework that collaboratively integrates vision-language models with video generation models to improve visual reasoning and correct failures in real-time.
Bad Seeing or Bad Thinking? Rewarding Perception for Vision-Language Reasoning
This paper introduces a reinforcement learning framework that improves perception-reasoning synergy in vision-language models by explicitly rewarding perceptual fidelity, using a 'blindfolded reasoning' proxy and structured verbal verification to address ambiguity in modality credit assignment.
Visual Reasoning through Tool-supervised Reinforcement Learning
Introduces ToolsRL, a two-stage reinforcement learning framework that teaches multimodal LLMs to use simple visual tools for complex visual reasoning tasks.