DyCo-RL: Dynamic Cross-Modal Coordination for Visual Reasoning

Hugging Face Daily Papers 06/06/26, 12:00 AM Papers

Summary

This paper identifies that failures in visual reasoning often stem from breakdowns in dynamic cross-modal coordination between visual and textual evidence during chain-of-thought generation. It introduces DyCo-RL, a reinforcement learning framework that rewards effective cross-modal coordination, leading to improved reasoning performance.

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a leading paradigm for enhancing visual reasoning in Multimodal Large Language Models (MLLMs). However, existing RLVR methods optimize primarily for the reasoning outcome, fundamentally overlooking the fine-grained cross-modal coordination required during the generation process. Through token-level analyses and controlled interventions, we reveal that during Chain-of-Thought (CoT) reasoning, MLLMs frequently fail to dynamically alternate between extracting visual evidence and synthesizing textual context-a coordination breakdown that is causally linked to reasoning failures. Motivated by these findings, we propose DyCo-RL, which integrates dynamic cross-modal coordination into RLVR optimization. Specifically, DyCo-RL uses the Fisher-Rao geodesic distance to measure within-modality attention shifts, assigning tokens to either visually-oriented or text-oriented functional roles. It then evaluates the alignment between a token's actual attention allocation and its assigned role, leveraging this score for alignment-guided advantage reweighting during policy optimization. Extensive experiments demonstrate that the algorithm-agnostic DyCo-RL, when applied to Qwen2.5-VL-3B/7B, consistently improves four representative RLVR algorithms across seven benchmarks spanning visual-centric and mathematical reasoning.

Original Article

View Cached Full Text

Cached at: 06/12/26, 02:52 AM

Paper page - DyCo-RL: Dynamic Cross-Modal Coordination for Visual Reasoning

Source: https://huggingface.co/papers/2606.08035 Why do visual reasoning failures persist even after RLVR training ?

We find that reasoning failures are often associated with breakdowns in this coordination process. We find that the issue is often not visual perception error or text reasoning error alone, but a failure of dynamic cross-modal coordination. During Chain-of-Thought generation, successful reasoning requires models to continuously switch between looking at visual evidence and thinking on previously established textual context. Existing RLVR methods optimize final outcomes but largely ignore this token-level behavior.

Through token-level analyses and causal interventions, we show that reasoning failures frequently occur when visually-oriented tokens stop attending to relevant image content, or when text-oriented tokens fail to remain grounded in prior reasoning history.

To address this problem, we introduce DyCo-RL, a plug-and-play RLVR framework that explicitly rewards effective cross-modal coordination. DyCo-RL identifies token functional roles using Fisher–Rao attention dynamics and reweights policy optimization according to role-attention alignment. The resulting models exhibit substantially stronger reasoning performance across diverse visual and mathematical reasoning benchmarks.

DyCo-RL: Dynamic Cross-Modal Coordination for Visual Reasoning

Paper page - DyCo-RL: Dynamic Cross-Modal Coordination for Visual Reasoning

Similar Articles

DyCon: Dynamic Reasoning Control via Evolving Difficulty Modeling

CORA: Analyzing and bridging thinking-answer gap in Multimodal RLVR via Consistency-Oriented Reasoning Alignment

CollabVR: Collaborative Video Reasoning with Vision-Language and Video Generation Models

Bad Seeing or Bad Thinking? Rewarding Perception for Vision-Language Reasoning

Visual Reasoning through Tool-supervised Reinforcement Learning

Submit Feedback

Similar Articles

DyCon: Dynamic Reasoning Control via Evolving Difficulty Modeling

CORA: Analyzing and bridging thinking-answer gap in Multimodal RLVR via Consistency-Oriented Reasoning Alignment

CollabVR: Collaborative Video Reasoning with Vision-Language and Video Generation Models

Bad Seeing or Bad Thinking? Rewarding Perception for Vision-Language Reasoning

Visual Reasoning through Tool-supervised Reinforcement Learning