Multimodal Continuous Reasoning via Asymmetric Mutual Variational Learning
Summary
Proposes Asymmetric Mutual Variational Learning (AMVL) to resolve train-inference mismatch in multimodal continuous reasoning by using bidirectional calibration to prevent answer leakage and improve latent-space stability, achieving significant gains on the BLINK benchmark.
View Cached Full Text
Cached at: 07/02/26, 07:47 AM
Paper page - Multimodal Continuous Reasoning via Asymmetric Mutual Variational Learning
Source: https://huggingface.co/papers/2607.00461
Abstract
Asymmetric Mutual Variational Learning addresses train-inference mismatch in multimodal reasoning by using bidirectional calibration to prevent answer leakage and improve latent-space stability.
Multimodal Large Language Models(MLLMs) are often constrained by alanguage-space bottleneck, forcing complex visual reasoning into discrete tokens which can lose perceptual nuance. A promising alternative iscontinuous latent reasoning, where the goal is to discover implicit reasoning pathways that bridge the multimodal query and the final answer. However, this introduces a severetrain-inference mismatch: a training-timeposterior, conditioned on the ground-truth answer, can exploit answer-dependent shortcuts. Standardvariational trainingthen forces the inference-timepriorto mimic aposteriorthat has access to information unavailable at test time, leading to poor performance. To address this, we propose Asymmetric Mutual Variational Learning (AMVL), a framework that resolves this mismatch via abidirectional calibrationobjective. Aforward KL divergencetrains the target-agnosticpriorto match theposterior, while a novelreverse KL divergencesimultaneously regularizes theposterior, preventing it from collapsing into inference-incompatible regions and mitigating this ``answer leakage’’. We provide theoretical analysis formalizing this leakage aspriorcontamination and prove that our dual-KL objective reduces it. We instantiate AMVL in alatent-integrated MLLMand show that it consistently outperforms strong discrete and latent-reasoning baselines, improving the average score on the complexBLINK benchmarkby +10.83 and achieving gains of up to +32.00 on individual reasoning tasks, with analyses confirming improved latent-space stability.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2607\.00461
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2607.00461 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2607.00461 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2607.00461 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Eliciting Complex Spatial Reasoning in MLLMs through Wide-Baseline Matching
This paper introduces ReasonMatch-Bench, a benchmark for wide-baseline matching in multimodal LLMs, and proposes Dynamic Correspondence Reinforcement Learning (DCRL) to improve spatial reasoning. Experiments show significant gains on the benchmark while maintaining general performance.
VLMs are Good Teachers for Video Reasoning via Adaptive Test-Time Optimization
This paper introduces a paradigm where Vision-Language Models (VLMs) act as test-time teachers to guide Video Generation Models (VGMs) via differentiable rewards and LoRA optimization, achieving a 16.7-point average improvement on video reasoning benchmarks.
Do Vision-Language Models Truly Perform Vision Reasoning? A Rigorous Study of the Modality Gap
This paper introduces CrossMath, a controlled multimodal reasoning benchmark that reveals a critical limitation in current vision-language models: they perform reasoning primarily in textual space rather than genuine vision-grounded reasoning, with visual input often degrading performance compared to text-only baselines. The authors propose fine-tuning approaches to mitigate this modality gap and improve multimodal reasoning capabilities.
iVGR: Internalizing Visually Grounded Reasoning for MLLMs with Reinforcement Learning
Introduces iVGR, a reinforcement learning framework that internalizes visual localization into textual reasoning for multimodal language models, eliminating the need for explicit visual grounding during inference while improving fine-grained perception performance.
CORA: Analyzing and bridging thinking-answer gap in Multimodal RLVR via Consistency-Oriented Reasoning Alignment
This paper analyzes the thinking-answer inconsistency in multimodal reinforcement learning with verifiable rewards (RLVR) for large vision-language models and proposes CORA, a method that introduces a consistency reward model and hybrid reward advantage splitting to improve faithfulness and task performance.