Multimodal Continuous Reasoning via Asymmetric Mutual Variational Learning

Hugging Face Daily Papers 07/01/26, 12:00 AM Papers

Summary

Proposes Asymmetric Mutual Variational Learning (AMVL) to resolve train-inference mismatch in multimodal continuous reasoning by using bidirectional calibration to prevent answer leakage and improve latent-space stability, achieving significant gains on the BLINK benchmark.

Multimodal Large Language Models (MLLMs) are often constrained by a language-space bottleneck, forcing complex visual reasoning into discrete tokens which can lose perceptual nuance. A promising alternative is continuous latent reasoning, where the goal is to discover implicit reasoning pathways that bridge the multimodal query and the final answer. However, this introduces a severe train-inference mismatch: a training-time posterior, conditioned on the ground-truth answer, can exploit answer-dependent shortcuts. Standard variational training then forces the inference-time prior to mimic a posterior that has access to information unavailable at test time, leading to poor performance. To address this, we propose Asymmetric Mutual Variational Learning (AMVL), a framework that resolves this mismatch via a bidirectional calibration objective. A forward KL divergence trains the target-agnostic prior to match the posterior, while a novel reverse KL divergence simultaneously regularizes the posterior, preventing it from collapsing into inference-incompatible regions and mitigating this ``answer leakage''. We provide theoretical analysis formalizing this leakage as prior contamination and prove that our dual-KL objective reduces it. We instantiate AMVL in a latent-integrated MLLM and show that it consistently outperforms strong discrete and latent-reasoning baselines, improving the average score on the complex BLINK benchmark by +10.83 and achieving gains of up to +32.00 on individual reasoning tasks, with analyses confirming improved latent-space stability.

Original Article

View Cached Full Text

Cached at: 07/02/26, 07:47 AM

Paper page - Multimodal Continuous Reasoning via Asymmetric Mutual Variational Learning

Source: https://huggingface.co/papers/2607.00461

Abstract

Asymmetric Mutual Variational Learning addresses train-inference mismatch in multimodal reasoning by using bidirectional calibration to prevent answer leakage and improve latent-space stability.

Multimodal Large Language Models(MLLMs) are often constrained by alanguage-space bottleneck, forcing complex visual reasoning into discrete tokens which can lose perceptual nuance. A promising alternative iscontinuous latent reasoning, where the goal is to discover implicit reasoning pathways that bridge the multimodal query and the final answer. However, this introduces a severetrain-inference mismatch: a training-timeposterior, conditioned on the ground-truth answer, can exploit answer-dependent shortcuts. Standardvariational trainingthen forces the inference-timepriorto mimic aposteriorthat has access to information unavailable at test time, leading to poor performance. To address this, we propose Asymmetric Mutual Variational Learning (AMVL), a framework that resolves this mismatch via abidirectional calibrationobjective. Aforward KL divergencetrains the target-agnosticpriorto match theposterior, while a novelreverse KL divergencesimultaneously regularizes theposterior, preventing it from collapsing into inference-incompatible regions and mitigating this ``answer leakage’’. We provide theoretical analysis formalizing this leakage aspriorcontamination and prove that our dual-KL objective reduces it. We instantiate AMVL in alatent-integrated MLLMand show that it consistently outperforms strong discrete and latent-reasoning baselines, improving the average score on the complexBLINK benchmarkby +10.83 and achieving gains of up to +32.00 on individual reasoning tasks, with analyses confirming improved latent-space stability.

View arXiv page View PDF Add to collection

Get this paper in your agent:

hf papers read 2607\.00461

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2607.00461 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2607.00461 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2607.00461 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Multimodal Continuous Reasoning via Asymmetric Mutual Variational Learning

Paper page - Multimodal Continuous Reasoning via Asymmetric Mutual Variational Learning

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

Eliciting Complex Spatial Reasoning in MLLMs through Wide-Baseline Matching

VLMs are Good Teachers for Video Reasoning via Adaptive Test-Time Optimization

Do Vision-Language Models Truly Perform Vision Reasoning? A Rigorous Study of the Modality Gap

iVGR: Internalizing Visually Grounded Reasoning for MLLMs with Reinforcement Learning

CORA: Analyzing and bridging thinking-answer gap in Multimodal RLVR via Consistency-Oriented Reasoning Alignment

Submit Feedback

Similar Articles

Eliciting Complex Spatial Reasoning in MLLMs through Wide-Baseline Matching

VLMs are Good Teachers for Video Reasoning via Adaptive Test-Time Optimization

Do Vision-Language Models Truly Perform Vision Reasoning? A Rigorous Study of the Modality Gap

iVGR: Internalizing Visually Grounded Reasoning for MLLMs with Reinforcement Learning

CORA: Analyzing and bridging thinking-answer gap in Multimodal RLVR via Consistency-Oriented Reasoning Alignment