multimodal-reasoning

Tag

Cards List
#multimodal-reasoning

MathVis-Fine: Aligning Visual Supervision with Necessity via Progressive Dependency-Guided Training for Multimodal Mathematical Reasoning

arXiv cs.AI · 2026-06-17 Cached

This paper introduces MathVis-Fine, a framework for fine-grained visual dependency modeling in multimodal mathematical reasoning, along with a new dataset and a two-stage progressive training paradigm that balances answer correctness and visual grounding rewards based on each sample's intrinsic visual dependency level.

0 favorites 0 likes
#multimodal-reasoning

FinAcumen: Financial Multimodal Reasoning via Self-Evolving Experience Memory Harness

arXiv cs.AI · 2026-06-17 Cached

FinAcumen is a framework that accumulates reasoning experience from prior trajectories into a persistent memory bank for financial multimodal reasoning, improving performance across four benchmarks while maintaining a frozen 8B vision-language model.

0 favorites 0 likes
#multimodal-reasoning

VeriGeo: Controllable Geometry Question Generation with Numerical and Analytical Verification

arXiv cs.AI · 2026-06-15 Cached

VeriGeo introduces a controllable geometry question generation framework that uses verification-guided reflection to ensure numerical and analytical consistency. The method produces high-quality synthetic data, achieving state-of-the-art results on GeoQA and strong performance on PGPS9K and MathVista-GPS.

0 favorites 0 likes
#multimodal-reasoning

Improving Multimodal Reasoning via Worst Dimension Optimization

arXiv cs.AI · 2026-06-09 Cached

This paper introduces Multimodal Multi-Dimensional Scalarization Process Reward Modeling (MMS-PRM), which enforces the worst dimension's robustness in multimodal reasoning to prevent failures like visual hallucinations from being masked by strong text logic.

0 favorites 0 likes
#multimodal-reasoning

Spectral-Progressive Thought Flow for Lightweight Multimodal Reasoning

arXiv cs.LG · 2026-06-03 Cached

Proposes SpecFlow, a lightweight multimodal spatial reasoning framework that represents intermediate visual thoughts in a fixed-size discrete cosine space, reducing computation and KV cache costs by up to 2.1 times while maintaining competitive performance.

0 favorites 0 likes
#multimodal-reasoning

LoMo: Local Modality Substitution for Deeper Vision-Language Fusion

Hugging Face Daily Papers · 2026-05-28 Cached

LoMo proposes a data curation method that reformulates single-modality prompts into interleaved multimodal sequences to improve cross-modal representation alignment in vision-language models, achieving consistent gains on multiple benchmarks.

0 favorites 0 likes
#multimodal-reasoning

Faithful-MR1: Faithful Multimodal Reasoning via Anchoring and Reinforcing Visual Attention

arXiv cs.CL · 2026-05-22 Cached

Faithful-MR1 is a training framework that improves faithful multimodal reasoning in MLLMs by anchoring visual attention via a <Focus> token and reinforcing faithful use through counterfactual image intervention. It outperforms baselines on Qwen2.5-VL backbones with less training data.

0 favorites 0 likes
#multimodal-reasoning

The Expense of Seeing: Attaining Trustworthy Multimodal Reasoning Within the Monolithic Paradigm

Hugging Face Daily Papers · 2026-05-21 Cached

This paper challenges the assumption that current Vision-Language Models faithfully synthesize multimodal data, proposing an information-theoretic Modality Translation Protocol with new metrics (Toll, Curse, Fallacy of Seeing) to evaluate trustworthiness over traditional multimodal gain.

0 favorites 0 likes
#multimodal-reasoning

Mirror, Mirror on the Wall: Can VLM Agents Tell Who They Are at All?

arXiv cs.AI · 2026-05-12 Cached

This research introduces a 3D benchmark to evaluate whether Vision-Language Model (VLM) agents can achieve mirror self-recognition, a proxy for higher-order cognition. The study finds that while stronger VLMs can use reflected evidence for action, weaker models often fail to extract self-relevant information or misattribute reflections, highlighting the distinction between linguistic compliance and grounded self-identification.

0 favorites 0 likes
#multimodal-reasoning

UniPath: Adaptive Coordination of Understanding and Generation for Unified Multimodal Reasoning

Hugging Face Daily Papers · 2026-05-12 Cached

UniPath proposes a framework for adaptive coordination of understanding and generation in unified multimodal models, leveraging coordination-path diversity to improve performance over fixed strategies.

0 favorites 0 likes
#multimodal-reasoning

Prioritizing the Best: Incentivizing Reliable Multimodal Reasoning by Rewarding Beyond Answer Correctness

arXiv cs.CL · 2026-04-22 Cached

Researchers introduce Groupwise Ranking Reward to fix reasoning-answer inconsistency in multimodal RL, boosting reliability-conditioned accuracy from 47.4% to 54.7% over standard RLVR.

0 favorites 0 likes
#multimodal-reasoning

Do Vision-Language Models Truly Perform Vision Reasoning? A Rigorous Study of the Modality Gap

arXiv cs.CL · 2026-04-20 Cached

This paper introduces CrossMath, a controlled multimodal reasoning benchmark that reveals a critical limitation in current vision-language models: they perform reasoning primarily in textual space rather than genuine vision-grounded reasoning, with visual input often degrading performance compared to text-only baselines. The authors propose fine-tuning approaches to mitigate this modality gap and improve multimodal reasoning capabilities.

0 favorites 0 likes
#multimodal-reasoning

More Thought, Less Accuracy? On the Dual Nature of Reasoning in Vision-Language Models

Papers with Code Trending · 2025-09-30 Cached

This paper uncovers that prolonged reasoning in vision-language models can impair perceptual grounding, causing recognition failures on basic visual questions. It proposes Vision-Anchored Policy Optimization (VAPO) to steer reasoning toward visually grounded trajectories, achieving state-of-the-art performance with the VAPO-Thinker-7B model.

0 favorites 0 likes
← Back to home

Submit Feedback