Tag
This paper introduces MathVis-Fine, a framework for fine-grained visual dependency modeling in multimodal mathematical reasoning, along with a new dataset and a two-stage progressive training paradigm that balances answer correctness and visual grounding rewards based on each sample's intrinsic visual dependency level.
FinAcumen is a framework that accumulates reasoning experience from prior trajectories into a persistent memory bank for financial multimodal reasoning, improving performance across four benchmarks while maintaining a frozen 8B vision-language model.
VeriGeo introduces a controllable geometry question generation framework that uses verification-guided reflection to ensure numerical and analytical consistency. The method produces high-quality synthetic data, achieving state-of-the-art results on GeoQA and strong performance on PGPS9K and MathVista-GPS.
This paper introduces Multimodal Multi-Dimensional Scalarization Process Reward Modeling (MMS-PRM), which enforces the worst dimension's robustness in multimodal reasoning to prevent failures like visual hallucinations from being masked by strong text logic.
Proposes SpecFlow, a lightweight multimodal spatial reasoning framework that represents intermediate visual thoughts in a fixed-size discrete cosine space, reducing computation and KV cache costs by up to 2.1 times while maintaining competitive performance.
LoMo proposes a data curation method that reformulates single-modality prompts into interleaved multimodal sequences to improve cross-modal representation alignment in vision-language models, achieving consistent gains on multiple benchmarks.
Faithful-MR1 is a training framework that improves faithful multimodal reasoning in MLLMs by anchoring visual attention via a <Focus> token and reinforcing faithful use through counterfactual image intervention. It outperforms baselines on Qwen2.5-VL backbones with less training data.
This paper challenges the assumption that current Vision-Language Models faithfully synthesize multimodal data, proposing an information-theoretic Modality Translation Protocol with new metrics (Toll, Curse, Fallacy of Seeing) to evaluate trustworthiness over traditional multimodal gain.
This research introduces a 3D benchmark to evaluate whether Vision-Language Model (VLM) agents can achieve mirror self-recognition, a proxy for higher-order cognition. The study finds that while stronger VLMs can use reflected evidence for action, weaker models often fail to extract self-relevant information or misattribute reflections, highlighting the distinction between linguistic compliance and grounded self-identification.
UniPath proposes a framework for adaptive coordination of understanding and generation in unified multimodal models, leveraging coordination-path diversity to improve performance over fixed strategies.
Researchers introduce Groupwise Ranking Reward to fix reasoning-answer inconsistency in multimodal RL, boosting reliability-conditioned accuracy from 47.4% to 54.7% over standard RLVR.
This paper introduces CrossMath, a controlled multimodal reasoning benchmark that reveals a critical limitation in current vision-language models: they perform reasoning primarily in textual space rather than genuine vision-grounded reasoning, with visual input often degrading performance compared to text-only baselines. The authors propose fine-tuning approaches to mitigate this modality gap and improve multimodal reasoning capabilities.
This paper uncovers that prolonged reasoning in vision-language models can impair perceptual grounding, causing recognition failures on basic visual questions. It proposes Vision-Anchored Policy Optimization (VAPO) to steer reasoning toward visually grounded trajectories, achieving state-of-the-art performance with the VAPO-Thinker-7B model.