Tag
This paper introduces MER-R1, a reinforcement learning framework that synergizes fast and slow thinking for multimodal emotion recognition. It achieves state-of-the-art performance by jointly optimizing recall and precision through dual-objective disentanglement and slow-fast confidence calibration.
InnerZoom proposes a single-forward framework for cross-layer evidence bridging in GUI grounding, achieving state-of-the-art performance on multiple benchmarks while reducing latency by up to 31.8%.
Introduces SocialPersona, a benchmark for evaluating multimodal large language models on their ability to recover revealed preferences from longitudinal social-media timelines and use them in personalized dialogue.
HeRA aligns individual attention heads in Multimodal Large Language Models (MLLMs) to preserve local neighborhood relationships across modalities, improving vision-centric task performance and reducing visual hallucinations.
ThinkDeception proposes a novel framework that leverages multimodal large language models and a progressive reinforcement learning strategy with chain-of-thought reasoning for interpretable deception detection, achieving new state-of-the-art results on standard benchmarks.
This paper introduces ViGOS, a method for multimodal on-policy self-distillation that decouples perception and reasoning by having the student model first produce a visual description before reasoning, reducing shortcut reliance and improving image-grounding behavior.
This paper proposes MODF-SIR, a multi-agent collaborative framework built on a lightweight multimodal large language model for social intelligence reasoning. It employs knowledge distillation, long-tail event extraction, and test-time adaptation to achieve state-of-the-art results with reduced training data.
This paper introduces PhysTool-Bench, a benchmark for evaluating multimodal large language models' ability to recognize and plan the use of physical tools in real-world scenes. The authors find that even the best model identifies only 58.7% of tools and completes just 21.0% of queries end-to-end, revealing a two-level deficit in perception and functional commonsense.
A training-free framework for spatial reasoning from egocentric videos that enables revisiting conclusions through synthesized novel-view videos generated from predicted 3D geometry.
PathoSage introduces a three-stage framework for pathology multimodal reasoning that separates knowledge retrieval, evidence collection, and evidence adjudication to reduce hallucinations and handle conflicting evidence, featuring a training-free Beta-Bernoulli experience system for modeling tool reliability.
Visual Para-Thinker++ proposes a single-policy multi-agent framework for visual reasoning that uses role-conditioned agents (Main, Worker, Summary) and dedicated training methods to reduce hallucinations and improve efficiency, outperforming baselines on hallucination-sensitive benchmarks.
Introduces WorldBench, a visually diverse multimodal reasoning benchmark that reveals significant limitations in current multimodal large language models' visual understanding.
Proposes the CORE framework that endows multimodal large language models with explicit conflict-capturing capability for generalizable manipulation detection, adapting to unseen manipulation types with few or zero samples.
Introduces VSTAT, a new benchmark to measure how well multimodal LLMs track states in videos, revealing that frontier models struggle with tasks humans find easy.
VSTAT is a new benchmark for visual state tracking in videos that reveals perceptual gaps between humans and multimodal LLMs.
Introduces iVGR, a reinforcement learning framework that internalizes visual localization into textual reasoning for multimodal language models, eliminating the need for explicit visual grounding during inference while improving fine-grained perception performance.
Faithful-MR1 is a training framework that improves faithful multimodal reasoning in MLLMs by anchoring visual attention via a <Focus> token and reinforcing faithful use through counterfactual image intervention. It outperforms baselines on Qwen2.5-VL backbones with less training data.
LatentOmni proposes a unified latent space for audio-visual reasoning, avoiding the information loss of text-based chain-of-thought. It achieves state-of-the-art performance among open-source models on audio-visual reasoning benchmarks.
Researchers introduce the MM-OCEAN dataset and a three-tier evaluation framework for grounded personality reasoning in multimodal LLMs, revealing a 'Prejudice Gap' where models often make correct predictions without proper grounding.
This paper identifies imbalanced attention head groups in MLLMs that drive or resist modality-conflict hallucination, and proposes MACI, a causal intervention that suppresses hallucination-driving heads only when conflict is detected, achieving large hallucination reduction across five models.