mllm

Tag

Cards List
#mllm

MER-R1: Multimodal Emotion Reasoning via Slow-Fast Thinking Synergy

arXiv cs.AI · 2d ago Cached

This paper introduces MER-R1, a reinforcement learning framework that synergizes fast and slow thinking for multimodal emotion recognition. It achieves state-of-the-art performance by jointly optimizing recall and precision through dual-objective disentanglement and slow-fast confidence calibration.

0 favorites 0 likes
#mllm

One Forward Beats Two: InnerZoom for Accurate and Efficient GUI Grounding

Hugging Face Daily Papers · 2d ago Cached

InnerZoom proposes a single-forward framework for cross-layer evidence bridging in GUI grounding, achieving state-of-the-art performance on multiple benchmarks while reducing latency by up to 31.8%.

0 favorites 0 likes
#mllm

SocialPersona: Benchmarking Personalized Profiling and Response with Multimodal Social-Media Context

arXiv cs.CL · 5d ago Cached

Introduces SocialPersona, a benchmark for evaluating multimodal large language models on their ability to recover revealed preferences from longitudinal social-media timelines and use them in personalized dialogue.

0 favorites 0 likes
#mllm

Mind the Heads: Topological Representation Alignment for Multimodal LLMs

Hugging Face Daily Papers · 2026-06-22 Cached

HeRA aligns individual attention heads in Multimodal Large Language Models (MLLMs) to preserve local neighborhood relationships across modalities, improving vision-centric task performance and reducing visual hallucinations.

0 favorites 0 likes
#mllm

ThinkDeception: A Progressive Reinforcement Learning Framework for Interpretable Multimodal Deception Detection

arXiv cs.AI · 2026-06-18 Cached

ThinkDeception proposes a novel framework that leverages multimodal large language models and a progressive reinforcement learning strategy with chain-of-thought reasoning for interpretable deception detection, achieving new state-of-the-art results on standard benchmarks.

0 favorites 0 likes
#mllm

Seeing Before Reasoning: Decoupling Perception and Reasoning for Shortcut-Resilient Multimodal On-Policy Self-Distillation

Hugging Face Daily Papers · 2026-06-17 Cached

This paper introduces ViGOS, a method for multimodal on-policy self-distillation that decouples perception and reasoning by having the student model first produce a visual description before reasoning, reducing shortcut reliance and improving image-grounding behavior.

0 favorites 0 likes
#mllm

MODF-SIR: A Multi-agent Omni-modal Distilled Framework for Social Intelligence Reasoning

arXiv cs.AI · 2026-06-11 Cached

This paper proposes MODF-SIR, a multi-agent collaborative framework built on a lightweight multimodal large language model for social intelligence reasoning. It employs knowledge distillation, long-tail event extraction, and test-time adaptation to achieve state-of-the-art results with reduced training data.

0 favorites 0 likes
#mllm

Beyond APIs: Probing the Limits of MLLMs in Physical Tool Use

arXiv cs.CL · 2026-06-10 Cached

This paper introduces PhysTool-Bench, a benchmark for evaluating multimodal large language models' ability to recognize and plan the use of physical tools in real-world scenes. The authors find that even the best model identifies only 58.7% of tools and completes just 21.0% of queries end-to-end, revealing a two-level deficit in perception and functional commonsense.

0 favorites 0 likes
#mllm

Reason, Then Re-reason: Cross-view Revisiting Improves Spatial Reasoning

Hugging Face Daily Papers · 2026-06-10 Cached

A training-free framework for spatial reasoning from egocentric videos that enables revisiting conclusions through synthesized novel-view videos generated from predicted 3D geometry.

0 favorites 0 likes
#mllm

PathoSage: Towards Multi-Source Evidence Adjudication in Pathology via Experience-Aware Agentic Workflow

arXiv cs.AI · 2026-06-09 Cached

PathoSage introduces a three-stage framework for pathology multimodal reasoning that separates knowledge retrieval, evidence collection, and evidence adjudication to reduce hallucinations and handle conflicting evidence, featuring a training-free Beta-Bernoulli experience system for modeling tool reliability.

0 favorites 0 likes
#mllm

Visual Para-Thinker++: A Single-Policy Multi-Agent Framework for Visual Reasoning

Hugging Face Daily Papers · 2026-06-08 Cached

Visual Para-Thinker++ proposes a single-policy multi-agent framework for visual reasoning that uses role-conditioned agents (Main, Worker, Summary) and dedicated training methods to reduce hallucinations and improve efficiency, outperforming baselines on hallucination-sensitive benchmarks.

0 favorites 0 likes
#mllm

WorldBench: A Challenging and Visually Diverse Multimodal Reasoning Benchmark

Hugging Face Daily Papers · 2026-06-04 Cached

Introduces WorldBench, a visually diverse multimodal reasoning benchmark that reveals significant limitations in current multimodal large language models' visual understanding.

0 favorites 0 likes
#mllm

CORE: Conflict-Oriented Reasoning for General Multimodal Manipulation Detection

arXiv cs.AI · 2026-06-03 Cached

Proposes the CORE framework that endows multimodal large language models with explicit conflict-capturing capability for generalizable manipulation detection, adapting to unseen manipulation types with few or zero samples.

0 favorites 0 likes
#mllm

@PinzhiHuang: State tracking is a core pillar of video understanding: it requires identifying entities and events, and mapping how th…

X AI KOLs Following · 2026-06-03 Cached

Introduces VSTAT, a new benchmark to measure how well multimodal LLMs track states in videos, revealing that frontier models struggle with tasks humans find easy.

0 favorites 0 likes
#mllm

@ma_nanye: VSTAT highlights the substantial perceptual gap between humans and MLLMs, but it goes far beyond that. Its diverse task…

X AI KOLs Following · 2026-06-03 Cached

VSTAT is a new benchmark for visual state tracking in videos that reveals perceptual gaps between humans and multimodal LLMs.

0 favorites 0 likes
#mllm

iVGR: Internalizing Visually Grounded Reasoning for MLLMs with Reinforcement Learning

Hugging Face Daily Papers · 2026-05-29 Cached

Introduces iVGR, a reinforcement learning framework that internalizes visual localization into textual reasoning for multimodal language models, eliminating the need for explicit visual grounding during inference while improving fine-grained perception performance.

0 favorites 0 likes
#mllm

Faithful-MR1: Faithful Multimodal Reasoning via Anchoring and Reinforcing Visual Attention

arXiv cs.CL · 2026-05-22 Cached

Faithful-MR1 is a training framework that improves faithful multimodal reasoning in MLLMs by anchoring visual attention via a <Focus> token and reinforcing faithful use through counterfactual image intervention. It outperforms baselines on Qwen2.5-VL backbones with less training data.

0 favorites 0 likes
#mllm

LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning

arXiv cs.CL · 2026-05-22 Cached

LatentOmni proposes a unified latent space for audio-visual reasoning, avoiding the information loss of text-based chain-of-thought. It achieves state-of-the-art performance among open-source models on audio-visual reasoning benchmarks.

0 favorites 0 likes
#mllm

Perception or Prejudice: Can MLLMs Go Beyond First Impressions of Personality?

Hugging Face Daily Papers · 2026-05-21 Cached

Researchers introduce the MM-OCEAN dataset and a three-tier evaluation framework for grounded personality reasoning in multimodal LLMs, revealing a 'Prejudice Gap' where models often make correct predictions without proper grounding.

0 favorites 0 likes
#mllm

Causal Evidence for Attention Head Imbalance in Modality Conflict Hallucination

arXiv cs.AI · 2026-05-20 Cached

This paper identifies imbalanced attention head groups in MLLMs that drive or resist modality-conflict hallucination, and proposes MACI, a causal intervention that suppresses hallucination-driving heads only when conflict is detected, achieving large hallucination reduction across five models.

0 favorites 0 likes
Next →
← Back to home

Submit Feedback