multimodal-agents

#multimodal-agents

DMV-Bench: Diagnosing Long-Horizon Multimodal Agents' Visual Memory with Incidental Cue Injection

arXiv cs.CL ↗ · 13h ago Cached

Introduces DMV-Bench, an interactive benchmark for evaluating visual memory in multimodal agents using incidental visual cues from product images, and proposes DualMem, a dual-coding memory architecture that outperforms text-only and other multimodal baselines across various chain lengths.

0 favorites 0 likes

#multimodal-agents

Can Agents Read the Room? Benchmarking Visual Social Intelligence in Multimodal Simulation

arXiv cs.CL ↗ · 2026-06-16 Cached

This paper introduces AgentViSS, a benchmark evaluating visual social intelligence in multimodal social simulation, containing 240 scenarios with aligned visual-textual evidence. Evaluating seven recent MLLMs reveals a gap between local role enactment and visually grounded interaction management.

0 favorites 0 likes

#multimodal-agents

SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks

Hugging Face Daily Papers ↗ · 2026-06-08 Cached

SpatialWorld is a unified benchmark for evaluating interactive spatial reasoning in multimodal agents across diverse real-world tasks, revealing that even the strongest models achieve low task success rates.

0 favorites 0 likes

#multimodal-agents

Task-Focused Memorization for Multimodal Agents

Hugging Face Daily Papers ↗ · 2026-05-29 Cached

Introduces TaskMem, a reinforcement-learning-based framework for dynamic memorization in multimodal agents, achieving accuracy improvements of 6.3%, 7.0%, and 5.3% on streaming video benchmarks.

0 favorites 0 likes

#multimodal-agents

Hallucination as Exploit: Evidence-Carrying Multimodal Agents

arXiv cs.AI ↗ · 2026-05-20 Cached

This paper formalizes hallucination-to-action conversion in multimodal agents and proposes evidence-carrying agents (ECA) that use constrained verifiers to authorize only safe tool calls, achieving 0% unsafe-action rate on a 200-task pipeline.

0 favorites 0 likes

#multimodal-agents

Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents

Hugging Face Daily Papers ↗ · 2026-05-11 Cached

This paper introduces On-Policy Data Evolution (ODE) and a visual-native agent harness to improve multimodal deep search agents. By enabling reusable visual evidence and closed-loop data generation, ODE significantly boosts the performance of Qwen3-VL agents across multiple benchmarks, surpassing Gemini 2.5 Pro.

0 favorites 0 likes

#multimodal-agents

HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents

Hugging Face Daily Papers ↗ · 2026-05-08 Cached

HyperEyes is a parallel multimodal search agent that uses dual-grained reinforcement learning to optimize inference efficiency, achieving higher accuracy with significantly fewer tool-call rounds compared to existing agents.

0 favorites 0 likes

#multimodal-agents

InterLV-Search: Benchmarking Interleaved Multimodal Agentic Search

Hugging Face Daily Papers ↗ · 2026-05-08 Cached

InterLV-Search is a new benchmark introduced in this paper to evaluate interleaved language-vision agentic search, highlighting limitations in current systems regarding visual evidence seeking and multimodal integration.

0 favorites 0 likes

multimodal-agents

Submit Feedback