Tag
Introduces DMV-Bench, an interactive benchmark for evaluating visual memory in multimodal agents using incidental visual cues from product images, and proposes DualMem, a dual-coding memory architecture that outperforms text-only and other multimodal baselines across various chain lengths.
This paper introduces AgentViSS, a benchmark evaluating visual social intelligence in multimodal social simulation, containing 240 scenarios with aligned visual-textual evidence. Evaluating seven recent MLLMs reveals a gap between local role enactment and visually grounded interaction management.
SpatialWorld is a unified benchmark for evaluating interactive spatial reasoning in multimodal agents across diverse real-world tasks, revealing that even the strongest models achieve low task success rates.
Introduces TaskMem, a reinforcement-learning-based framework for dynamic memorization in multimodal agents, achieving accuracy improvements of 6.3%, 7.0%, and 5.3% on streaming video benchmarks.
This paper formalizes hallucination-to-action conversion in multimodal agents and proposes evidence-carrying agents (ECA) that use constrained verifiers to authorize only safe tool calls, achieving 0% unsafe-action rate on a 200-task pipeline.
This paper introduces On-Policy Data Evolution (ODE) and a visual-native agent harness to improve multimodal deep search agents. By enabling reusable visual evidence and closed-loop data generation, ODE significantly boosts the performance of Qwen3-VL agents across multiple benchmarks, surpassing Gemini 2.5 Pro.
HyperEyes is a parallel multimodal search agent that uses dual-grained reinforcement learning to optimize inference efficiency, achieving higher accuracy with significantly fewer tool-call rounds compared to existing agents.
InterLV-Search is a new benchmark introduced in this paper to evaluate interleaved language-vision agentic search, highlighting limitations in current systems regarding visual evidence seeking and multimodal integration.