InterLV-Search: Benchmarking Interleaved Multimodal Agentic Search
Summary
InterLV-Search is a new benchmark introduced in this paper to evaluate interleaved language-vision agentic search, highlighting limitations in current systems regarding visual evidence seeking and multimodal integration.
View Cached Full Text
Cached at: 05/11/26, 02:42 AM
Paper page - InterLV-Search: Benchmarking Interleaved Multimodal Agentic Search
Source: https://huggingface.co/papers/2605.07510
Abstract
InterLV-Search benchmark evaluates interleaved language-vision agentic search by repeatedly using textual and visual evidence to condition later search, revealing current systems’ limitations in visual evidence seeking and multimodal evidence integration.
Existing benchmarks formultimodal agentic searchevaluate multimodal search and visual browsing, but visual evidence is either confined to the input or treated as an answer endpoint rather than part of aninterleaved search trajectory. We introduce InterLV-Search, a benchmark for Interleaved Language-Vision Agentic Search, in which textual and visual evidence is repeatedly used to condition later search. It contains 2,061 examples across three levels: activevisual evidence seeking, controlled offline interleaved multimodal search, and open-web interleaved multimodal search. Beyond existing benchmarks, it also includesmultimodal multi-branch samplesthat involve comparison between multiple entities during the evidence search. We construct Level 1 and Level 2 withautomated pipelinesand Level 3 with a machine-led, human-supervised open-web pipeline. We further provide InterLV-Agent for standardized tool use, trajectory logging, and evaluation. Experiments on proprietary and open-sourcemultimodal agentsshow that current systems remain far from solving interleaved multimodal search, with the best model below 50% overall accuracy, highlighting challenges invisual evidence seeking,search control, and multimodal evidence integration. We release the benchmark data and evaluation code at https://github.com/hbhalpha/InterLV-Search-Bench
View arXiv pageView PDFGitHubAdd to collection
Get this paper in your agent:
hf papers read 2605\.07510
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.07510 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.07510 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.07510 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Visual-Seeker: Towards Visual-Native Multimodal Agentic Search via Active Visual Reasoning
Visual-Seeker proposes a visual-native multimodal deep search agent that actively reasons over fine-grained visual details and synthesizes multimodal evidence, achieving state-of-the-art performance on five challenging multimodal search benchmarks.
DMV-Bench: Diagnosing Long-Horizon Multimodal Agents' Visual Memory with Incidental Cue Injection
Introduces DMV-Bench, an interactive benchmark for evaluating visual memory in multimodal agents using incidental visual cues from product images, and proposes DualMem, a dual-coding memory architecture that outperforms text-only and other multimodal baselines across various chain lengths.
Benchmarking Visual State Tracking in Multimodal Video Understanding
Introduces VSTAT, a benchmark for evaluating visual state tracking in multimodal large language models (MLLMs) using 834 clips and 1,500 questions. Current MLLMs perform poorly compared to humans, failing at visual perception rather than reasoning.
OVO-S-Bench: A Hierarchical Benchmark for Streaming Spatial Intelligence in Multimodal LLMs
OVO-S-Bench introduces a comprehensive human-annotated benchmark of 1,680 questions across 348 videos to evaluate streaming spatial intelligence in multimodal LLMs, revealing that even the best model (Gemini-3.1-Pro) trails human experts by 27 points. The benchmark exposes key limitations including allocentric mapping as a major bottleneck and chain-of-thought reasoning amplifying spatial errors.
OpenSearch-VL: An Open Recipe for Frontier Multimodal Search Agents
OpenSearch-VL is an open-source framework and paper introducing a recipe for training frontier multimodal search agents using reinforcement learning, featuring specialized data curation and a novel training algorithm.