InterLV-Search: Benchmarking Interleaved Multimodal Agentic Search

Hugging Face Daily Papers 05/08/26, 12:00 AM Papers

Summary

InterLV-Search is a new benchmark introduced in this paper to evaluate interleaved language-vision agentic search, highlighting limitations in current systems regarding visual evidence seeking and multimodal integration.

Existing benchmarks for multimodal agentic search evaluate multimodal search and visual browsing, but visual evidence is either confined to the input or treated as an answer endpoint rather than part of an interleaved search trajectory. We introduce InterLV-Search, a benchmark for Interleaved Language-Vision Agentic Search, in which textual and visual evidence is repeatedly used to condition later search. It contains 2,061 examples across three levels: active visual evidence seeking, controlled offline interleaved multimodal search, and open-web interleaved multimodal search. Beyond existing benchmarks, it also includes multimodal multi-branch samples that involve comparison between multiple entities during the evidence search. We construct Level 1 and Level 2 with automated pipelines and Level 3 with a machine-led, human-supervised open-web pipeline. We further provide InterLV-Agent for standardized tool use, trajectory logging, and evaluation. Experiments on proprietary and open-source multimodal agents show that current systems remain far from solving interleaved multimodal search, with the best model below 50% overall accuracy, highlighting challenges in visual evidence seeking, search control, and multimodal evidence integration. We release the benchmark data and evaluation code at https://github.com/hbhalpha/InterLV-Search-Bench

Original Article Export to Word Export to PDF

View Cached Full Text

Cached at: 05/11/26, 02:42 AM

Paper page - InterLV-Search: Benchmarking Interleaved Multimodal Agentic Search

Source: https://huggingface.co/papers/2605.07510

Abstract

InterLV-Search benchmark evaluates interleaved language-vision agentic search by repeatedly using textual and visual evidence to condition later search, revealing current systems’ limitations in visual evidence seeking and multimodal evidence integration.

Existing benchmarks formultimodal agentic searchevaluate multimodal search and visual browsing, but visual evidence is either confined to the input or treated as an answer endpoint rather than part of aninterleaved search trajectory. We introduce InterLV-Search, a benchmark for Interleaved Language-Vision Agentic Search, in which textual and visual evidence is repeatedly used to condition later search. It contains 2,061 examples across three levels: activevisual evidence seeking, controlled offline interleaved multimodal search, and open-web interleaved multimodal search. Beyond existing benchmarks, it also includesmultimodal multi-branch samplesthat involve comparison between multiple entities during the evidence search. We construct Level 1 and Level 2 withautomated pipelinesand Level 3 with a machine-led, human-supervised open-web pipeline. We further provide InterLV-Agent for standardized tool use, trajectory logging, and evaluation. Experiments on proprietary and open-sourcemultimodal agentsshow that current systems remain far from solving interleaved multimodal search, with the best model below 50% overall accuracy, highlighting challenges invisual evidence seeking,search control, and multimodal evidence integration. We release the benchmark data and evaluation code at https://github.com/hbhalpha/InterLV-Search-Bench

View arXiv page View PDF GitHub Add to collection

Get this paper in your agent:

hf papers read 2605\.07510

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.07510 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.07510 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.07510 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

InterLV-Search: Benchmarking Interleaved Multimodal Agentic Search

Paper page - InterLV-Search: Benchmarking Interleaved Multimodal Agentic Search

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

OpenSearch-VL: An Open Recipe for Frontier Multimodal Search Agents

HyperGVL: Benchmarking and Improving Large Vision-Language Models in Hypergraph Understanding and Reasoning

Retrieve, Integrate, and Synthesize: Spatial-Semantic Grounded Latent Visual Reasoning

Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation

LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?

Submit Feedback

Similar Articles

OpenSearch-VL: An Open Recipe for Frontier Multimodal Search Agents

HyperGVL: Benchmarking and Improving Large Vision-Language Models in Hypergraph Understanding and Reasoning

Retrieve, Integrate, and Synthesize: Spatial-Semantic Grounded Latent Visual Reasoning

Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation

LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?