TVIR: Building Deep Research Agents Towards Text--Visual Interleaved Report Generation

Hugging Face Daily Papers Papers

Summary

Introduces TVIR, a benchmark and hierarchical multi-agent framework for generating text-visual interleaved reports, evaluating factual reliability and visual alignment in automated report generation.

Deep Research Agents have shown strong capability in multi-step information retrieval, reasoning, and long-form report generation, but existing benchmarks and systems remain predominantly text-centric, with limited evaluation of whether visual elements are factually reliable and well aligned with the surrounding analysis. To address this gap, we introduce TVIR (Text--Visual Interleaved Report Generation), which includes TVIR-Bench, a benchmark of 100 expert-curated multimodal deep research tasks that require visual elements to serve specific analytical sub-goals, and TVIR-Agent, a hierarchical multi-agent framework that serves as a strong baseline for constructing outlines, retrieving images, generating charts with traceable sources, and composing reports through context-aware sequential writing. We further develop a dual-path evaluation framework that combines Textual Assessment and Visual Assessment. Experiments across nine deep research systems show that TVIR-Agent achieves strong overall performance, underscoring the importance of explicit multimodal design and evaluation for evidence-driven report generation.
Original Article
View Cached Full Text

Cached at: 06/02/26, 07:33 PM

Paper page - TVIR: Building Deep Research Agents Towards Text–Visual Interleaved Report Generation

Source: https://huggingface.co/papers/2606.02320 Authors:

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

Abstract

A multimodal deep research benchmark and agent framework are introduced to evaluate and improve the factual reliability and visual alignment of automated report generation systems.

Deep Research Agents have shown strong capability in multi-step information retrieval, reasoning, and long-form report generation, but existing benchmarks and systems remain predominantly text-centric, with limited evaluation of whethervisual elementsare factually reliable and well aligned with the surrounding analysis. To address this gap, we introduce TVIR (Text--Visual Interleaved Report Generation), which includes TVIR-Bench, a benchmark of 100 expert-curatedmultimodal deep researchtasks that requirevisual elementsto serve specific analytical sub-goals, and TVIR-Agent, ahierarchical multi-agent frameworkthat serves as a strong baseline for constructing outlines, retrieving images, generating charts with traceable sources, and composing reports through context-aware sequential writing. We further develop a dual-path evaluation framework that combinesTextual AssessmentandVisual Assessment. Experiments across nine deep research systems show that TVIR-Agent achieves strong overall performance, underscoring the importance of explicit multimodal design and evaluation forevidence-driven report generation.

View arXiv pageView PDFProject pageGitHub3Add to collection

Get this paper in your agent:

hf papers read 2606\.02320

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.02320 in a model README.md to link it from this page.

Datasets citing this paper1

#### NJU-LINK/TVIR-Bench Viewer• Updatedabout 5 hours ago • 100

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.02320 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

VESTA: Visual Exploration with Statistical Tool Agents

arXiv cs.AI

This paper introduces VESTA, a framework that equips vision-language models with dynamically growing toolkits for data exploration and statistical model refinement, outperforming prior agent-based methods on complex scientific modeling tasks. The authors also present Dawn, a benchmark for distribution fitting and time series modeling, including real-world astronomy challenges.

UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

Papers with Code Trending

UI-TARS-2 is a native GUI-centered agent model that addresses data scalability, multi-turn RL, and environment stability challenges, achieving state-of-the-art results on GUI benchmarks (88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena,73.3 on AndroidWorld) and outperforming Claude and OpenAI agents.

AtlasVA: Self-Evolving Visual Skill Memory for Teacher-Free VLM Agents

Hugging Face Daily Papers

AtlasVA is a teacher-free visual skill memory framework for vision-language model agents that uses spatial heatmaps, visual exemplars, and symbolic text skills to improve spatial decision-making in long-horizon tasks, outperforming baselines on several benchmarks.