TVIR: Building Deep Research Agents Towards Text--Visual Interleaved Report Generation
Summary
Introduces TVIR, a benchmark and hierarchical multi-agent framework for generating text-visual interleaved reports, evaluating factual reliability and visual alignment in automated report generation.
View Cached Full Text
Cached at: 06/02/26, 07:33 PM
Paper page - TVIR: Building Deep Research Agents Towards Text–Visual Interleaved Report Generation
Source: https://huggingface.co/papers/2606.02320 Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Abstract
A multimodal deep research benchmark and agent framework are introduced to evaluate and improve the factual reliability and visual alignment of automated report generation systems.
Deep Research Agents have shown strong capability in multi-step information retrieval, reasoning, and long-form report generation, but existing benchmarks and systems remain predominantly text-centric, with limited evaluation of whethervisual elementsare factually reliable and well aligned with the surrounding analysis. To address this gap, we introduce TVIR (Text--Visual Interleaved Report Generation), which includes TVIR-Bench, a benchmark of 100 expert-curatedmultimodal deep researchtasks that requirevisual elementsto serve specific analytical sub-goals, and TVIR-Agent, ahierarchical multi-agent frameworkthat serves as a strong baseline for constructing outlines, retrieving images, generating charts with traceable sources, and composing reports through context-aware sequential writing. We further develop a dual-path evaluation framework that combinesTextual AssessmentandVisual Assessment. Experiments across nine deep research systems show that TVIR-Agent achieves strong overall performance, underscoring the importance of explicit multimodal design and evaluation forevidence-driven report generation.
View arXiv pageView PDFProject pageGitHub3Add to collection
Get this paper in your agent:
hf papers read 2606\.02320
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.02320 in a model README.md to link it from this page.
Datasets citing this paper1
#### NJU-LINK/TVIR-Bench Viewer• Updatedabout 5 hours ago • 100
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.02320 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Towards Verifiable Multimodal Deep Research: A Multi-Agent Harness for Interleaved Report Generation
This paper presents Ptah, a multi-agent harness for generating verifiable multimodal deep research reports by interleaving textual and visual evidence through specialized agents and verification mechanisms. It introduces PtahEval for evaluation.
VESTA: Visual Exploration with Statistical Tool Agents
This paper introduces VESTA, a framework that equips vision-language models with dynamically growing toolkits for data exploration and statistical model refinement, outperforming prior agent-based methods on complex scientific modeling tasks. The authors also present Dawn, a benchmark for distribution fitting and time series modeling, including real-world astronomy challenges.
ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction
This paper introduces ReVision, a method to reduce token usage in computer-use agents by removing redundant visual patches from consecutive screenshots. It demonstrates that this efficiency gain allows agents to process longer trajectories and improve performance on benchmarks like OSWorld.
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning
UI-TARS-2 is a native GUI-centered agent model that addresses data scalability, multi-turn RL, and environment stability challenges, achieving state-of-the-art results on GUI benchmarks (88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena,73.3 on AndroidWorld) and outperforming Claude and OpenAI agents.
AtlasVA: Self-Evolving Visual Skill Memory for Teacher-Free VLM Agents
AtlasVA is a teacher-free visual skill memory framework for vision-language model agents that uses spatial heatmaps, visual exemplars, and symbolic text skills to improve spatial decision-making in long-horizon tasks, outperforming baselines on several benchmarks.