Towards Verifiable Multimodal Deep Research: A Multi-Agent Harness for Interleaved Report Generation
Summary
This paper presents Ptah, a multi-agent harness for generating verifiable multimodal deep research reports by interleaving textual and visual evidence through specialized agents and verification mechanisms. It introduces PtahEval for evaluation.
View Cached Full Text
Cached at: 05/29/26, 07:00 AM
Paper page - Towards Verifiable Multimodal Deep Research: A Multi-Agent Harness for Interleaved Report Generation
Source: https://huggingface.co/papers/2605.29861
Abstract
Multi-agent system for generating reliable, visually informative multimodal reports by interleaving textual and visual evidence through specialized agents and verification mechanisms.
Large Language Models(LLMs) have advancedautonomous agentsfromdeep search, which retrieves concise factual answers, todeep research, which synthesizes scattered evidence into long-form reports. However, verifiablemultimodal deep researchremains challenging due to open-ended synthesis without deterministic ground truth and the need to interleave textual arguments with visual evidence. We propose Ptah, amulti-agent harnessfor interleaved report generation. Ptah orchestrates the lifecycle from user query to rendered web report through planning, research, and writing stages, where specialized agents construct visual-aware plans, collect claim-grounded evidence, maintain source-aligned images in aVisual Working Memory, and compose reports throughdeclarative multimodal tool use. Averifier agentserves as the harness’s acceptance function, enforcing factual grounding, citation fidelity, and cross-modal consistency throughout the workflow. We further introduce PtahEval, anevaluation protocolthat augments existing benchmarks with image-level and presentation-level assessments. Experiments ondeep research benchmarksshow that Ptah produces more reliable, visually informative, and usable human-facing multimodal reports than strong baselines.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2605\.29861
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.29861 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.29861 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.29861 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Data Journalist Agent: Transforming Data into Verifiable Multimodal Stories
This paper introduces Data Journalist Agent (Data2Story), a multi-agent framework that automates data journalism by generating evidence-grounded, multimodal news stories while ensuring transparency and verifiability.
TVIR: Building Deep Research Agents Towards Text--Visual Interleaved Report Generation
Introduces TVIR, a benchmark and hierarchical multi-agent framework for generating text-visual interleaved reports, evaluating factual reliability and visual alignment in automated report generation.
DuMate-DeepResearch: An Auditable Multi-Agent System with Recursive Search and Rubric-Grounded Reasoning
This technical report introduces DuMate-DeepResearch, a multi-agent framework for deep research tasks that decouples the agent core from a tool ecosystem, and incorporates graph-based dynamic planning, recursive two-level execution, and rubric-based test-time optimization. The system achieves state-of-the-art results on two deep research benchmarks, demonstrating the value of auditable agent infrastructure.
PresentAgent-2: Towards Generalist Multimodal Presentation Agents
PresentAgent-2 is an agentic framework that generates presentation videos from user queries by conducting research, creating multimodal slides, and producing interactive content across single, discussion, and interaction modes.
Self-Evolving Deep Research via Joint Generation and Evaluation
Researchers from HKUST, ByteDance, and UCL propose SCORE, a co-evolutionary training framework that jointly trains an LLM as both a deep research report generator and an evaluator, using a meta-harness to dynamically adjust evaluation difficulty and prevent reward saturation. Experiments show consistent improvement in open-ended research report quality.