Towards Verifiable Multimodal Deep Research: A Multi-Agent Harness for Interleaved Report Generation

Hugging Face Daily Papers Papers

Summary

This paper presents Ptah, a multi-agent harness for generating verifiable multimodal deep research reports by interleaving textual and visual evidence through specialized agents and verification mechanisms. It introduces PtahEval for evaluation.

Large Language Models (LLMs) have advanced autonomous agents from deep search, which retrieves concise factual answers, to deep research, which synthesizes scattered evidence into long-form reports. However, verifiable multimodal deep research remains challenging due to open-ended synthesis without deterministic ground truth and the need to interleave textual arguments with visual evidence. We propose Ptah, a multi-agent harness for interleaved report generation. Ptah orchestrates the lifecycle from user query to rendered web report through planning, research, and writing stages, where specialized agents construct visual-aware plans, collect claim-grounded evidence, maintain source-aligned images in a Visual Working Memory, and compose reports through declarative multimodal tool use. A verifier agent serves as the harness's acceptance function, enforcing factual grounding, citation fidelity, and cross-modal consistency throughout the workflow. We further introduce PtahEval, an evaluation protocol that augments existing benchmarks with image-level and presentation-level assessments. Experiments on deep research benchmarks show that Ptah produces more reliable, visually informative, and usable human-facing multimodal reports than strong baselines.
Original Article
View Cached Full Text

Cached at: 05/29/26, 07:00 AM

Paper page - Towards Verifiable Multimodal Deep Research: A Multi-Agent Harness for Interleaved Report Generation

Source: https://huggingface.co/papers/2605.29861

Abstract

Multi-agent system for generating reliable, visually informative multimodal reports by interleaving textual and visual evidence through specialized agents and verification mechanisms.

Large Language Models(LLMs) have advancedautonomous agentsfromdeep search, which retrieves concise factual answers, todeep research, which synthesizes scattered evidence into long-form reports. However, verifiablemultimodal deep researchremains challenging due to open-ended synthesis without deterministic ground truth and the need to interleave textual arguments with visual evidence. We propose Ptah, amulti-agent harnessfor interleaved report generation. Ptah orchestrates the lifecycle from user query to rendered web report through planning, research, and writing stages, where specialized agents construct visual-aware plans, collect claim-grounded evidence, maintain source-aligned images in aVisual Working Memory, and compose reports throughdeclarative multimodal tool use. Averifier agentserves as the harness’s acceptance function, enforcing factual grounding, citation fidelity, and cross-modal consistency throughout the workflow. We further introduce PtahEval, anevaluation protocolthat augments existing benchmarks with image-level and presentation-level assessments. Experiments ondeep research benchmarksshow that Ptah produces more reliable, visually informative, and usable human-facing multimodal reports than strong baselines.

View arXiv pageView PDFAdd to collection

Get this paper in your agent:

hf papers read 2605\.29861

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.29861 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.29861 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.29861 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

DuMate-DeepResearch: An Auditable Multi-Agent System with Recursive Search and Rubric-Grounded Reasoning

arXiv cs.AI

This technical report introduces DuMate-DeepResearch, a multi-agent framework for deep research tasks that decouples the agent core from a tool ecosystem, and incorporates graph-based dynamic planning, recursive two-level execution, and rubric-based test-time optimization. The system achieves state-of-the-art results on two deep research benchmarks, demonstrating the value of auditable agent infrastructure.

Self-Evolving Deep Research via Joint Generation and Evaluation

arXiv cs.CL

Researchers from HKUST, ByteDance, and UCL propose SCORE, a co-evolutionary training framework that jointly trains an LLM as both a deep research report generator and an evaluator, using a meta-harness to dynamically adjust evaluation difficulty and prevent reward saturation. Experiments show consistent improvement in open-ended research report quality.