Tag
This paper presents Ptah, a multi-agent harness for generating verifiable multimodal deep research reports by interleaving textual and visual evidence through specialized agents and verification mechanisms. It introduces PtahEval for evaluation.
MemEye is a visual-centric evaluation framework that assesses multimodal agent memory by measuring visual evidence granularity and retrieval complexity across 8 life-scenario tasks, revealing that current architectures struggle to preserve fine-grained visual details and reason about state changes over time.