PresentAgent-2: Towards Generalist Multimodal Presentation Agents

Hugging Face Daily Papers Papers

Summary

PresentAgent-2 is an agentic framework that generates presentation videos from user queries by conducting research, creating multimodal slides, and producing interactive content across single, discussion, and interaction modes.

Presentation generation is moving beyond static slide creation toward end-to-end presentation video generation with research grounding, multimodal media, and interactive delivery. We introduce PresentAgent-2, an agentic framework for generating presentation videos from user queries. Given an open-ended user query and a selected presentation mode, PresentAgent-2 first summarizes the query into a focused topic and performs deep research over presentation-friendly sources to collect multimodal resources, including relevant text, images, GIFs, and videos. It then constructs presentation slides, generates mode-specific scripts, and composes slides, audio, and dynamic media into a complete presentation video. PresentAgent-2 supports three independent presentation modes within a unified framework: Single Presentation, which generates a single-speaker narrated presentation video; Discussion, which creates a multi-speaker presentation with structured speaker roles, such as for asking guiding questions, explaining concepts, clarifying details, and summarizing key points; and Interaction, which independently supports answering audience questions grounded in the generated slides, scripts, retrieved evidence, and presentation context. To evaluate these capabilities, we build a multimodal presentation benchmark covering single presentation, discussion, and interaction scenarios, with task-specific evaluation criteria for content quality, media relevance, dynamic media use, dialogue naturalness, and interaction grounding. Overall, PresentAgent-2 extends presentation generation from document-dependent slide creation to query-driven, research-grounded presentation video generation with multimodal media, dialogue, and interaction. Code: https://github.com/AIGeeksGroup/PresentAgent-2. Website: https://aigeeksgroup.github.io/PresentAgent-2.
Original Article
View Cached Full Text

Cached at: 05/14/26, 04:17 AM

Paper page - PresentAgent-2: Towards Generalist Multimodal Presentation Agents

Source: https://huggingface.co/papers/2605.11363

Abstract

PresentAgent-2 is an agentic framework that generates presentation videos from user queries by conducting research, creating multimodal slides, and producing interactive content across single, discussion, and interaction modes.

Presentation generation is moving beyond static slide creation toward end-to-endpresentation video generationwithresearch grounding,multimodal media, and interactive delivery. We introduce PresentAgent-2, anagentic frameworkfor generating presentation videos from user queries. Given an open-ended user query and a selected presentation mode, PresentAgent-2 first summarizes the query into a focused topic and performs deep research over presentation-friendly sources to collect multimodal resources, including relevant text, images, GIFs, and videos. It then constructs presentation slides, generates mode-specific scripts, and composes slides, audio, and dynamic media into a complete presentation video. PresentAgent-2 supports three independentpresentation modeswithin a unified framework:Single Presentation, which generates a single-speaker narrated presentation video;Discussion, which creates a multi-speaker presentation with structured speaker roles, such as for asking guiding questions, explaining concepts, clarifying details, and summarizing key points; andInteraction, which independently supports answering audience questions grounded in the generated slides, scripts, retrieved evidence, and presentation context. To evaluate these capabilities, we build a multimodal presentation benchmark coveringsingle presentation,discussion, andinteractionscenarios, with task-specific evaluation criteria for content quality, media relevance, dynamic media use,dialogue naturalness, andinteraction grounding. Overall, PresentAgent-2 extends presentation generation from document-dependent slide creation to query-driven, research-groundedpresentation video generationwithmultimodal media, dialogue, andinteraction. Code: https://github.com/AIGeeksGroup/PresentAgent-2. Website: https://aigeeksgroup.github.io/PresentAgent-2.

View arXiv pageView PDFProject pageGitHub2Add to collection

Get this paper in your agent:

hf papers read 2605\.11363

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.11363 in a model README.md to link it from this page.

Datasets citing this paper1

#### AIGeeksGroup/PresentEval Viewer• Updatedabout 14 hours ago • 58 • 63

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.11363 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation

Hugging Face Daily Papers

MM-WebAgent is a hierarchical agentic framework that generates coherent and visually consistent webpages by coordinating AIGC-based element generation through joint optimization of layout and multimodal content. The paper introduces a benchmark and multi-level evaluation protocol, demonstrating improvements over code-generation and agent-based baselines.

LectūraAgents: A Multi-Agent Framework for Adaptive Personalized AI-Assisted Learning and Embodied Teaching

Hugging Face Daily Papers

LectūraAgents is a multi-agent framework for adaptive personalized learning that mimics professor-student interactions and generates embodied teaching actions aligned with learner profiles. It introduces a hierarchical architecture, an adaptive embodied teaching mechanism, and a Teaching Action-Speech Alignment algorithm, showing consistent improvements over existing approaches.

Macaron-A2UI: A Model for Generative UI in Personal Agents

Hugging Face Daily Papers

Presents Macaron-A2UI, a model for generative UI in personal agents that synthesizes dynamic interfaces with lightweight executable actions, moving beyond text-only chat. The paper introduces a large-scale corpus, the A2UI-Bench benchmark, and trains models up to 754B parameters using LoRA fine-tuning and reinforcement learning, achieving strong results.