X+Slides: Benchmarking Audience-Conditioned Slide Generation

arXiv cs.AI Papers

Summary

X+Slides is a new benchmark for evaluating audience-conditioned slide generation from source documents, using source-grounded probes and audience-specific utility weights. Experiments on DeepPresenter, SlideTailor, and NotebookLM show that current systems recover substantial but incomplete audience-essential information.

arXiv:2606.19256v1 Announce Type: new Abstract: Automatically generating slide decks from source documents is an important application of large language models (LLMs). Existing benchmarks primarily assess slide completeness and technical depth, while overlooking the target audience as a critical real-world factor. For instance, specialists demand rigorous proofs, whereas decision-makers prioritize actionable conclusions. To bridge this gap, we introduce X+Slides, a benchmark specifically designed for audience-conditioned slide generation. Built on a diverse corpus spanning 113 topics and seven presentation scenes, X+Slides employs a dynamic evaluation framework constructed from 8,133 deduplicated, source-grounded probes. By assigning audience-specific utility weights to the same source-grounded probes, X+Slides reports four complementary metrics: Audience Coverage measures how much audience-essential information is conveyed, Domain-wise Coverage shows which information types are covered, Efficiency measures delivered utility per unit of attention cost, and Correctness verifies whether slide claims are supported by the source. Experiments on DeepPresenter, SlideTailor, and NotebookLM show that current systems can recover a substantial but still incomplete part of audience-essential information: at $\tau_A=0.7$, DeepPresenter reaches a best Audience Coverage of 0.714, SlideTailor reaches 0.594, and the NotebookLM ablation reaches 0.853 while showing clear grounding differences. These results indicate that visual quality and broad topic coverage should not be treated as evidence support without source-grounded evaluation.
Original Article
View Cached Full Text

Cached at: 06/18/26, 05:42 AM

# X+Slides: Benchmarking Audience-Conditioned Slide Generation
Source: [https://arxiv.org/abs/2606.19256](https://arxiv.org/abs/2606.19256)
[View PDF](https://arxiv.org/pdf/2606.19256)

> Abstract:Automatically generating slide decks from source documents is an important application of large language models \(LLMs\)\. Existing benchmarks primarily assess slide completeness and technical depth, while overlooking the target audience as a critical real\-world factor\. For instance, specialists demand rigorous proofs, whereas decision\-makers prioritize actionable conclusions\. To bridge this gap, we introduce X\+Slides, a benchmark specifically designed for audience\-conditioned slide generation\. Built on a diverse corpus spanning 113 topics and seven presentation scenes, X\+Slides employs a dynamic evaluation framework constructed from 8,133 deduplicated, source\-grounded probes\. By assigning audience\-specific utility weights to the same source\-grounded probes, X\+Slides reports four complementary metrics: Audience Coverage measures how much audience\-essential information is conveyed, Domain\-wise Coverage shows which information types are covered, Efficiency measures delivered utility per unit of attention cost, and Correctness verifies whether slide claims are supported by the source\. Experiments on DeepPresenter, SlideTailor, and NotebookLM show that current systems can recover a substantial but still incomplete part of audience\-essential information: at $\\tau\_A=0\.7$, DeepPresenter reaches a best Audience Coverage of 0\.714, SlideTailor reaches 0\.594, and the NotebookLM ablation reaches 0\.853 while showing clear grounding differences\. These results indicate that visual quality and broad topic coverage should not be treated as evidence support without source\-grounded evaluation\.

## Submission history

From: Haodong Chen \[[view email](https://arxiv.org/show-email/158b9e03/2606.19256)\] **\[v1\]**Wed, 17 Jun 2026 16:30:26 UTC \(39,100 KB\)

Similar Articles

DeepSlide: From Artifacts to Presentation Delivery

arXiv cs.AI

DeepSlide is a human-in-the-loop multi-agent system for the full presentation process, from requirement elicitation and time-budgeted narrative planning to evidence-grounded slide-script generation and rehearsal support. It introduces a dual-scoreboard benchmark separating static artifact quality from dynamic delivery excellence, and achieves gains in narrative flow, pacing precision, and slide-script synergy.

AI-Generated Slides: Are They Good? Can Students Tell?

arXiv cs.AI

This paper examines using generative AI tools (NotebookLM, Claude, M365 Copilot, Cursor, Claude Code) to generate slides from instructor notes, finding that coding assistants produce the best slides and that students cannot reliably distinguish AI-generated slides from human-created ones.

Narrative-Driven Paper-to-Slide Generation via ArcDeck

Hugging Face Daily Papers

ArcDeck is a multi-agent framework that generates presentation slides from academic papers by modeling logical flow through discourse trees and iterative agent refinement, outperforming direct summarization methods. The paper introduces ArcBench, a new benchmark for evaluating paper-to-slide generation with emphasis on narrative coherence and logical structure.