Helping Figures Tell their Story! Paper-Grounded Video Generation Explaining Complex Scientific Figures

arXiv cs.CL 06/12/26, 04:00 AM Papers

scientific-figures video-generation narration grounding benchmark multimodal

Summary

Introduces MINARD, a pipeline for generating narrated, region-grounded walkthrough videos from scientific figures and their papers, along with the FigTalk benchmark and new grounding metrics.

arXiv:2606.12576v1 Announce Type: new Abstract: Scientific figures compress complex pipelines into a single canvas, yet understanding them requires paper-grounded, step-by-step narration aligned with visual highlights a capability missing from current video generation systems and benchmarks. To address this, we introduce paper-grounded figure-to-video generation: generating narrated, region-grounded walkthrough videos from a figure and its paper. We propose MINARD (Multimodal Interpretation of Narrated Architecture via Region Decomposition), a pipeline that generates paper-grounded narrations and sequentially grounds them to figure regions. We also release FigTalk, a benchmark with new sequential and component-level grounding metrics derived. On FigTalk, MINARD generates humanlike, paper-faithful narrations and outperforms narration-conditioned figure spatial grounding compared to existing approaches in both automatic and human evaluation

Original Article

View Cached Full Text

Cached at: 06/12/26, 08:50 AM

# Helping Figures Tell their Story! Paper-Grounded Video Generation Explaining Complex Scientific Figures
Source: [https://arxiv.org/abs/2606.12576](https://arxiv.org/abs/2606.12576)
[View PDF](https://arxiv.org/pdf/2606.12576)

> Abstract:Scientific figures compress complex pipelines into a single canvas, yet understanding them requires paper\-grounded, step\-by\-step narration aligned with visual highlights a capability missing from current video generation systems and benchmarks\. To address this, we introduce paper\-grounded figure\-to\-video generation: generating narrated, region\-grounded walkthrough videos from a figure and its paper\. We propose MINARD \(Multimodal Interpretation of Narrated Architecture via Region Decomposition\), a pipeline that generates paper\-grounded narrations and sequentially grounds them to figure regions\. We also release FigTalk, a benchmark with new sequential and component\-level grounding metrics derived\. On FigTalk, MINARD generates humanlike, paper\-faithful narrations and outperforms narration\-conditioned figure spatial grounding compared to existing approaches in both automatic and human evaluation

## Submission history

From: Ishani Mondal \[[view email](https://arxiv.org/show-email/70f8b856/2606.12576)\] **\[v1\]**Wed, 10 Jun 2026 18:25:10 UTC \(9,279 KB\)

Helping Figures Tell their Story! Paper-Grounded Video Generation Explaining Complex Scientific Figures

Similar Articles

CityRAG: Stepping Into a City via Spatially-Grounded Video Generation

AutoFigure: Generating and Refining Publication-Ready Scientific Illustrations

Towards One-to-Many Temporal Grounding

See Before You Code: Learning Visual Priors for Spatially Aware Educational Animation Generation

Narrative-Driven Paper-to-Slide Generation via ArcDeck

Submit Feedback

Similar Articles

CityRAG: Stepping Into a City via Spatially-Grounded Video Generation

AutoFigure: Generating and Refining Publication-Ready Scientific Illustrations
AutoFigure is an open-source system for generating and refining editable, publication-ready scientific diagrams, accepted to ICLR 2026.

Towards One-to-Many Temporal Grounding

See Before You Code: Learning Visual Priors for Spatially Aware Educational Animation Generation

Narrative-Driven Paper-to-Slide Generation via ArcDeck