Helping Figures Tell their Story! Paper-Grounded Video Generation Explaining Complex Scientific Figures
Summary
Introduces MINARD, a pipeline for generating narrated, region-grounded walkthrough videos from scientific figures and their papers, along with the FigTalk benchmark and new grounding metrics.
View Cached Full Text
Cached at: 06/12/26, 08:50 AM
# Helping Figures Tell their Story! Paper-Grounded Video Generation Explaining Complex Scientific Figures Source: [https://arxiv.org/abs/2606.12576](https://arxiv.org/abs/2606.12576) [View PDF](https://arxiv.org/pdf/2606.12576) > Abstract:Scientific figures compress complex pipelines into a single canvas, yet understanding them requires paper\-grounded, step\-by\-step narration aligned with visual highlights a capability missing from current video generation systems and benchmarks\. To address this, we introduce paper\-grounded figure\-to\-video generation: generating narrated, region\-grounded walkthrough videos from a figure and its paper\. We propose MINARD \(Multimodal Interpretation of Narrated Architecture via Region Decomposition\), a pipeline that generates paper\-grounded narrations and sequentially grounds them to figure regions\. We also release FigTalk, a benchmark with new sequential and component\-level grounding metrics derived\. On FigTalk, MINARD generates humanlike, paper\-faithful narrations and outperforms narration\-conditioned figure spatial grounding compared to existing approaches in both automatic and human evaluation ## Submission history From: Ishani Mondal \[[view email](https://arxiv.org/show-email/70f8b856/2606.12576)\] **\[v1\]**Wed, 10 Jun 2026 18:25:10 UTC \(9,279 KB\)
Similar Articles
CityRAG: Stepping Into a City via Spatially-Grounded Video Generation
CityRAG introduces a video generative model that produces long, physically grounded, 3D-consistent videos of real-world cities using geo-registered data, enabling realistic navigation and simulation for robotics and autonomous driving.
AutoFigure: Generating and Refining Publication-Ready Scientific Illustrations
AutoFigure is an open-source system for generating and refining editable, publication-ready scientific diagrams, accepted to ICLR 2026.
Towards One-to-Many Temporal Grounding
This paper introduces One-to-Many Temporal Grounding (OMTG), a new task for localizing multiple disjoint video segments from a single text query, along with a benchmark, evaluation metrics, a 56k-sample dataset, and novel reward functions that achieve state-of-the-art results, outperforming Gemini 2.5 Pro and Seed-1.8.
See Before You Code: Learning Visual Priors for Spatially Aware Educational Animation Generation
This paper introduces OmniManim, a render-feedback-aware framework for generating educational animations from natural language descriptions using large language models. It addresses visual defects like element overlap and misalignment by incorporating explicit visual planning, post-render diagnostics, and localized repair, demonstrating improved render quality on newly constructed datasets.
Narrative-Driven Paper-to-Slide Generation via ArcDeck
ArcDeck is a multi-agent framework that generates presentation slides from academic papers by modeling logical flow through discourse trees and iterative agent refinement, outperforming direct summarization methods. The paper introduces ArcBench, a new benchmark for evaluating paper-to-slide generation with emphasis on narrative coherence and logical structure.