Helping Figures Tell their Story! Paper-Grounded Video Generation Explaining Complex Scientific Figures

arXiv cs.CL Papers

Summary

Introduces MINARD, a pipeline for generating narrated, region-grounded walkthrough videos from scientific figures and their papers, along with the FigTalk benchmark and new grounding metrics.

arXiv:2606.12576v1 Announce Type: new Abstract: Scientific figures compress complex pipelines into a single canvas, yet understanding them requires paper-grounded, step-by-step narration aligned with visual highlights a capability missing from current video generation systems and benchmarks. To address this, we introduce paper-grounded figure-to-video generation: generating narrated, region-grounded walkthrough videos from a figure and its paper. We propose MINARD (Multimodal Interpretation of Narrated Architecture via Region Decomposition), a pipeline that generates paper-grounded narrations and sequentially grounds them to figure regions. We also release FigTalk, a benchmark with new sequential and component-level grounding metrics derived. On FigTalk, MINARD generates humanlike, paper-faithful narrations and outperforms narration-conditioned figure spatial grounding compared to existing approaches in both automatic and human evaluation
Original Article
View Cached Full Text

Cached at: 06/12/26, 08:50 AM

# Helping Figures Tell their Story! Paper-Grounded Video Generation Explaining Complex Scientific Figures
Source: [https://arxiv.org/abs/2606.12576](https://arxiv.org/abs/2606.12576)
[View PDF](https://arxiv.org/pdf/2606.12576)

> Abstract:Scientific figures compress complex pipelines into a single canvas, yet understanding them requires paper\-grounded, step\-by\-step narration aligned with visual highlights a capability missing from current video generation systems and benchmarks\. To address this, we introduce paper\-grounded figure\-to\-video generation: generating narrated, region\-grounded walkthrough videos from a figure and its paper\. We propose MINARD \(Multimodal Interpretation of Narrated Architecture via Region Decomposition\), a pipeline that generates paper\-grounded narrations and sequentially grounds them to figure regions\. We also release FigTalk, a benchmark with new sequential and component\-level grounding metrics derived\. On FigTalk, MINARD generates humanlike, paper\-faithful narrations and outperforms narration\-conditioned figure spatial grounding compared to existing approaches in both automatic and human evaluation

## Submission history

From: Ishani Mondal \[[view email](https://arxiv.org/show-email/70f8b856/2606.12576)\] **\[v1\]**Wed, 10 Jun 2026 18:25:10 UTC \(9,279 KB\)

Similar Articles

Towards One-to-Many Temporal Grounding

Hugging Face Daily Papers

This paper introduces One-to-Many Temporal Grounding (OMTG), a new task for localizing multiple disjoint video segments from a single text query, along with a benchmark, evaluation metrics, a 56k-sample dataset, and novel reward functions that achieve state-of-the-art results, outperforming Gemini 2.5 Pro and Seed-1.8.

See Before You Code: Learning Visual Priors for Spatially Aware Educational Animation Generation

arXiv cs.AI

This paper introduces OmniManim, a render-feedback-aware framework for generating educational animations from natural language descriptions using large language models. It addresses visual defects like element overlap and misalignment by incorporating explicit visual planning, post-render diagnostics, and localized repair, demonstrating improved render quality on newly constructed datasets.

Narrative-Driven Paper-to-Slide Generation via ArcDeck

Hugging Face Daily Papers

ArcDeck is a multi-agent framework that generates presentation slides from academic papers by modeling logical flow through discourse trees and iterative agent refinement, outperforming direct summarization methods. The paper introduces ArcBench, a new benchmark for evaluating paper-to-slide generation with emphasis on narrative coherence and logical structure.