SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer

Papers with Code Trending 09/29/25, 12:28 PM Papers

video-generation diffusion-model linear-attention efficient-inference text-to-video open-source

Summary

SANA-Video is a small diffusion model that efficiently generates high-resolution, long videos using linear attention and a constant-memory KV cache, achieving competitive performance at dramatically lower cost and faster speed compared to existing models.

We introduce SANA-Video, a small diffusion model that can efficiently generate videos up to 720x1280 resolution and minute-length duration. SANA-Video synthesizes high-resolution, high-quality and long videos with strong text-video alignment at a remarkably fast speed, deployable on RTX 5090 GPU. Two core designs ensure our efficient, effective and long video generation: (1) Linear DiT: We leverage linear attention as the core operation, which is more efficient than vanilla attention given the large number of tokens processed in video generation. (2) Constant-Memory KV cache for Block Linear Attention: we design block-wise autoregressive approach for long video generation by employing a constant-memory state, derived from the cumulative properties of linear attention. This KV cache provides the Linear DiT with global context at a fixed memory cost, eliminating the need for a traditional KV cache and enabling efficient, minute-long video generation. In addition, we explore effective data filters and model training strategies, narrowing the training cost to 12 days on 64 H100 GPUs, which is only 1% of the cost of MovieGen. Given its low cost, SANA-Video achieves competitive performance compared to modern state-of-the-art small diffusion models (e.g., Wan 2.1-1.3B and SkyReel-V2-1.3B) while being 16x faster in measured latency. Moreover, SANA-Video can be deployed on RTX 5090 GPUs with NVFP4 precision, accelerating the inference speed of generating a 5-second 720p video from 71s to 29s (2.4x speedup). In summary, SANA-Video enables low-cost, high-quality video generation.

Original Article

View Cached Full Text

Cached at: 05/16/26, 12:22 AM

Paper page - SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer

Source: https://huggingface.co/papers/2509.24695 Published on Sep 29, 2025

Submitted byhttps://huggingface.co/Yuyang-z

Yuyangon Sep 30, 2025

Authors:

Abstract

SANA-Video, a small diffusion model, efficiently generates high-resolution, high-quality videos with strong text-video alignment using linear attention and a constant-memory KV cache, achieving competitive performance at a lower cost and faster speed.

We introduce SANA-Video, a smalldiffusion modelthat can efficiently generate videos up to 720x1280 resolution and minute-length duration. SANA-Video synthesizes high-resolution, high-quality and long videos with strongtext-video alignmentat a remarkably fast speed, deployable on RTX 5090 GPU. Two core designs ensure our efficient, effective and long video generation: (1)Linear DiT: We leveragelinear attentionas the core operation, which is more efficient than vanilla attention given the large number of tokens processed in video generation. (2)Constant-Memory KV cachefor Block Linear Attention: we designblock-wise autoregressiveapproach for long video generation by employing a constant-memory state, derived from the cumulative properties oflinear attention. This KV cache provides theLinear DiTwith global context at a fixed memory cost, eliminating the need for a traditional KV cache and enabling efficient, minute-long video generation. In addition, we explore effective data filters and model training strategies, narrowing the training cost to 12 days on 64 H100 GPUs, which is only 1% of the cost ofMovieGen. Given its low cost, SANA-Video achieves competitive performance compared to modern state-of-the-art smalldiffusion models (e.g., Wan 2.1-1.3B and SkyReel-V2-1.3B) while being 16x faster in measured latency. Moreover, SANA-Video can be deployed on RTX 5090 GPUs withNVFP4 precision, accelerating the inference speed of generating a 5-second 720p video from 71s to 29s (2.4x speedup). In summary, SANA-Video enables low-cost, high-quality video generation.

View arXiv page View PDF Project page GitHub5.25k Add to collection

Get this paper in your agent:

hf papers read 2509\.24695

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper5

#### Efficient-Large-Model/SANA-Video_2B_720p Text-to-Video• UpdatedMar 16 • 20 • 15 #### Efficient-Large-Model/SANA-Video_2B_480p Text-to-Video• UpdatedOct 28, 2025 • 88 • 14 #### Efficient-Large-Model/SANA-Video_2B_480p_diffusers Text-to-Video• UpdatedNov 4, 2025 • 5 #### Efficient-Large-Model/SANA-Video_2B_480p_LongLive_diffusers Text-to-Video• UpdatedDec 9, 2025 • 2 Browse 5 models citing this paper## Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2509.24695 in a dataset README.md to link it from this page.

Spaces citing this paper1

Collections including this paper5

Browse 5 collections that include this paper

SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer

Paper page - SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer

Abstract

Models citing this paper5

Spaces citing this paper1

Collections including this paper5

Similar Articles

Efficient-Large-Model/SANA-WM_bidirectional

SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer

LVSA: Training-Free Sparse Attention for Long Video Diffusion

NVlabs/Sana

LongCat-Video Technical Report

Submit Feedback

Similar Articles

Efficient-Large-Model/SANA-WM_bidirectional

SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer

LVSA: Training-Free Sparse Attention for Long Video Diffusion

LongCat-Video Technical Report