LVSA: Training-Free Sparse Attention for Long Video Diffusion

Hugging Face Daily Papers Papers

Summary

LVSA introduces a training-free sparse attention mechanism for video diffusion models, reducing compute up to 3.17x while enabling generation beyond training horizons without quality loss.

Dense self-attention is the compute and quality bottleneck of long-video diffusion inference: cost grows quadratically with the sequence length, and beyond the training horizon the model converges to near-static output, that is, "frozen" repetitive video. State of the art approaches are either too costly, e.g., they require retraining, or fail to satisfy both performance and quality objectives in a scalable manner. To this end, we introduce Long Video Sparse Attention (LVSA), a training-free model-agnostic block-sparse attention for video diffusion transformers that combines a structured window pattern with rotating global anchors, thus removing the fixed-grid bias which causes long-range temporal artifacts. LVSA, combined with a FlashInfer kernel, reduces compute up to 3.17x on Wan 2.1 1.3B at a 6x horizon, 2.98x on Wan 2.1 14B at a 6x horizon, and 3.33x on HunyuanVideo 1.5 at a 1.5x horizon, compared to dense attention. Beyond reducing compute, LVSA enables HunyuanVideo 1.5 generation at a 2x horizon, which is otherwise out-of-memory on a single GPU. Moreover, LVSA provides speedups up to 2.41x compared to RIFLEx and 3.27x compared to UltraViCo on Wan 2.1 1.3B. To demonstrate applicability across diverse platforms, we apply LVSA on NPUs and achieve speedups up to 2.71x on Wan 2.2 A14B and 3.24x on Wan 2.1 1.3B compared to dense attention. To evaluate quality in a fair way, we introduce VQeval, a tool properly scoring loopy video failures, which instead are rewarded in state of the art evaluators like VBench-Long. LVSA is quality-neutral for generation at training horizon length and quality-positive at extended lengths.
Original Article
View Cached Full Text

Cached at: 06/02/26, 03:36 PM

Paper page - LVSA: Training-Free Sparse Attention for Long Video Diffusion

Source: https://huggingface.co/papers/2605.31057

Abstract

Long Video Sparse Attention (LVSA) addresses computational bottlenecks in video diffusion models by introducing a sparse attention mechanism that reduces compute costs while maintaining video quality beyond training horizons.

Dense self-attentionis the compute and quality bottleneck of long-video diffusion inference: cost grows quadratically with the sequence length, and beyond the training horizon the model converges to near-static output, that is, “frozen” repetitive video. State of the art approaches are either too costly, e.g., they require retraining, or fail to satisfy both performance and quality objectives in a scalable manner. To this end, we introduce Long Video Sparse Attention (LVSA), a training-free model-agnosticblock-sparse attentionforvideo diffusion transformersthat combines astructured window patternwithrotating global anchors, thus removing the fixed-grid bias which causes long-range temporal artifacts. LVSA, combined with aFlashInfer kernel, reduces compute up to 3.17x on Wan 2.1 1.3B at a 6x horizon, 2.98x on Wan 2.1 14B at a 6x horizon, and 3.33x on HunyuanVideo 1.5 at a 1.5x horizon, compared to dense attention. Beyond reducing compute, LVSA enables HunyuanVideo 1.5 generation at a 2x horizon, which is otherwise out-of-memory on a single GPU. Moreover, LVSA provides speedups up to 2.41x compared toRIFLExand 3.27x compared toUltraViCoon Wan 2.1 1.3B. To demonstrate applicability across diverse platforms, we apply LVSA on NPUs and achieve speedups up to 2.71x on Wan 2.2 A14B and 3.24x on Wan 2.1 1.3B compared to dense attention. To evaluate quality in a fair way, we introduceVQeval, a tool properly scoring loopy video failures, which instead are rewarded in state of the art evaluators likeVBench-Long. LVSA is quality-neutral for generation at training horizon length and quality-positive at extended lengths.

View arXiv pageView PDFGitHub13Add to collection

Get this paper in your agent:

hf papers read 2605\.31057

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.31057 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.31057 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.31057 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

SparDA: Sparse Decoupled Attention for Efficient Long-Context LLM Inference

arXiv cs.CL

SparDA proposes a decoupled sparse attention architecture that adds a lightweight 'Forecast' projection to predict future KV cache needs, enabling lookahead prefetching from CPU to GPU and reducing selection overhead. On 8B sparse-pretrained models, it achieves up to 1.25× prefill and 1.7× decode speedup, with up to 5.3× higher decode throughput over non-offload baselines.

Video2LoRA: Parametric Video Internalization for Vision-Language Models

Hugging Face Daily Papers

This paper introduces Video2LoRA, a method that predicts Low-Rank Adaptation (LoRA) weights directly from video representations, enabling efficient video processing in frozen vision-language models. It reduces visual token load by up to 1500x and query TTFT by 6-80x while maintaining performance on video summarization and captioning benchmarks.

Lightning Unified Video Editing via In-Context Sparse Attention

Hugging Face Daily Papers

This paper introduces In-context Sparse Attention (ISA), a framework that significantly reduces computational costs in video editing by pruning redundant context and using dynamic query grouping. The authors demonstrate the method's effectiveness with LIVEditor, achieving near-lossless acceleration and state-of-the-art results on multiple video editing benchmarks.