SANA-Streaming: Real-time Streaming Video Editing with Hybrid Diffusion Transformer

Hugging Face Daily Papers 05/28/26, 12:00 AM Papers

video-editing diffusion-transformer real-time streaming temporal-consistency consumer-gpus system-co-design

Summary

SANA-Streaming enables real-time high-resolution video-to-video editing on consumer GPUs using a hybrid diffusion transformer architecture, cycle-reverse regularization, and efficient system co-design, achieving 24 FPS at 1280x704 resolution on a single RTX 5090.

Real-time streaming video-to-video editing (V2V) is critical for interactive applications such as live broadcasting and gaming, yet it remains a formidable challenge due to the stringent requirements for temporal consistency and inference throughput. In this paper, we present SANA-Streaming, a system-algorithm co-designed framework for high-resolution, real-time streaming video editing on consumer GPUs, with the following three core designs: (1) Hybrid Diffusion Transformer architecture introduces softmax attention in part of the blocks to improve local modeling capabilities while preserving the efficiency of linear layers. (2) Cycle-Reverse Regularization is a novel training strategy that enforces semantic consistency by predicting source frames from generated content via flow matching, improving temporal consistency without requiring paired long edited videos. (3) Efficient System Co-design combines fused GDN kernels and Mixed-Precision Quantization (MPQ) optimized for the NVIDIA Blackwell (RTX 5090) architecture. By profiling real-world throughput, our MPQ maximizes Tensor Core utilization while maintaining generation quality. The resulting system achieves real-time 1280 x 704 resolution editing at 24 end-to-end FPS on a single RTX 5090 GPU, with the DiT core running at 58 FPS. Experimental results demonstrate that our co-design approach significantly outperforms existing SOTA methods in both temporal coherence and system throughput.

Original Article

View Cached Full Text

Cached at: 06/01/26, 03:17 AM

Paper page - SANA-Streaming: Real-time Streaming Video Editing with Hybrid Diffusion Transformer

Source: https://huggingface.co/papers/2605.30409 Published on May 28

Submitted byhttps://huggingface.co/Yuyang-z

Yuyangon Jun 1

Abstract

SANA-Streaming enables real-time high-resolution video-to-video editing through a hybrid diffusion transformer architecture, cycle-reverse regularization, and efficient system co-design optimized for consumer GPUs.

Real-time streaming video-to-video editing (V2V) is critical for interactive applications such as live broadcasting and gaming, yet it remains a formidable challenge due to the stringent requirements fortemporal consistencyand inference throughput. In this paper, we present SANA-Streaming, asystem-algorithm co-designed framework for high-resolution, real-time streaming video editing on consumer GPUs, with the following three core designs: (1) HybridDiffusion Transformerarchitecture introducessoftmax attentionin part of the blocks to improve local modeling capabilities while preserving the efficiency of linear layers. (2) Cycle-Reverse Regularization is a novel training strategy that enforces semantic consistency by predicting source frames from generated content viaflow matching, improvingtemporal consistencywithout requiring paired long edited videos. (3) Efficient System Co-design combines fused GDN kernels andMixed-Precision Quantization(MPQ) optimized for the NVIDIA Blackwell (RTX 5090) architecture. By profiling real-world throughput, our MPQ maximizes Tensor Core utilization while maintaining generation quality. The resulting system achieves real-time 1280 x 704 resolution editing at 24 end-to-end FPS on a single RTX 5090 GPU, with the DiT core running at 58 FPS. Experimental results demonstrate that our co-design approach significantly outperforms existing SOTA methods in both temporal coherence and system throughput.

View arXiv page View PDF Project page Add to collection

Get this paper in your agent:

hf papers read 2605\.30409

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.30409 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.30409 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.30409 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

SANA-Streaming: Real-time Streaming Video Editing with Hybrid Diffusion Transformer

Paper page - SANA-Streaming: Real-time Streaming Video Editing with Hybrid Diffusion Transformer

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

@_philschmid: Build a realtime translation app with the new Gemini Live Translate, Next.js, LiveKit and Cloud Run. What it covers: 1.…

No more lightbulbs, much more sports: Five predictions for Roku’s future

We used AI to create a storytelling game that reacts in real time to everything you say and do

Adobe’s redesigned AI studio remembers what your creations look like

@lmsysorg: SGLang-Omni now serves MOSS-TTS-Local Transformer v1.5 from @Open_MOSS on day 0! This is an open 48 kHz stereo TTS mode…

Submit Feedback

Similar Articles

@_philschmid: Build a realtime translation app with the new Gemini Live Translate, Next.js, LiveKit and Cloud Run. What it covers: 1.…

No more lightbulbs, much more sports: Five predictions for Roku’s future

We used AI to create a storytelling game that reacts in real time to everything you say and do

Adobe’s redesigned AI studio remembers what your creations look like

@lmsysorg: SGLang-Omni now serves MOSS-TTS-Local Transformer v1.5 from @Open_MOSS on day 0! This is an open 48 kHz stereo TTS mode…