SANA-Streaming: Real-time Streaming Video Editing with Hybrid Diffusion Transformer
Summary
SANA-Streaming enables real-time high-resolution video-to-video editing on consumer GPUs using a hybrid diffusion transformer architecture, cycle-reverse regularization, and efficient system co-design, achieving 24 FPS at 1280x704 resolution on a single RTX 5090.
View Cached Full Text
Cached at: 06/01/26, 03:17 AM
Paper page - SANA-Streaming: Real-time Streaming Video Editing with Hybrid Diffusion Transformer
Source: https://huggingface.co/papers/2605.30409 Published on May 28
·
Submitted byhttps://huggingface.co/Yuyang-z
Yuyangon Jun 1
Abstract
SANA-Streaming enables real-time high-resolution video-to-video editing through a hybrid diffusion transformer architecture, cycle-reverse regularization, and efficient system co-design optimized for consumer GPUs.
Real-time streaming video-to-video editing (V2V) is critical for interactive applications such as live broadcasting and gaming, yet it remains a formidable challenge due to the stringent requirements fortemporal consistencyand inference throughput. In this paper, we present SANA-Streaming, asystem-algorithm co-designed framework for high-resolution, real-time streaming video editing on consumer GPUs, with the following three core designs: (1) HybridDiffusion Transformerarchitecture introducessoftmax attentionin part of the blocks to improve local modeling capabilities while preserving the efficiency of linear layers. (2) Cycle-Reverse Regularization is a novel training strategy that enforces semantic consistency by predicting source frames from generated content viaflow matching, improvingtemporal consistencywithout requiring paired long edited videos. (3) Efficient System Co-design combines fused GDN kernels andMixed-Precision Quantization(MPQ) optimized for the NVIDIA Blackwell (RTX 5090) architecture. By profiling real-world throughput, our MPQ maximizes Tensor Core utilization while maintaining generation quality. The resulting system achieves real-time 1280 x 704 resolution editing at 24 end-to-end FPS on a single RTX 5090 GPU, with the DiT core running at 58 FPS. Experimental results demonstrate that our co-design approach significantly outperforms existing SOTA methods in both temporal coherence and system throughput.
View arXiv pageView PDFProject pageAdd to collection
Get this paper in your agent:
hf papers read 2605\.30409
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.30409 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.30409 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.30409 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
@_philschmid: Build a realtime translation app with the new Gemini Live Translate, Next.js, LiveKit and Cloud Run. What it covers: 1.…
A tutorial on building a realtime translation app using Gemini Live Translate, Next.js, LiveKit, and Cloud Run, covering audio streaming, translation, and deployment.
No more lightbulbs, much more sports: Five predictions for Roku’s future
Fox's acquisition of Roku is expected to bring major changes, including making The Roku Channel exclusive to Roku devices and abandoning IoT products like lightbulbs in favor of sports content.
We used AI to create a storytelling game that reacts in real time to everything you say and do
This is an interactive storytelling game demo that uses AI to respond to user voice input in real time. The player plays the role of a bard having an improvised conversation with an AI character called 'Death Lady', and the AI dynamically advances the story based on every sentence the player says.
Adobe’s redesigned AI studio remembers what your creations look like
Adobe unveils a redesigned Firefly AI studio with persistent Elements and Projects for consistent design reuse, plus new video editing and brand kit capabilities for its AI assistant.
@lmsysorg: SGLang-Omni now serves MOSS-TTS-Local Transformer v1.5 from @Open_MOSS on day 0! This is an open 48 kHz stereo TTS mode…
MOSS-TTS-Local Transformer v1.5 is an open-source 48 kHz stereo TTS model with zero-shot voice cloning, native streaming, and support for 31 languages, built on a Qwen3-4B backbone and served via SGLang-Omni.