Tag
SANA-Streaming enables real-time high-resolution video-to-video editing on consumer GPUs using a hybrid diffusion transformer architecture, cycle-reverse regularization, and efficient system co-design, achieving 24 FPS at 1280x704 resolution on a single RTX 5090.
LatentOmni proposes a cross-modal reasoning framework that interleaves textual reasoning with audio-visual latent states, outperforming explicit text-based chain-of-thought methods in audio-visual reasoning tasks.
A novel inference-time method for long video generation using overlapping sliding windows with Tweedie matching and stochastic early-phase sampling to improve temporal consistency and visual quality without additional training.
MIGA is a train-free method for generating consistent long videos by reducing the training-inference gap and enhancing temporal consistency through dual consistency mechanisms.
Stream-T1 is a proposed framework for test-time scaling in streaming video generation, improving temporal consistency and quality through mechanisms like noise propagation and reward pruning. The paper addresses the high computational costs of existing diffusion-based methods by leveraging chunk-level synthesis.
ReImagine introduces an image-first approach to controllable high-quality human video generation, combining SMPL-X motion guidance with video diffusion models to decouple appearance from temporal consistency.