Tag
MaineCoon is a 22B real-time text-to-audio-video model that achieves up to 47.5 FPS on a single H100 GPU, enabling low-cost, long-duration streaming with synchronized speech and visuals for live AI characters.
MeshFlow introduces an equivariant optimal-transport flow matching model for direct triangle mesh generation, achieving state-of-the-art quality while providing approximately 18x inference speedup over autoregressive methods.
Go-with-the-Track unifies motion control and reference image compositing in video generation using point-track embeddings with spatial-aware encoding and video diffusion transformers, achieving superior motion and reference control in a single model.
Introduces TimeMoDE, a framework combining Diffusion Transformers with Mixture-of-Experts for generating realistic time series under data scarcity, using pre-training on multi-domain datasets and domain prompts to handle domain-specific features and diffusion timestep signals for adaptive denoising.
PAIWorld enhances diffusion-transformer world models with geometric awareness and cross-view attention to improve multi-view 3D consistency for robotic manipulation tasks, achieving state-of-the-art results on benchmarks.
UniDDT proposes a decoupled diffusion transformer framework that unifies multimodal understanding and generation by leveraging a Noisy ViT encoder and LLM for semantic encoding, achieving strong performance on both tasks.
Qwen-RobotWorld is a language-conditioned video world model that predicts future visual trajectories across multiple robotic domains using a double-stream diffusion transformer and an 8.6M video-text corpus. It unifies embodied world modeling for robotic manipulation, autonomous driving, indoor navigation, and human-to-robot transfer, achieving top benchmarks on EWMBench and DreamGen Bench.
DreamX-World 1.0 is a general-purpose interactive text/image-to-video world model that supports camera navigation, scene persistence, and promptable events across multiple domains, using novel techniques like E-PRoPE, causal forcing, and memory-conditioned scene persistence to achieve controllable long-horizon generation.
This paper introduces a new task, reference-guided generated content super-resolution-refinement (RefGC-SR²), which simultaneously recovers high-resolution details and refines generative artifacts using a frequency-aware diffusion transformer model. The method leverages a high-resolution reference image to improve the quality of AI-generated images during post-processing.
RepFusion proposes using multimodal large language models as noisy representation encoders for diffusion transformers in text-to-image generation, outperforming traditional denoising approaches.
World Tracing introduces a generative pixel-aligned geometry representation that predicts 3D points aligned with observed pixels while completing occluded surfaces. It uses a diffusion transformer trained with pixel-space flow matching, achieving strong performance on visible-surface reconstruction and complete geometry generation across object, scene, and dynamic benchmarks.
AHA-WAM is an asynchronous world-action model that uses dual Diffusion Transformers to decouple world prediction from action execution, achieving efficient long-horizon planning and real-time control. It achieves state-of-the-art performance on robotic manipulation tasks with up to 92.8% success on RoboTwin and 78.3% on real-world tasks, while reaching 24.17 Hz closed-loop control.
LoomVideo introduces a 5B-parameter unified architecture for video generation and editing that reduces computational overhead using novel conditioning mechanisms and multi-modal alignment, achieving competitive performance and faster inference.
Echo-Infinity introduces a learnable evolving memory mechanism for autoregressive video generation, enabling real-time infinite video generation with constant memory cost and state-of-the-art performance.
WavTTS presents the first raw waveform generative text-to-speech model using flow matching and Diffusion Transformer, achieving performance comparable to latent-space diffusion models while avoiding information loss from compressed representations.
This paper demonstrates that text-to-image diffusion transformer models primarily rely on token merging and word order from text encoders rather than full contextual embeddings, suggesting that the image model itself decodes complex linguistic structures.
SwanSphere proposes a unified streaming framework for high-fidelity spatial audio generation from panoramic videos and text prompts using causal autoregressive diffusion transformers and multimodal learning strategies, achieving superior performance in both video-to-spatial and text-to-spatial audio tasks.
SANA-Streaming enables real-time high-resolution video-to-video editing on consumer GPUs using a hybrid diffusion transformer architecture, cycle-reverse regularization, and efficient system co-design, achieving 24 FPS at 1280x704 resolution on a single RTX 5090.
StreamChar is a streaming framework for real-time audio-video generation of character animation, using an LLM orchestrator and joint audio-video DiT with two-stage distillation and memory mechanisms to maintain long-horizon consistency and visual quality.
Prism ML releases Bonsai Image, a 1.21 GB text-to-image diffusion transformer using ternary weights (1.58-bit) for NVIDIA GPUs, offering 4.5s / 1024² on RTX 3080 and much smaller than FP16.