diffusion-transformer

#diffusion-transformer

@rohanpaul_ai: AI video is moving into its real-time reaction era, with MaineCoon now leading in low-latency AI video. @catnips_ai jus…

X AI KOLs Following ↗ · yesterday Cached

MaineCoon is a 22B real-time text-to-audio-video model that achieves up to 47.5 FPS on a single H100 GPU, enabling low-cost, long-duration streaming with synchronized speech and visuals for live AI characters.

0 favorites 0 likes

#diffusion-transformer

MeshFlow: Mesh Generation with Equivariant Flow Matching

Hugging Face Daily Papers ↗ · 3d ago Cached

MeshFlow introduces an equivariant optimal-transport flow matching model for direct triangle mesh generation, achieving state-of-the-art quality while providing approximately 18x inference speedup over autoregressive methods.

0 favorites 0 likes

#diffusion-transformer

Go-with-the-Track: Video Compositing and Motion Control with Point Tracking

Hugging Face Daily Papers ↗ · 2026-06-18 Cached

Go-with-the-Track unifies motion control and reference image compositing in video generation using point-track embeddings with spatial-aware encoding and video diffusion transformers, achieving superior motion and reference control in a single model.

0 favorites 0 likes

#diffusion-transformer

Towards a Unified Generative Model for Scarce Time Series with Domain Experts

arXiv cs.LG ↗ · 2026-06-16 Cached

Introduces TimeMoDE, a framework combining Diffusion Transformers with Mixture-of-Experts for generating realistic time series under data scarcity, using pre-training on multi-domain datasets and domain prompts to handle domain-specific features and diffusion timestep signals for adaptive denoising.

0 favorites 0 likes

#diffusion-transformer

PAIWorld: A 3D-Consistent World Foundation Model for Robotic Manipulation

Hugging Face Daily Papers ↗ · 2026-06-16 Cached

PAIWorld enhances diffusion-transformer world models with geometric awareness and cross-view attention to improve multi-view 3D consistency for robotic manipulation tasks, achieving state-of-the-art results on benchmarks.

0 favorites 0 likes

#diffusion-transformer

UniDDT: Unifying Multimodal Understanding and Generation with Decoupled Diffusion Transformer

Hugging Face Daily Papers ↗ · 2026-06-15 Cached

UniDDT proposes a decoupled diffusion transformer framework that unifies multimodal understanding and generation by leveraging a Noisy ViT encoder and LLM for semantic encoding, achieving strong performance on both tasks.

0 favorites 0 likes

#diffusion-transformer

Qwen-RobotWorld Technical Report: Unifying Embodied World Modeling through Language-Conditioned Video Generation

Hugging Face Daily Papers ↗ · 2026-06-15 Cached

Qwen-RobotWorld is a language-conditioned video world model that predicts future visual trajectories across multiple robotic domains using a double-stream diffusion transformer and an 8.6M video-text corpus. It unifies embodied world modeling for robotic manipulation, autonomous driving, indoor navigation, and human-to-robot transfer, achieving top benchmarks on EWMBench and DreamGen Bench.

0 favorites 0 likes

#diffusion-transformer

DreamX-World 1.0: A General-Purpose Interactive World Model

Hugging Face Daily Papers ↗ · 2026-06-15 Cached

DreamX-World 1.0 is a general-purpose interactive text/image-to-video world model that supports camera navigation, scene persistence, and promptable events across multiple domains, using novel techniques like E-PRoPE, causal forcing, and memory-conditioned scene persistence to achieve controllable long-horizon generation.

0 favorites 0 likes

#diffusion-transformer

RefGC-SR^2: Reference-guided Generated Content Super-Resolution and Refinement

Hugging Face Daily Papers ↗ · 2026-06-13 Cached

This paper introduces a new task, reference-guided generated content super-resolution-refinement (RefGC-SR²), which simultaneously recovers high-resolution details and refines generative artifacts using a frequency-aware diffusion transformer model. The method leverages a high-resolution reference image to improve the quality of AI-generated images during post-processing.

0 favorites 0 likes

#diffusion-transformer

RepFusion: Leveraging Multimodal Priors for Denoising in Representation Space

Hugging Face Daily Papers ↗ · 2026-06-12 Cached

RepFusion proposes using multimodal large language models as noisy representation encoders for diffusion transformers in text-to-image generation, outperforming traditional denoising approaches.

0 favorites 0 likes

#diffusion-transformer

World Tracing: Generative Pixel-Aligned Geometry Beyond the Visible

Hugging Face Daily Papers ↗ · 2026-06-11 Cached

World Tracing introduces a generative pixel-aligned geometry representation that predicts 3D points aligned with observed pixels while completing occluded surfaces. It uses a diffusion transformer trained with pixel-space flow matching, achieving strong performance on visible-surface reconstruction and complete geometry generation across object, scene, and dynamic benchmarks.

0 favorites 0 likes

#diffusion-transformer

AHA-WAM:Asynchronous Horizon-Adaptive World-Action Modeling with Observation-Guided Context Routing

Hugging Face Daily Papers ↗ · 2026-06-08 Cached

AHA-WAM is an asynchronous world-action model that uses dual Diffusion Transformers to decouple world prediction from action execution, achieving efficient long-horizon planning and real-time control. It achieves state-of-the-art performance on robotic manipulation tasks with up to 92.8% success on RoboTwin and 78.3% on real-world tasks, while reaching 24.17 Hz closed-loop control.

0 favorites 0 likes

#diffusion-transformer

LoomVideo: Unifying Multimodal Inputs into Video Generation and Editing

Hugging Face Daily Papers ↗ · 2026-06-04 Cached

LoomVideo introduces a 5B-parameter unified architecture for video generation and editing that reduces computational overhead using novel conditioning mechanisms and multi-modal alignment, achieving competitive performance and faster inference.

0 favorites 0 likes

#diffusion-transformer

Echo-Infinity: Learning Evolving Memory for Real-Time Infinite Video Generation

Hugging Face Daily Papers ↗ · 2026-06-03 Cached

Echo-Infinity introduces a learnable evolving memory mechanism for autoregressive video generation, enabling real-time infinite video generation with constant memory cost and state-of-the-art performance.

0 favorites 0 likes

#diffusion-transformer

WavTTS: Towards High-Quality Zero-Shot TTS via Direct Raw Waveform Modeling

Papers with Code Trending ↗ · 2026-06-02 Cached

WavTTS presents the first raw waveform generative text-to-speech model using flow matching and Diffusion Transformer, achieving performance comparable to latent-space diffusion models while avoiding information loss from compressed representations.

0 favorites 0 likes

#diffusion-transformer

Text-to-Image Models Need Less from Text Encoders Than You Think

Hugging Face Daily Papers ↗ · 2026-06-02 Cached

This paper demonstrates that text-to-image diffusion transformer models primarily rely on token merging and word order from text encoders rather than full contextual embeddings, suggesting that the image model itself decodes complex linguistic structures.

0 favorites 0 likes

#diffusion-transformer

Towards Streaming Synchronized Spatial Audio Generation via Autoregressive Diffusion Transformer

Hugging Face Daily Papers ↗ · 2026-05-29 Cached

SwanSphere proposes a unified streaming framework for high-fidelity spatial audio generation from panoramic videos and text prompts using causal autoregressive diffusion transformers and multimodal learning strategies, achieving superior performance in both video-to-spatial and text-to-spatial audio tasks.

0 favorites 0 likes

#diffusion-transformer

SANA-Streaming: Real-time Streaming Video Editing with Hybrid Diffusion Transformer

Hugging Face Daily Papers ↗ · 2026-05-28 Cached

SANA-Streaming enables real-time high-resolution video-to-video editing on consumer GPUs using a hybrid diffusion transformer architecture, cycle-reverse regularization, and efficient system co-design, achieving 24 FPS at 1280x704 resolution on a single RTX 5090.

0 favorites 0 likes

#diffusion-transformer

StreamChar: Long-Horizon Streaming Character Audio-Video Generation with Decoupled Orchestration

Hugging Face Daily Papers ↗ · 2026-05-25 Cached

StreamChar is a streaming framework for real-time audio-video generation of character animation, using an LLM orchestrator and joint audio-video DiT with two-stage distillation and memory mechanisms to maintain long-horizon consistency and visual quality.

0 favorites 0 likes

#diffusion-transformer

prism-ml/bonsai-image-ternary-4B-gemlite-2bit

Hugging Face Models Trending ↗ · 2026-05-21 Cached

Prism ML releases Bonsai Image, a 1.21 GB text-to-image diffusion transformer using ternary weights (1.58-bit) for NVIDIA GPUs, offering 4.5s / 1024² on RTX 3080 and much smaller than FP16.

0 favorites 0 likes

diffusion-transformer

Submit Feedback