diffusion-transformer

#diffusion-transformer

CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation

Hugging Face Daily Papers ↗ · 2026-04-21 Cached

CoInteract introduces an end-to-end Diffusion Transformer framework that jointly models RGB appearance and HOI geometry to generate physically-plausible human-object interaction videos with stable hands/faces and zero inference overhead.

0 favorites 0 likes

#diffusion-transformer

Hierarchical Codec Diffusion for Video-to-Speech Generation

Hugging Face Daily Papers ↗ · 2026-04-17 Cached

HiCoDiT is a novel Hierarchical Codec Diffusion Transformer for video-to-speech generation that leverages the hierarchical structure of RVQ-based codec discrete speech tokens, using coarse-to-fine conditioning with dual-scale normalization to achieve strong audio-visual alignment.

0 favorites 0 likes

#diffusion-transformer

HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System

Hugging Face Daily Papers ↗ · 2026-04-15 Cached

HiVLA introduces a hierarchical vision-language-action framework that decouples semantic planning from motor control using a diffusion transformer action expert for improved robotic manipulation. The system combines a VLM planner for task decomposition and visual grounding with a specialized DiT action expert using cascaded cross-attention, outperforming end-to-end baselines particularly in long-horizon tasks and fine-grained manipulation.

0 favorites 0 likes

#diffusion-transformer

OneHOI: Unifying Human-Object Interaction Generation and Editing

Hugging Face Daily Papers ↗ · 2026-04-15 Cached

OneHOI is a unified diffusion transformer framework that consolidates human-object interaction (HOI) generation and editing into a single conditional denoising process using relational modeling and structured attention mechanisms. The approach achieves state-of-the-art results across both HOI generation and editing tasks with support for multiple control modalities.

0 favorites 0 likes

#diffusion-transformer

baidu/ERNIE-Image

Hugging Face Models Trending ↗ · 2026-04-07 Cached

Baidu releases ERNIE-Image, an open-weight text-to-image generation model with 8B parameters built on Diffusion Transformer architecture, achieving state-of-the-art performance among open-weight models with strong capabilities in text rendering, instruction following, and structured image generation.

0 favorites 0 likes

#diffusion-transformer

baidu/ERNIE-Image-Turbo

Hugging Face Models Trending ↗ · 2026-04-02 Cached

Baidu releases ERNIE-Image-Turbo, a distilled text-to-image generation model that achieves fast generation in 8 inference steps while maintaining strong text rendering, instruction following, and structured image generation capabilities.

0 favorites 0 likes

#diffusion-transformer

NucleusAI/Nucleus-Image

Hugging Face Models Trending ↗ · 2026-03-17 Cached

Nucleus-Image is an open-source text-to-image diffusion transformer with 17B parameters across 64 routed experts, activating only ~2B parameters per forward pass. It matches or exceeds leading models like Qwen-Image and Imagen4 while maintaining high efficiency, released with full model weights, training code, and dataset.

0 favorites 0 likes

diffusion-transformer

CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation

Hierarchical Codec Diffusion for Video-to-Speech Generation

HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System

OneHOI: Unifying Human-Object Interaction Generation and Editing

baidu/ERNIE-Image

baidu/ERNIE-Image-Turbo

NucleusAI/Nucleus-Image

Submit Feedback