Tag
This paper systematically compares reconstruction-based and semantic latent spaces for action-conditioned latent diffusion world models in robotics. It finds that semantic encoders like V-JEPA 2.1 generally outperform reconstruction encoders on policy-relevant metrics, advocating for semantic latent spaces as a stronger foundation for robotics world models.
NVIDIA introduces PiD, a Pixel Diffusion Decoder that replaces traditional VAE/RAE decoders in latent diffusion models, enabling fast, high-resolution decoding with up to 6× speedup and improved visual fidelity.
NVIDIA Spatial Intelligence Lab proposes PiD, which redesigns the decoding stage of latent diffusion models as a conditional pixel diffusion process, unifying decoding and upsampling to achieve low-latency, high-resolution decoding.
This paper proposes AirfoilGen, a latent diffusion model for airfoil shape generation that ensures geometric validity via a circle sweeping representation and enables control over aerodynamic performance (lift/drag coefficients). Experiments show 98.41% performance-conditioning accuracy, using a new dataset of over 200,000 airfoils.
This paper identifies a collapse-and-refine mechanism in diffusion models under the manifold hypothesis, proposing Score-induced Latent Diffusion (SiLD) that provably avoids the curse of dimensionality. Experiments show SiLD matches or outperforms VAE-based latent diffusion models.
Stable Audio 3 introduces a family of fast latent diffusion models for variable-length audio generation and editing, with open-source release of small and medium model weights.
This technical report investigates draft-conditioned latent refinement for non-autoregressive text generation, showing that good latent geometry does not guarantee good decoding and emphasizing decoder recoverability as a key evaluation metric.
ByteDance releases Cola-DLM, a hierarchical continuous latent-space diffusion language model combining a Text VAE with a block-causal Diffusion Transformer, available on Hugging Face with model weights, code, and paper.
This paper introduces DAWN, a latent generative baseline for World-Action Interactive Models (WAIMs) that jointly models scene evolution and action generation through recursive refinement, achieving strong long-horizon planning in autonomous driving scenarios.
The L2P paper introduces a Latent-to-Pixel transfer paradigm that leverages pre-trained latent diffusion models to create efficient pixel-space models capable of 4K generation with minimal training overhead.
This article introduces Prior-Aligned Autoencoders (PAE), a new method for creating diffusion-friendly latent manifolds that achieves state-of-the-art image generation quality while enabling 13x faster training convergence.
This paper introduces TextLDM, a method that adapts visual latent diffusion transformers for language modeling by mapping discrete tokens to continuous latents. It demonstrates that this approach, enhanced by representation alignment, matches GPT-2 performance and unifies visual and text generation architectures.
L2P proposes an efficient transfer paradigm that leverages pre-trained latent diffusion models to build pixel-space diffusion models, enabling high-quality generation with minimal computational overhead and data requirements, and supporting native 4K resolution.
This Hugging Face repository provides workflows and model downloads for Lightricks' LTX-2.3 video generation model, designed for use with ComfyUI, including split models, GGUF versions, and required custom nodes.