L2P: Unlocking Latent Potential for Pixel Generation

Hugging Face Daily Papers Papers

Summary

The L2P paper introduces a Latent-to-Pixel transfer paradigm that leverages pre-trained latent diffusion models to create efficient pixel-space models capable of 4K generation with minimal training overhead.

Pixel diffusion models have recently regained attention for visual generation. However, training advanced pixel-space models from scratch demands prohibitive computational and data resources. To address this, we propose the Latent-to-Pixel (L2P) transfer paradigm, an efficient framework that directly harnesses the rich knowledge of pre-trained LDMs to build powerful pixel-space models. Specifically, L2P discards the VAE in favor of large-patch tokenization and freezes the source LDM's intermediate layers, exclusively training shallow layers to learn the latent-to-pixel transformation. By utilizing LDM-generated synthetic images as the sole training corpus, L2P fits an already smooth data manifold, enabling rapid convergence with zero real-data collection. This strategy allows L2P to seamlessly migrate massive latent priors to the pixel space using only 8 GPUs. Furthermore, eliminating the VAE memory bottleneck unlocks native 4K ultra-high resolution generation. Extensive experiments across mainstream LDM architectures show that L2P incurs negligible training overhead, yet performs on par with the source LDM on DPG-Bench and reaches 93% performance on GenEval.
Original Article
View Cached Full Text

Cached at: 05/13/26, 08:11 AM

Paper page - L2P: Unlocking Latent Potential for Pixel Generation

Source: https://huggingface.co/papers/2605.12013 Published on May 12

·

Submitted byhttps://huggingface.co/zhen-nan

chenon May 13

Abstract

Latent-to-Pixel transfer paradigm efficiently leverages pre-trained latent diffusion models to create pixel-space models with minimal training overhead and high-resolution generation capabilities.

Pixel diffusion modelshave recently regained attention for visual generation. However, training advanced pixel-space models from scratch demands prohibitive computational and data resources. To address this, we propose the Latent-to-Pixel (L2P) transfer paradigm, an efficient framework that directly harnesses the rich knowledge of pre-trained LDMs to build powerful pixel-space models. Specifically, L2P discards theVAEin favor oflarge-patch tokenizationand freezes the source LDM’sintermediate layers, exclusively trainingshallow layersto learn the latent-to-pixel transformation. By utilizing LDM-generatedsynthetic imagesas the sole training corpus, L2P fits an already smoothdata manifold, enabling rapid convergence with zero real-data collection. This strategy allows L2P to seamlessly migrate massive latent priors to the pixel space using only 8 GPUs. Furthermore, eliminating theVAEmemory bottleneck unlocks native4K ultra-high resolution generation. Extensive experiments across mainstream LDM architectures show that L2P incurs negligible training overhead, yet performs on par with the source LDM onDPG-Benchand reaches 93% performance onGenEval.

View arXiv pageView PDFProject pageGitHub8Add to collection

Get this paper in your agent:

hf papers read 2605\.12013

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.12013 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.12013 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.12013 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

zhen-nan/L2P

Hugging Face Models Trending

L2P proposes an efficient transfer paradigm that leverages pre-trained latent diffusion models to build pixel-space diffusion models, enabling high-quality generation with minimal computational overhead and data requirements, and supporting native 4K resolution.

PiD: Fast and High-Resolution Latent Decoding with Pixel Diffusion

Hugging Face Daily Papers

PiD introduces a pixel diffusion decoder that reformulates latent decoding as conditional pixel diffusion, enabling fast and high-quality image synthesis at high resolutions with reduced computational requirements. It decodes latents into 4x or 8x upscaled images in under a second on consumer hardware.

@FeitengLi: NVIDIA Spatial Intelligence Lab proposes PiD, redesigning the decoding stage in latent diffusion models. Current mainstream text-to-image generation happens in latent space, then uses a VAE decoder to map back to pixels. This decoder's…

X AI KOLs Timeline

NVIDIA Spatial Intelligence Lab proposes PiD, which redesigns the decoding stage of latent diffusion models as a conditional pixel diffusion process, unifying decoding and upsampling to achieve low-latency, high-resolution decoding.

nvidia/PiD

Hugging Face Models Trending

NVIDIA releases PiD (Pixel Diffusion Decoder), a conditional pixel-space diffusion model that unifies latent-to-pixel decoding and upsampling into one generative module, producing super-resolved images in one pass. Model checkpoints and VAE weights are released under a non-commercial license.