PiD: Fast and High-Resolution Latent Decoding with Pixel Diffusion

Hugging Face Daily Papers Papers

Summary

PiD introduces a pixel diffusion decoder that reformulates latent decoding as conditional pixel diffusion, enabling fast and high-quality image synthesis at high resolutions with reduced computational requirements. It decodes latents into 4x or 8x upscaled images in under a second on consumer hardware.

Most practical high-resolution text-to-image systems, including latent diffusion and autoregressive models, perform generation in a compact latent space, and a decoder maps the generated latents back to pixels. Yet the latent-to-pixel decoder is reconstruction-oriented, optimized to invert the encoder rather than synthesize more details, and becomes increasingly costly at megapixel scale. This drawback calls for a more expressive and efficient decoding paradigm. Motivated by recent progress in scalable pixel-space diffusion, we introduce PiD, a Pixel diffusion Decoder that reformulates latent decoding as conditional pixel diffusion, unifying decoding and upsampling into one generative module. By denoising directly in high-resolution pixel space, PiD synthesizes 4times and even 8times upscaled images with low latency. For latent conditioning, a lightweight sigma-aware adapter injects noise-corrupted latents into the pixel diffusion backbone, enabling PiD to decode partially denoised latents and terminate the latent diffusion process early. To further improve efficiency, we distill the model using DMD2, reducing inference to just 4 steps. PiD applies to both conventional VAE latents and semantic latents (e.g., SigLIP, DINOv2) used in recent RAE-based models. PiD decodes latents of 512 times 512 images into 2048 times 2048 pixels in under 1 second with 13 GB peak memory on a consumer RTX 5090, and as fast as 210 ms on a GB200 GPU, about 6times faster than cascaded diffusion-based super-resolution pipelines with better visual fidelity.
Original Article
View Cached Full Text

Cached at: 05/25/26, 02:35 AM

Paper page - PiD: Fast and High-Resolution Latent Decoding with Pixel Diffusion

Source: https://huggingface.co/papers/2605.23902

Abstract

PiD introduces a pixel diffusion decoder that reformulates latent decoding as conditional pixel diffusion, enabling fast and high-quality image synthesis at high resolutions with reduced computational requirements.

Most practical high-resolution text-to-image systems, includinglatent diffusionand autoregressive models, perform generation in a compact latent space, and a decoder maps the generated latents back to pixels. Yet the latent-to-pixel decoder is reconstruction-oriented, optimized to invert the encoder rather than synthesize more details, and becomes increasingly costly at megapixel scale. This drawback calls for a more expressive and efficient decoding paradigm. Motivated by recent progress in scalable pixel-space diffusion, we introduce PiD, aPixel diffusionDecoder that reformulateslatent decodingasconditional pixel diffusion, unifying decoding and upsampling into one generative module. By denoising directly in high-resolutionpixel space, PiD synthesizes 4times and even 8times upscaled images with low latency. For latent conditioning, a lightweightsigma-aware adapterinjects noise-corrupted latents into thepixel diffusionbackbone, enabling PiD to decode partially denoised latents and terminate thelatent diffusionprocess early. To further improve efficiency, we distill the model using DMD2, reducing inference to just 4 steps. PiD applies to both conventionalVAE latentsandsemantic latents(e.g., SigLIP, DINOv2) used in recent RAE-based models. PiD decodes latents of 512 times 512 images into 2048 times 2048 pixels in under 1 second with 13 GB peak memory on a consumer RTX 5090, and as fast as 210 ms on a GB200 GPU, about 6times faster than cascaded diffusion-basedsuper-resolutionpipelines with better visual fidelity.

View arXiv pageView PDFProject pageAdd to collection

Models citing this paper1

#### nvidia/PiD Image-to-Image• Updated19 minutes ago • 3

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.23902 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.23902 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

nvidia/PiD

Hugging Face Models Trending

NVIDIA releases PiD (Pixel Diffusion Decoder), a conditional pixel-space diffusion model that unifies latent-to-pixel decoding and upsampling into one generative module, producing super-resolved images in one pass. Model checkpoints and VAE weights are released under a non-commercial license.

@FeitengLi: NVIDIA Spatial Intelligence Lab proposes PiD, redesigning the decoding stage in latent diffusion models. Current mainstream text-to-image generation happens in latent space, then uses a VAE decoder to map back to pixels. This decoder's…

X AI KOLs Timeline

NVIDIA Spatial Intelligence Lab proposes PiD, which redesigns the decoding stage of latent diffusion models as a conditional pixel diffusion process, unifying decoding and upsampling to achieve low-latency, high-resolution decoding.

L2P: Unlocking Latent Potential for Pixel Generation

Hugging Face Daily Papers

The L2P paper introduces a Latent-to-Pixel transfer paradigm that leverages pre-trained latent diffusion models to create efficient pixel-space models capable of 4K generation with minimal training overhead.