nvidia/PiD
Summary
NVIDIA releases PiD (Pixel Diffusion Decoder), a conditional pixel-space diffusion model that unifies latent-to-pixel decoding and upsampling into one generative module, producing super-resolved images in one pass. Model checkpoints and VAE weights are released under a non-commercial license.
View Cached Full Text
Cached at: 05/26/26, 07:58 AM
nvidia/PiD · Hugging Face
Source: https://huggingface.co/nvidia/PiD
https://huggingface.co/nvidia/PiD#pid–pixel-diffusion-decoderPiD — Pixel Diffusion Decoder

Yifan Lu,Qi Wu,Jay Zhangjie Wu,Zian Wang,Huan Ling,Sanja Fidler,Xuanchi Ren
PiD reformulates the latent-to-pixel decoder as a conditional pixel-space diffusion model, unifying decoding and upsampling into a single generative module. It denoises directly in high-resolution pixel space and produces a super-resolved image in one pass. This repository hosts the released decoder checkpoints, plus the encoder/decoder (“VAE”) weights they depend on.
AllPiD\_\*checkpoints in this repo are4-step distilled. The non-PiD\_\*entries (ae\.safetensors,flux2\_ae\.safetensors,sd3\_vae/,rae/,scale\_rae/) arethe corresponding encoder/decoder VAE weightsthat PiD plugs into — they’re not PiD checkpoints themselves.
https://huggingface.co/nvidia/PiD#licenseterms-of-useLicense/Terms of Use
This model is released under theNSCLv1License. The work and any derivative works may only be used for non-commercial (research or evaluation) purposes.
https://huggingface.co/nvidia/PiD#deployment-geographyDeployment Geography:
Global
https://huggingface.co/nvidia/PiD#pid-checkpointsPiD checkpoints
Two variants are released for each diffusers-style backbone:
2k— trained at 2048px, used as a 4× decoder (512 LDM → 2048 px), or as an 8× decoder for the Scale-RAE backbone (256 → 2048).2kto4k— trained with multi-resolution data bucketing 2048→3840 and an SD3-style dynamic shift; designed for 1024 LDM → 4K (4096 px) decoding.
Both checkpoint variants support multiple aspect ratios.
PathBackbone (encoder side)SR factorVariantcheckpoints/PiD\_res2k\_sr4x\_official\_flux\_distill\_4stepFlux1-dev (16-ch VAE)4×2kcheckpoints/PiD\_res2k\_sr4x\_official\_flux2\_distill\_4stepFlux2-dev (128-ch BN VAE)4×2kcheckpoints/PiD\_res2k\_sr4x\_official\_sd3\_distill\_4stepSD3 medium (16-ch VAE)4×2kcheckpoints/PiD\_res2k\_sr4x\_official\_dinov2\_distill\_4stepDINOv2-B + RAE ViT-XL (768-ch)4×2kcheckpoints/PiD\_res2k\_sr8x\_official\_siglip\_distill\_4stepSigLIP-2 So400M + Scale-RAE ViT-XL (1152)8×2kcheckpoints/PiD\_res2kto4k\_sr4x\_official\_flux\_distill\_4stepFlux1-dev (16-ch VAE)4×2kto4kcheckpoints/PiD\_res2kto4k\_sr4x\_official\_flux2\_distill\_4stepFlux2-dev (128-ch BN VAE)4×2kto4kcheckpoints/PiD\_res2kto4k\_sr4x\_official\_sd3\_distill\_4stepSD3 medium (16-ch VAE)4×2kto4k
Z-Image shares Flux1’s VAE, so its inference path reuses thefluxcheckpoints (both2kand2kto4k) — no separatezimagecheckpoint is shipped.
Each directory contains a single file,model\_ema\_bf16\.pth, which is the EMA weights cast to bfloat16 — the format the inference scripts load by default.
https://huggingface.co/nvidia/PiD#vae–encoder-weightsVAE / encoder weights
These are the per-backbone encoder (and, where applicable, original decoder) weights that PiD pairs with. They’re hosted here so a single download brings everything needed end-to-end.
PathDescriptioncheckpoints/ae\.safetensorsFlux1-dev / Z-Image 16-ch VAE (encoder + original Flux decoder).checkpoints/flux2\_ae\.safetensorsFlux2-dev 128-ch BN VAE.checkpoints/sd3\_vae/SD3 medium 16-ch VAE in diffusers format.checkpoints/rae/DINOv2-B image encoder + RAE ViT-XL decoder + ImageNet-512 normalization statistics.checkpoints/scale\_rae/SigLIP-2 So400M encoder + Scale-RAE ViT-XL decoder + decoder config.
https://huggingface.co/nvidia/PiD#usageUsage
The decoder checkpoints are loaded by the inference scripts in thePiD codebase. The exact\(backbone, ckpt\_type\) → pathmapping is the single source of truth inpid/\_src/inference/checkpoint\_registry\.py— clone the repo, point it at this snapshot, and the demos pick the right file automatically:
# Pull just the checkpoints/ tree into the repo root (skips this README and
# the teaser figure so they don't clobber the files in the source repo).
hf download nvidia/PiD --local-dir . --include "checkpoints/*"
# Then run any of the demos, e.g.:
PYTHONPATH=. python -m pid._src.inference.from_ldm_flux \
--prompt "A photorealistic cat" \
--ldm_inference_steps 28 --save_xt_steps 22 24 26 \
--output_dir ./results/demo \
--cfg_scale 1 --pid_inference_steps 4 --scale 4
Pick the2kto4kvariant via\-\-pid\_ckpt\_type 2kto4kwhen decoding at 4K.
https://huggingface.co/nvidia/PiD#citationCitation
@article{lu2026pid,
title={PiD: Fast and High-Resolution Latent Decoding with Pixel Diffusion},
author={Lu, Yifan and Wu, Qi and Wu, Jay Zhangjie and Wang, Zian and Ling, Huan and Fidler, Sanja and Ren, Xuanchi},
journal={arXiv preprint arXiv:2605.23902},
year={2026}
}
Similar Articles
PiD: Fast and High-Resolution Latent Decoding with Pixel Diffusion
PiD introduces a pixel diffusion decoder that reformulates latent decoding as conditional pixel diffusion, enabling fast and high-quality image synthesis at high resolutions with reduced computational requirements. It decodes latents into 4x or 8x upscaled images in under a second on consumer hardware.
@FeitengLi: NVIDIA Spatial Intelligence Lab proposes PiD, redesigning the decoding stage in latent diffusion models. Current mainstream text-to-image generation happens in latent space, then uses a VAE decoder to map back to pixels. This decoder's…
NVIDIA Spatial Intelligence Lab proposes PiD, which redesigns the decoding stage of latent diffusion models as a conditional pixel diffusion process, unifying decoding and upsampling to achieve low-latency, high-resolution decoding.
@xuanchi13: The latent-vs-pixel debate misses the point. GPT Image 2 shows what users notice: pixel-level fidelity. Latent models s…
NVIDIA introduces PiD, a Pixel Diffusion Decoder that replaces traditional VAE/RAE decoders in latent diffusion models, enabling fast, high-resolution decoding with up to 6× speedup and improved visual fidelity.
Nemotron-Labs-Diffusion from NVIDIA
NVIDIA released the Nemotron-Labs-Diffusion model family (3B to 14B) that supports both AR and diffusion decoding with novel self-speculation, achieving significant speedups (up to 4x) over standard AR and Eagle3 methods across hardware platforms.
zhen-nan/L2P
L2P proposes an efficient transfer paradigm that leverages pre-trained latent diffusion models to build pixel-space diffusion models, enabling high-quality generation with minimal computational overhead and data requirements, and supporting native 4K resolution.