nvidia/PiD

Hugging Face Models Trending 04/28/26, 12:41 AM Models

diffusion-model super-resolution pixel-decoder nvidia image-generation open-source research

Summary

NVIDIA releases PiD (Pixel Diffusion Decoder), a conditional pixel-space diffusion model that unifies latent-to-pixel decoding and upsampling into one generative module, producing super-resolved images in one pass. Model checkpoints and VAE weights are released under a non-commercial license.

Task: image-to-image Tags: pytorch, diffusers, safetensors, super-resolution, diffusion, pixel-diffusion-decoder, vae-decoder, image-to-image, arxiv:2605.23902, base_model:Tongyi-MAI/Z-Image, base_model:finetune:Tongyi-MAI/Z-Image, region:us

Original Article

View Cached Full Text

Cached at: 05/26/26, 07:58 AM

nvidia/PiD · Hugging Face

Source: https://huggingface.co/nvidia/PiD

https://huggingface.co/nvidia/PiD#pid–pixel-diffusion-decoderPiD — Pixel Diffusion Decoder

PiD teaser

Paper,Project Page

Yifan Lu,Qi Wu,Jay Zhangjie Wu,Zian Wang,Huan Ling,Sanja Fidler,Xuanchi Ren

PiD reformulates the latent-to-pixel decoder as a conditional pixel-space diffusion model, unifying decoding and upsampling into a single generative module. It denoises directly in high-resolution pixel space and produces a super-resolved image in one pass. This repository hosts the released decoder checkpoints, plus the encoder/decoder (“VAE”) weights they depend on.

AllPiD\_\*checkpoints in this repo are4-step distilled. The non-PiD\_\*entries (ae\.safetensors,flux2\_ae\.safetensors,sd3\_vae/,rae/,scale\_rae/) arethe corresponding encoder/decoder VAE weightsthat PiD plugs into — they’re not PiD checkpoints themselves.

https://huggingface.co/nvidia/PiD#licenseterms-of-useLicense/Terms of Use

This model is released under theNSCLv1License. The work and any derivative works may only be used for non-commercial (research or evaluation) purposes.

https://huggingface.co/nvidia/PiD#deployment-geographyDeployment Geography:

Global

https://huggingface.co/nvidia/PiD#pid-checkpointsPiD checkpoints

Two variants are released for each diffusers-style backbone:

2k— trained at 2048px, used as a 4× decoder (512 LDM → 2048 px), or as an 8× decoder for the Scale-RAE backbone (256 → 2048).
2kto4k— trained with multi-resolution data bucketing 2048→3840 and an SD3-style dynamic shift; designed for 1024 LDM → 4K (4096 px) decoding.

Both checkpoint variants support multiple aspect ratios.

PathBackbone (encoder side)SR factorVariantcheckpoints/PiD\_res2k\_sr4x\_official\_flux\_distill\_4stepFlux1-dev (16-ch VAE)4×2kcheckpoints/PiD\_res2k\_sr4x\_official\_flux2\_distill\_4stepFlux2-dev (128-ch BN VAE)4×2kcheckpoints/PiD\_res2k\_sr4x\_official\_sd3\_distill\_4stepSD3 medium (16-ch VAE)4×2kcheckpoints/PiD\_res2k\_sr4x\_official\_dinov2\_distill\_4stepDINOv2-B + RAE ViT-XL (768-ch)4×2kcheckpoints/PiD\_res2k\_sr8x\_official\_siglip\_distill\_4stepSigLIP-2 So400M + Scale-RAE ViT-XL (1152)8×2kcheckpoints/PiD\_res2kto4k\_sr4x\_official\_flux\_distill\_4stepFlux1-dev (16-ch VAE)4×2kto4kcheckpoints/PiD\_res2kto4k\_sr4x\_official\_flux2\_distill\_4stepFlux2-dev (128-ch BN VAE)4×2kto4kcheckpoints/PiD\_res2kto4k\_sr4x\_official\_sd3\_distill\_4stepSD3 medium (16-ch VAE)4×2kto4k Z-Image shares Flux1’s VAE, so its inference path reuses thefluxcheckpoints (both2kand2kto4k) — no separatezimagecheckpoint is shipped.

Each directory contains a single file,model\_ema\_bf16\.pth, which is the EMA weights cast to bfloat16 — the format the inference scripts load by default.

https://huggingface.co/nvidia/PiD#vae–encoder-weightsVAE / encoder weights

These are the per-backbone encoder (and, where applicable, original decoder) weights that PiD pairs with. They’re hosted here so a single download brings everything needed end-to-end.

PathDescriptioncheckpoints/ae\.safetensorsFlux1-dev / Z-Image 16-ch VAE (encoder + original Flux decoder).checkpoints/flux2\_ae\.safetensorsFlux2-dev 128-ch BN VAE.checkpoints/sd3\_vae/SD3 medium 16-ch VAE in diffusers format.checkpoints/rae/DINOv2-B image encoder + RAE ViT-XL decoder + ImageNet-512 normalization statistics.checkpoints/scale\_rae/SigLIP-2 So400M encoder + Scale-RAE ViT-XL decoder + decoder config.

https://huggingface.co/nvidia/PiD#usageUsage

The decoder checkpoints are loaded by the inference scripts in thePiD codebase. The exact\(backbone, ckpt\_type\) → pathmapping is the single source of truth inpid/\_src/inference/checkpoint\_registry\.py— clone the repo, point it at this snapshot, and the demos pick the right file automatically:

# Pull just the checkpoints/ tree into the repo root (skips this README and
# the teaser figure so they don't clobber the files in the source repo).
hf download nvidia/PiD --local-dir . --include "checkpoints/*"

# Then run any of the demos, e.g.:
PYTHONPATH=. python -m pid._src.inference.from_ldm_flux \
    --prompt "A photorealistic cat" \
    --ldm_inference_steps 28 --save_xt_steps 22 24 26 \
    --output_dir ./results/demo \
    --cfg_scale 1 --pid_inference_steps 4 --scale 4

Pick the2kto4kvariant via\-\-pid\_ckpt\_type 2kto4kwhen decoding at 4K.

https://huggingface.co/nvidia/PiD#citationCitation

@article{lu2026pid,
    title={PiD: Fast and High-Resolution Latent Decoding with Pixel Diffusion},
    author={Lu, Yifan and Wu, Qi and Wu, Jay Zhangjie and Wang, Zian and Ling, Huan and Fidler, Sanja and Ren, Xuanchi},
    journal={arXiv preprint arXiv:2605.23902},
    year={2026}
}

nvidia/PiD

nvidia/PiD · Hugging Face

https://huggingface.co/nvidia/PiD#pid–pixel-diffusion-decoderPiD — Pixel Diffusion Decoder

https://huggingface.co/nvidia/PiD#licenseterms-of-useLicense/Terms of Use

https://huggingface.co/nvidia/PiD#deployment-geographyDeployment Geography:

https://huggingface.co/nvidia/PiD#pid-checkpointsPiD checkpoints

https://huggingface.co/nvidia/PiD#vae–encoder-weightsVAE / encoder weights

https://huggingface.co/nvidia/PiD#usageUsage

https://huggingface.co/nvidia/PiD#citationCitation

Similar Articles

PiD: Fast and High-Resolution Latent Decoding with Pixel Diffusion

@FeitengLi: NVIDIA Spatial Intelligence Lab proposes PiD, redesigning the decoding stage in latent diffusion models. Current mainstream text-to-image generation happens in latent space, then uses a VAE decoder to map back to pixels. This decoder's…

@xuanchi13: The latent-vs-pixel debate misses the point. GPT Image 2 shows what users notice: pixel-level fidelity. Latent models s…

Nemotron-Labs-Diffusion from NVIDIA

zhen-nan/L2P

Submit Feedback

Similar Articles

PiD: Fast and High-Resolution Latent Decoding with Pixel Diffusion

@FeitengLi: NVIDIA Spatial Intelligence Lab proposes PiD, redesigning the decoding stage in latent diffusion models. Current mainstream text-to-image generation happens in latent space, then uses a VAE decoder to map back to pixels. This decoder's…

@xuanchi13: The latent-vs-pixel debate misses the point. GPT Image 2 shows what users notice: pixel-level fidelity. Latent models s…

Nemotron-Labs-Diffusion from NVIDIA