nvidia/PiD

Hugging Face Models Trending Models

Summary

NVIDIA releases PiD (Pixel Diffusion Decoder), a conditional pixel-space diffusion model that unifies latent-to-pixel decoding and upsampling into one generative module, producing super-resolved images in one pass. Model checkpoints and VAE weights are released under a non-commercial license.

Task: image-to-image Tags: pytorch, diffusers, safetensors, super-resolution, diffusion, pixel-diffusion-decoder, vae-decoder, image-to-image, arxiv:2605.23902, base_model:Tongyi-MAI/Z-Image, base_model:finetune:Tongyi-MAI/Z-Image, region:us
Original Article
View Cached Full Text

Cached at: 05/26/26, 07:58 AM

nvidia/PiD · Hugging Face

Source: https://huggingface.co/nvidia/PiD

https://huggingface.co/nvidia/PiD#pid–pixel-diffusion-decoderPiD — Pixel Diffusion Decoder

PiD teaser

Paper,Project Page

Yifan Lu,Qi Wu,Jay Zhangjie Wu,Zian Wang,Huan Ling,Sanja Fidler,Xuanchi Ren

PiD reformulates the latent-to-pixel decoder as a conditional pixel-space diffusion model, unifying decoding and upsampling into a single generative module. It denoises directly in high-resolution pixel space and produces a super-resolved image in one pass. This repository hosts the released decoder checkpoints, plus the encoder/decoder (“VAE”) weights they depend on.

AllPiD\_\*checkpoints in this repo are4-step distilled. The non-PiD\_\*entries (ae\.safetensors,flux2\_ae\.safetensors,sd3\_vae/,rae/,scale\_rae/) arethe corresponding encoder/decoder VAE weightsthat PiD plugs into — they’re not PiD checkpoints themselves.

https://huggingface.co/nvidia/PiD#licenseterms-of-useLicense/Terms of Use

This model is released under theNSCLv1License. The work and any derivative works may only be used for non-commercial (research or evaluation) purposes.

https://huggingface.co/nvidia/PiD#deployment-geographyDeployment Geography:

Global

https://huggingface.co/nvidia/PiD#pid-checkpointsPiD checkpoints

Two variants are released for each diffusers-style backbone:

  • 2k— trained at 2048px, used as a 4× decoder (512 LDM → 2048 px), or as an 8× decoder for the Scale-RAE backbone (256 → 2048).
  • 2kto4k— trained with multi-resolution data bucketing 2048→3840 and an SD3-style dynamic shift; designed for 1024 LDM → 4K (4096 px) decoding.

Both checkpoint variants support multiple aspect ratios.

PathBackbone (encoder side)SR factorVariantcheckpoints/PiD\_res2k\_sr4x\_official\_flux\_distill\_4stepFlux1-dev (16-ch VAE)4×2kcheckpoints/PiD\_res2k\_sr4x\_official\_flux2\_distill\_4stepFlux2-dev (128-ch BN VAE)4×2kcheckpoints/PiD\_res2k\_sr4x\_official\_sd3\_distill\_4stepSD3 medium (16-ch VAE)4×2kcheckpoints/PiD\_res2k\_sr4x\_official\_dinov2\_distill\_4stepDINOv2-B + RAE ViT-XL (768-ch)4×2kcheckpoints/PiD\_res2k\_sr8x\_official\_siglip\_distill\_4stepSigLIP-2 So400M + Scale-RAE ViT-XL (1152)8×2kcheckpoints/PiD\_res2kto4k\_sr4x\_official\_flux\_distill\_4stepFlux1-dev (16-ch VAE)4×2kto4kcheckpoints/PiD\_res2kto4k\_sr4x\_official\_flux2\_distill\_4stepFlux2-dev (128-ch BN VAE)4×2kto4kcheckpoints/PiD\_res2kto4k\_sr4x\_official\_sd3\_distill\_4stepSD3 medium (16-ch VAE)4×2kto4k Z-Image shares Flux1’s VAE, so its inference path reuses thefluxcheckpoints (both2kand2kto4k) — no separatezimagecheckpoint is shipped.

Each directory contains a single file,model\_ema\_bf16\.pth, which is the EMA weights cast to bfloat16 — the format the inference scripts load by default.

https://huggingface.co/nvidia/PiD#vae–encoder-weightsVAE / encoder weights

These are the per-backbone encoder (and, where applicable, original decoder) weights that PiD pairs with. They’re hosted here so a single download brings everything needed end-to-end.

PathDescriptioncheckpoints/ae\.safetensorsFlux1-dev / Z-Image 16-ch VAE (encoder + original Flux decoder).checkpoints/flux2\_ae\.safetensorsFlux2-dev 128-ch BN VAE.checkpoints/sd3\_vae/SD3 medium 16-ch VAE in diffusers format.checkpoints/rae/DINOv2-B image encoder + RAE ViT-XL decoder + ImageNet-512 normalization statistics.checkpoints/scale\_rae/SigLIP-2 So400M encoder + Scale-RAE ViT-XL decoder + decoder config.

https://huggingface.co/nvidia/PiD#usageUsage

The decoder checkpoints are loaded by the inference scripts in thePiD codebase. The exact\(backbone, ckpt\_type\) → pathmapping is the single source of truth inpid/\_src/inference/checkpoint\_registry\.py— clone the repo, point it at this snapshot, and the demos pick the right file automatically:

# Pull just the checkpoints/ tree into the repo root (skips this README and
# the teaser figure so they don't clobber the files in the source repo).
hf download nvidia/PiD --local-dir . --include "checkpoints/*"

# Then run any of the demos, e.g.:
PYTHONPATH=. python -m pid._src.inference.from_ldm_flux \
    --prompt "A photorealistic cat" \
    --ldm_inference_steps 28 --save_xt_steps 22 24 26 \
    --output_dir ./results/demo \
    --cfg_scale 1 --pid_inference_steps 4 --scale 4

Pick the2kto4kvariant via\-\-pid\_ckpt\_type 2kto4kwhen decoding at 4K.

https://huggingface.co/nvidia/PiD#citationCitation

@article{lu2026pid,
    title={PiD: Fast and High-Resolution Latent Decoding with Pixel Diffusion},
    author={Lu, Yifan and Wu, Qi and Wu, Jay Zhangjie and Wang, Zian and Ling, Huan and Fidler, Sanja and Ren, Xuanchi},
    journal={arXiv preprint arXiv:2605.23902},
    year={2026}
}

Similar Articles

PiD: Fast and High-Resolution Latent Decoding with Pixel Diffusion

Hugging Face Daily Papers

PiD introduces a pixel diffusion decoder that reformulates latent decoding as conditional pixel diffusion, enabling fast and high-quality image synthesis at high resolutions with reduced computational requirements. It decodes latents into 4x or 8x upscaled images in under a second on consumer hardware.

@FeitengLi: NVIDIA Spatial Intelligence Lab proposes PiD, redesigning the decoding stage in latent diffusion models. Current mainstream text-to-image generation happens in latent space, then uses a VAE decoder to map back to pixels. This decoder's…

X AI KOLs Timeline

NVIDIA Spatial Intelligence Lab proposes PiD, which redesigns the decoding stage of latent diffusion models as a conditional pixel diffusion process, unifying decoding and upsampling to achieve low-latency, high-resolution decoding.

Nemotron-Labs-Diffusion from NVIDIA

Reddit r/LocalLLaMA

NVIDIA released the Nemotron-Labs-Diffusion model family (3B to 14B) that supports both AR and diffusion decoding with novel self-speculation, achieving significant speedups (up to 4x) over standard AR and Eagle3 methods across hardware platforms.

zhen-nan/L2P

Hugging Face Models Trending

L2P proposes an efficient transfer paradigm that leverages pre-trained latent diffusion models to build pixel-space diffusion models, enabling high-quality generation with minimal computational overhead and data requirements, and supporting native 4K resolution.