Parallel Rollout Approximation for Pixel-Space Autoregressive Image Generation

Hugging Face Daily Papers Papers

Summary

Parallel Rollout Approximation (PRA) improves pixel-space autoregressive image generation by using low-dimensional intermediate states and parallel training, achieving new state-of-the-art results on ImageNet-1K generation.

Pixel-space continuous-token autoregressive (AR) generation directly models images as sequences of raw pixel patches, avoiding discrete tokenization or a separately pretrained tokenizer. However, it faces coupled challenges: high-dimensional patch generation causes large single-step errors, and teacher-forced training creates a train--inference gap that makes these errors accumulate across AR steps. Existing fixes such as x-prediction and input noise injection only partially mitigate these issues. Exact rollout training better matches inference-time conditions, but is impractical due to prohibitively slow sequential sampling. We propose Parallel Rollout Approximation (PRA), a scalable framework that addresses both challenges jointly. PRA generates low-dimensional intermediate states instead of high-dimensional pixel patches, then maps them back to pixel-space tokens with a pixel decoder, preserving a pixel-in, pixel-out AR interface. It also constructs inference-like pixel inputs through the same intermediate-state-to-pixel path used at inference, independently across positions, approximating the pixel-feedback interface encountered during inference-time rollout while retaining parallel teacher-forced training. On class-conditional ImageNet-1K generation at 256times256 resolution, PRA-S with 135M parameters achieves an FID of 2.58, surpassing the previous billion-scale pixel-space AR result of 3.60. Scaling to PRA-L with 511M parameters further improves FID to 1.94, establishing a new state of the art among pixel-space AR models. Beyond generation, PRA achieves higher ImageNet classification probing accuracy than other AR and diffusion baselines, suggesting its potential for unified pixel-space image generation and understanding.
Original Article
View Cached Full Text

Cached at: 06/29/26, 02:03 PM

Paper page - Parallel Rollout Approximation for Pixel-Space Autoregressive Image Generation

Source: https://huggingface.co/papers/2606.27978

Abstract

Parallel Rollout Approximation (PRA) addresses limitations in pixel-space autoregressive image generation by using low-dimensional intermediate states and parallel training to improve quality and efficiency.

Pixel-spacecontinuous-tokenautoregressive (AR) generation directly models images as sequences of raw pixel patches, avoiding discrete tokenization or a separately pretrained tokenizer. However, it faces coupled challenges: high-dimensional patch generation causes large single-step errors, andteacher-forced trainingcreates a train--inference gap that makes these errors accumulate across AR steps. Existing fixes such as x-prediction and input noise injection only partially mitigate these issues. Exactrollout trainingbetter matches inference-time conditions, but is impractical due to prohibitively slow sequential sampling. We proposeParallel Rollout Approximation(PRA), a scalable framework that addresses both challenges jointly. PRA generates low-dimensionalintermediate statesinstead of high-dimensional pixel patches, then maps them back topixel-spacetokens with apixel decoder, preserving a pixel-in, pixel-out AR interface. It also constructs inference-like pixel inputs through the same intermediate-state-to-pixel path used at inference, independently across positions, approximating the pixel-feedback interface encountered during inference-time rollout while retaining parallelteacher-forced training. On class-conditionalImageNet-1K generation at 256times256 resolution, PRA-S with 135M parameters achieves anFIDof 2.58, surpassing the previous billion-scalepixel-spaceAR result of 3.60. Scaling to PRA-L with 511M parameters further improvesFIDto 1.94, establishing a new state of the art amongpixel-spaceAR models. Beyond generation, PRA achieves higherImageNetclassification probing accuracy than other AR and diffusion baselines, suggesting its potential for unifiedpixel-spaceimage generation and understanding.

View arXiv pageView PDFGitHubAdd to collection

Get this paper in your agent:

hf papers read 2606\.27978

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.27978 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.27978 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.27978 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

L2P: Unlocking Latent Potential for Pixel Generation

Hugging Face Daily Papers

The L2P paper introduces a Latent-to-Pixel transfer paradigm that leverages pre-trained latent diffusion models to create efficient pixel-space models capable of 4K generation with minimal training overhead.

GEAR: Guided End-to-End AutoRegression for Image Synthesis

Hugging Face Daily Papers

GEAR proposes a method to jointly train a vector-quantized tokenizer and autoregressive generator end-to-end via representation alignment, achieving up to 10x faster convergence on ImageNet gFID compared to strong baselines.

prunaai/p-image

Replicate Explore

P-Image is Pruna's text-to-image generation model that produces state-of-the-art images in less than a second, offering a combination of speed, affordability, and quality.

GPT-Image-2 is rolling out

Reddit r/singularity

OpenAI is rolling out GPT-Image-2, a new image generation model. This appears to be a significant update to their image generation capabilities.