GEAR: Guided End-to-End AutoRegression for Image Synthesis

Hugging Face Daily Papers 06/30/26, 12:00 AM Papers

Summary

GEAR proposes a method to jointly train a vector-quantized tokenizer and autoregressive generator end-to-end via representation alignment, achieving up to 10x faster convergence on ImageNet gFID compared to strong baselines.

Visual generative models are typically trained in two stages. A tokenizer is first trained for reconstruction and then frozen, after which a generator is trained on its discrete indices or continuous latents. This decoupling leaves the tokenizer unaware of what the generator finds easy to model. We present GEAR (Guided End-to-end AutoRegression), which trains a vector-quantized (VQ) tokenizer and an autoregressive (AR) generator jointly and end-to-end, guided by representation alignment. The key obstacle is that the VQ index fed to the AR model is non-differentiable, so gradients cannot reach the tokenizer, and a straight-through estimator collapses. GEAR resolves this with a dual read-out of the codebook assignment. A hard, one-hot branch trains the AR with next-token prediction, while a differentiable soft branch carries a representation-alignment loss that flows back to guide only the tokenizer. The AR model thereby steers its tokenizer toward an index distribution it can predict more easily. This shifts the alignment burden from the tokenizer to the AR: the tokenizer's own features become less DINOv2-like while the AR's become more so, the opposite of diffusion-side recipes that make the latent itself semantic. GEAR speeds up ImageNet gFID convergence by up to 10x relative to the strong LlamaGen-REPA baseline, learns markedly better patch-level and spatially-coherent features, and generalizes across quantizers (VQVAE, LFQ, IBQ) and to text-to-image generation.

Original Article

View Cached Full Text

Cached at: 07/01/26, 11:42 AM

Paper page - GEAR: Guided End-to-End AutoRegression for Image Synthesis

Source: https://huggingface.co/papers/2606.32039 Published on Jun 30

Submitted byhttps://huggingface.co/LanguageBind

linbinon Jul 1

Abstract

GEAR trains a vector-quantized tokenizer and autoregressive generator jointly end-to-end using representation alignment, overcoming non-differentiability issues through a dual read-out approach that improves convergence speed and feature quality.

Visual generative models are typically trained in two stages. A tokenizer is first trained for reconstruction and then frozen, after which a generator is trained on its discrete indices or continuous latents. This decoupling leaves the tokenizer unaware of what the generator finds easy to model. We present GEAR (Guided End-to-end AutoRegression), which trains avector-quantized(VQ) tokenizer and anautoregressive(AR) generator jointly and end-to-end, guided byrepresentation alignment. The key obstacle is that the VQ index fed to the AR model is non-differentiable, so gradients cannot reach the tokenizer, and astraight-through estimatorcollapses. GEAR resolves this with a dual read-out of thecodebook assignment. A hard, one-hot branch trains the AR withnext-token prediction, while a differentiable soft branch carries a representation-alignment loss that flows back to guide only the tokenizer. The AR model thereby steers its tokenizer toward an index distribution it can predict more easily. This shifts the alignment burden from the tokenizer to the AR: the tokenizer’s own features become lessDINOv2-like while the AR’s become more so, the opposite of diffusion-side recipes that make the latent itself semantic. GEAR speeds upImageNet gFIDconvergence by up to 10x relative to the strong LlamaGen-REPA baseline, learns markedly better patch-level and spatially-coherent features, and generalizes across quantizers (VQVAE,LFQ,IBQ) and totext-to-image generation.

View arXiv page View PDF Project page GitHub33 Add to collection

Get this paper in your agent:

hf papers read 2606\.32039

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper6

#### BinLin203/Warmup-LFQ Updatedabout 9 hours ago • 2 #### BinLin203/Warmup-IBQ Updatedabout 9 hours ago • 2 #### BinLin203/GEAR-VQ Updatedabout 9 hours ago • 1 #### BinLin203/GEAR-LFQ Updatedabout 9 hours ago Browse 6 models citing this paper## Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.32039 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.32039 in a Space README.md to link it from this page.

GEAR: Guided End-to-End AutoRegression for Image Synthesis

Paper page - GEAR: Guided End-to-End AutoRegression for Image Synthesis

Abstract

Models citing this paper6

Spaces citing this paper0

Collections including this paper1

Similar Articles

RankE: End-to-End Post-Training for Discrete Text-to-Image Generation with Decoder Co-Evolution

High-Fidelity Two-Step Image Generation via Teacher-Aligned End-to-End Distillation

InsightTok: Improving Text and Face Fidelity in Discrete Tokenization for Autoregressive Image Generation

Cross-scale Aligned Supervision for Training GANs

unsloth/ERNIE-Image-Turbo-GGUF

Submit Feedback

Similar Articles

RankE: End-to-End Post-Training for Discrete Text-to-Image Generation with Decoder Co-Evolution

High-Fidelity Two-Step Image Generation via Teacher-Aligned End-to-End Distillation

InsightTok: Improving Text and Face Fidelity in Discrete Tokenization for Autoregressive Image Generation

Cross-scale Aligned Supervision for Training GANs

unsloth/ERNIE-Image-Turbo-GGUF