(1D) Ordered Tokens Enable Efficient Test-Time Search

Hugging Face Daily Papers Papers

Summary

This paper investigates how 1D coarse-to-fine token structures in autoregressive models improve test-time search efficiency compared to classical 2D grid tokenization. The authors show that such ordered tokens enable better test-time scaling and even training-free text-to-image generation when guided by image-text verifiers.

Tokenization is a key component of autoregressive (AR) generative models, converting raw data into more manageable units for modeling. Commonly, tokens describe local information, such as regions of pixels in images or word pieces in text, and AR generation predicts these tokens in a fixed order. A worthwhile question is whether token structures affect the ability to steer the generation through test-time search, where multiple candidate generations are explored and evaluated by a verifier. Using image generation as our testbed, we hypothesize that recent 1D ordered tokenizers with coarse-to-fine structure can be more amenable to search than classical 2D grid structures. This is rooted in the fact that the intermediate states in coarse-to-fine sequences carry semantic meaning that verifiers can reliably evaluate, enabling effective steering during generation. Through controlled experiments, we find that AR models trained on coarse-to-fine ordered tokens exhibit improved test-time scaling behavior compared to grid-based counterparts. Moreover, we demonstrate that, thanks to the ordered structure, pure test-time search over token sequences (i.e., without training an AR model) can perform training-free text-to-image generation when guided by an image-text verifier. Beyond this, we systematically study how classical search algorithms (best-of-N, beam search, lookahead search) interact with different token structures, as well as the role of different verifiers and AR priors. Our results highlight the impact of token structure on inference-time scalability and provide practical guidance for test-time scaling in AR models.
Original Article
View Cached Full Text

Cached at: 04/21/26, 07:21 AM

Paper page - (1D) Ordered Tokens Enable Efficient Test-Time Search

Source: https://huggingface.co/papers/2604.15453

Abstract

Autoregressive models with coarse-to-fine token structures show better test-time scaling and enable training-free text-to-image generation when combined with image-text verifiers.

Tokenizationis a key component of autoregressive (AR) generative models, converting raw data into more manageable units for modeling. Commonly, tokens describe local information, such as regions of pixels in images or word pieces in text, and AR generation predicts these tokens in a fixed order. A worthwhile question is whether token structures affect the ability to steer the generation throughtest-time search, where multiple candidate generations are explored and evaluated by a verifier. Using image generation as our testbed, we hypothesize that recent 1D ordered tokenizers withcoarse-to-fine structurecan be more amenable to search than classical 2D grid structures. This is rooted in the fact that the intermediate states in coarse-to-fine sequences carry semantic meaning that verifiers can reliably evaluate, enabling effective steering during generation. Through controlled experiments, we find that AR models trained on coarse-to-fine ordered tokens exhibit improved test-time scaling behavior compared to grid-based counterparts. Moreover, we demonstrate that, thanks to the ordered structure, puretest-time searchover token sequences (i.e., without training an AR model) can perform training-free text-to-image generation when guided by animage-text verifier. Beyond this, we systematically study how classical search algorithms (best-of-N,beam search,lookahead search) interact with different token structures, as well as the role of different verifiers and AR priors. Our results highlight the impact of token structure on inference-time scalability and provide practical guidance for test-time scaling in AR models.

View arXiv pageView PDFProject pageGitHub8Add to collection

Get this paper in your agent:

hf papers read 2604\.15453

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2604.15453 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2604.15453 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2604.15453 in a Space README.md to link it from this page.

Collections including this paper1

Similar Articles

From 2D Grids to 1D Tokens: Reforming Shared Representations for Multimodal Image Fusion

Hugging Face Daily Papers

This paper introduces a multimodal image fusion method that uses a 1D token interface from a pretrained image tokenizer to enhance global appearance coherence while preserving local details through selective token editing (STE). Experiments on four benchmarks show state-of-the-art performance in both global coherence and local fidelity.

Compute Optimal Tokenization (2 minute read)

TLDR AI

This paper systematically derives compression-aware neural scaling laws by training nearly 1,300 models, demonstrating that the widely used heuristic of 20 tokens per parameter is an artifact of specific tokenizers. The authors propose a tokenizer-agnostic scaling law based on bytes, offering a new framework for compute-efficient training across diverse languages and modalities.

Adaptive Tokenisation Via Temporal Redundancy Masking And Latent Inpainting [R]

Reddit r/MachineLearning

This paper introduces an adaptive video tokenisation method that exploits temporal redundancy in latent space to allocate tokens dynamically, achieving efficient compression without auxiliary networks. The proposed Latent Inpainting Transformer reconstructs dropped positions, delivering 31x speedup over ElasticTok-CV and 2x over InfoTok.