Next-Latent Prediction Transformers Learn Compact World Models

Papers with Code Trending Papers

Summary

Introduces Next-Latent Prediction (NextLat), a self-supervised objective that trains transformers to predict their next latent state, encouraging compact internal world models and improving generalization across sequence modeling tasks.

Transformers replace recurrence with a memory that grows with sequence length and self-attention that enables ad-hoc look ups over past tokens. Consequently, they lack an inherent incentive to compress history into compact latent states with consistent transition rules. This often leads to learning solutions that generalize poorly. We introduce Next-Latent Prediction (NextLat), which extends standard next-token training with self-supervised predictions in the latent space. Specifically, NextLat trains a transformer to learn latent representations that are predictive of its next latent state given the next output token. Theoretically, we show that these latents provably converge to belief states, compressed information of the history necessary to predict the future. This simple auxiliary objective also injects a recurrent inductive bias into transformers, while leaving their architecture, parallel training, and inference unchanged. NextLat effectively encourages the transformer to form compact internal world models with its own belief states and transition dynamics -- a crucial property absent in standard next-token prediction transformers. Empirically, across benchmarks targeting core sequence modeling competencies -- world modeling, reasoning, planning, and language modeling -- NextLat demonstrates significant gains over standard next-token training in downstream accuracy, representation compression, and lookahead planning. NextLat stands as a simple and efficient paradigm for shaping transformer representations toward stronger generalization.
Original Article
View Cached Full Text

Cached at: 06/17/26, 11:38 PM

Paper page - Next-Latent Prediction Transformers Learn Compact World Models

Source: https://huggingface.co/papers/2511.05963

Abstract

Next-Latent Prediction enhances transformer architectures by introducing self-supervised latent state prediction, enabling more effective history compression and improved generalization in sequence modeling tasks.

Transformersreplace recurrence with a memory that grows with sequence length andself-attentionthat enables ad-hoc look ups over past tokens. Consequently, they lack an inherent incentive to compress history into compact latent states with consistent transition rules. This often leads to learning solutions that generalize poorly. We introduce Next-Latent Prediction (NextLat), which extends standardnext-token trainingwith self-supervised predictions in thelatent space. Specifically, NextLat trains a transformer to learn latent representations that are predictive of its next latent state given the next output token. Theoretically, we show that these latents provably converge tobelief states, compressed information of the history necessary to predict the future. This simple auxiliary objective also injects a recurrent inductive bias intotransformers, while leaving their architecture, parallel training, and inference unchanged. NextLat effectively encourages the transformer to form compact internal world models with its ownbelief statesand transition dynamics -- a crucial property absent in standard next-token predictiontransformers. Empirically, across benchmarks targeting core sequence modeling competencies --world modeling,reasoning,planning, andlanguage modeling-- NextLat demonstrates significant gains over standardnext-token trainingin downstream accuracy,representation compression, andlookahead planning. NextLat stands as a simple and efficient paradigm for shaping transformer representations toward stronger generalization.

View arXiv pageView PDFProject pageGitHub54Add to collection

Get this paper in your agent:

hf papers read 2511\.05963

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2511.05963 in a model README.md to link it from this page.

Datasets citing this paper1

#### JaydenTeoh/manhattan Viewer• UpdatedMar 2 • 91.6M • 428 • 1

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2511.05963 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

Next-Latent Prediction Transformers [R]

Reddit r/MachineLearning

Microsoft Research introduces Next-Latent Prediction (NextLat), a self-supervised method that trains transformers to predict their own next latent state, enabling compact world models for reasoning and planning and achieving up to 3.3x faster inference via self-speculative decoding.

Looped World Models

Hugging Face Daily Papers

Looped World Models introduce iterative latent state refinement through shared transformer blocks, achieving 100x parameter efficiency while adapting computational depth to prediction complexity.

NITP: Next Implicit Token Prediction for LLM Pre-training

Hugging Face Daily Papers

Next Implicit Token Prediction (NITP) enhances language model pre-training by adding dense continuous supervision in representation space, improving generalization and performance across model sizes with minimal computational overhead.

World Machine: Towards Generative World Modeling for Time-Series

arXiv cs.LG

World Machine proposes a transformer-based generative world modeling architecture for time series that uses latent states to adapt to varying context lengths, addressing the quadratic memory cost of traditional transformers. Experiments on a synthetic dataset validate its feasibility and show improvements over conventional transformers.

Generative modeling with sparse transformers

OpenAI Blog

OpenAI introduces the Sparse Transformer, a deep neural network that improves the attention mechanism from O(N²) to O(N√N) complexity, enabling modeling of sequences 30x longer than previously possible across text, images, and audio. The model uses sparse attention patterns and checkpoint-based memory optimization to train networks up to 128 layers deep, achieving state-of-the-art performance across multiple domains.