@AlphaSignalAI: https://x.com/AlphaSignalAI/status/2069424192274252094

X AI KOLs Timeline Papers

Summary

Microsoft's NextLat introduces a training objective that rewards belief-state representations instead of relying solely on next-token prediction, pushing models toward compact world models for better generalization.

https://t.co/WqF8HzdlQm
Original Article
View Cached Full Text

Cached at: 06/23/26, 04:12 PM

Next-Token Prediction Doesn’t Teach Models to Understand. It Teaches Them to Predict.

Microsoft’s NextLat adds a training objective that rewards belief-state representations instead of relying solely on next-token prediction.

In 8 mins, learn why next-token prediction doesn’t strongly favor world models and how Microsoft’s NextLat pushes models toward belief-state learning.

There is a difference between a model that predicts the next token accurately and a model that understands the world well enough to generalize.

The first can be achieved by memorizing patterns. The second requires building a compact internal model of how things work, one that captures the rules rather than the surface regularities.

Standard next-token prediction can produce the second, but it does not explicitly favor it over other predictive solutions.

A model can achieve strong token prediction accuracy through sophisticated pattern-matching without ever building a compact world model.

NextLat introduces a training signal that directly rewards representations with belief-state structure. Same architecture. Same inference. Meaningfully different internal representations.

Jayden Teoh@jayden_teoh_·Jun 16Next-token prediction is myopic. What if transformers learn to predict their own next latent state?

We present 𝗡𝗲𝘅𝘁-𝗟𝗮𝘁𝗲𝗻𝘁 𝗣𝗿𝗲𝗱𝗶𝗰𝘁𝗶𝗼𝗻 (𝗡𝗲𝘅𝘁𝗟𝗮𝘁): a self-supervised learning method that teaches transformers to form compact world models for reasoningShow more433071.8K263K

The missing incentive in next-token prediction

When a transformer processes a sequence, it doesn’t compress history into a fixed-size state the way a recurrent network does. It keeps the entire history in memory via its key-value cache and attends to whatever past tokens are relevant at each step.

This is why transformers are powerful. Ad-hoc retrieval over the full context is more flexible than a fixed recurrent state.

But flexibility has a cost. Because the model can always look up what it needs, there is no pressure to compress. The model never has to ask: what is the minimal representation of what I’ve seen so far that would let me predict what comes next?

Many theories of learning link strong generalization to compact representations that capture underlying structure rather than memorizing observations.

NTP alone does not strongly prefer compact world models over surface-level predictive shortcuts. Both produce accurate next-token predictions.

The paper opens with a pointed analogy.

Ptolemy’s geocentric model predicted Earth observations accurately but was structurally convoluted.

It was supplanted not because it was less accurate in-distribution, but because Copernicus’s heliocentric model was simpler and generalized beyond Earth’s perspective.

Transformers trained on next-token prediction have no incentive to find the simpler explanation.

What belief states are and why they matter

A belief state is a sufficient statistic of the history: a compact representation containing exactly the information needed to predict the future, and nothing more. If you have the belief state, additional history adds no predictive value.

Recurrent networks naturally develop something like belief states because their fixed-size hidden state must encode everything useful about the past. The bottleneck forces compression.

Transformers have no such bottleneck. Their internal representations can be arbitrarily complex because the full history is always available. There is no mechanism forcing them toward compact representations that capture the world’s structure.

NextLat restores this pressure without adding the costs of recurrence.

What NextLat does

Next-token prediction supervises outputs. NextLat supervises the internal trajectory.

Other methods constrain what the model produces at each step. NextLat constrains how the model’s internal representations evolve between steps. That distinction is what drives belief-state convergence.

The modification is elegant. During training, NextLat adds one auxiliary objective alongside standard next-token prediction:

Train a lightweight latent dynamics model that predicts the transformer’s next hidden state, given the current hidden state and the next token.

That’s it. Nothing changes at inference. The latent dynamics model is used only during training to shape the representations.

How the Next-Latent Prediction works

How the Next-Latent Prediction works

Two components are added on top of the standard cross-entropy loss.

A next-hidden-state prediction loss: a small MLP learns to predict the transformer’s next hidden state from the current hidden state plus the next token. Smooth L1 loss, supervised over a multi-step rollout horizon.

A KL alignment loss: forces predicted latent states to agree with true latent states in token prediction space. This acts like knowledge distillation, guiding the latent dynamics model toward semantically consistent representations.

The combined objective:

L_NextLat = L_next-token + λ₁ · L_next-h + λ₂ · L_KL

The formal guarantee behind this is Theorem 3.2:

The theorem states two conditions.

First, the output head must recover the true next-token distribution from the hidden state.

Second, the latent dynamics model must recover the true transition law of the hidden states.

When both hold, the hidden state must be a belief state. A sufficient statistic of the history for predicting the future.

The intuition is if the hidden state at step t, combined with the next token, must predict the hidden state at step t+1, and that state must predict the token after, and so on recursively, then the hidden state must contain a sufficient statistic of the history.

The recursive chain forces compression.

The practical transformer trained on finite data won’t achieve the exact theoretical optimum. But the objective creates a meaningful gradient toward it.

What this looks like in practice

The paper evaluates NextLat across four domains: world modeling, reasoning, planning, and language modeling.

The Manhattan taxi rides experiment is the most visually compelling.

A transformer is trained on sequences of Manhattan taxi rides and asked to generate valid trajectories. Edges consistent with the true street network are colored black. Invalid edges are colored red.

Standard GPT: substantial red. Invalid routes generated with confidence.

NextLat: substantially less red. The model generates routes more consistent with the actual street structure.

This is evidence of stronger world-model acquisition. The model isn’t retrieving memorized routes. It’s generating from an internal representation that better reflects how the grid actually works.

NextLat also produces lower-rank latent representations and better sequence compression, evidence that the objective changes the internal structure of the representations rather than simply improving task performance.

The model is organizing its latent space differently, not just predicting more accurately.

It outperforms standard next-token prediction and all baselines across world modeling, valid trajectory generation, sequence compression, and detour robustness.

It also outperforms Belief State Transformers (BST), the prior approach to the same problem, while being substantially more efficient. BST requires a separate transformer. NextLat requires only a small MLP.

On reasoning and planning benchmarks, NextLat shows consistent improvements.

On language modeling, gains are smaller, which makes sense. Natural language has more surface regularities to exploit and is less strictly dependent on compact world modeling.

A free inference speedup

NextLat’s latent dynamics model unlocks something standard transformers can’t do: variable-length self-speculative decoding.

Standard speculative decoding with multi-token prediction drafts a fixed number of tokens per step, constrained by the training horizon.

NextLat’s latent dynamics model composes recursively in latent space, so it can draft a flexible number of tokens regardless of the training horizon.

The result: up to 3.3x faster inference on language modeling benchmarks. No separate draft model. No architectural changes. The latent dynamics model trained during pretraining handles drafting at inference.

The deeper implication

Next-token prediction only cares whether the final prediction is correct. Many different internal representations can produce the same output distribution.

NextLat constrains the representations themselves.

To predict future hidden states consistently over multiple steps, the model must organize its latent space in a way that captures the underlying dynamics of the process generating the sequence.

Standard NTP is well-aligned with fitting the training distribution. It is weakly aligned with understanding the world well enough to generalize.

A model can achieve strong benchmark performance through sophisticated pattern-matching without a compact internal world model.

NextLat directly targets that gap.

What this means for how you think about LLMs

A large language model that predicts tokens accurately is not the same thing as a large language model that has built a compact model of the world.

Current transformers can achieve strong predictive performance without explicitly learning belief states. NTP allows many solutions, some structurally elegant, some built of surface shortcuts. It does not strongly prefer one over the other.

NextLat adds a training signal that rewards representations that behave like belief states. Whether it becomes a standard ingredient in future foundation-model training remains an open question.

What the paper establishes is one of the clearest theoretical links yet between a training objective and world-model formation, with empirical evidence that the link holds across domains.

Similar Articles

Next-Latent Prediction Transformers [R]

Reddit r/MachineLearning

Microsoft Research introduces Next-Latent Prediction (NextLat), a self-supervised method that trains transformers to predict their own next latent state, enabling compact world models for reasoning and planning and achieving up to 3.3x faster inference via self-speculative decoding.

Next-Latent Prediction Transformers Learn Compact World Models

Papers with Code Trending

Introduces Next-Latent Prediction (NextLat), a self-supervised objective that trains transformers to predict their next latent state, encouraging compact internal world models and improving generalization across sequence modeling tasks.

@harshbhatt7585: https://x.com/harshbhatt7585/status/2063593933314113587

X AI KOLs Timeline

The author shares learnings from training a 160M parameter LLM from scratch, experimenting with architectures like multi-token prediction and hierarchical reasoning models. They emphasize the importance of fast iteration, simplifying ideas, and understanding why architectures work.