@rohanpaul_ai: New Microsoft paper argues that transformers generalize better when they learn compact internal states, not just next t…
Summary
Microsoft's NextLat paper proposes a self-supervised training method where transformers predict their next hidden state instead of just the next token, leading to more compact world models, better planning and reasoning, and up to 3.3x faster generation.
View Cached Full Text
Cached at: 06/24/26, 04:19 AM
New Microsoft paper argues that transformers generalize better when they learn compact internal states, not just next tokens.
The problem is that normal transformers can look back at every earlier token, so they do not have to squeeze the past into a clean summary. token prediction alone can reward shortcuts that do not become coherent world models.
That can work beautifully on familiar data and still fail when the model has to plan, detour, reason, or carry a hidden structure forward.
NextLat fixes this by adding a training task where the model must predict its next hidden state, not just the next word.
A hidden state is the model’s private summary of what it has seen, so predicting the next one pushes the model to learn how situations change over time.
The authors tested this on map-like world modeling, math reasoning, graph planning, story prediction, and regular language modeling.
The main result is that NextLat often learned more compact and useful internal states, solved planning tasks better, and sped up generation by up to 3.3x.
Overall, it gives transformers some of the useful memory behavior of recurrent models without changing the transformer architecture or slowing normal inference.
Link – arxiv. org/abs/2511.05963
Title: “Next-Latent Prediction Transformers Learn Compact World Models”
Jayden Teoh (@jayden_teoh_): Next-token prediction is myopic. What if transformers learn to predict their own next latent state?
🌠 We present 𝗡𝗲𝘅𝘁-𝗟𝗮𝘁𝗲𝗻𝘁 𝗣𝗿𝗲𝗱𝗶𝗰𝘁𝗶𝗼𝗻 (𝗡𝗲𝘅𝘁𝗟𝗮𝘁): a self-supervised learning method that teaches transformers to form compact world models for reasoning
Similar Articles
Next-Latent Prediction Transformers [R]
Microsoft Research introduces Next-Latent Prediction (NextLat), a self-supervised method that trains transformers to predict their own next latent state, enabling compact world models for reasoning and planning and achieving up to 3.3x faster inference via self-speculative decoding.
Next-Latent Prediction Transformers Learn Compact World Models
Introduces Next-Latent Prediction (NextLat), a self-supervised objective that trains transformers to predict their next latent state, encouraging compact internal world models and improving generalization across sequence modeling tasks.
@AlphaSignalAI: https://x.com/AlphaSignalAI/status/2069424192274252094
Microsoft's NextLat introduces a training objective that rewards belief-state representations instead of relying solely on next-token prediction, pushing models toward compact world models for better generalization.
@rohanpaul_ai: Interesting, this paper shows that Transformers may not need separate key and value projections to work well. This pape…
This paper investigates whether Transformers need separate key and value projections, finding that sharing them can reduce KV cache by 50% with only 3.1% higher perplexity, and further cuts when combined with GQA and MQA.
@machinestein: ICML 2026: Latent Reasoning in TRMs is Secretly a Policy Improvement Operator Why does recursive reasoning, especially …
The paper reveals that latent reasoning in transformer-based reasoning models (TRMs) functions as a policy improvement operator, and proposes an algorithm that enhances learning and inference efficiency by up to 18x.