@rohanpaul_ai: New Microsoft paper argues that transformers generalize better when they learn compact internal states, not just next t…

X AI KOLs Timeline 06/24/26, 03:26 AM Papers

transformers next-latent-prediction world-models microsoft self-supervised-learning reasoning

Summary

Microsoft's NextLat paper proposes a self-supervised training method where transformers predict their next hidden state instead of just the next token, leading to more compact world models, better planning and reasoning, and up to 3.3x faster generation.

New Microsoft paper argues that transformers generalize better when they learn compact internal states, not just next tokens. The problem is that normal transformers can look back at every earlier token, so they do not have to squeeze the past into a clean summary. token prediction alone can reward shortcuts that do not become coherent world models. That can work beautifully on familiar data and still fail when the model has to plan, detour, reason, or carry a hidden structure forward. NextLat fixes this by adding a training task where the model must predict its next hidden state, not just the next word. A hidden state is the model’s private summary of what it has seen, so predicting the next one pushes the model to learn how situations change over time. The authors tested this on map-like world modeling, math reasoning, graph planning, story prediction, and regular language modeling. The main result is that NextLat often learned more compact and useful internal states, solved planning tasks better, and sped up generation by up to 3.3x. Overall, it gives transformers some of the useful memory behavior of recurrent models without changing the transformer architecture or slowing normal inference. ---- Link – arxiv. org/abs/2511.05963 Title: "Next-Latent Prediction Transformers Learn Compact World Models"

Original Article

View Cached Full Text

Cached at: 06/24/26, 04:19 AM

New Microsoft paper argues that transformers generalize better when they learn compact internal states, not just next tokens.

The problem is that normal transformers can look back at every earlier token, so they do not have to squeeze the past into a clean summary. token prediction alone can reward shortcuts that do not become coherent world models.

That can work beautifully on familiar data and still fail when the model has to plan, detour, reason, or carry a hidden structure forward.

NextLat fixes this by adding a training task where the model must predict its next hidden state, not just the next word.

A hidden state is the model’s private summary of what it has seen, so predicting the next one pushes the model to learn how situations change over time.

The authors tested this on map-like world modeling, math reasoning, graph planning, story prediction, and regular language modeling.

The main result is that NextLat often learned more compact and useful internal states, solved planning tasks better, and sped up generation by up to 3.3x.

Overall, it gives transformers some of the useful memory behavior of recurrent models without changing the transformer architecture or slowing normal inference.

Link – arxiv. org/abs/2511.05963

Title: “Next-Latent Prediction Transformers Learn Compact World Models”

Jayden Teoh (@jayden_teoh_): Next-token prediction is myopic. What if transformers learn to predict their own next latent state?

🌠 We present 𝗡𝗲𝘅𝘁-𝗟𝗮𝘁𝗲𝗻𝘁 𝗣𝗿𝗲𝗱𝗶𝗰𝘁𝗶𝗼𝗻 (𝗡𝗲𝘅𝘁𝗟𝗮𝘁): a self-supervised learning method that teaches transformers to form compact world models for reasoning

@rohanpaul_ai: New Microsoft paper argues that transformers generalize better when they learn compact internal states, not just next t…

Similar Articles

Next-Latent Prediction Transformers [R]

Next-Latent Prediction Transformers Learn Compact World Models

@AlphaSignalAI: https://x.com/AlphaSignalAI/status/2069424192274252094

@rohanpaul_ai: Interesting, this paper shows that Transformers may not need separate key and value projections to work well. This pape…

@machinestein: ICML 2026: Latent Reasoning in TRMs is Secretly a Policy Improvement Operator Why does recursive reasoning, especially …

Submit Feedback

Similar Articles

Next-Latent Prediction Transformers [R]

Next-Latent Prediction Transformers Learn Compact World Models

@AlphaSignalAI: https://x.com/AlphaSignalAI/status/2069424192274252094

@rohanpaul_ai: Interesting, this paper shows that Transformers may not need separate key and value projections to work well. This pape…

@machinestein: ICML 2026: Latent Reasoning in TRMs is Secretly a Policy Improvement Operator Why does recursive reasoning, especially …