@rohanpaul_ai: New Microsoft paper argues that transformers generalize better when they learn compact internal states, not just next t…

X AI KOLs Timeline Papers

Summary

Microsoft's NextLat paper proposes a self-supervised training method where transformers predict their next hidden state instead of just the next token, leading to more compact world models, better planning and reasoning, and up to 3.3x faster generation.

New Microsoft paper argues that transformers generalize better when they learn compact internal states, not just next tokens. The problem is that normal transformers can look back at every earlier token, so they do not have to squeeze the past into a clean summary. token prediction alone can reward shortcuts that do not become coherent world models. That can work beautifully on familiar data and still fail when the model has to plan, detour, reason, or carry a hidden structure forward. NextLat fixes this by adding a training task where the model must predict its next hidden state, not just the next word. A hidden state is the model’s private summary of what it has seen, so predicting the next one pushes the model to learn how situations change over time. The authors tested this on map-like world modeling, math reasoning, graph planning, story prediction, and regular language modeling. The main result is that NextLat often learned more compact and useful internal states, solved planning tasks better, and sped up generation by up to 3.3x. Overall, it gives transformers some of the useful memory behavior of recurrent models without changing the transformer architecture or slowing normal inference. ---- Link – arxiv. org/abs/2511.05963 Title: "Next-Latent Prediction Transformers Learn Compact World Models"
Original Article
View Cached Full Text

Cached at: 06/24/26, 04:19 AM

New Microsoft paper argues that transformers generalize better when they learn compact internal states, not just next tokens.

The problem is that normal transformers can look back at every earlier token, so they do not have to squeeze the past into a clean summary. token prediction alone can reward shortcuts that do not become coherent world models.

That can work beautifully on familiar data and still fail when the model has to plan, detour, reason, or carry a hidden structure forward.

NextLat fixes this by adding a training task where the model must predict its next hidden state, not just the next word.

A hidden state is the model’s private summary of what it has seen, so predicting the next one pushes the model to learn how situations change over time.

The authors tested this on map-like world modeling, math reasoning, graph planning, story prediction, and regular language modeling.

The main result is that NextLat often learned more compact and useful internal states, solved planning tasks better, and sped up generation by up to 3.3x.

Overall, it gives transformers some of the useful memory behavior of recurrent models without changing the transformer architecture or slowing normal inference.


Link – arxiv. org/abs/2511.05963

Title: “Next-Latent Prediction Transformers Learn Compact World Models”

Jayden Teoh (@jayden_teoh_): Next-token prediction is myopic. What if transformers learn to predict their own next latent state?

🌠 We present 𝗡𝗲𝘅𝘁-𝗟𝗮𝘁𝗲𝗻𝘁 𝗣𝗿𝗲𝗱𝗶𝗰𝘁𝗶𝗼𝗻 (𝗡𝗲𝘅𝘁𝗟𝗮𝘁): a self-supervised learning method that teaches transformers to form compact world models for reasoning

Similar Articles

Next-Latent Prediction Transformers [R]

Reddit r/MachineLearning

Microsoft Research introduces Next-Latent Prediction (NextLat), a self-supervised method that trains transformers to predict their own next latent state, enabling compact world models for reasoning and planning and achieving up to 3.3x faster inference via self-speculative decoding.

Next-Latent Prediction Transformers Learn Compact World Models

Papers with Code Trending

Introduces Next-Latent Prediction (NextLat), a self-supervised objective that trains transformers to predict their next latent state, encouraging compact internal world models and improving generalization across sequence modeling tasks.