@natashajaques: Really enjoyed reading the Microsoft MAI-Thinking-1 "Building a Hill Climbing Machine" paper. Amazing they publicly rel…

X AI KOLs Following Papers

Summary

Natasha Jaques praises the Microsoft MAI-Thinking-1 paper for fully disclosing the training recipe for a frontier model, highlighting the token distribution across pre-training, mid-training, and RL post-training phases, and noting that Yann LeCun's cake analogy was prescient.

Really enjoyed reading the Microsoft MAI-Thinking-1 "Building a Hill Climbing Machine" paper. Amazing they publicly released all the info needed to train a frontier model, down to hparams. I also thought this was pretty telling: - pre-training: 30 trillion tokens - mid-training (SFT on STEM/math/code data): 3.55 trillion tokens - RL post-training: 150 billion tokens. Looks like @ylecun was right all along with the cake analogy. Obviously I still think something like RL (optimizing for long term goals) is fundamental to what we think of as intelligence. But it's not the volume of learning signal, it's the optimization on top of an already reasonable predictive model.
Original Article
View Cached Full Text

Cached at: 06/10/26, 01:51 PM

Really enjoyed reading the Microsoft MAI-Thinking-1 “Building a Hill Climbing Machine” paper. Amazing they publicly released all the info needed to train a frontier model, down to hparams.

I also thought this was pretty telling:

  • pre-training: 30 trillion tokens
  • mid-training (SFT on STEM/math/code data): 3.55 trillion tokens
  • RL post-training: 150 billion tokens. Looks like @ylecun was right all along with the cake analogy.

Obviously I still think something like RL (optimizing for long term goals) is fundamental to what we think of as intelligence. But it’s not the volume of learning signal, it’s the optimization on top of an already reasonable predictive model.

Similar Articles

@dair_ai: https://x.com/dair_ai/status/2056018543850754283

X AI KOLs Following

A roundup of the top AI papers from May 11-17, covering Lighthouse Attention for long-context pretraining, a comparison of grep vs embedding retrieval for coding agents, and mechanistic interpretability work revealing a geometric calculator in LLMs.

@harshbhatt7585: https://x.com/harshbhatt7585/status/2063593933314113587

X AI KOLs Timeline

The author shares learnings from training a 160M parameter LLM from scratch, experimenting with architectures like multi-token prediction and hierarchical reasoning models. They emphasize the importance of fast iteration, simplifying ideas, and understanding why architectures work.