100 Trillion+ Pretraining data??? This is the largest data I've see a model being trained on.

Reddit r/LocalLLaMA Models

Summary

A new AI model is being trained on over 100 trillion tokens, doubling the typical pretraining data size of 27-50 trillion tokens used by other models like Kimi, Mimo, and DeepSeek.

https://preview.redd.it/oss7g2gnll4h1.png?width=894&format=png&auto=webp&s=5d4295707a700ed7541c274b8be8ad75bbd0903d Usually we see 27-50 Trillion tokens in most models, kimi, mimo, deepseek. They seem to have doubled the pretraining data. Minimax-m2.5 was like 27T tokens. If we see mimo, they have done: \- 27T for the Mimo-v2.5-Pro 1 Trillion Parameters \- 48T for the smaller Mimo-v2.5 model which is multimodal. \- 32T for Deepseek V4 Flash and Pro I find it difficult to believe this model will be much bigger than the previous M2 series models. The training data scale is way too big, and will require way more resources for a much bigger model. M3 seems likely to be under 500B params.
Original Article

Similar Articles

@teach_fireworks: A one-image comparison of mainstream Agent development frameworks! How to choose among so many Agent development frameworks? For personal heavy daily coding / research on open-source projects: try Pi Agent AI SaaS or enterprise-level agents: OpenAI Agents SDK + Lang…

X AI KOLs Timeline

A tweet compares mainstream AI Agent development frameworks (such as Pi Agent, OpenAI Agents SDK, LangGraph, LlamaIndex, Pydantic AI) and gives selection recommendations for different scenarios.