100 Trillion+ Pretraining data??? This is the largest data I've see a model being trained on.

Reddit r/LocalLLaMA 06/01/26, 04:38 AM Models

Summary

A new AI model is being trained on over 100 trillion tokens, doubling the typical pretraining data size of 27-50 trillion tokens used by other models like Kimi, Mimo, and DeepSeek.

https://preview.redd.it/oss7g2gnll4h1.png?width=894&format=png&auto=webp&s=5d4295707a700ed7541c274b8be8ad75bbd0903d Usually we see 27-50 Trillion tokens in most models, kimi, mimo, deepseek. They seem to have doubled the pretraining data. Minimax-m2.5 was like 27T tokens. If we see mimo, they have done: \- 27T for the Mimo-v2.5-Pro 1 Trillion Parameters \- 48T for the smaller Mimo-v2.5 model which is multimodal. \- 32T for Deepseek V4 Flash and Pro I find it difficult to believe this model will be much bigger than the previous M2 series models. The training data scale is way too big, and will require way more resources for a much bigger model. M3 seems likely to be under 500B params.

Original Article

Similar Articles

Scalable and Efficient Joint Spiking Embedding Predictive Architecture for Large-Scale Dynamic Graphs

arXiv cs.LG

Proposes SG-JEPA, a joint spiking embedding predictive architecture for large-scale dynamic graphs that partitions nodes into context and target sets along the temporal dimension to learn predictive embeddings, achieving competitive performance on node classification while scaling to graphs with 13 million edges and avoiding complex self-supervised mechanisms.

Can Kimi K3 solve the same problems that Claude Fable can?

Reddit r/LocalLLaMA

A discussion questioning whether open-source models like Kimi K3 or GLM can replicate the mathematical and cybersecurity problem-solving achievements recently demonstrated by closed-source models from OpenAI and Anthropic.

@BhavinJawade: 𝗢𝗻-𝗽𝗼𝗹𝗶𝗰𝘆 𝗱𝗶𝘀𝘁𝗶𝗹𝗹𝗮𝘁𝗶𝗼𝗻 𝗶𝘀𝗻'𝘁 𝗮 𝗳𝗿𝗲𝗲-𝗹𝘂𝗻𝗰𝗵 On-policy distillation has become a default…

X AI KOLs Timeline

Bhavin Jawade discusses several failure modes of on-policy distillation for training large language models, including early mistakes becoming uncorrectable, stronger teachers being worse, privileged information conditioning failing to transfer, and thinking collapse from dense supervision.

@jakevin7: Today I directly used GPT5.6-Sol to build Maka's official website. Found that Sol's frontend capability is still not good. It can only be said that compared to 5.5, there is improvement, but compared to other models' frontend capabilities, it's far behind.

X AI KOLs Timeline

User @jakevin7 tested GPT5.6-Sol to build Maka's official website and believes its frontend capability, though improved, is still far behind other models.

Which one should i buy? Claude, Cursor, or GPT?

Reddit r/AI_Agents

A post asking for advice on whether to buy Claude, Cursor, or GPT, comparing these AI tools.