Tag
A new AI model is being trained on over 100 trillion tokens, doubling the typical pretraining data size of 27-50 trillion tokens used by other models like Kimi, Mimo, and DeepSeek.
GPIC is a new large-scale image-text dataset and benchmark for generative modeling, claimed to be much more efficient than ImageNet and a better proxy for real-world problems, with fully permissive licensing for research and commercial use.
New arXiv paper announces the largest multilingual food model, trained on 4.1M recipes across 7 languages with 1,790 ingredients, compressed into 2MB.
SciAtlas is a large-scale, multi-disciplinary academic knowledge graph containing over 43 million papers and 3 billion triplets, designed to provide structured knowledge for AI-driven automated scientific research with a neuro-symbolic retrieval algorithm.
Ant Group released Ring-2.6-1T, a 1 trillion parameter reasoning model for agent workflows, featuring MIT license, extended context, and Async RL + IcePop training, achieving state-of-the-art results.
SWE-ZERO-12M-trajectories is the largest open agentic trace dataset for coding, with 112B tokens across 12M trajectories from 122K pull requests and 3K repositories, enabling scalable training of agentic coding models without requiring containerized execution.
Urban-ImageNet is a large-scale multi-modal dataset and evaluation benchmark for urban space perception from social media imagery, supporting scene classification, cross-modal retrieval, and instance segmentation tasks across 61 urban sites in 24 Chinese cities.