large-scale

Tag

Cards List
#large-scale

100 Trillion+ Pretraining data??? This is the largest data I've see a model being trained on.

Reddit r/LocalLLaMA · 3d ago

A new AI model is being trained on over 100 trillion tokens, doubling the typical pretraining data size of 27-50 trillion tokens used by other models like Kimi, Mimo, and DeepSeek.

0 favorites 0 likes
#large-scale

@jcjohnss: GPIC should be the new standard benchmark for generative modeling. Training 1 epoch on GPIC is the same cost as 100 epo…

X AI KOLs Following · 5d ago Cached

GPIC is a new large-scale image-text dataset and benchmark for generative modeling, claimed to be much more efficient than ImageNet and a better proxy for real-world problems, with fully permissive licensing for research and commercial use.

0 favorites 0 likes
#large-scale

@josefchen: Launching our new paper on arXiv: we trained the largest multilingual food model ever built. 4.1M recipes. 7 languages.…

X AI KOLs Timeline · 2026-05-26 Cached

New arXiv paper announces the largest multilingual food model, trained on 4.1M recipes across 7 languages with 1,790 ingredients, compressed into 2MB.

0 favorites 0 likes
#large-scale

SciAtlas: A Large-Scale Knowledge Graph for Automated Scientific Research

arXiv cs.AI · 2026-05-25 Cached

SciAtlas is a large-scale, multi-disciplinary academic knowledge graph containing over 43 million papers and 3 billion triplets, designed to provide structured knowledge for AI-driven automated scientific research with a neuro-symbolic retrieval algorithm.

0 favorites 0 likes
#large-scale

Ring-2.6-1T is putting up SOTA-level numbers for real-world agents

Reddit r/ArtificialInteligence · 2026-05-18

Ant Group released Ring-2.6-1T, a 1 trillion parameter reasoning model for agent workflows, featuring MIT license, extended context, and Async RL + IcePop training, achieving state-of-the-art results.

0 favorites 0 likes
#large-scale

@kevin_x_li: Introducing SWE-ZERO-12M-trajectories: the largest agentic trace dataset in the open, 5.7x larger than the previous lar…

X AI KOLs Following · 2026-05-13 Cached

SWE-ZERO-12M-trajectories is the largest open agentic trace dataset for coding, with 112B tokens across 12M trajectories from 122K pull requests and 3K repositories, enabling scalable training of agentic coding models without requiring containerized execution.

0 favorites 0 likes
#large-scale

Urban-ImageNet: A Large-Scale Multi-Modal Dataset and Evaluation Framework for Urban Space Perception

Hugging Face Daily Papers · 2026-05-11 Cached

Urban-ImageNet is a large-scale multi-modal dataset and evaluation benchmark for urban space perception from social media imagery, supporting scene classification, cross-modal retrieval, and instance segmentation tasks across 61 urban sites in 24 Chinese cities.

0 favorites 0 likes
← Back to home

Submit Feedback