data-scaling

Tag

Cards List
#data-scaling

VeriEvol: Scaling Multimodal Mathematical Reasoning via Verifiable Evol-Instruct

Hugging Face Daily Papers · 3d ago Cached

VeriEvol is a novel framework for scaling reinforcement learning in visual mathematical reasoning by ensuring reliable reward labels through a two-axis approach separating prompt difficulty from answer reliability, using evolutionary operators and hypothesis-testing verification. It achieves significant accuracy gains on a five-benchmark visual-math suite.

0 favorites 0 likes
#data-scaling

HumanScale: Egocentric Human Video Can Outperform Real-Robot Data for Embodied Pretraining

Hugging Face Daily Papers · 2026-06-18 Cached

This paper finds that egocentric human video, when processed with a filtering and labeling pipeline, can outperform teleoperated real-robot data for pretraining embodied foundation models, achieving lower validation loss and higher success rates on real-robot tasks.

0 favorites 0 likes
#data-scaling

Why don't frontier labs say how much data they are training on?

Reddit r/ArtificialInteligence · 2026-06-17

Article questions why frontier AI labs like OpenAI and Anthropic do not disclose the size of their training data, suggesting that improvements may come from data volume rather than genuine intelligence.

0 favorites 0 likes
#data-scaling

Data Scaling as Progressive Coverage of a Predictive Contribution Spectrum

arXiv cs.CL · 2026-05-21 Cached

This paper proposes that real-data scaling laws are governed by progressive coverage of a latent predictive contribution spectrum rather than token-frequency tails alone, and provides empirical evidence using a suffix-automaton representation of text corpora.

0 favorites 0 likes
#data-scaling

@MangQiuyang: Open-ended coding training data may no longer be the bottleneck: AI can scale open-ended tasks—and even outperform huma…

X AI KOLs Timeline · 2026-05-15 Cached

FrontierSmith is a system that synthesizes open-ended coding problems at scale from closed-ended tasks. It generates, filters, and builds training environments; models trained on its data outperform those trained on human-curated open-ended data.

0 favorites 0 likes
← Back to home

Submit Feedback