The data black hole at the center of AI

Reddit r/artificial News

Summary

This article deeply analyzes the problem that AI's sample efficiency is far lower than that of humans, pointing out that frontier models require massive amounts of domain-specific data, while humans can learn from just a few examples. This data black hole is a core bottleneck in current AI development. Through multiple comparisons (annotation volume, robot manipulation, driving) and refuting common objections, the article demonstrates the severity of this gap and explores its impact on the goals of AI automation.

No content available
Original Article
View Cached Full Text

Cached at: 06/22/26, 01:39 AM

### TL;DR AI's sample efficiency is far worse than humans'. Frontier models rely on massive amounts of domain-specific data (billions to trillions of tokens), whereas humans can learn skills from just a handful of examples. This data black hole is the core bottleneck of current AI. --- ## Sample Efficiency: The Gap Between AI and Humans An important definition of intelligence is **sample efficiency** — how much data is needed to operate fluently and competently in a given domain. In recent years, we haven't gained a clear picture of how much progress has been made in training sample efficiency; instead, the main ways AI has gotten better are simply broadening and improving the data distribution and scaling up compute. Reinforcement learning is the primary method: it's essentially a form of synthetic data generation — you pour a lot of compute into a verifier or judgment criteria (e.g., using a large language model as a judge) to identify good data, then train the model to predict those correct outcomes, similar to training a model to predict the next word in internet text. For this process to work, the model must have a certain prior probability of expecting the correct solution. Therefore, you need an **incredible amount of human expert trajectory data**, covering every domain and skill the model is ultimately expected to handle. This data is extremely task-specific and customized. For example, job postings from Mercor or Surge recruit: - **Word experts**: responsible for converting old documents into polished Word files - **Legal experts**: writing realistic M&A due diligence reports or securities documents - **Management consultants**: writing templated market research Behind each skill are at least hundreds of human experts, who generate example completions, write judgment criteria, and explain chain-of-thought reasoning. The data industry generating these expert labels, along with the reinforcement learning environments that bring these skills together, generates billions of dollars in revenue annually and will soon reach tens of billions. Imagine that you need the equivalent of decades of coursework, hundreds of professors guiding you simultaneously, and millions of practice tasks just to learn how to beautify a Word document. The disparity in task count is even underestimated because the model must practice repeatedly across far more tasks (each task is harder). A human student might practice a textbook problem once or twice, but using GRPO (Group Relative Policy Optimization), these models generate hundreds to thousands of outputs per task to solve the credit assignment problem. The correct way to view these models is not as humans who have mastered multiple skills, but as **Frankenstein monsters**: stitched together from tens of billions of carefully constructed examples. --- ## The Scale of the Data Black Hole Epoch recently reported that open models lag behind frontier models by about four months. Open-source models can catch up relatively easily because **data is the real driver of progress** — data can be easily distilled from public APIs, whereas hyperparameters, training tricks, and architectural optimizations cannot. If the latter were driving most of the progress, catching up would be far harder. It's easy to forget just how much data these models are trained on, far exceeding what a human sees in a lifetime. AI is like a galaxy of capabilities, but at its center, invisible to the naked eye, holding all the constellations together, is an unimaginably huge **data black hole**. A few comparisons illustrate the sample efficiency gap: ### Comparison 1: Token count Assume an average human sees and hears 2000 words per hour (a generous estimate). From birth to adulthood, that’s roughly **200 million tokens**. Frontier models are trained on **trillions to quadrillions of tokens** — a gap of nearly **a million times**. ### Comparison 2: Robotic manipulation A human can learn to teleoperate any random humanoid robot or robotic arm within a few hours. If AI could learn this fast, robotics would be a trillion-dollar industry, and you could have endless Unitree G1 robots doing useful work in the world. But we can't: even after collecting millions of hours of demonstration data, it's not enough for AI to perform complex, open-ended tasks. ### Comparison 3: Driving A teenager typically learns to drive with about **20 hours of practice**. Even accounting for 16 years of growing up, understanding the world, and building physical intuition, the data volume is still **three to four orders of magnitude less** than what Waymo and Tesla use to train their autonomous driving models. --- ## Common Objections and Their Rebuttals ### Objection 1: Evolutionary pretraining (Karpathy's view) "For humans, billions of years of evolution is equivalent to pretraining. So comparing a human's lifetime data to a cold-start LLM is unfair." **Rebuttal**: The human genome is only 3 GB, of which just 1%–2% codes for proteins. There simply isn't enough space to store the network parameters from evolutionary pretraining. A closer analogy is that evolution found the right hyperparameters and loss functions, while humans still build their brain connectome (like neural network weights) from scratch during a lifetime. Even if we concede pretraining, it doesn't explain why **every new marginal capability requires so much data** — after a human is educated, they don't need a hundred professors to learn a new programming language, but AI, even after pretraining, still needs massive amounts of data to learn the next skill. ### Objection 2: Multimodal data "Humans also have vast amounts of visual, auditory, and other multimodal data throughout their lives, so comparing only language tokens is unfair." **Rebuttal**: Blind or deaf people, though cut off from some sensory input, still possess general intelligence. Deaf people communicate through sign language and reading, likely consuming far fewer than 200 million language tokens. This suggests that our earlier million-fold difference might even be underestimated. ### Objection 3: Insufficient scaling "Scaling laws show that larger models are more sample-efficient. The human brain has 100 trillion synapses; current frontier models have about 5 trillion parameters. Perhaps scaling up by one or two orders of magnitude will reach human level." **Rebuttal**: In the scaling law equation, the parameter term and the data term independently add to the loss. Even with compute-optimal training, increasing parameters to infinity only reduces the required data by **about 10x** (based on the Chinchilla law constant). Humans are thousands to millions of times more efficient than these models, and scaling alone cannot bridge that gap — humans are on a completely different scaling curve. --- ## Why Care About Sample Efficiency? Impact on AI Goals The labs have two major goals: **automating white-collar work** and **automating AI research itself**. The bet for white-collar work is that common tasks (software engineering, analysis, accounting, etc.) can be included in the training distribution. Revenue curves from the past few months suggest that including common tasks in the distribution yields enormous value, even if it can't replicate the specificity of human learning. Training AI to perform such tasks may be less efficient than training a human, but so what? Human lifespans don't allow for such massive and broad training. If a person had to read every public repository on GitHub to become a competent software engineer, training would be impractical — they'd be collecting Social Security by then, and could only handle one project at a time. But AI can learn these skills through a one-time gigawatt-scale training effort, and the learning is amortized over billions of sessions. So even if training is extremely inefficient, huge gains are still possible. As for **out-of-distribution thinking** — some jobs require handling problems far from the training distribution every day (e.g., software engineering). These should be the first jobs AI replaces, but I'm willing to bet that by 2028, overall demand for human software engineers will be higher than today, largely due to AI's complementary inputs. The labs' plan for the latter type of work is: **first automate AI research**, then let the automated AI researcher solve the sample efficiency problem. The question is: can AI that does not have human-level sample efficiency solve the research problems that stand between us and human-level intelligence? That's a complex question, and I'll discuss it in a longer blog post in the future. Currently, people think about intelligence explosions in too clumsy a way — either completely denying the possibility of AI accelerating AI, or assuming something godlike pops out the other end. They don't reason carefully about what a period of faster-than-usual AI progress (but taking place on top of existing LLMs) would look like. --- Source: The data black hole at the center of AI (https://www.youtube.com/watch?v=4pG3SJQPAwk)

Similar Articles

@ba_niu80557: https://x.com/ba_niu80557/status/2068751230667755859

X AI KOLs Timeline

The article explores how increasingly powerful AI models eliminate those whose skills can be encoded into prompts, emphasizing that the truly irreplaceable value lies in tacit knowledge, physical-world operations, and interpersonal trust. Through the example of a friend transitioning from a consultant to a hardware integrator, the author illustrates how proactively yielding to AI-replaceable tasks while deepening expertise in areas beyond AI's reach is key to surviving and thriving in the technological wave.

@dongxi_nlp: A very valuable article, the last 6 takeaways are worth pondering. Among them, the last two: 5. The data industry is far from developed. Anthropic and OpenAI spend over $10 million on a single environment, while Chinese AI labs have a 'build rather than buy' mentality. 6. Countless...

X AI KOLs Timeline

The article summarizes the current state of the AI data industry, pointing out that the data industry is not yet mature. Anthropic and OpenAI spend over $10 million on a single environment, while Chinese AI labs tend to build rather than buy. In addition, many labs have access to Huawei chips but still crave more Nvidia chips.

@sodawhite_dev: https://x.com/sodawhite_dev/status/2067413032544940062

X AI KOLs Timeline

The article analyzes Anthropic's 400,000-session report on Claude Code, pointing out that AI programming tools are changing the division of labor between humans and AI. Domain knowledge is more important than coding ability. Expert users can enable AI to perform more complex tasks, while verification and task decomposition capabilities become core competitive advantages.

@Aoyi21: https://x.com/Aoyi21/status/2064975015200829457

X AI KOLs Timeline

This article proposes that the most cost-effective way to learn AI is to deconstruct others' Skills. By analyzing their task definitions, trigger conditions, operation steps, prohibitions, and acceptance criteria, you can learn how experts think and train AI, rather than just using tools.

Anthropic and OpenAI corner software market (worst case scenario)

Reddit r/singularity

This article discusses the "forward deployment engineer" positions introduced by Anthropic and OpenAI, revealing that AI companies cannot provide fully automated solutions, but instead have given rise to a "human-as-a-service" model that may monopolize the software development market. The article also offers advice on how job seekers can ride this wave by imitating job requirements and using AI to handle interviews.