Data Isn't Scarce. Your Imagination Is (8 minute read)

TLDR AI News

Summary

Asuka Zheng argues that the 'running out of training data' panic is misplaced; the real scarcity is a lack of imagination in collecting diverse, long-horizon data, illustrated by her SRE replacement project and broader research trends.

Asuka Zheng argues the "we're running out of training data" panic misses the actual shape of the data market, recounting her own SRE-replacement project that trained two world models until it stalled because end-to-end long-horizon incident trajectories from first anomaly to full resolution did not exist as a dataset.
Original Article
View Cached Full Text

Cached at: 05/29/26, 06:31 PM

Asuka Zheng argues the “we’re running out of training data” panic misses the actual shape of the data market, recounting her own SRE-replacement project that trained two world models until it stalled because end-to-end long-horizon incident trajectories from first anomaly to full resolution did not exist as a dataset.


Data Isn’t Scarce. Your Imagination Is.

We’re running out of training data.

No. We haven’t started collecting it.

When we talk about data, we picture reasoning, code, audio, image, robotics. Sure, those are real, and they already minted fortunes. But that list isn’t the real shape of data. It just marks the current edge of our imagination. The real space runs so much larger that some of the data sitting in it, we can’t even picture what shape it will take.

I learned this the hard way, and then I watched it confirm itself everywhere I looked.

I spent a long stretch of my life doing SRE, so I kept circling one question: could a large model replace a human SRE engineer, so we’d stop getting woken up by oncall at midnight? We actually tried. We wrote a paper on it. And once we got deep in, we found the hard part wasn’t the model. It was the environment. You almost can’t simulate a real SRE incident. So we reached for an idea that felt elegant at the time: train two world models. One plays the K8s expert and generates commands. The other plays the K8s environment itself and generates plausible responses under specific scenarios. Let them spar. Let the model learn to operate inside a world it conjured for itself.

It didn’t work.

What stopped us was almost cruelly specific. We couldn’t get enough data, and what we did collect came out low quality. We needed the kind of data that captures a real K8s cluster across an entire incident, from the first anomaly to the moment it’s fully resolved, and exactly how it responds at every step in between. In the end we fell back to a multi-agent harness, a sophisticated framework running RL on a model with very little data, and the results came out mediocre. But I swear we were almost there. The direction was right. We just missed that one kind of data, and we ran out of time before the submission.

Looking back now, I realize I’d walked straight into the doorway of an entire new world. I was trying to fit the real behavior of a system into a model, and the thing that turned out scarce was end-to-end, long-horizon data. Not a single instant. A complete chain of cause and effect.

The research of these past one month convinced me that this is happening everywhere at once.

Thinking Machines Lab’s work on interaction models shook me the most. Everyone this year talks about proactive agents. I tried to fake that proactivity myself, bolting it onto the outside with wearables. Thinking Machines showed me something deeper: give the model the right data, and that capability grows natively from inside it. Interactivity stops being scaffolding you strap to the outside. The model absorbs it. And so silence, interruption, the precise instant the visual world changes, things no dataset ever bothered to record, all turn into data you must train on and can train on.

A more radical direction comes from a paper called Neural Computers (a Meta AI and KAUST team; Jürgen Schmidhuber sits among the authors, which fits, since he first brought the term “world model” into machine learning back in 1990). They fold computation, memory, and I/O into a single learned runtime state, then instantiate it as a video model that rolls out screen frames directly. In CLI and GUI environments, it generates the next frame from instructions, pixels, and user actions.

You should see the turn coming by now. The behavior of a computer itself becomes a stream of data we can model. Every frame of a terminal, every response to a mouse or a keystroke, used to be the cold deterministic output of a program. Now it’s an I/O alignment waiting for someone to learn it.

Bring any human-machine interface inside a model’s scope, and you instantly turn all of its responses and all of its timing into a brand-new class of data. The number of possible interfaces has no ceiling: full-duplex voice, visual proactivity, a sense of elapsed time, and eventually the return to the physical world, the body, biosignals. Every new interface conjures a continent of data out of nothing, a continent that didn’t exist before, whose shape stays entirely open.

That’s the true shape of the data market. Not a fixed pie slowly growing larger. A map that keeps growing, that keeps sprouting new continents.

The signal comes from more than one direction. Block, Jack Dorsey’s company, wrote an essay called From Hierarchy to Intelligence, proposing to rebuild the company itself around a company world model and a customer world model. It’s the most insipring essay I’ve read this year. This line hit me: money is the most honest signal in the world. People lie on surveys, ignore ads, abandon carts. But when they spend, save, send, borrow, or repay, that’s the truth. Every transaction states a fact about a real person’s life. And once you see transactions that way, a conclusion surfaces on its own, one that should unsettle anyone holding an enterprise AI strategy. ERP, CRM, ticketing systems, two decades of what we filed away as “business records,” never were dead databases. They were the alignment corpus between two world models, a company’s and its customers’, piling up in plain sight the whole time, waiting for someone to claim it. The old interfaces always were interfaces. We just never called them that. We might be sitting on a goldmine right now and call it “process” and “records.”

Look further out and Karpathy’s autoresearch project three months ago, letting an AI agent run the research loop itself, run its own experiments, iterate on its own, turned the data of the research process itself into a hot topic across Silicon Valley recently.

Horizontally, infinite forms. Vertically, infinite depth. And now even the process that produces data turns into data.

But infinite interfaces don’t mean equal interfaces. The moment everything becomes an interface, the question quietly changes shape. It stops asking “where is there still data” and starts asking “which interface do we open first.” And that question has structure. Some interfaces lie. Surveys lie; money doesn’t. Some show you only the path that happened, never the one that didn’t, which is the exact wall I hit with K8s: one incident gives you one trajectory, never the counterfactual you actually needed to train on. Some bottom out one layer down. Others open a new world at every layer. And a rare few, once you model them, sprout interfaces nobody could have named before they opened.

Imagination opened the map. But a map isn’t a compass. What if you’re sure the market’s already mapped, and it isn’t?

This is what David Deutsch meant. This is the beginning of infinity. To him, knowledge has no terminus. Every good explanation gives rise to new problems we couldn’t have posed before, new possibilities we couldn’t have imagined. Data works the same way. The handful of categories we can name are just the few rooms we’ve already lit. Out in the dark wait countless more, and some of the doors, we don’t even know are there.

The new gold is data. Plenty of people say that already. But the mine almost everyone pictures is just this small patch under our feet. The real lode is the data we can’t yet name, can’t yet draw the shape of, the data that will surface continent by continent, every time we solve another problem, every time we open another mode of interaction.

We are not somewhere in the middle of the data age.

We are on its first chapter. Its first page.

References

Yang, P., et al. (13 authors). “AOI: Turning Failed Trajectories into Training Signals for Autonomous Cloud Diagnosis.” arXiv:2603.03378, March 2026. https://arxiv.org/abs/2603.03378

Thinking Machines Lab. “Interaction Models: A Scalable Approach to Human-AI Collaboration,” Thinking Machines Lab: Connectionism, May 2026. https://thinkingmachines.ai/blog/interaction-models/ Thinkingmachines

Zhuge, M., Zhao, C., Liu, H., et al. (19 authors, including Jürgen Schmidhuber). “Neural Computers.” arXiv:2604.06425, April 2026. Meta AI and KAUST. https://arxiv.org/abs/2604.06425

Dorsey, J., and Botha, R. “From Hierarchy to Intelligence.” Published March 31, 2026, jointly on Block and Sequoia Capital. https://block.xyz/inside/from-hierarchy-to-intelligence

Karpathy, A. “autoresearch.” GitHub repository, released March 7, 2026. https://github.com/karpathy/autoresearch

Deutsch, D. The Beginning of Infinity: Explanations That Transform the World. Viking, 2011.



Similar Articles

@dongxi_nlp: A very valuable article, the last 6 takeaways are worth pondering. Among them, the last two: 5. The data industry is far from developed. Anthropic and OpenAI spend over $10 million on a single environment, while Chinese AI labs have a 'build rather than buy' mentality. 6. Countless...

X AI KOLs Timeline

The article summarizes the current state of the AI data industry, pointing out that the data industry is not yet mature. Anthropic and OpenAI spend over $10 million on a single environment, while Chinese AI labs tend to build rather than buy. In addition, many labs have access to Huawei chips but still crave more Nvidia chips.