@vipulved: https://x.com/vipulved/status/2071404852908081211

X AI KOLs Following 06/29/26, 01:26 AM News

economy tokens modular-architecture transformers open-weights ai-ecosystem interfaces

Summary

An essay arguing that the AI ecosystem is undergoing modularization similar to the PC revolution, with standardized interfaces like transformers, inference APIs, and agentic harnesses enabling specialization and rapid innovation, and that open-weights models are a direct economic consequence.

https://t.co/TIeuZQUj5D

Original Article

View Cached Full Text

Cached at: 06/29/26, 04:42 PM

The Economy of Tokens

Carliss Baldwin and Kim Clark argued that the most important economic event in technology industries is often not the invention of a new product, but the creation of a modular architecture with stable interfaces. This is why the PC ecosystem out-innovated many vertically integrated computer companies: Intel could improve processors, Microsoft could improve operating systems, and thousands of peripheral and software vendors could innovate independently.

Once the interfaces were stable, the ecosystem expanded explosively, improving along every axis at once: processors doubled in speed, graphics cards and modems arrived to extend what the machine could do, and a generation of software, spreadsheets, games, desktop publishing, drove demand that no single hardware maker could have created on its own. And because everyone built to the same interfaces, an advance by one vendor diffused sideways across the layer, quickly matched or bettered by others. A few vertically integrated companies prospered, “but a computer in every home and on every desk” became possible through disaggregation.

I believe something similar is happening in AI.

Over the last nine months, we’ve seen this firsthand at @togethercompute where we focus on efficient post-training and serving of open-weights models. The volume of tokens processed through our APIs increased by nearly 10,000×, from 30B tokens a month to 400T tokens a month, and the trend shows little sign of slowing. Tokens have become dramatically more useful: their application surface has widened, and as a result autonomous software agents, capable of performing long-range tasks with little or no human supervision, have moved from demos into production. This is the beginning of what is likely to be a monumental technological arc.

But what is perhaps less appreciated is that, as with previous platform markets, the appearance of stable interfaces is enabling specialization, independent innovation, and a modular re-organization of the AI ecosystem.

AI has already converged on three powerful, standardized interfaces:

The transformer architecture
Inference API (OpenAI compatible)
Agentic harnesses that operate over these APIs

While the token market is still in an early phase of development, despite producing multiple trillion-dollar startups, these three implicit interfaces have quietly standardized the production, consumption, and coordination of intelligence. The consequence is a progressive unbundling of the stack, with increasingly visible specialization and market formation at each layer.

What I hope to illustrate in this essay, among other things, is that open-weights models are a direct economic consequence of the Baldwin-and-Clark-style modularization of generative AI. Sandwiched between the stable interfaces of the transformer architecture below and the inference API and harness above, building a model has become far more focused and efficient than it was a few years ago. A shared architecture, increasingly capable shared machinery, and a commodity silicon path have collapsed the cost and calendar of standing up a frontier model. Advances diffuse rapidly, and the frontier, until recently limited to a few labs, is going to be reached by many at one.

It all started with transformers. Attention Is All You Need turned out to be more than the title of the paper. It has proven to be a remarkably general theory of intelligence.

The Industrialization of Transformers

The transformer captured something profound about intelligence: that the world can be represented as a sequence of tokens, and that predicting the next token forces a model to discover the relationships that give those tokens meaning. Previous generations of neural networks exploited the structure of particular domains: locality in images and sequentiality in language. The transformer instead learned structure itself. As a result, the same architecture is capable of modeling language, code, images, proteins, and increasingly, the physical world.

In much the same way that steam engines, interchangeable parts, and shipping containers transformed fragmented crafts into industrial systems, the transformer has become the common industrial substrate for intelligence. Every improvement to attention algorithms, optimizers, kernel libraries, inference engines, and training frameworks, advances the frontier for nearly every model at once, and an entire supplier ecosystem has emerged around it, spanning software frameworks, silicon, and models.

Inference Frameworks

Inference frameworks have proliferated. Several open-source inference engines—@vllm_project, @sgl_project, TensorRT-LLM, TokenSpeed—now compete to serve a common workload. Beneath the engines sit the kernel libraries: FlashAttention, FlashInfer, CUTLASS, Triton, DeepGEMM, ThunderKittens, the optimized GPU routines that carry out the actual arithmetic a transformer requires. FlashAttention restructures the attention computation to avoid shuffling an enormous matrix in and out of GPU memory; FlashInfer adds kernels tuned for serving, like paged key-value caches and speculative decoding; CUTLASS supplies the matrix-multiply building blocks specialized for transformer shapes and low-precision formats.

This depth of specialization is itself a product of standardization: because nearly every model is the same kind of object, the effort poured into making one operation faster is repaid across the entire industry rather than stranded on a single architecture. All these libraries are engine-agnostic. FlashAttention doesn’t care whether it’s called from vLLM or TensorRT-LLM, so when @tri_dao releases a new version, the gain propagates everywhere within days.

Training Frameworks

The same layering has formed on the training side. Distributed training frameworks: Megatron-LM, DeepSpeed, NeMo, and PyTorch’s FSDP, have standardized the parallelisms that makes training a large transformer tractable: sharding tensors, pipelines, and experts across thousands of GPUs, so a new lab starts with proven machinery rather than rebuilding it.

Above them, a faster-moving layer has formed for the RL post-training that now drives most frontier gains, and these frameworks are themselves modular compositions. For instance, slime, the open-source framework @Zai_org used to post-train GLM-5.2, bridges Megatron’s training loop and SGLang’s rollout engine through a shared data buffer, passing each engine’s controls through directly rather than wrapping them in a new abstraction. Because systems are being built with these modular pieces, an advance in one, a scheduling trick, an RL recipe, a fault-tolerance mechanism, diffuses to the rest within a release cycle.

Silicon

The standardization afforded by transformers has also created a stable target for hardware innovation. It has enabled @nvidia to pursue one of the most ambitious industrial roadmaps ever undertaken: a decade-long effort to compound advances in compute, memory, networking, packaging, and systems design into successive generations of accelerators and AI factories that reliably deliver more capability at greater scale and lower unit costs. It has also allowed @AMD to assemble an increasingly competitive platform, while Google Cloud TPU and AWS Trainium have emerged as credible alternatives by overfitting to the computational profile of transformer models. A new generation of hardware companies like @cerebras are emerging in what has historically been among the most capital intensive and entrenched sectors of technology. They don’t need to convince the world that it needs a new chip, they only need to claim a place on the Patero of price-performance.

The Open Frontier

The phenomenon we, at Together AI, witness most closely is the impact of this architectural standardization on the supply of frontier open-weights models. A popular view is that the open frontier is a result of distillation, a sort of cheap tracing of the closed frontier. This view is almost certainly mistaken. While some distillation is no doubt useful, what drives the open frontier is the diffusion of the recipe itself, the architecture and training methods that move freely from any one lab into all the others.

If you opened up Minimax, Mistral, Qwen, and DeepSeek, you’d find nearly the same design; the modern transformer block is an assembly of borrowed parts. Rotary position embeddings, from a single 2021 paper, displaced learned and absolute positional encodings and are now effectively universal. RMSNorm replaced LayerNorm; SwiGLU replaced ReLU/GELU in the feed-forward block, both from short Shazeer-era papers, both now default. Multi-query and then grouped-query attention (GQA), introduced to shrink the KV cache, propagated from @Google research into essentially every open model within a year. And mixture-of-experts, an idea from 2017’s sparsely-gated MoE and the Switch Transformer, went from a research curiosity to the dominant frontier design once Mixtral and DeepSeek demonstrated the economics of activating a fraction of parameters per token.

Training methods diffuse the same way. RLHF first taught base models to follow instructions; DPO then simplified that process, achieving the same alignment without the separate reward model RLHF required; Constitutional AI style synthetic feedback eased the human-labeling bottleneck that throttled both; and DeepSeek’s GRPO, together with the R1 recipe for eliciting long chains of reasoning, spread aqcross the ecosystem within weeks of being published.

This doesn’t mean that the transformer or the art of training and serving it has become static. The evolution is continuous. Attention has drifted far from its vanilla softmax beginnings and continues to transmute with almost every major model version – linear, sparse, and state-space variants that appear in modern models are a result of the pursuit of increasingly longer context. Muon and geometry-aware optimizers are starting to replace Adam/AdamW, which has been a de facto standard. But this evolution occurs atop a shared industrial platform, turning what would otherwise be isolated breakthroughs into industry-wide productivity gains.

Coordination of Intelligence

Every general-purpose technology eventually needs a layer that turns raw capability into useful work: the microprocessor got the operating system, the internet got the browser. For the model, that layer is the harness. Left alone, a model does one thing: predict the next token, again and again, until it has answered the prompt in front of it and stops. A harness has emerged as the software that sits between the model and the outside world and turns that stream of tokens into work. It runs the model in a loop—pose a task, let the model call a tool, feed the result back, repeat—and supplies what the loop needs to keep running: memory across steps, retries, permissions, and a sandbox where code can run. The transformer is intelligence; the API is language; the harness is agency.

The same loop—act, observe, correct—works whatever the task; it is built around calling tools, and software has already wrapped most of what we touch in one: calendars, inboxes, payment rails, documents, booking systems. The harness reaches as far as those tools reach, which is to say across most of our digital lives. Coding is where it became visible first, and because training models to code well turned out to sharpen tool use generally, the gains spilled far beyond software. That tool calling fluency is why the harness pattern appeared and jumped the IDE almost immediately: OpenClaw, which wires the loop into the everyday tools people already use, has become one of the most unexpectedly beloved consumer tools of the year.

What lets any harness drive any tool is another standardized interface: the Model Context Protocol. @AnthropicAI open-sourced MCP in late 2024, and within eighteen months it became the common standard for how a harness calls a tool, backed by every major model lab, tool company and now governed by a neutral foundation. Because the interface is shared, a tool written once works with any harness, and the model underneath can be replaced without rewriting anything above it. Other sub-standards are still forming beneath it. Skills, pioneered by Anthropic, as packaged capabilities a harness can pick up without retraining, are converging on a shared, filesystem-based shape. This standard hasn’t fully settled, but it is only a matter of time, because this standardization would turn scattered effort into compounding progress.

The harness pattern is reshaping the models themselves. Frontier labs increasingly post-train models to operate inside harnesses, to plan over many steps, call tools in order, and recover when one fails, using reinforcement learning against live environments rather than static examples. @MiniMax_AI documented one such approach in training M2.5 across more than a hundred thousand real agent environments. Because a coding agent might take fifty steps to fix a bug, a process reward grades each step of the trajectory rather than scoring only the final diff: it asks whether every tool call was well-formed and whether each move advanced the task, to produce a model that is remarkably better suited to drive a harness. The coordination interface in effect has become a training target, resulting in remarkably rapid progress in software engineering and tool use.

Software’s Industrial Turn

Three years ago, when SWE-bench was published, the frontier models at the time scored ~2% on this benchmark. I remember being surprised by the quixotic audacity of SWE-bench. Today, most open and closed models sit comfortably above 85%. The progress in the models that is reflected in this near saturation SWE-bench has produced a legitimate revolution in software engineering. Software production is evolving from an artisanal craft into an industrial one, and the inevitability of software AGI, however you choose to describe that term, is no longer a matter of imagination, but rather a mechanical extrapolation of a time-series.

The models on the absolute frontier remain awe-inspiring, but software engineering competence has become a property of the wider frontier as a whole. Open-weight and derivatives: Composer 2.5, GLM 5.2, Kimi 2.6, Nemotron, Minimax 3 are not only competent, they are widely deployed as software engineers. A substantial portion of the 400T tokens we serve is software. The task of creating software, which is nuanced and complex, is increasingly being solved by AI models as a category, not by an exclusive oligopoly at the nosebleed of the frontier.

The Business Case for Generosity

The obvious objection to open weights is economic: how do you fund frontier-scale capital expenditure while giving the product away?

Open weights model companies, initially with unclear business models, have developed revenue engines by charging for fully produced tokens through their APIs and applications and via licensing model weights to distributors, from telcos to model platforms to companies post training their models to create new intellectual property.

That these engines are starting to become real is now visible in the financials: @Kimi_Moonshot’s ARR crossed $100M in March 2026 and more than doubled past $200M within a month, and @MiniMax_AI, which is public and discloses its results, is showing a similarly rapid revenue ramp. Taken together with the underlying demand for the models, these numbers suggest that open weights companies are innovating on business models as much as on capability and starting to find real purchase.

And capital is flooding toward the open approach on both sides of the Pacific. In Europe, @MistralAI raised a €1.7B Series C at an €11.7B valuation in September 2025, led by @ASMLcompany, and is targeting more than $1B in annual recurring revenue by the end of 2026. In China the open cohort has gone vertical: Moonshot’s valuation went from roughly $4.3B at the end of 2025 to a $20B in May 2026 on a $2B round. Zhipu listed in Hong Kong in January 2026 and has already set the open-weights world on fire with the release of GLM 5.2. Listing by MiniMax followed, leaving the four front-rank Chinese labs carrying more than $180B in combined value. In the US, @reflection_ai raised $2B at an $8B valuation, explicitly to be the Western open-frontier lab. Venture is pricing open weights as a strategic asset, not a giveaway.

And not every open-weights business model needs to be built around direct revenue. With its Nemotron family, NVIDIA does give away what others are monetizing, not just the weights, but the training datasets, the post-training recipes, and the reinforcement-learning environments behind them. NVIDIA’s largest customers are also its emerging rivals, designing their own accelerators, and a broad, healthy population of open models, runnable on anyone’s hardware, is a great counterweight: it keeps the widest possible market from consolidating onto silicon controlled by a handful of frontier companies.

Other efforts to commoditize the complement are likely to appear. Sovereign states, for instance, will pursue independence in the production of intelligence as they did in energy, unwilling to let a foreign stack supply a strategic input. And building the frontier, even as it moves, will only get easier as the substrate matures: more standardized below, more modular above, cheaper to build on at every layer.

An Order of Magnitude Cheaper

Open weights aren’t just converging on benchmarks, and offering frictionless substitution, they are undercutting closed frontier models on price. And unlike benchmark performance, the cost gap is not marginal but categorical.

In a competitive, disaggregated inference market, with multiple providers competing on price-performance-capacity by applying significant capital, focus and research to the problem of efficiently extracting tokens per watt and capex dollar, the pricing tends to be optimal rather than a single vendor banking the spread. As a result open weights tokens now cost over an order of magnitude less than the closed frontier tokens they replace.

The simple arithmetic, that @chamath shared in a recent X post, is illustrative:

Chamath Palihapitiya@chamath·Jun 7Your margin is my opportunity: AI version…

The biggest surprise of 2026 is that the capability gap between the best open-weight/source models and the best closed models has narrowed much faster than the pricing gap. The pricing gap remains enormous while the capability gap isShow moreQuoteGavin Baker@GavinSBaker·Jun 6Quite a week for open-source AI. Especially American open-source. Nemotron 3 Ultra is the most important release in quite some time. And some really cool RL and fine-tuning work from Harvey. x.com/victormustar/s…2433402K936K

Chamath’s example actually understates the ratio of input to output tokens, which is typically 300:1 for agentic workloads, so the results tend to be even more dramatic. As an experiment, we tried the same set of agentic tasks with three models — Minimax M2.7, Kimi K2.6 and Opus 4.8 — and found MiniMax M2.7 was 5x faster and 63x cheaper than Opus 4.8. To be sure, there are still accuracy tradeoffs, but they are becoming less obvious, and closing rapidly. Many companies are already employing harnesses to mix and match, closed and open tokens to achieve optimal cost, as @coinbase recently shared:

Brian Armstrong@brian_armstrong·Jun 27How to keep AI spend flat while token usage grows exponentially: Not with friction and spend alerts. With better defaults, routing, and caching.

Better Defaults (not Usage Caps) – Engineers can choose any model they want, but defaults matter. We’re experimenting with defaultingShow more4461K5.7K3.4M

Sun Microsystems and the Bertrand Collapse

Closed frontier labs hold a differentiation premium that’s real as long as the quality gap is real. The moment open weights cross “good enough” for a given workload, the premium for closed models on that workload collapses toward zero.

This is what happened to Sun Microsystems’s server platform with the appearance of the modular ecosystem of Linux, Intel & AMD. Through the dot-com boom Sun was “the dot in dot-com,” with a market cap somewhere around $200 billion at the peak, because its premium was genuinely justified given reliability, scalability, Solaris’ stability and integrated support.

But two things collapsed that advantage. Linux matured into something enterprise-grade and AMD’s Opteron in 2003 brought 64-bit to commodity x86, erasing one of the last clean technical reasons to pay for RISC/Unix on a large class of workloads. Sun was acquired by Oracle in 2010. The dominance of the disaggregated ecosystem was so complete in this case that ironically the bits of Sun that survive today are in the Java ecosystem that was designed for modularity, as a tool to “write once and run anywhere”.

I don’t expect a wholesale Sun-style collapse of the closed frontier labs. But I do believe that market formation around open weights is already impossible to stop: open weights will keep commoditizing the classes of workloads that become sufficiently popular and application independent.

Land, Power, Shell, Chips

Tokens are becoming a fundamental resource, like capital, electricity, and bandwidth. It takes flops to produce tokens, and flops take power. The world runs 15 to 20 GW of AI capacity today, about as much as Bitcoin mining, and McKinsey and Goldman expect 150 GW within five years, implying $6 to $7 trillion of investment in AI infrastructure.

Today, that bill is paid mostly by a handful of giants borrowing against their own balance sheets, with the largest hyperscalers on track to spend roughly $700 billion this year. But the way AI infrastructure gets funded is starting to disaggregate too. NVIDIA publishes a standard blueprint for an AI data center, and a class of specialist neoclouds has grown up to build them: because every transformer runs on the same kind of GPU cluster, capacity underwritten for one customer is readily resold to the next, and the thin, standardized workload makes such a cloud far cheaper to stand up than the hundreds of services a hyperscaler carries.

Capital is following the same path into separate, tradeable pieces. Wall Street is starting to lend against the GPUs themselves: @CoreWeave recently raised $8.5 billion at an investment-grade rating, secured by chips and not just the contracts to rent them. Whole campuses are getting financed by outsiders rather than the tech companies, as pioneered in Meta’s funding of its $27 billion Hyperion site through a joint venture with Blue Owl. And sovereign wealth funds are paying for AI capacity as critical infrastructure, the UAE through a $100 billion vehicle, Saudi Arabia through @HUMAIN, for reasons of both wealth creation and sovereignty over a strategic input. The pieces once bundled inside hyperscalers, the buildings, the power, the chips, and the money for all of it, are coming apart into separate markets with their own specialists and investors.

This disaggregation of the bottom of the stack is still early, but it is potentially immense: a reorganization that could create enormous wealth and, by opening each layer to new entrants, distribute it far more widely than a few balance sheets ever could.

Open weights could be quietly central to this. A GPU cluster financed against a single company’s closed model is a bet on that company; the same cluster serving open weights is a general-purpose asset, fungible across customers and models, and a fungible asset is one that capital can securitize at scale.

The Politics of Abundance

If there is a real threat to the trajectory of open-weights, it is not economic but political. The economy of tokens may become the largest economy in the world, and a disaggregating ecosystem is precisely the kind of thing the incumbents it threatens may try to arrest. The most durable way to do that is not to out-compete the open frontier but to regulate it: to lobby to write the rules of a new market in terms that only a few vertically integrated companies can satisfy. This is an old pattern. The railroads were eventually bound by the Interstate Commerce Act after their power became a political question; AT&T’s vertically integrated monopoly over American telephony was broken up by antitrust only after decades of entrenchment.

Concentrated control of critical infrastructure invites a regulatory response, but regulation is also the instrument by which incumbents can freeze a market in the shape that favors them. For open weights, the biggest risk is regulatory capture: rules written, in the name of safety or security, to make openness itself the liability.

We are at an early and consequential juncture in a new and immensely important industry. A world where intelligence is abundant, open, and widely accessible is within reach, yet nothing about the technology guarantees that outcome. The disaggregation is real and the economics favor it, but markets this large and this strategic are seldom left to settle on their own. Whether intelligence becomes a commons or a chokepoint will be decided less by what the models can do than by who is permitted to build them.

For now, the question is still open.

@vipulved: https://x.com/vipulved/status/2071404852908081211

The Economy of Tokens

The Industrialization of Transformers

Coordination of Intelligence

Software’s Industrial Turn

The Business Case for Generosity

An Order of Magnitude Cheaper

Sun Microsystems and the Bertrand Collapse

Land, Power, Shell, Chips

The Politics of Abundance

Similar Articles

Open weights aren't catching up to closed models by copying them, but they're winning because of how the whole AI stack is quietly modularising

@oneill_c: https://x.com/oneill_c/status/2054604986269802579

@tuhinone: https://x.com/tuhinone/status/2054603346905080136

@JayaGup10: https://x.com/JayaGup10/status/2052870394093408558

The economics of AI are starting to favor open models

Submit Feedback

Similar Articles

Open weights aren't catching up to closed models by copying them, but they're winning because of how the whole AI stack is quietly modularising

@oneill_c: https://x.com/oneill_c/status/2054604986269802579

@tuhinone: https://x.com/tuhinone/status/2054603346905080136

@JayaGup10: https://x.com/JayaGup10/status/2052870394093408558

The economics of AI are starting to favor open models