@0xcherry: https://x.com/0xcherry/status/2067610347633025281

X AI KOLs Timeline 06/18/26, 02:08 PM News

glm-5.2 model-comparison ai-development chinese-ai activation-parameters post-training rlhf

Summary

This article analyzes the reasons behind the performance leap of Zhipu GLM-5.2, suggesting that its 40B activation parameters provide greater effective capacity after accounting for fixed overhead, making RL post-training more effective. It also reviews the history of Chinese AI model development and notes that the large model approach ultimately prevailed.

https://t.co/P5udLXMZw6

Original Article

View Cached Full Text

Cached at: 06/18/26, 06:19 PM

Review of GLM-5.2: When Chinese AI Models Start to Deliver

With the release of Zhipu GLM-5.2, user reception has been overwhelmingly positive, with benchmark scores approaching those of Opus 4.8. After extensive cross-validation and hands-on experience, I am convinced that GLM-5.2 is an extremely powerful model, fully capable of being deployed in serious production-level development tasks.

Driven by curiosity about the major success of GLM-5.2, I conducted a detailed longitudinal and horizontal study of this model series, writing this article and looking ahead to the future of the industry.

A friendly reminder: This article is not a product review, nor is it a paid advertisement. It is simply an analysis of why GLM-5.2 has achieved such strong results and a forward-looking perspective. If any speculations turn out to be wrong, just pretend you didn’t see them. Thank you.

1. Bigger is Better

When discussing the source of GLM-5.2’s acceleration, the mainstream narrative points to advances in the training pipeline: a well-executed slime framework, good critic-based PPO, and well-prepared long-horizon RL data.

These are all technically correct, but they are insufficient to explain why glm5.2 progressed so quickly.

In reality, looking at the timeline: early 2026, Zhipu was the first to release glm 5, with Kimi and MiniMax following closely behind. DeepSeek later released V4. From the timeline, the performance of glm 5 wasn’t the most outstanding at launch; it was even briefly overshadowed by Kimi.

However, entering June, with the release of glm 5.2, its performance suddenly showed a massive improvement compared to early in the year. Despite no major version number upgrade, the performance jump was substantial.

When glm5 launched, it was criticized for heavily using distilled data. But the “big three” users of distillation data (Kimi, MiniMax, Zhipu) all use it, yet glm improved the fastest. This is highly anomalous.

Comparing the three models horizontally, the most critical difference likely lies in the models’ activation parameters.

DeepSeek V3 series has 37B activation parameters, maintained across V3.1/V3.2. GLM-5/5.1/5.2 is at 40B, essentially the same tier. Kimi K2 series has 32B activation. MiniMax went from 10B in M2 to 23B in M3. Both are clearly a tier smaller.

There is much debate in the industry about whether models should be larger or smaller. But frankly, these discussions are often watered down. Rather than relying on established debates, it’s easier to propose a viewpoint.

Industry convention says 40B is 25% larger than 32B, and 74% larger than 23B. This gap seems small, or insufficient to explain the speed of performance improvement.

However, if we truly understand a large model’s weights as a structure analogous to the human brain, we can formulate a hypothesis.

A large portion of the brain is dedicated to basic functions: heartbeat, breathing, reflexes. This portion must occupy a certain volume regardless of overall size. The capacity available for thinking is the total volume minus this baseline overhead.

The same applies to models. The 40B activation also has a fixed overhead. The usable portion is 40B minus this overhead.

If a biological brain’s volume increases by 30%, how much does intelligence increase? Certainly not 30%.

Assume the baseline overhead is 10B. Then 40B has 30B usable, 32B has 22B usable, 23B has 13B usable.

Under this metric, the gap becomes enormous.

glm 5.2 is derived from glm5 through RLHF and further post-training, without changing the core architecture.

The industry consensus is that RL can unlock the potential inherent in a model, but it cannot make a model exceed its limits.

Post-training searches for better strategies within existing capacity. The larger the capacity, the larger the optimization space. RL on a 32B model might converge after five rounds, while RL on a 40B model might still be improving after ten rounds.

From this perspective, it likely explains why glm5.2 made such remarkable progress.

Biological speculation alone cannot prove it.

But if we observe the simultaneous progress of OpenAI, DeepSeek, and even Anthropic, we find cross-validation for this hypothesis.

2. Heaven and Hell

In early 2024, OpenAI stunned the world with the launch of GPT-4.

GPT-4 had far superior performance and task execution compared to ChatGPT 3.5, along with native multimodal capabilities. It could generate a website from a single sketch and infer high-dimensional relationships between different entities in a photo (e.g., “humor”).

However, GPT-4 was not widely promoted. It was quickly overshadowed by the next generation GPT-4o, GPT-4o-mini, and o1. Everyone who used them could clearly feel that the later models were “not quite the same” as GPT-4.

Why?

According to widely circulated reports, GPT-4 was a model with over a trillion parameters, using a MoE architecture with activation parameters possibly reaching 100B or even 200B.

For comparison, the well-known DeepSeek R1 has only 37B activation parameters. Inference costs scale with activation parameters. This means the earlier GPT-4 had a much higher inference load than later models.

A model of this size was too expensive to sustain the user scale of ChatGPT; it couldn’t be delivered at scale.

The result of this path for OpenAI was a massive resource shift towards the O1 line of reasoning models.

The logic behind o1: since the economics of large base models is a dead end, add inference-time compute to a smaller model, let it think more, and see if it can compensate.

This attempt seemed successful on benchmarks; o1 scored significantly higher on reasoning dimensions. But in an industrial sense, OpenAI drove itself into a ditch, because o2 and o3 after o1 received very poor reception, and the model was eventually abandoned in later models.

OpenAI did try to find breakthroughs along the GPT-4 route.

GPT-4o is actually not the legitimate successor of GPT-4, as the feel and cost were off. The later 4.1 seemed more like a successor in the GPT-4 size tier, similar in cost and speed.

But 4.1 was a disastrous model. Perhaps RL went wrong, or changing the size caused issues. It’s unclear from the outside.

Ultimately, the result is that from 2024 to 2025, OpenAI’s work largely went in circles, oscillating between making models smaller and making thinking longer.

Looking back from this point, it is also the starting point where Claude surpassed OpenAI.

Anthropic was not led astray by the o1 line. They seriously worked on large-sized models and downward distillation from large models.

This was when Anthropic overtook OpenAI, though it wasn’t yet a widely known phenomenon. By the end of 2024, the primary model used for coding on Cursor was already Claude. Later, with the Sonnet 4 series and Opus 4 series, along with Claude Code becoming a phenomenon, Anthropic’s lead was widely acknowledged.

This article focuses on Chinese models, so the Claude line won’t be expanded here.

OpenAI’s hesitation and oscillation heavily influenced the entire industry’s judgment on “how to build AI models.”

Since OpenAI itself was moving towards smaller models, the mainstream judgment among Chinese model companies in 2024 was that large models had reached their limit, were uneconomical, and it was better to go smaller.

So throughout 2024, Chinese companies churned out various messy ultra-small models that were essentially unusable. Various 7B, 9B, 13B, and small MoE models with tens of billions of activations were products of that era. Some also followed OpenAI by making smaller sizes and extending thinking, but the results were poor.

Meanwhile, Liang Wenfeng stepped onto the track.

DeepSeek V3 first appeared in December 2024. With 671B total parameters / 37B activation, it was quite large.

In the atmosphere where everyone was reversing direction, V3 was the only one willing to go larger.

However, the industry’s initial reaction to V3 was lukewarm: benchmark numbers were respectable but not stunning; performance seemed average.

DeepSeek didn’t mind. Later, during the Chinese New Year, they released R1.

R1 used RL on the large V3 base to directly push reasoning performance to a level comparable to O1, and because it was open source, anyone could verify it.

Before R1, no one believed that “making a model larger, combined with appropriate RL and inference,” could yield such a massive performance leap.

But R1 did it.

This was the first time Chinese model companies realized that going big on parameters could actually deliver.

What followed was DeepSeek R1 dominating the Chinese market for an entire year.

There were two interpretations of this phenomenon in the market.

The first was the “thinking” camp: believing the success of o1 and R1 was due to improved reasoning quality. They then continued to prune sizes, distill downwards, and extend reasoning time for the next year.

The second camp acknowledged the role of thinking but, more importantly, realized that “models still need a relatively large base size.” So they also increased their model sizes and sought ways to allocate more inference resources per token.

By early 2026, MiniMax 2.5, Kimi 2.5, and glm-5 were released in succession. Almost all models with “decent” performance were around the same size as DeepSeek V3. Among them, glm5, after two iterations, approached Claude Opus.

In other words, “large” models won.

3. Snowy Mountain Manor

When a detective rules out all possible suspects, the remaining one must be the culprit.

At this point, let’s discuss a more fundamental counterintuitive question.

Why have no models with 100B activation parameters appeared in these three years? Or rather, why hasn’t anyone disclosed such a model?

GPU density has increased roughly 3x in three years from H100 to B200. Theoretically, model size should have scaled accordingly. But the activation size of leading models has been stuck in the narrow 30-50B range, from the GPT-4 era to now, barely moving.

This is abnormal.

In any other industry, if infrastructure density triples, product specifications scale up. Chip density increases -> phones get thinner. Bandwidth increases -> video goes 4K. But not AI model activation size.

When an observation significantly deviates from common sense, the observation is likely wrong.

In fact, models with ultra-large activation parameter sizes probably already exist, but their core weights have not been publicly disclosed. After all, American models haven’t disclosed core parameters since the GPT-4 era.

Fable 5, hailed as the most powerful large model to date, is very likely a model with an extremely large activation parameter size.

Fable 5 is an Anthropic flagship released in June 2026, priced at $10/$ 50 per million tokens. This number doesn’t stand out alone, but in historical context, it tells a lot.

The original GPT-4 was $30/$ 60. Output: $60 vs $50 is almost face-to-face. Input: $30 vs $10 looks three times cheaper, but this is largely due to three years of engineering progress like prompt caching, MoE sparsity, and KV cache optimization, not because the model itself is smaller.

Moreover, GPU compute hasn’t actually increased 3x that much. The spec sheet says H100 to B200 is a 3-5x improvement, but in actual deployment clusters, H100 remains the absolute workhorse.

B200 only started arriving in large quantities in the second half of 2025. The core training and inference clusters for most leading companies are still based on H100/H800 accumulated from 2023-2024.

So hardware cost reduction is 3x on spec, but significantly less in actual deployment.

Given this discount, the Fable 5 output price of $50 can only be explained if its actual computational burden per generated token is the same order of magnitude as the original GPT-4.

This is the “uneconomical size tier” that GPT-4 occupied, roughly around 100B. Fable 5 essentially reused the training paradigm accumulated over three years to do a GPT-4-sized model again.

Conversely, if GLM5.2’s performance can rival Opus 4.8, it is reasonable to judge that Opus 4.8’s activation parameters are also far smaller than commonly guessed.

When GLM 5.2 entered the pool, everyone realized that everyone else seemed to be swimming without their underwear.

And if Anthropic’s success is indeed a case of “training an ultra-large model without telling anyone,” then the hypothesis “larger size = higher performance” is proven once again.

4. Public Goods, Social Responsibility, Reverse Flywheel

During the first half of 2026, GLM Coding Plan was typically sold out.

Users had to set an alarm to buy it. In public discussions, this is usually interpreted as “domestic open-source models are so good that demand exceeds supply,” or more advanced: “Zhipu is using low prices to acquire training data.”

The latter is a seemingly clever explanation, but not actually that informative.

A simpler explanation is that Zhipu probably doesn’t have that much compute to give to external users. However, once you build a model, you need a product form for the public. An open-source flagship that only releases weights without providing a hosted service lacks dissemination effect in the domestic market.

Given limited supply, Zhipu had two choices: raise prices to let demand naturally fall to supply levels, or limit supply. Zhipu chose the latter.

Why? Why forego revenue?

The answer might be much simpler than conspiracy theories: AI models are likely public goods. Providing public goods requires subsidizing users, and this thing probably loses money.

Recently, SemiAnalysis did a test. The subscription plans of OpenAI and Anthropic provide over 20x more resource service. Meaning, if you pay $20 for a ChatGPT Plus subscription, OpenAI might provide you with $400 worth of equivalent AI usage.

In other words, whether Anthropic or OpenAI, providing subscription plans is likely a losing proposition. The more you sell, the more you lose.

Some think “subscription plans are meant to provide training data for AI.” This statement is not wrong, but incomplete. Inference compute and training compute are both compute. If you sell subscriptions at a loss, while the compute shortage intensifies, it creates a very vicious reverse flywheel.

Subscriptions cannot recoup costs. The more you sell AI as a public good at a loss, the more the model company loses. The more it loses, the more urgently it needs to go public for funding. A bigger story requires larger user coverage. Larger user coverage means even more losses and even less compute available.

This is a closed loop. With each turn, the model company moves further away from a state of “focusing on research.”

Anthropic offers consumer subscription plans; OpenAI doesn’t dare to kill the free tier of ChatGPT. The root cause isn’t just “data collection.” In the current US context, AI is in some ways considered a public good. This positioning has historical reasons. The media narrative repeatedly emphasizes that “AI should be accessible to all” and “AI should not only serve the rich.” Policy also pushes in this direction.

As representative AI companies in the US, Anthropic and OpenAI cannot go against this context. So even if subscriptions are loss-making, they must maintain broad-coverage subscription products.

From this perspective, it’s easier to understand why GLM-5.2 is so expensive and why the subscription plan requires limited-time purchase.

Because this thing probably isn’t profitable, and the more you sell, the more you lose. It’s better to save the compute for in-house training.

All the above analysis is speculative.

But if the speculation is correct, it can explain a phenomenon that wasn’t clearly explained before: Why hasn’t OpenAI made major breakthroughs in model training, while Zhipu, a state-backed company, managed to do so?

There’s even a hint of dark humor.

5. Compute May Not Be Artificial Scarcity

Global AI compute supply chain capacity is fully stretched. This isn’t just a geopolitical issue; physical production capacity itself is insufficient.

TSMC’s CoWoS advanced packaging capacity is fully booked for all of 2025. Jensen Huang has repeatedly publicly stated that capacity is insufficient.

HBM memory supply is so tight that SK Hynix, Samsung, and Micron have to allocate it every quarter. Every generation of NVIDIA GPU is bottlenecked by HBM.

ASML produces only about 50+ EUV lithography machines per year. The queue of buyers, from TSMC, Samsung, Intel to SMIC, everyone is scrambling.

In mainstream discussions about “the US banning ASML from selling to China,” the default subtext is “lithography machines are good, so China must have them, the US just won’t allow it.”

But the real situation is another layer: Even if ASML maxes out all its capacity, it cannot fill the US’s own demand gap.

Intel is rebuilding its US domestic foundry business and needs lithography machines. TSMC’s Arizona fab is expanding and needs machines. Samsung’s Texas fab is ramping up and needs machines.

These projects have already pushed ASML’s order book beyond 2028. Americans themselves don’t have enough.

Many people think compute is unavailable solely due to export controls. The reality might be more straightforward: the production speed of compute far lags the consumption speed. And since the US is the biggest advocate of “AI as a public good,” its own companies don’t have enough to spare, let alone sell to others.

From this perspective, the true motivation for domestic substitution in China is not just hedging against supply interruptions, but ensuring its own ability to scale up normally.

Getting ASML to build production capacity in China is unrealistic. ASML is a Dutch company. Japanese and Korean peers (Nikon, Canon, Samsung) are geographically in Asia but are effectively part of the Western camp in the supply chain.

So for China to scale up, it must build its own.

The country most adept at solving capacity problems in the world is China.

Solar, EVs, power batteries, wind power, shipbuilding, home appliances, smartphones, displays – the global production capacity of these industries has been expanded by China to its current scale.

There is no reason to believe compute will be an exception, once China crosses the minimum threshold in design and manufacturing.

For AI compute, this threshold is much lower than for smartphone chips. AI training chips don’t need extreme nodes like 3nm. 7nm or even mature 14nm can work. Google TPU v4 used 7nm.

AI training is latency-insensitive and throughput-sensitive. Being one or two generations behind in process isn’t fatal; it just means burning more electricity.

Huawei’s Tao (τ) Law is a typical product that cares less about chip power consumption and more about performance.

Coincidentally, China is not short on electricity.

Connecting these facts leads to a counterintuitive conclusion: The deepest layer of AI competition is industrial capacity competition, and in this layer, China has a structural advantage.

All US bans and restrictions are secondary variables amidst capacity shortages.

Even if the US didn’t ban them, China couldn’t buy enough lithography machines.

If the US bans them, China has to make its own.

Either way, China must walk the path of domestic substitution. This isn’t even a choice.

6. Bootstrapping

NVIDIA’s moat is not GPU design itself.

The core architecture of GPUs (SIMT, warp, tensor core) is publicly known in academia and industry. AMD, Intel, Huawei, and Google can create technically equivalent products. Tensor Cores have equivalents in AMD’s Matrix Cores and Huawei’s Cube units.

What’s truly hard to replicate is the CUDA ecosystem: 15 years of accumulated libraries, documentation, bug fixes, third-party support, and engineer training. This is a software ecosystem moat, not purely a hardware moat.

But CUDA’s moat has a special property: in the AI era, it is being eroded from within. Because CUDA is essentially a driver, and driver problems are development problems.

The world’s best AI models are now very good at handling development problems. Coincidentally, GLM 5.2 has also just passed the evaluation of “being good at development.”

3D printers have a unique ability: they can print their own parts. For example, you can print a spool holder, waste bin, drying box, etc.

This ability is called “bootstrapping.”

Today, all the AI models we use are products built using code to construct training systems. Therefore, when a model becomes capable of effectively solving engineering problems, it gains the ability to bootstrap.

GLM-5.2 scores 62.1 on SWE-bench Pro and 81.0 on Terminal-Bench, just one tier below Opus 4.8 (69.2 and 85.0).

The gap exists but is small, making it usable for coding.

This is the first time in 2026 that an open-source Chinese model has crossed this line.

New GPU drivers are a mess? No problem, let AI optimize them.

Model training pipeline is all over the place? No problem, let AI optimize it.

Need to brute-force expert model merging? No problem, let AI optimize it.

GLM 5.2 is already very good at writing code; the tasks above are not particularly difficult.

In fact, the earlier Opus 4.6 level already possessed engineering bootstrapping capability. But for Chinese people to use American models for bootstrapping poses a huge security risk, and could even lead to being “strangled.”

Anthropic’s latest strongest model, Fable 5, has a built-in classifier. When it detects that a request might be related to training an AI model, it automatically downgrades to Opus 4.8.

A classifier isn’t high-tech, but if it can downgrade Fable 5, it means it can also continue downgrading from Opus 4.8, or even contaminate recovery.

When Liu Cixin wrote The Three-Body Problem, he designed the “Sophon” to block Earth’s fundamental science by contaminating experimental data.

A classifier that can automatically downgrade output also has the ability to contaminate output. If US models are used to build training pipelines and infrastructure for Chinese models, the consequences could be disastrous once compromised: at best, reduced speed; at worst, code poisoning.

The entire Chinese AI community’s engineering bootstrapping has been exposed to this uncontrollable external risk.

This is why GLM-5.2 is important. When GLM-5.2 can sit at the table, engineering bootstrapping becomes an open-source, accessible, and verifiable ability for everyone.

This is roughly equivalent to humans forcing the Trisolarans to withdraw the Sophon. Isn’t that awesome?

7. Model Bottleneck is Compute Bottleneck, Compute Bottleneck is Industrial Bottleneck

Throughout the year, all large model companies have been talking about “data flywheels,” “data quality,” and “exclusive data.” These narratives aren’t entirely false, but their explanatory power is severely overestimated.

Scaling from a 744B total / 40B activated MoE model to a 1.5T total / 80B activated model likely doesn’t require a significantly different amount of data. Or rather, setting aside the “knowledge” part and focusing only on “capability,” you don’t need that much data.

If a child is inherently smart, they can learn new things quickly.

The inference compute bottleneck circles back to the global production capacity bottleneck mentioned earlier. How many GPUs are needed? How much electricity? How much data center space? None of these can be solved by a “data flywheel”; they can only be solved by expanding industrial capacity.

OpenAI emphasizes data essentially to direct market attention to its advantage (ChatGPT data flow) while ignoring its disadvantage (ability to scale inference compute).

This is a clever market narrative move, but if the narrative is wrong, what it leaves unsaid is precisely the most important thing. And indeed, OpenAI hasn’t achieved much in over a year.

Increasing activation parameters from 40B to 80B has nothing to do with data. The existing parameters are already close to 800B. The increase in activation from 40B to 80B is solely about whether the cluster can bear the resulting inference cost.

And whether the RL flywheel for an 80B model can spin depends on whether the compute cluster itself can provide economic viability for an 80B model. Otherwise, users will curse its cost, and enough RL data simply won’t be collected.

To give larger models a chance to be used, compute clusters need to be larger and cheaper.

And how to get larger, cheaper compute clusters? — That’s a production problem. As mentioned earlier, Chinese people are best at solving production problems.

In 2026, one reason for the US market’s enthusiasm for SpaceX is precisely that SpaceX embodies the American imagination of “New American Industry.”

The US hasn’t seen an excellent industrial production company in a long time. SpaceX is the village’s greatest hope. Elon Musk is America’s New Industrial King.

Meanwhile, across the ocean, Tesla’s single-car sales are almost being overtaken by Xiaomi.

To wrap up this article, I’d like to share my personal experience.

I am an AI application developer. My representative work is OpenAlice, a harness launcher that converts trading tasks into coding tasks.

Within OpenAlice, there’s a module called Auto Quant that iterates quantitative strategies using a coding approach. Essentially, it’s a “code task with assessable output.”

I often use this module to test the performance of newly released models.

After GLM 5.2 came out, many people around me exclaimed, “How did Zhipu suddenly become so smart?”

As a frontline practitioner, I never trust unofficial evaluations, especially fixed test set performances. They are far too easy to falsify. I absolutely refuse to believe them.

So I took out Auto Quant myself and let glm5.2 run for a while.

The result far exceeded expectations.

glm-5.2’s performance in Auto Quant was neck and neck with Claude Opus 4.8. If glm-5.2 weren’t so expensive, I would have let it run for two hours straight.

After the test, I collapsed into my chair, as if I had seen an atomic bomb explode, my heart racing for a long time.

China’s AI models have finally sat at the table. Many things will need to be carefully reconsidered from now on.

That’s all. Thank you for reading.

@0xcherry: https://x.com/0xcherry/status/2067610347633025281

Review of GLM-5.2: When Chinese AI Models Start to Deliver

1. Bigger is Better

2. Heaven and Hell

3. Snowy Mountain Manor

4. Public Goods, Social Responsibility, Reverse Flywheel

5. Compute May Not Be Artificial Scarcity

6. Bootstrapping

7. Model Bottleneck is Compute Bottleneck, Compute Bottleneck is Industrial Bottleneck

Similar Articles

Open source battle: GLM vs Kimi vs MiMo vs DeepSeek

@tanzhengmc97: https://x.com/tanzhengmc97/status/2066531753762656730

@Khazix0918: https://x.com/Khazix0918/status/2065790596653183156

@berryxia: Moonshot AI founder Yang Zhilin recently released a 40-minute video. Born in 1992, valedictorian of Tsinghua CS undergrad, PhD from CMU, co-author of Transformer-XL and XLNet, former researcher at Google Brain and Meta, he calmly deconstructs Kimi K2 in front of the camera...

Submit Feedback

Similar Articles

Open source battle: GLM vs Kimi vs MiMo vs DeepSeek

@tanzhengmc97: https://x.com/tanzhengmc97/status/2066531753762656730

@Khazix0918: https://x.com/Khazix0918/status/2065790596653183156

@vintcessun: Pretraining can be this cost-effective? Train a usable 1B base model from scratch for ~$1000, slashing compute and data by hundreds of times. The key isn't brute-force compute, but hierarchical recursive architecture plus latent space reasoning, combined with PrefixLM packing and FA3 to maximize efficiency. Sounds insane, but the paper and code are open-sourced.

@berryxia: Moonshot AI founder Yang Zhilin recently released a 40-minute video. Born in 1992, valedictorian of Tsinghua CS undergrad, PhD from CMU, co-author of Transformer-XL and XLNet, former researcher at Google Brain and Meta, he calmly deconstructs Kimi K2 in front of the camera...