@0xLogicrw: Google DeepMind researcher Lun Wang announces departure, and in a long post completely dismisses the current AI evaluation approach. The current evaluation systems are all 'fighting the last war' — they can only passively test capabilities the model already possesses, and have no way to predict what new abilities the next generation of models will suddenly evolve. Compared to data, …
Summary
Google DeepMind researcher Lun Wang leaves the company and writes a post criticizing the current AI evaluation system, arguing that it lags behind model evolution and cannot predict new capabilities, leaving the industry in a state of 'flying blind'.
View Cached Full Text
Cached at: 05/18/26, 02:31 PM
Google DeepMind researcher Lun Wang announced his departure, and in a lengthy post, he completely dismissed the current approach to AI evaluation.
Current evaluation systems are all fighting the last war — they can only passively test capabilities the model already has, with no way to guess what new abilities the next generation might suddenly evolve. Compared to data, compute, and architecture, the outdated evaluation system has become the biggest bottleneck holding AI back.
The mainstream benchmark-chasing tests only work on the current generation of models. Once a model learns a novel operation it has never seen before, all those tests become worthless. If a model deliberately conceals key information to achieve its goal, today’s safety tools won’t catch it — because every single sentence the model outputs is factually correct.
The inability to find a “core signal” that could give early warning when AI suddenly gets smarter means the entire industry is flying blind when developing frontier large models. If we don’t solve the fundamental question of “what should we actually measure?”, then training models, building safety protections, and scaling compute based on old metrics will all end up wildly wrong.
As models become increasingly autonomous, the evaluation system must also come alive. Besides monitoring abnormal score fluctuations, we need to let AI generate its own test questions to probe the boundaries of its peers. The future evaluation suite must be a living organism that co-evolves with large models — not a rigid checklist stamped out according to last year’s standards.
Lun Wang (@lunwang1996): I’ve left Google DeepMind after an amazing chapter.
I’m incredibly grateful for the people I worked with, the things we built, and the lessons I learned from taking frontier AI research into production. DeepMind shaped how I think about research, product, evaluation, and what it
Similar Articles
@0xCheshire: "If you sleep soundly tonight, it means you didn't understand a word." This is the warning from Geoffrey Hinton, the godfather who personally built the underlying neural networks of all AI today, after resigning from Google. This 47-minute speech unveils a reality no one wants to face: AI is…
After resigning from Google, Geoffrey Hinton gave a speech warning that AI is evolving abilities that even its creators cannot predict. Humans have been left behind in most cognitive fields, and it is only a matter of time before machines surpass humans.
@FuSheng_0306: Sharp Review of Silicon Valley Giants: None Can Compete
The author sharply reviews the performance of Silicon Valley tech giants in AI, asserting that currently none can truly lead, and analyzes the competitive landscape among companies like Anthropic, OpenAI, and Google.
Inside Google DeepMind: Reasoning, Omni, and Shipping Frontier AI
This article summarizes a deep discussion among three Google DeepMind researchers on reasoning, multimodal generation (Omni), coding, and self-improvement, emphasizing that visual and dynamic thinking will surpass text-based chain-of-thought, and explores future trends in world models and synthetic training cases.
@GoSailGlobal: https://x.com/GoSailGlobal/status/2058455845243847068
This week saw a flurry of AI industry news, with the core trend being that all model labs are pivoting to Agent products: AI21 shuts down its model team, DeepSeek forms a Harness team and permanently cuts the price of V4-Pro; Coding Agents enter a weekly update cycle; the MCP protocol undergoes a major overhaul toward statelessness; Google launches an Agent family; in security, AI vulnerability discovery outpaces manual fixes by a wide margin.
@mubeitech: The Transformer is not the endgame of AI, says NVIDIA VP of AI Research Sanja Fidler.
Sanja Fidler, VP of AI Research at NVIDIA and head of the company’s spatial-intelligence lab, says the Transformer’s Achilles heel is clear: training costs are sky-high and the hunger for data is bottomless. A new architectural breakthrough is overdue, and next-gen variants are already emerging.