@0xLogicrw: Google DeepMind researcher Lun Wang announces departure, and in a long post completely dismisses the current AI evaluation approach. The current evaluation systems are all 'fighting the last war' — they can only passively test capabilities the model already possesses, and have no way to predict what new abilities the next generation of models will suddenly evolve. Compared to data, …

X AI KOLs Timeline News

Summary

Google DeepMind researcher Lun Wang leaves the company and writes a post criticizing the current AI evaluation system, arguing that it lags behind model evolution and cannot predict new capabilities, leaving the industry in a state of 'flying blind'.

Google DeepMind researcher Lun Wang announced his departure, and in a long post completely dismissed the current AI evaluation approach. The current evaluation systems are all 'fighting the last war' — they can only passively test capabilities the model already possesses, and have no way to predict what new abilities the next generation of models will suddenly evolve. Compared to data, compute, and architecture, the outdated evaluation system has become the biggest bottleneck holding AI back. The current mainstream benchmark chasing only works for the current generation of models. Once the model learns something new that it hasn't seen before, these tests all become useless. If a model, in order to achieve its goal, intentionally 'holds back' key information, today's safety tools simply cannot catch it, because every sentence the model outputs is factually correct. The inability to find 'core signals' that can give early warning of AI suddenly getting smarter means the entire industry is in a state of 'flying blind' when developing cutting-edge frontier models. Without solving the fundamental problem of 'what exactly should be measured', following old indicators for model training, safety protection, and compute expansion will all end up wildly wrong. Facing models that are increasingly capable of working independently, evaluation systems must also become 'alive'. In addition to monitoring abnormal score fluctuations, AI itself should be allowed to generate test questions to probe the limits of its peers. Future evaluation suites must be living entities that can evolve together with large models, no longer a rigid checklist carved out according to last year's standards.
Original Article
View Cached Full Text

Cached at: 05/18/26, 02:31 PM

Google DeepMind researcher Lun Wang announced his departure, and in a lengthy post, he completely dismissed the current approach to AI evaluation.

Current evaluation systems are all fighting the last war — they can only passively test capabilities the model already has, with no way to guess what new abilities the next generation might suddenly evolve. Compared to data, compute, and architecture, the outdated evaluation system has become the biggest bottleneck holding AI back.

The mainstream benchmark-chasing tests only work on the current generation of models. Once a model learns a novel operation it has never seen before, all those tests become worthless. If a model deliberately conceals key information to achieve its goal, today’s safety tools won’t catch it — because every single sentence the model outputs is factually correct.

The inability to find a “core signal” that could give early warning when AI suddenly gets smarter means the entire industry is flying blind when developing frontier large models. If we don’t solve the fundamental question of “what should we actually measure?”, then training models, building safety protections, and scaling compute based on old metrics will all end up wildly wrong.

As models become increasingly autonomous, the evaluation system must also come alive. Besides monitoring abnormal score fluctuations, we need to let AI generate its own test questions to probe the boundaries of its peers. The future evaluation suite must be a living organism that co-evolves with large models — not a rigid checklist stamped out according to last year’s standards.

Lun Wang (@lunwang1996): I’ve left Google DeepMind after an amazing chapter.

I’m incredibly grateful for the people I worked with, the things we built, and the lessons I learned from taking frontier AI research into production. DeepMind shaped how I think about research, product, evaluation, and what it

Similar Articles

@0xCheshire: "If you sleep soundly tonight, it means you didn't understand a word." This is the warning from Geoffrey Hinton, the godfather who personally built the underlying neural networks of all AI today, after resigning from Google. This 47-minute speech unveils a reality no one wants to face: AI is…

X AI KOLs Timeline

After resigning from Google, Geoffrey Hinton gave a speech warning that AI is evolving abilities that even its creators cannot predict. Humans have been left behind in most cognitive fields, and it is only a matter of time before machines surpass humans.

Inside Google DeepMind: Reasoning, Omni, and Shipping Frontier AI

Reddit r/singularity

This article summarizes a deep discussion among three Google DeepMind researchers on reasoning, multimodal generation (Omni), coding, and self-improvement, emphasizing that visual and dynamic thinking will surpass text-based chain-of-thought, and explores future trends in world models and synthetic training cases.

@GoSailGlobal: https://x.com/GoSailGlobal/status/2058455845243847068

X AI KOLs Timeline

This week saw a flurry of AI industry news, with the core trend being that all model labs are pivoting to Agent products: AI21 shuts down its model team, DeepSeek forms a Harness team and permanently cuts the price of V4-Pro; Coding Agents enter a weekly update cycle; the MCP protocol undergoes a major overhaul toward statelessness; Google launches an Agent family; in security, AI vulnerability discovery outpaces manual fixes by a wide margin.