@LangChain: "Validate your validators." The eval advice nobody is following. Watch @sh_reya + @HamelHusain’s Interrupt keynote on t…

X AI KOLs Following News

Summary

The article summarizes common mistakes in AI evaluation, emphasizing the need to validate validators, design specific metrics, and enforce rigorous experimental design. It calls for a return to data science thinking to improve the reliability of AI system evaluation.

"Validate your validators." The eval advice nobody is following. ⏯️ Watch @sh_reya + @HamelHusain’s Interrupt keynote on the return of the data scientist: https://t.co/olOqyGs7a5 https://t.co/IpghaMmPBs
Original Article
View Cached Full Text

Cached at: 06/12/26, 03:02 PM

“Validate your validators.”

The eval advice nobody is following.

⏯️ Watch @sh_reya + @HamelHusain’s Interrupt keynote on the return of the data scientist: https://t.co/olOqyGs7a5 https://t.co/IpghaMmPBs


TL;DR: In AI evaluation, don’t rely on generic metrics or unverified LLM judges. Act like a data scientist: repeatedly look at your data, design specific failure modes, validate your validators, and use rigorous experimental design.

Background: Regression from Data Science to AI Engineering

The speaker notes that four years ago, in machine learning engineering and data science, teams would carefully inspect data, visualize, ensure models align with human annotations, and thoughtfully design metrics to match business goals. Today, AI engineering has regressed in many ways—people judge correctness by gut feel, often ask the model itself how it’s doing, rarely think about metric design, and just plug in off-the-shelf metric packages. This leads to frequent failures in evals and retrieval. This talk focuses on evaluation, reveals common mistakes, and teaches how to put on a “data scientist” hat to overcome them.

Mistake 1: Using Generic or Off-the-Shelf Metrics

Problem: Many use generic metrics like “helpfulness,” “hallucination,” or “coherence” to measure agent accuracy. These words sound important but are very vague: can you precisely define what “hallucination” means? Different domains (medical, legal) have completely different definitions for the same concept, making it meaningless to use the same off-the-shelf evaluator.

How to fix: Explore data like a data scientist. Use AI-assisted tools like Codex, Claude Code, or Cursor to build custom interfaces, load agent traces, read each message and trace, and discuss where things might go wrong. Pretend you’re the user, talk to product managers, and jot down observed errors as open notes. After a while, categorize these notes into specific failure modes. For example, in a real estate agent tour app, a specific failure mode might be “rescheduling a tour when the user didn’t ask” or “fabricating times for tours.” The core idea: look at your data and scale this process.

Mistake 2: Using LLM Judges Blindly Without Validation

Problem: After finding a failure mode, many people directly ask the LLM “how often does this failure mode occur in my data?” without building any trust in the LLM judge itself. They have the LLM score on a 1–5 scale, look at a histogram, and make business decisions based on it—hard to trust.

How to fix: Treat the LLM judge as an imperfect classifier. You need labeled trace examples for each failure mode, and split your data into training, development, and test sets. Find which prompt or model performs well on the train/dev set and ensure you haven’t overfit on the test set. This process is just as necessary in LLM applications. Also, since failures are often minority classes, don’t just look at accuracy; use imbalance classification metrics like precision, recall, false positives/false negatives.

Mistake 3: Poor Experimental Design

The speaker mentions two manifestations.

Low-quality synthetic data generation

When you ask an LLM to “give me some data” or “five generic questions,” it tends to produce homogeneous traces. Solution: systematically generate synthetic data, letting the LLM only participate in very small parts. Identify which dimensions of user input vary (e.g., user role: novice vs. experienced), generate combinations of these dimensions, review all synthetic data for quality, and ensure diversity. A practical exercise: go into your app, look at traces, find at least three dimensions that vary between different users, use an LLM to generate different values for each dimension, then take the Cartesian product of all dimensions to generate synthetic data.

Traps when designing metrics

People often use 1–5 or 1–100 rating scales, which are not very interpretable or actionable. It’s recommended to simplify to binary classifications (success/failure, pass/fail). Keep the judge’s scope very narrow, focused on a binary task, then label a lot of data yourself to measure agreement with that task. It’s hard to write a good LLM judge prompt on the first try, but you can iterate using the methods above.

Mistake 4: Lacking Domain Expertise or Ignoring Criteria Drift When Labeling Data

Problem: Many teams have AI engineers or developers label data. Unless you’re building a coding app, this is usually a bad idea—these people lack domain expertise. Also, don’t trust any label blindly; verify the annotator’s expertise and inspect labels yourself.

Criteria drift: From the paper “Who Validates the Validator?” (Shreya is a co-author). People don’t know what they want until they see some data. Specifying a scoring rubric in advance is not enough; you need to repeatedly look at data and force people to look at data. It’s a process of “pushing data in front of them.”

Mistake 5: Over-automation

Some think: “Can Claude just do all this evaluation work for me?” The answer is no. Claude doesn’t know all the product nuances that could go wrong, and it can’t read your mind. Of course, LLMs can find obvious errors or clearly broken things, but many critical, product-level failures require you to externalize context to discover.

Other Common Traps (List)

  • Misusing similarity scores (Rouge, Bleu, etc.) — these often appear in off-the-shelf eval frameworks, but measuring similarity isn’t always meaningful.
  • Asking the judge “Is this helpful?” — the prompt is too generic and not specific to your product.
  • Having annotators read raw JSON data — remove friction for viewing data; build your own data labeling interface to make reading data enjoyable.
  • Reporting uncalibrated scores — you must study how well the LLM judge aligns with humans, otherwise you’re just guessing.
  • Ignoring criteria drift — always understand that criteria change.
  • Overfitting the judge to the data — don’t repeatedly hill-climb optimize on the same dataset; hold out data to ensure generalization.
  • Inefficient sampling — ensure effective data sampling.
  • Empty metrics on dashboards — metrics should be worth the space and carry signal.

Returning to a Data Science Mindset

Many issues discussed today have corresponding skills in data science that can mitigate them:

  • Error analysis / data analysis — look at data, find patterns in traces, similar to exploratory data analysis.
  • Metric design — design specific, actionable metrics for real problems.
  • Validation — validate that LLM judges align with human judgment, just like validating models in ML.
  • Data management — especially proper management of test data.
  • Monitoring and observability — keep continuous observation.
  • Holistic approach — treat evaluation as a system, not isolated steps.

Source: https://www.youtube.com/watch?v=QDQT99csHJQ&feature=youtu.be

Similar Articles

@Phoenixyin13: Incredible! This Red Queen Gödel Machine from NVIDIA, Cambridge University, and other teams is absolutely one of the most important AI papers I've seen recently. This time, the paper directly targets the core bottleneck of self-improving AI: previously, once the evaluator was fixed, it led to agents gaming the system or quickly stagnating...

X AI KOLs Timeline

The Red Queen Gödel Machine paper from NVIDIA, Cambridge University, and other teams solves the bottleneck of recursive self-improvement by co-evolving agents and evaluators. It surpasses existing SOTA on tasks like code and paper writing, providing an important methodology for controlled open-ended AI evolution.

@ba_niu80557: https://x.com/ba_niu80557/status/2068751230667755859

X AI KOLs Timeline

The article explores how increasingly powerful AI models eliminate those whose skills can be encoded into prompts, emphasizing that the truly irreplaceable value lies in tacit knowledge, physical-world operations, and interpersonal trust. Through the example of a friend transitioning from a consultant to a hardware integrator, the author illustrates how proactively yielding to AI-replaceable tasks while deepening expertise in areas beyond AI's reach is key to surviving and thriving in the technological wave.