@AdamRLucek: What are Online Evals? Most agent evals run "offline": a premade dataset of inputs goes through the agent, and an inter…

X AI KOLs Following 06/16/26, 05:56 PM News

Summary

Explains the concept of online evaluations for AI agents, which measure agent performance on live traffic over time, as opposed to offline evaluations that use fixed datasets.

What are Online Evals? Most agent evals run "offline": a premade dataset of inputs goes through the agent, and an intermediate step or final output gets scored. They answer "is this version better than the last?" Online evals answer a different question: "is the agent still working?" Instead of a fixed dataset, they measure a dimension of the agent as it runs on live traffic, tracked over time. Let's discuss how they work and how to set them up

Original Article

View Cached Full Text

Cached at: 06/17/26, 03:58 PM

What are Online Evals?

Most agent evals run “offline”: a premade dataset of inputs goes through the agent, and an intermediate step or final output gets scored. They answer “is this version better than the last?”

Online evals answer a different question: “is the agent still working?” Instead of a fixed dataset, they measure a dimension of the agent as it runs on live traffic, tracked over time. Let’s discuss how they work and how to set them up

First, how are the two structurally different?

Offline, we compare an agent’s behavior against a ground truth defined while building it. The eval is a dataset of inputs and expected outputs, curated to capture the behavior we want, which lets us compare performance across versions.

Online, the focus shifts from comparison to monitoring. We swap that curated dataset for live production data and score the outputs as they’re generated. Two things change: we don’t control the inputs (they come from real users), and we have no ground truth to measure the outputs against.

These constraints shape how online evals are designed. They tend to fall into two categories:

Heuristic evals: written as code that runs directly on the trace, measuring deterministic signals like step count, response length, or content matches.
LLM-as-a-judge evals: for subjective, probabilistic metrics like quality, helpfulness, hallucination, or other application-specific judgements, we score the output against a natural language rubric, i.e. a prompt.

Importantly, this lets us monitor the agent’s live performance (akin to observability), see trends over time, and get alerted when a metric drops below a threshold. Combining heuristic functions and LLM-as-a-judge implementations lets us capture both hard and fuzzy metrics from live usage that may not surface in a controlled, offline experiment.

Online and offline evals aren’t competitors. Rather, they tend to feed into each other. Online monitoring surfaces problematic traces, which undergo annotation and error analysis, and are ultimately converted into offline evaluations to cement behavior and capture regressions as the agent evolves. Together, they close the full evaluation loop and provide a holistic view of your agent’s performance.

@AdamRLucek: What are Online Evals? Most agent evals run "offline": a premade dataset of inputs goes through the agent, and an inter…

Similar Articles

@levie: Almost all AI model and agent progress is downstream from evals. Open weights post training for specific domains comes …

An Empirical Study of Automating Agent Evaluation

Online Agent-as-a-Judge: Situation-Generating Evaluation for Interactive Agents

How to go about evaluation and Observability while building AI agents?

How evals drive the next chapter in AI for businesses

Submit Feedback

Similar Articles

@levie: Almost all AI model and agent progress is downstream from evals. Open weights post training for specific domains comes …

An Empirical Study of Automating Agent Evaluation

Online Agent-as-a-Judge: Situation-Generating Evaluation for Interactive Agents

How to go about evaluation and Observability while building AI agents?

How evals drive the next chapter in AI for businesses