@AdamRLucek: What are Online Evals? Most agent evals run "offline": a premade dataset of inputs goes through the agent, and an inter…

X AI KOLs Following News

Summary

Explains the concept of online evaluations for AI agents, which measure agent performance on live traffic over time, as opposed to offline evaluations that use fixed datasets.

What are Online Evals? Most agent evals run "offline": a premade dataset of inputs goes through the agent, and an intermediate step or final output gets scored. They answer "is this version better than the last?" Online evals answer a different question: "is the agent still working?" Instead of a fixed dataset, they measure a dimension of the agent as it runs on live traffic, tracked over time. Let's discuss how they work and how to set them up
Original Article
View Cached Full Text

Cached at: 06/17/26, 03:58 PM

What are Online Evals?

Most agent evals run “offline”: a premade dataset of inputs goes through the agent, and an intermediate step or final output gets scored. They answer “is this version better than the last?”

Online evals answer a different question: “is the agent still working?” Instead of a fixed dataset, they measure a dimension of the agent as it runs on live traffic, tracked over time. Let’s discuss how they work and how to set them up

First, how are the two structurally different?

Offline, we compare an agent’s behavior against a ground truth defined while building it. The eval is a dataset of inputs and expected outputs, curated to capture the behavior we want, which lets us compare performance across versions.

Online, the focus shifts from comparison to monitoring. We swap that curated dataset for live production data and score the outputs as they’re generated. Two things change: we don’t control the inputs (they come from real users), and we have no ground truth to measure the outputs against.

These constraints shape how online evals are designed. They tend to fall into two categories:

  1. Heuristic evals: written as code that runs directly on the trace, measuring deterministic signals like step count, response length, or content matches.

  2. LLM-as-a-judge evals: for subjective, probabilistic metrics like quality, helpfulness, hallucination, or other application-specific judgements, we score the output against a natural language rubric, i.e. a prompt.

Importantly, this lets us monitor the agent’s live performance (akin to observability), see trends over time, and get alerted when a metric drops below a threshold. Combining heuristic functions and LLM-as-a-judge implementations lets us capture both hard and fuzzy metrics from live usage that may not surface in a controlled, offline experiment.

Online and offline evals aren’t competitors. Rather, they tend to feed into each other. Online monitoring surfaces problematic traces, which undergo annotation and error analysis, and are ultimately converted into offline evaluations to cement behavior and capture regressions as the agent evolves. Together, they close the full evaluation loop and provide a holistic view of your agent’s performance.

Similar Articles

An Empirical Study of Automating Agent Evaluation

arXiv cs.CL

This paper introduces EvalAgent, a system that automates the evaluation of AI agents by encoding domain-specific expertise, addressing the limitations of standard coding assistants in this task. It also presents AgentEvalBench, a benchmark for testing evaluation pipelines, and demonstrates significant improvements in evaluation reliability.

How evals drive the next chapter in AI for businesses

OpenAI Blog

OpenAI publishes a framework for business leaders on using AI evaluations (evals) to measure and improve AI system performance in organizational contexts, distinguishing between frontier evals for model development and contextual evals tailored to specific business workflows.