@AdamRLucek: What are Online Evals? Most agent evals run "offline": a premade dataset of inputs goes through the agent, and an inter…
Summary
Explains the concept of online evaluations for AI agents, which measure agent performance on live traffic over time, as opposed to offline evaluations that use fixed datasets.
View Cached Full Text
Cached at: 06/17/26, 03:58 PM
What are Online Evals?
Most agent evals run “offline”: a premade dataset of inputs goes through the agent, and an intermediate step or final output gets scored. They answer “is this version better than the last?”
Online evals answer a different question: “is the agent still working?” Instead of a fixed dataset, they measure a dimension of the agent as it runs on live traffic, tracked over time. Let’s discuss how they work and how to set them up
First, how are the two structurally different?
Offline, we compare an agent’s behavior against a ground truth defined while building it. The eval is a dataset of inputs and expected outputs, curated to capture the behavior we want, which lets us compare performance across versions.
Online, the focus shifts from comparison to monitoring. We swap that curated dataset for live production data and score the outputs as they’re generated. Two things change: we don’t control the inputs (they come from real users), and we have no ground truth to measure the outputs against.
These constraints shape how online evals are designed. They tend to fall into two categories:
-
Heuristic evals: written as code that runs directly on the trace, measuring deterministic signals like step count, response length, or content matches.
-
LLM-as-a-judge evals: for subjective, probabilistic metrics like quality, helpfulness, hallucination, or other application-specific judgements, we score the output against a natural language rubric, i.e. a prompt.
Importantly, this lets us monitor the agent’s live performance (akin to observability), see trends over time, and get alerted when a metric drops below a threshold. Combining heuristic functions and LLM-as-a-judge implementations lets us capture both hard and fuzzy metrics from live usage that may not surface in a controlled, offline experiment.
Online and offline evals aren’t competitors. Rather, they tend to feed into each other. Online monitoring surfaces problematic traces, which undergo annotation and error analysis, and are ultimately converted into offline evaluations to cement behavior and capture regressions as the agent evolves. Together, they close the full evaluation loop and provide a holistic view of your agent’s performance.
Similar Articles
@levie: Almost all AI model and agent progress is downstream from evals. Open weights post training for specific domains comes …
Almost all AI model and agent progress depends on evaluations (evals). Understanding workflows and agent performance through evals will become a core enterprise competency for driving automation.
An Empirical Study of Automating Agent Evaluation
This paper introduces EvalAgent, a system that automates the evaluation of AI agents by encoding domain-specific expertise, addressing the limitations of standard coding assistants in this task. It also presents AgentEvalBench, a benchmark for testing evaluation pipelines, and demonstrates significant improvements in evaluation reliability.
Online Agent-as-a-Judge: Situation-Generating Evaluation for Interactive Agents
Proposes Online Agent-as-a-Judge, an evaluation framework that uses an in-world evaluator agent to actively generate situations for testing interactive social agents, improving coverage and reliability over passive methods.
How to go about evaluation and Observability while building AI agents?
The author discusses challenges in evaluating and monitoring AI agents in production, including offline vs online evals, LLM-as-a-judge, tracing, and cost tracking, while citing tools like Langfuse and LangSmith but focusing on underlying processes.
How evals drive the next chapter in AI for businesses
OpenAI publishes a framework for business leaders on using AI evaluations (evals) to measure and improve AI system performance in organizational contexts, distinguishing between frontier evals for model development and contextual evals tailored to specific business workflows.