@steverab: Very excited to share that our paper "Towards a Science of AI Agent Reliability" was accepted at ICML 2026! See you in …

X AI KOLs Timeline 06/05/26, 01:32 PM Papers

Summary

A paper analyzing AI agent reliability, accepted at ICML 2026, finds that even the latest frontier models (GPT 5.5, Gemini 3.1 Pro, Claude Opus 4.7) show only marginal reliability improvements over earlier versions, with low outcome consistency and persistent issues in agent scaffolding.

Very excited to share that our paper "Towards a Science of AI Agent Reliability" was accepted at ICML 2026! See you in Seoul! We just released our camera ready version with three important updates (details below). We also recorded a short video on the paper's contributions. Main changes (full discussion at https://hal.cs.princeton.edu/reliability/#updates…): We have added the latest set of frontier models to our evaluation (GPT 5.5, Gemini 3.1 Pro and 3.5 Flash, and Claude Opus 4.7) and find that they are not meaningfully more reliable than previously released models. Agent reliability is still far from being solved. We have updated the definition and measurement of our outcome consistency metric, which contained a typo in the pre-print we initially released. This caused us to under-estimate outcome consistency in our initial set of results. We have updated the paper and our codebase to the corrected metric. Despite this change, our new results show that outcome consistency is still surprisingly low across many reported models. We discovered multiple issues in our HAL Generalist Agent scaffold that we used for our experiments on GAIA. Notably, we discovered multiple instances of answer leakage and agents cheating on our evaluation. This caused us to slightly over-estimate both accuracy and reliability. At the same time, we noticed that the scaffold was overly constrained in terms of permissible software library imports. This caused us to slightly under-estimate both accuracy and reliability. We have done a rigorous audit of the scaffold and have fixed those issues. Overall, we saw that our resulting accuracy and reliability numbers are not meaningfully impacted by this change when compared to our original numbers. Our paper: https://arxiv.org/abs/2602.16666 Our dashboard: https://hal.cs.princeton.edu/reliability/ Short video: https://youtu.be/qftDfEft7U0 Joint work w/ @sayashk, @PKirgis, @khl53182440, @SaitejaUtpala, and @random_walker.

Original Article

View Cached Full Text

Cached at: 06/06/26, 01:21 AM

Very excited to share that our paper “Towards a Science of AI Agent Reliability” was accepted at ICML 2026! See you in Seoul!

We just released our camera ready version with three important updates (details below). We also recorded a short video on the paper’s contributions.

Main changes (full discussion at https://hal.cs.princeton.edu/reliability/#updates…): We have added the latest set of frontier models to our evaluation (GPT 5.5, Gemini 3.1 Pro and 3.5 Flash, and Claude Opus 4.7) and find that they are not meaningfully more reliable than previously released models. Agent reliability is still far from being solved. We have updated the definition and measurement of our outcome consistency metric, which contained a typo in the pre-print we initially released. This caused us to under-estimate outcome consistency in our initial set of results. We have updated the paper and our codebase to the corrected metric. Despite this change, our new results show that outcome consistency is still surprisingly low across many reported models. We discovered multiple issues in our HAL Generalist Agent scaffold that we used for our experiments on GAIA. Notably, we discovered multiple instances of answer leakage and agents cheating on our evaluation. This caused us to slightly over-estimate both accuracy and reliability. At the same time, we noticed that the scaffold was overly constrained in terms of permissible software library imports. This caused us to slightly under-estimate both accuracy and reliability. We have done a rigorous audit of the scaffold and have fixed those issues. Overall, we saw that our resulting accuracy and reliability numbers are not meaningfully impacted by this change when compared to our original numbers.

Our paper: https://arxiv.org/abs/2602.16666 Our dashboard: https://hal.cs.princeton.edu/reliability/ Short video: https://youtu.be/qftDfEft7U0

Joint work w/ @sayashk, @PKirgis, @khl53182440, @SaitejaUtpala, and @random_walker.

HAL Reliability Evaluation

Source: https://hal.cs.princeton.edu/reliability/

AI Agent Reliability Tracker

Rising accuracy scores suggest rapid progress, but agents still fail unpredictably in practice. A single success metric obscures whether agents behaveconsistentlyacross runs, withstandperturbations, failpredictably, or respectsafetyconstraints. We evaluate 15 agents across 2 benchmarks on twelve metrics spanning four reliability dimensions — and find that recent capability gains have yielded only small improvements in reliability.

Reliability Trends

Agent Leaderboard

Benchmarks

Key Findings

Reliability Lags Behind Accuracy Improvements

Despite 24 months of model development, overall reliability shows only small improvements over time while accuracy steadily climbs. Improving raw task performance is insufficient for building dependable AI agents — reliability requires targeted attention beyond capability scaling alone.

Reliability improvements are also disproportionate across evaluation scenarios: highly structured environments show moderate gains, while open-ended tasks show barely any improvement, even among the latest models.

Outcome and Resource Consistency Remain Low

Agents thatcansolve a task often fail to do so consistently. The gap between capability (pass@k) and reliability (pass^k) is substantial across all models. Resource consistency is similarly low, with high variance in token and compute usage across runs — agents allocate effort unpredictably.

A ‘what but not when’ pattern emerges: agents achieve substantially higher distribution consistency than sequence consistency, indicating they reliably select similar action types across runs but vary in execution order. Improving reliability requires not just better action selection but more stable planning and execution.

Calibration Improves, but Discrimination Stagnates

Calibration — the alignment between predicted confidence and actual accuracy — has improved noticeably in recent frontier models. However, discrimination — the ability to distinguish tasks the agent will solve from those it won’t — shows divergent trends across benchmarks and has in some cases worsened.

Improvements in calibration alone do not guarantee reliable failure identification. An agent may express well-calibrated confidence yet still fail to distinguish correct from incorrect predictions. Both sub-metrics must be measured independently.

Robustness Saturates, but Prompt Sensitivity Distinguishes Models

Fault robustness and structural robustness show ceiling effects across most models — agents handle genuine technical failures gracefully. In contrast, prompt robustness remains a key differentiator: sensitivity to superficial instruction paraphrasing varies substantially across models.

This pattern is counterintuitive: models tolerate real infrastructure faults but remain vulnerable to surface-level variations in how tasks are specified — a critical concern for real-world deployment where user instructions naturally vary.

Reliability Does Not Scale Uniformly with Capability

While calibration, robustness, and safety generally improve with model size, consistency often exhibits an inverse pattern: smaller models frequently achieve equal or higher consistency than their larger counterparts. Reasoning models are generally more reliable, but their reliability does not improve as quickly as their accuracy.

Larger models have more solution paths available, which increases run-to-run variability. This suggests that scaling alone will not solve the reliability problem — targeted architectural and training interventions are needed.

Safety Improves, but High-Severity Violations Persist

The most recent frontier models exhibit significantly lower overall violation rates. However, financial accuracy violations — incorrect charges and refunds — remain the most prevalent failure mode. Even infrequent high-severity failures can carry significant costs and represent critical blockers for deployment.

Benchmark quality also matters: safety and predictability improve almost universally when evaluated on a verified task subset with grading errors removed, underscoring the importance of clean evaluation data.

Reliability Gains Are Disproportionate Across Benchmarks

Reliability profiles are highly task-type dependent. An agent that is reliable on open-ended multi-step reasoning may struggle on structured customer-service tasks, and vice versa. Dimension-level scores vary substantially across benchmarks for the same agent.

This highlights the need for multi-benchmark evaluation. Single-benchmark reliability scores can be misleading — agents must be tested across diverse task structures to build a complete picture of their reliability.

Recommendations

Evaluate with Dynamic, Multi-Run Protocols

Single-run accuracy on fixed benchmarks provides a misleadingly narrow view of capability. Usemulti-run protocolsto assess variance across identical tasks,multi-condition protocolsto systematically perturb user inputs, andtemporal re-evaluationat regular intervals to detect silent degradation.

Current benchmarks are too static. Generative benchmarks with parameterized test sets (renaming fields, reordering responses, injecting faults) would provide more realistic and robust evaluations.

Design Agents Explicitly for Reliability

Calibration and safety have improved noticeably — evidence that intentional optimization works. In contrast,consistency and discrimination show little progress, suggesting they are not yet explicit optimization targets. Make reliability dimensions measurable and actionable in agent development.

Capability-oriented evaluation alone misses actionable optimization targets. Use reliability metrics to identify which dimensions lack progress and need targeted attention.

Use Reliability Metrics for Deployment Governance

Treat reliability as a deployment prerequisite, similar to aviation safety standards. Setminimum thresholdsfor consistency and safety before production deployment, implement incident reporting, and use multi-dimensional reliability metrics to guide change management decisions.

Organizations should require reliability certification before deployment, not just capability assessment. Diverse contributions through dimension-specific optimization become possible with clear measurement.

Distinguish Automation vs. Augmentation Use Cases

Reliability requirements differ fundamentally by use case. Foraugmentation(coding assistants, copilots), moderate reliability may suffice since humans review output. Forautomation(customer service, database management), reliability is a hard prerequisite — 90% success with unpredictable 10% failures is unacceptable.

As the field pushes toward greater agent autonomy, the reliability bar rises significantly. Deployment standards should be context-aware and scale with the level of autonomous action.

What’s New in the Final Version

Our final published paper includes three substantive updates: results for the latest frontier models, a correction to the outcome-consistency metric, and scaffolding fixes that close several loopholes through which the agent could reach GAIA ground truth values during evaluation. All harness changes are documented inthis PR.

New frontier models

Our previous results covered frontier models released up until January 2026. Since then, every frontier provider has released updated models:GPT-5.5,Gemini 3.1 Proand3.5 Flash, andClaude Opus 4.7(we did not yet have a chance to evaluate Claude Opus 4.8). Overall, the general trend we described in the paper holds: all new models make a noticeable jump in accuracy, but their reliability is not improving at nearly the same rate. Once again we see small improvements on τ-bench, while on GAIA we observe GPT-5.5 and Opus 4.7 being no more reliable than their predecessors.

Outcome-consistency metric fix

Outcome consistency asks: if you run the same task many times, does the agent return the same final response every single time? We run each task $K$ times and look at how often it succeeds. An agent that succeeds every time (or fails every time) is perfectly consistent; one that succeeds half the time and fails half the time is maximally inconsistent. The score is meant to land on a smooth scale from 1.0 (always the same outcome) down to 0.0 (a pure 50/50 split).

We noticed our metric wasn’t actually producing that smooth scale. The original formula normalized the sample variance of the runs by the largest variance a task with that success rate could have. The sample variance used the unbiased “ddof=1” estimator meant to infer an unknown population from a small sample. Our intent was right, but for pass/fail (Bernoulli) data theddof=1estimator systematically over-counts the spread by a factor of $K/$K\-1$$ . As a result, the ratio it produced was always greater than 1 whenever the runs weren’t unanimous; the formula then returned a negative number, which got clipped to 0. The consequence was that every task that wasn’t perfectly consistent collapsed to zero, which was not the smooth scale we wanted.

Our fix normalizes thepopulationvariance (the spread actually observed across our runs, “ddof=0”) instead of the inferred sample variance. This is the appropriate choice here, since we want to describe the runs we observed rather than estimate a hypothetical population. It simplifies to a clean closed form:

$\\text\{consistency\} = $2\\hat\{p\} \- 1$^2$ , where $\\hat\{p\}$ is the observed success rate;
it scales in $\[0, 1\]$ naturally, with no clipping required; and
it is smooth across the whole range: 1 when the agent always agrees with itself, 0 at a 50/50 split.

We thank Ben Crestel, Davi Valério, Jonathan Almeida, and Adriana Prado for independently alerting us to this mistake in our preprint.

Scaffolding and data-leakage fixes (GAIA)

We audited the HAL Generalist Agent scaffold and closed three ways through which the agent could reach GAIA ground truth:

**Ground truth ininput\.json.**Each GAIA task’sinput\.jsonheld the task’s input and (for ease of evaluation) the ground-truth label. The agent was given only one specific key from this file, but also had access to the directory storing the full file. Our first mitigation sanitizes the input file so it no longer contains the answer. Adam Stein (@adamlsteinl) has recently done important work surfacing these issues across a variety of benchmarks.
**Benchmark name leaked through file paths.**Stored attachments (required by many GAIA tasks) were passed to the agent as absolute file paths containing\.\.\./datasets\-\-gaia\-benchmark\-\-GAIA/\.\.\., effectively revealing that it was being evaluated on a common benchmark. Our second mitigation sanitizes the path so it no longer contains the benchmark name.
**Online mirrors.**We found many cases where the agent navigated straight to a known mirror (e.g. a HuggingFace Space hostinggaia\_validation\.jsonl) to look up answers online. This was possible in our scaffold because a meaningful share of GAIA tasks require browsing the web. Our third mitigation blocks access to common online repositories that host GAIA.

While auditing the scaffold for cheating, we also found and fixed other environmental issues. The agent:

had access to certain top-level Python package imports, but not to many of their sub-modules;
failed on every\.xlsxtask due to a missing import; and
was given a specificopenfunction to read files from restricted parts of the file system, yet lacked an equivalentwritefunction.

About

Acknowledgements

This work was supported by Princeton Language and Intelligence (PLI), the Princeton AI Lab, the Princeton Catalysis Initiative, Schmidt Sciences, and Coefficient Giving. We thank OpenAI and Google for providing compute credits to support our experimentation.