How do you actually test an agent harness when half of it is non-deterministic?

Reddit r/AI_Agents 06/16/26, 08:25 PM News

agent-testing non-determinism llm-as-judge golden-tests software-engineering testing-practices

Summary

A discussion on the challenges of testing AI agent harnesses with non-deterministic components, exploring approaches like golden output diffing and using an LLM as a judge, while questioning the validity of such methods.

Running into this at Lium and I'm curious how other people handle it? The deterministic parts of a harness are easy to test. Retry logic, parsing, routing, all of that you can unit test like normal code. But the second the model has to make a real judgment call how do you even write a test for that? Do you check for an exact output and accept it'll be brittle since the model phrases things differently every run? Do you use another model as a judge, and if so, who tests the judge? Do you just run it fifty times and eyeball whether it feels right often enough? I tried golden output diffing first. Failed constantly even when the agent was doing the right thing, just worded differently. Switched to LLM as judge for a bit, which works better but now I've got a non-deterministic test grading a non-deterministic system, which feels like it's just moving the problem one layer up instead of solving it. Anyone landed on something that actually works here? Is it just accepted that agent testing is fuzzier than normal software testing, or is there a pattern I'm missing?

Original Article

Similar Articles

best of the best agentic harnesses do this…

Reddit r/AI_Agents

The author shares insights on building effective agent harnesses: the best ones minimize LLM reliance for trivial tasks and reserve LLMs for complex reasoning, distinguishing genuine harnesses from simple wrappers.

Stop Comparing LLM Agents Without Disclosing the Harness

arXiv cs.AI

This position paper argues that in long-horizon LLM agent tasks, the execution harness often determines performance more than the model itself, and current benchmarks misattribute harness-level gains to model improvements. It proposes a harness-aware evaluation framework with disclosure standards and variance decomposition protocols.

Your harness is failing your agent but there's no benchmark to prove it

Reddit r/AI_Agents

The article highlights a lack of benchmarks for evaluating the reliability of agent harnesses, specifically focusing on how MCP implementations handle tool calls and errors compared to the models themselves.

It's Not the Capability: Harness Sensitivity Is Non-Monotone Across LLM Agent Tiers

arXiv cs.AI

This paper empirically tests the common assumption that more structured harnesses universally improve LLM agent reliability, finding a non-monotone relationship across model tiers. It introduces the HEAT-24 benchmark and reveals that strict harnesses can harm frontier chat models while benefiting reasoning models.

Harnesses for Inference-Time Alignment over Execution Trajectories

arXiv cs.LG

This paper studies harness design for LLM agents, separating it into task decomposition and guided execution, and shows that more elaborate harnesses are not uniformly better; it reveals failure modes and proposes partial harnesses as effective.