How do you actually test an agent harness when half of it is non-deterministic?

Reddit r/AI_Agents News

Summary

A discussion on the challenges of testing AI agent harnesses with non-deterministic components, exploring approaches like golden output diffing and using an LLM as a judge, while questioning the validity of such methods.

Running into this at Lium and I'm curious how other people handle it? The deterministic parts of a harness are easy to test. Retry logic, parsing, routing, all of that you can unit test like normal code. But the second the model has to make a real judgment call how do you even write a test for that? Do you check for an exact output and accept it'll be brittle since the model phrases things differently every run? Do you use another model as a judge, and if so, who tests the judge? Do you just run it fifty times and eyeball whether it feels right often enough? I tried golden output diffing first. Failed constantly even when the agent was doing the right thing, just worded differently. Switched to LLM as judge for a bit, which works better but now I've got a non-deterministic test grading a non-deterministic system, which feels like it's just moving the problem one layer up instead of solving it. Anyone landed on something that actually works here? Is it just accepted that agent testing is fuzzier than normal software testing, or is there a pattern I'm missing?
Original Article

Similar Articles

best of the best agentic harnesses do this…

Reddit r/AI_Agents

The author shares insights on building effective agent harnesses: the best ones minimize LLM reliance for trivial tasks and reserve LLMs for complex reasoning, distinguishing genuine harnesses from simple wrappers.

Stop Comparing LLM Agents Without Disclosing the Harness

arXiv cs.AI

This position paper argues that in long-horizon LLM agent tasks, the execution harness often determines performance more than the model itself, and current benchmarks misattribute harness-level gains to model improvements. It proposes a harness-aware evaluation framework with disclosure standards and variance decomposition protocols.