Tag
A discussion on the challenges of testing AI agent harnesses with non-deterministic components, exploring approaches like golden output diffing and using an LLM as a judge, while questioning the validity of such methods.
This article argues that standard test suites with fixed inputs and expected outputs are insufficient for AI agents due to infinite input spaces and non-deterministic behavior, advocating for property-based testing instead.