Tag
A discussion on the challenges of testing AI agent harnesses with non-deterministic components, exploring approaches like golden output diffing and using an LLM as a judge, while questioning the validity of such methods.