Tag
A discussion on the challenges of testing AI agent harnesses with non-deterministic components, exploring approaches like golden output diffing and using an LLM as a judge, while questioning the validity of such methods.
A developer seeks a model that frequently gets stuck in loops (e.g., GLM Flash) to test loop detection and recovery features for an agent, aiming to develop heuristics that score loop probability and enable backtracking.