Tag
Gergely Orosz shares his experience using Antithesis, a deterministic testing infrastructure that can run hours of testing in minutes.
This paper introduces layer-isolated evaluation for LLM agents, decomposing a production agent into architectural layers each tested with a deterministic, no-LLM harness. It demonstrates that per-slice baseline testing localizes regressions that aggregate metrics mask, validated by controlled regression injections across multiple tenants.