Tag
This paper introduces layer-isolated evaluation for LLM agents, decomposing a production agent into architectural layers each tested with a deterministic, no-LLM harness. It demonstrates that per-slice baseline testing localizes regressions that aggregate metrics mask, validated by controlled regression injections across multiple tenants.