Tag
This paper empirically examines when to interrupt autonomous AI agents during software execution, finding that affective-state thresholds saturate quickly, LLM judges achieve low F1 scores (0.17–0.40) at high cost, and human annotators themselves show near-chance agreement on intervention timing, making the construct unreliable as an optimization target.