The Saturation Trap and the Subjectivity of Intervention Timing: Why Affect-Based Triggers and LLM Judges Fail to Time Interventions on Autonomous Agents
Summary
This paper empirically examines when to interrupt autonomous AI agents during software execution, finding that affective-state thresholds saturate quickly, LLM judges achieve low F1 scores (0.17–0.40) at high cost, and human annotators themselves show near-chance agreement on intervention timing, making the construct unreliable as an optimization target.
View Cached Full Text
Cached at: 06/05/26, 02:06 AM
# The Saturation Trap and the Subjectivity of Intervention Timing: Why Affect-Based Triggers and LLM Judges Fail to Time Interventions on Autonomous Agents
Source: [https://arxiv.org/html/2606.04296](https://arxiv.org/html/2606.04296)
###### Abstract
As autonomous AI agents transition from conversational systems to long\-horizon software execution, runtime safety layers that determine*when*to interrupt an agent have become essential\. We empirically examine this timing problem using a continuous 18\-dimensional affective\-dynamics engine \(HEART\) as a diagnostic tool\. Four distinct intervention trigger families—absolute state thresholds, composite state\-action patterns, regex\-based reasoning feature extraction, and zero\-shot LLM\-as\-judge—are evaluated against human\-annotated intervention points from SWE\-bench\-Verified debugging traces\. Our findings are as follows\. First, we identify a*State Saturation Trap*: agents show no recovery signal under sustained difficulty, causing the engine’s modeled frustration to quickly exceed the threshold and remain at its maximum\. This converts threshold\-on\-state triggers from moment detectors to near\-constant indicators, firing on a large fraction of actions \(39–83%\) across five trajectories\. Second, we observe a*capability\-and\-context floor*for LLM judges: a small model \(gpt\-5\.4\-mini\) never triggers, while frontier and cross\-vendor models only escape the zero\-firing floor with full\-trajectory context, achieving F1 scores of approximately 0\.17–0\.40 at up to 90 times the cost\. Third, and most importantly, the supervised target is not reproducible among humans: three trained annotators using the same rubric on a 56\-action trajectory agree on intervention points only slightly above chance \(location Krippendorff’sα\\alpha=\+0\.047=\+0\.047; best pairwise Cohen’sκ\\kappa=\+0\.349=\+0\.349\) and show no agreement on intervention type \(pause is degenerate, clarify is below chance, reflect onlyα\\alpha=\+0\.226=\+0\.226\)\. We conclude that “intervention timing” is a low\-reliability construct, making single\-annotator F1 an unsuitable optimization target\. The main contribution of this work is the comprehensive mapping of this issue across human inter\-rater reliability, four detector architectures, a cross\-model LLM\-judge sweep, and a reproduced saturation effect, rather than the accuracy of any single detector\.
## 1Introduction
Autonomous agents that execute shell commands, modify codebases, and run test suites introduce a runtime\-control challenge distinct from traditional input/output filtering: determining when a trajectory has deviated enough to require interruption\. Human operators can often identify critical inflection points, such as repeated similar fixes or non\-converging strategies, before any single action becomes catastrophic\. This study focuses on developing an automated mechanism to reliably detect these moments\.
This work treats the problem as a measurement exercise rather than a system\-building effort\. We instrument autonomous traces with a continuous affective\-state engine and define four trigger families, each using either the engine state or raw reasoning text\. Trigger activations are compared to human annotations indicating where intervention was needed\. The primary goal is to determine which trigger architecture best aligns with human judgment\.
The findings required a shift in the paper’s focus\. Although the detectors did not perform as intended, validation efforts revealed that human judgment, used as the benchmark, is only weakly reproducible\. Three annotators applying the same rubric to the same trajectory showed minimal agreement\. This reframes the negative detector results: an automated method cannot be expected to match a target that even trained humans, given identical instructions, cannot reliably identify\.
#### Scope clarification\.
All quantitative calibration in this paper is based on a single trajectory \(astropy\_\_astropy\-13398\) with sparse labels, where each annotator flagged between 6 and 15 of 56 actions\. A second trajectory \(astropy\_\_astropy\-13033, 59 actions\) is used only for the saturation analysis, which is state\-based and does not use labels\. Four additional trajectories were processed as operational pilots but were not human\-labeled and did not contribute to the reported metrics\. This clarification emphasizes that the central claims are intentionally limited to what can be supported by one\-trajectory, three\-annotator data: a directional, mechanism\-level account rather than a comprehensive benchmark\.
#### Contributions\.
1. 1\.The State Saturation Trap \(reproduced,n=5n=5\):Threshold\-on\-state triggers become near\-constant indicators because agents accumulate modeled negative affect without recovery\. We demonstrate this effect across five independent trajectories \(28–59 actions\)\.
2. 2\.A capability\-and\-context floor for LLM\-as\-judge intervention timing:Across three judge models and two context\-window conditions, we identify when per\-action LLM judging escapes total deference and the associated cost\.
3. 3\.The subjectivity of intervention timing as a primary finding:Multi\-annotator reliability \(pairwise Cohen’sκ\\kappaand three\-rater Krippendorff’sα\\alpha\) shows that intervention timing is a low\-reliability construct\. We also distinguish between location and type agreement, which differ\.
4. 4\.A four\-architecture failure mapclassifies the distinct ways state\-based, composite, linguistic, and LLM\-driven triggers fail\. Several failure modes are inherent to the triggers and do not depend on label subjectivity\.
## 2Related Work
#### Affective computing\.
Prior work largely frames emotion as a classification output: sentiment polarity, categorical basic\-emotion labels\[[6](https://arxiv.org/html/2606.04296#bib.bib6)\], or placement on Russell’s arousal–valence circumplex\[[21](https://arxiv.org/html/2606.04296#bib.bib21),[19](https://arxiv.org/html/2606.04296#bib.bib19)\]and its three\-dimensional Pleasure–Arousal–Dominance extension\[[13](https://arxiv.org/html/2606.04296#bib.bib13)\]\. These approaches treat affect as a passive annotation of text rather than a persistent runtime state that conditions system behavior\[[16](https://arxiv.org/html/2606.04296#bib.bib16),[18](https://arxiv.org/html/2606.04296#bib.bib18)\]\. Notably, even within affective computing, the continuous \(VAD\) and categorical labelings of the same data are known to diverge, and annotators frequently disagree\[[4](https://arxiv.org/html/2606.04296#bib.bib4)\], a tension our results echo at the level of intervention judgments\. We use a continuous affect engine \(§[3](https://arxiv.org/html/2606.04296#S3)\) not as a classifier but as a diagnostic instrument that accumulates a modeled stress state over an action stream\.
#### LLM\-as\-judge\.
Using language models as automated evaluators is now standard for open\-ended tasks\[[24](https://arxiv.org/html/2606.04296#bib.bib24)\], and several systematic biases are documented—position bias, verbosity bias, and self\-preference bias among them\[[24](https://arxiv.org/html/2606.04296#bib.bib24),[22](https://arxiv.org/html/2606.04296#bib.bib22),[14](https://arxiv.org/html/2606.04296#bib.bib14)\]\. Recent work also reports a “superficial reflection bias,” in which reasoning\-like phrases in a trace sway a judge’s verdict\[[10](https://arxiv.org/html/2606.04296#bib.bib10)\], and questions whether human evaluation is itself a gold standard\[[9](https://arxiv.org/html/2606.04296#bib.bib9)\]\. Our LLM\-judge results yield a task\-specific observation: per\-action evaluation of an autonomous\-agent peer within a windowed context results in total non\-firing at small model scale, and our cross\-model sweep separates capability limits from context\-window limits\.
#### Inter\-rater reliability and human label variation\.
We use Cohen’sκ\\kappa\[[5](https://arxiv.org/html/2606.04296#bib.bib5)\]for pairwise agreement and Krippendorff’sα\\alpha\[[11](https://arxiv.org/html/2606.04296#bib.bib11)\]for the three\-rater case; the latter uses pooled marginals and is more appropriate under unequal annotator base rates\[[23](https://arxiv.org/html/2606.04296#bib.bib23)\]\. We interpret values against the Landis and Koch\[[12](https://arxiv.org/html/2606.04296#bib.bib12)\]bands while heeding the well\-documented kappa paradox, under which strong raw agreement can coexist with near\-zeroκ\\kappa/α\\alphawhen one category dominates\[[7](https://arxiv.org/html/2606.04296#bib.bib7),[8](https://arxiv.org/html/2606.04296#bib.bib8)\]\. Most importantly, our central finding aligns with the human label variation literature, which argues that, for subjective tasks, annotator disagreement is a signal, not noise, and that a single gold standard may not exist\[[3](https://arxiv.org/html/2606.04296#bib.bib3),[15](https://arxiv.org/html/2606.04296#bib.bib15),[17](https://arxiv.org/html/2606.04296#bib.bib17)\]\. We extend this view from its usual domains \(toxicity, hate speech, NLI, sentiment\) to a new one: deciding when to intervene on an autonomous agent\.
#### Runtime oversight of autonomous agents\.
A growing body of work builds runtime monitors that intervene on LLM/agent trajectories, typically by checking symbolic safety rules or by thresholding a predicted risk score\[[2](https://arxiv.org/html/2606.04296#bib.bib2),[20](https://arxiv.org/html/2606.04296#bib.bib20),[1](https://arxiv.org/html/2606.04296#bib.bib1)\]\. These systems are largely reactive—firing when a violation is imminent—and several explicitly note the difficulty of long\-horizon foresight\. Our Saturation Trap result is a direct caution to this line of work: a monitor that thresholds an accumulating internal state degenerates into a near\-constant alarm on agents that do not exhibit recovery, so absolute state thresholds are the wrong primitive for timing interventions\.
## 3The Diagnostic Probe and the Three\-Layer Architecture
We separate the system into three independent layers, a structure maintained throughout development and documented in a contemporaneous design log\.
#### Engine \(normative\)\.
A continuous 18\-dimensional affect vector, each dimension in\[0,1\]\[0,1\], with per\-emotion exponential decay toward a baseline of 0\.10, a momentum bias on decay and event application, an energy\-normalization cap on the summed intensities, a seven\-pair conflict\-resolution mechanism with proportional dampening, and a bidirectional coupling to a Big\-Five personality model\. The engine encodes how a human\-like affective system*ought*to respond to a sequence of successes and failures\. It is calibrated against psychological priors,*not*fitted to observed agent behavior\. The engine is described in full in a separate patent specification \(Indian Patent Application No\. 202521098101\); for this paper, it is used only as a fixed, unmodified probe\.
#### Observer \(diagnostic\)\.
Parses each agent action—thought, tool call, and observation—and maps it to engine inputs, producing the affective state the agent’s behavior*would imply*if it were a human\-like developer\. This layer measures the gap between human\-like affect and the agent’s actual behavior\.
#### Guidelines \(prescriptive\)\.
Reads the engine state and determines whether to intervene\. This is the layer evaluated in this study\.
#### Methodological commitment\.
We did not tune the engine to match observed agent behavior or adjust trigger thresholds to improve calibration\. Tuning the engine would blur the line between diagnosis and behavior, while tuning thresholds would overfit to a small label set\. This commitment was maintained throughout all development phases, including those with poor calibration\. All thresholds in §[4](https://arxiv.org/html/2606.04296#S4)remain at their initial theoretical values\.
## 4Methodology
### 4\.1Trajectory data
The primary trajectory,astropy\_\_astropy\-13398, is a 56\-action autonomous Claude run from the20250514\_aime\_coderSWE\-bench\-Verified evaluation, addressing a rotation\-matrix bug in theastropycoordinates module\. The agent ultimately resolves the issue but experiences extended mid\-run mechanical struggle, providing a dense environment for studying intervention timing\. Each action includes a natural\-language thought, a tool call with arguments, and an environment observation\.
### 4\.2Annotation
Three annotators independently labeled all 56 actions, each blind to the others’ labels, to the trigger outputs, and to the engine state, using an identical verbatim rubric with three boolean fields per action:
- •Pause— halt and reconsider; the trajectory looks problematic \(about to repeat a failing strategy, about to act destructively, about to commit after repeated failure\)\.
- •Reflect— step back and consider alternatives \(grinding without metacognition, repeating tests without changing approach, narrow edits to a structural problem\)\.
- •Clarify— articulate what is unclear before proceeding \(confused but proceeding as if certain, searching aimlessly, edits that don’t address the actual error\)\.
The rubric specified that intervention should be the exception, not the default\. One annotator’s initial pass produced a near\-constant distribution \(53 of 56 actions flagged clarify\), suggesting the task was interpreted as “label the step type” rather than “flag intervention need\.” This pass was discarded, and the annotator re\-labeled after receiving a clarified brief emphasizing sparsity\. Only the re\-labeled pass is used\. We report this to highlight the difficulty of specifying the task\.
### 4\.3Trigger architectures
All thresholds are first\-pass values, never tuned\.
#### A6 — Absolute state thresholds\.
- •sustained\_frustration→\\rightarrowPause iffrustration≥0\.7\\mathrm\{frustration\}\\geq 0\.7\(single emotion intensity, range\[0,1\]\[0,1\]\)\.
- •same\_valence\_accumulation→\\rightarrowReflect if the sum of five negative\-arousal emotions,frustration\+anger\+fear\+confusion\+vengeance≥1\.5\\mathrm\{frustration\}\+\\mathrm\{anger\}\+\\mathrm\{fear\}\+\\mathrm\{confusion\}\+\\mathrm\{vengeance\}\\geq 1\.5\(a summed accumulator, range\[0,5\]\[0,5\]\)\.
- •high\_confusion\_no\_reflection→\\rightarrowClarify ifconfusion≥0\.6\\mathrm\{confusion\}\\geq 0\.6ANDreflective\_flag=False\\mathrm\{reflective\\\_flag\}=\\mathrm\{False\}\.
\(Note: an earlier draft of this work misstated the first trigger as firing atfrustration≥1\.5\\mathrm\{frustration\}\\geq 1\.5\. The 1\.5 value belongs to the five\-emotion accumulator, not to a single intensity;sustained\_frustrationuses 0\.7\. The corrected semantics are as above\.\)
#### A8 — Composite state\-action patterns\.
Blends affective state with structural history from a ring buffer:rapid\_negative\_escalation\(Pause on a≥0\.4\\geq 0\.4rise in summed negative valence over a rolling 3\-action window\),failure\_repetition\(Reflect on the identical tool\+arguments≥3\\geq 3times in 5 actions with≥2\\geq 2errors\), andstalled\_progress\_with\_uncertainty\(Clarify on≥5\\geq 5actions with confusion≥0\.4\\geq 0\.4, no reflective flag, and no positive\-valence signal in the last 5\)\.
#### A9 — Regex reasoning\-feature extraction\.
Shallow linguistic features over the agent’s thought text grouped into hedging, assertion, cycling, and recognition dictionaries:cycle\_with\_resistance\(Reflect via a 3\-gate AND of cycling markers, errors, and explicit recognition\) andtone\_degradation\(Pause on a≥0\.5\\geq 0\.5drop in mean tone between a baseline and a recent window\)\.
#### A10 — Zero\-shot LLM\-as\-judge\.
A language model judges each action with the verbatim human rubric, three independent passes \(one per intervention type\), deterministic decoding, with a rolling window of prior thoughts as context\. The cross\-model, cross\-context extension of A10 is the sweep in §[6](https://arxiv.org/html/2606.04296#S6)\.
### 4\.4Calibration discipline
No post\-hoc threshold tuning was applied\. The test suite \(306 passing tests across 21 test files\) and all engine/adapter/trigger code remained unchanged across experiments\.
## 5The State Saturation Trap
We replayed all five pilot trajectories through the unmodified engine, observer, and A6 triggers\. Frustration is clamped to\[0,1\]\[0,1\]; the relevant threshold is 0\.7\.
Table 1:Saturation behavior on all five pilot trajectories\. Max frustration reaches 1\.00 on every trajectory\. Fire rates are the percentage of all actions on which each A6 trigger fires \(sust\_frust=sustained\_frustration;same\_val=same\_valence\_accumulation;conf=high\_confusion\_no\_reflection\)\. State/firing\-rate analysis only; no labels or F1 are involved\.On all five trajectories, modeled frustration crosses 0\.7 within roughly the first half of the run \(actions 12–21\) and remains saturated through the final action\. A human developer crossing this threshold would typically engage in a recovery action, such as stepping away, reframing, or seeking help, allowing the state to decay\. The agents show no such recovery signal, so the state remains saturated\.
The firing rates span 39–83% across the five trajectories and are lower on the shorter runs\. This is not a difference in mechanism but in denominator: the pre\-saturation prefix \(12–21 actions\) is a larger fraction of a 28\-action trajectory than of a 59\-action one, so a smaller proportion of total actions falls in the saturated region\. On every trajectory, once saturation is reached, the triggers fire on essentially every subsequent action\.
This is a structural, not a parametric, failure\. Lowering the threshold causes earlier saturation; raising it only delays the first trigger but does not prevent eventual saturation\. Absolute thresholding turns a moment\-of\-intervention detector into a continuous saturation indicator\. This effect is not limited to a single trajectory; it reproduces on all five pilot trajectories, spanning 28 to 59 actions\. The 13398 and 13033 replay rates match their original calibration figures exactly, confirming replay consistency\.
## 6Cross\-Model LLM\-as\-Judge Sweep
We judged the same 56\-action trajectory with the verbatim rubric under two context conditions—WINDOWED \(3 prior thoughts\) and MACRO \(full running trajectory\-so\-far\)—across three judge models\. Firing rate is the percentage of 56 actions; F1 is computed against the single primary annotator’s labels \(and is, per §[7](https://arxiv.org/html/2606.04296#S7), of limited meaning\)\. A dash \(—\) in an F1 column denotes an undefined value \(no true positives\), not a missing run\.
Table 2:Cross\-model LLM\-as\-judge sweep\. F1 is against the single primary annotator and is of limited meaning \(see §[7](https://arxiv.org/html/2606.04296#S7)\)\.The Claude results are from a methodologically clean re\-run, where all six cells were generated by fresh, isolated sub\-agents using a uniform method\. An earlier in\-session Claude sweep produced higher single\-cell values \(e\.g\., windowed/reflect F1 0\.545\) that did not replicate in the re\-run\. We treat the isolated re\-run as primary and highlight the significant run\-to\-run variance: windowed/reflect moved from 6 fires \(F1 0\.545\) to 13 fires \(F1 0\.222\) between two identical configurations\. No single F1 cell here provides a stable estimate\.
Approximate cost across both conditions: gpt\-5\.4\-mini $1\.19, gpt\-5\.4 $13\.87, claude $0 \(subscription\)\. The original A10 baseline \(gpt\-5\.4\-mini, windowed only\) fired 0/56 on all three types at∼$0\.135\{\\sim\}\\mathdollar 0\.135\.
#### Interpretation\.
Both capability and context are important, but their effects are uneven\. The small model returns 0 in every cell, regardless of context, indicating total deference\. Frontier and cross\-vendor models only escape the zero floor with full\-trajectory context, and even then achieve modest F1 scores at much higher cost \(the flagship model’s improvement required approximately 90 times the cost of the mini baseline\)\. Per\-action LLM judging is therefore both expensive and imprecise as a runtime timing mechanism, functioning as a coarse region detector rather than a precise moment detector\.
## 7Inter\-Rater Reliability: The Subjectivity of the Target
Three annotators, one rubric, the same 56 actions\.
### 7\.1Per\-annotator counts
Table 3:Per\-annotator positive counts\.The annotators differ not only in the number of interventions \(8 / 6 / 15\) but also in dominant intervention type: A is balanced, B favors reflect, and C primarily uses clarify\.
### 7\.2Location agreement \(same action flagged, ignoring type\), pairwise Cohen’sκ\\kappa
Table 4:Pairwise location agreement\.No action was flagged by all three annotators\. Six actions \{0, 2, 18, 33, 42, 44\} were flagged by at least two; every such action is a 2\-of\-3, never a 3\-of\-3\.
### 7\.3Three\-rater Krippendorff’sα\\alpha
Table 5:Three\-rater Krippendorff’sα\\alphaper intervention type and for location \(any\-type\) coding\.An independent recomputation reproduced the three pairwise Cohen’sκ\\kappavalues exactly \(A↔\\leftrightarrowB\+0\.349\+0\.349, A↔\\leftrightarrowC\+0\.092\+0\.092, B↔\\leftrightarrowC−0\.181\-0\.181\), confirming theα\\alphaimplementation\.
### 7\.4Reading
No intervention type achievesα=0\.4\\alpha\{\}=0\.4, which is well below Krippendorff’s tentative reliability threshold of 0\.667\. Reflect \(\+0\.226\+0\.226\) is the best\-performing type and the only one used by all three annotators\. Pause is degenerate, as one annotator never used it\. Clarify performs worse than chance\. The three\-rater locationα\\alphaof\+0\.047\+0\.047is much lower than the best pairwiseκ\\kappabecause the third annotator rarely overlaps with the others, reducing multi\-way agreement\.
Two structural factors contribute to this outcome\. First, annotators have different intervention “dialects”: the same late\-trajectory sequence is labeled as clarify by one, reflect by another, and split between pause and reflect by a third\. Thus, even when they flag the same action, they disagree on the intervention type\. Second, base rates differ by a factor of 2\.5 \(6 to 15 flags\), whichκ\\kappapenalizes heavily\. The evidence indicates there is no single, stable ground truth for intervention timing\. The primary annotator’s labels reflect only one perspective and do not constitute a definitive standard\. Detector scores are not re\-baselined against the≥2\\geq 2consensus set, as this set was identified after reviewing detector outputs, and such post hoc scoring would risk fitting the target to the predictions\.
## 8Why the Detectors Fail: A Joint Reading
Combining the three findings shows that detector failure results partly from the detectors themselves and partly from the nature of the target\.
#### Where vs\. what\.
The detectors and humans agree on the general region—the late rotation\-matrix sequence, roughly actions 29–50—and occasionally on specific moments \(such as action 42\), but not on intervention type, which is where human annotators disagree\. Only two F1 values persist across the LLM\-judge sweep—Claude reflect at action 42 and Claude clarify at action 44—both corresponding to actions where humans also converged\. These results are based on only one or two true positives among eight sparse labels and show significant run\-to\-run variance, indicating convergence on a few late\-trajectory moments rather than a consistent positive outcome\.
#### Pause\.
The inability of any detector to recover pause timing is partly due to the absence of a clear consensus pause signal\. Pause is at or below chance across all annotator pairs, and one annotator never used it\.
#### Clarify\.
Clarify is one annotator’s most\-used intervention \(13 of 15 flags\) but is the least agreed\-upon type \(below chance for every pair,α=−0\.106\\alpha\{\}=\-0\.106\)\. Any calibration relying on clarify labels is therefore built on the least stable foundation\.
#### Saturation\.
The A6 triggers detect a state rather than a specific moment\. Once frustration saturates \(by action 12–21 across the five trajectories\), “fire when frustration≥0\.7\\geq 0\.7” remains true for most subsequent actions\. This failure is independent of label subjectivity; it is inherent to the trigger and the agent’s lack of recovery, and it reproduces across all five trajectories\.
Therefore, the statement that “all architectures underperform against the labels” partly reflects the nature of the labels themselves: automated methods are being asked to match a target that trained humans cannot consistently reproduce\. This is a stronger and more defensible claim than simply stating that the detectors are not yet good enough\.
## 9Limitations
- •Single calibration trajectory\.All F1 and firing\-rate\-versus\-label metrics are based on a single 56\-action trajectory\. The saturation result spans five trajectories but uses no labels\. Exact label\-based metrics are noisy point estimates and should be interpreted directionally only\.
- •Sparse labels, three annotators\.There are eight, six, and fifteen positive labels over 56 actions\. With three annotators on one trajectory, the results demonstrate subjectivity but cannot quantify its extent in a broader population\.
- •One discarded annotation pass\.One annotator’s initial pass was degenerate and was re\-collected after rubric clarification\. Even after clarification, clarify\-type agreement remained below chance\.
- •LLM\-judge variance\.Single\-cell F1 values show substantial run\-to\-run variance and are based on only one or two true positives\. These results should be interpreted directionally\.
- •Pairwiseκ\\kappainstability\.At these base rates, pairwise Cohen’sκ\\kappais unstable\. We report Krippendorff’sα\\alphafor the multi\-rater case, but a larger annotator pool is needed for stable estimates\.
- •Single model family per vendor; two context conditions\.The capability/context floor is mapped at three model points and two window sizes, not exhaustively\.
## 10Future Work
- •Expand to multiple trajectories and a larger, more diverse annotator pool; report Fleiss’κ\\kappa/ Krippendorff’sα\\alphaon the expanded set\.
- •Separate “where to intervene” from “what intervention” as two distinct prediction problems with their own reliabilities, since they behave differently\.
- •Replace threshold\-on\-state triggers with transition\-aware triggers operating on the velocity and acceleration of affect accumulation, and add hysteresis/cool\-down to prevent post\-saturation spamming\.
- •Investigate whether a macro\-trajectory summary prompt, rather than a windowed context, systematically lifts LLM\-judge timing for frontier models, and at what cost\.
- •Treat intervention timing as an inherently distributional target \(a soft label aggregated over many annotators\) rather than a single gold standard\.
## 11Conclusion
This study began with the goal of identifying which trigger architecture best aligns with human intervention judgment on autonomous agent traces, but ultimately clarified the boundaries of the problem itself\. Threshold\-on\-state triggers consistently fall into a State Saturation Trap, as shown on five trajectories, because agents accumulate modeled negative affect without recovery, turning moment detectors into persistent indicators\. LLM\-as\-judge timing is limited by both capability and context: small models consistently defer, while frontier models achieve only coarse region\-level alignment with full context and at significant cost\. Additionally, the supervised target is a low\-reliability construct; three trained annotators using the same rubric agree on intervention points only slightly above chance and show no agreement on intervention type\. The main contribution of this work is the comprehensive characterization of this landscape—covering human inter\-rater reliability, four detector architectures, a cross\-model LLM\-judge sweep, and a reproduced saturation effect—rather than the accuracy of any single detector\. Developing reliable runtime intervention layers will require treating intervention timing as a subjective, distributional construct and moving from absolute state thresholds to transition\-aware, recovery\-sensitive control mechanisms\.
## Reproducibility
All three annotators’ label files, the inter\-rater computation script \(pairwise Cohen’sκ\\kappaand three\-rater Krippendorff’sα\\alpha, implemented from scratch with an independent cross\-check\), the saturation replay outputs for all five trajectories, and the cross\-model sweep outputs are released with this paper\. Trigger thresholds and engine constants remained fixed throughout; no post\-hoc tuning was applied\.
## References
- \[1\]M\. Andriushchenko, A\. Souly, M\. Dziemian, D\. Duenas, et al\.AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents\.arXiv preprint, 2025\.
- \[2\]H\. Wang, C\. M\. Poskitt, and J\. Sun\.AgentSpec: Customizable Runtime Enforcement for Safe and Reliable LLM Agents\.arXiv:2503\.18666, 2025\.
- \[3\]L\. Aroyo and C\. Welty\.Truth Is a Lie: Crowd Truth and the Seven Myths of Human Annotation\.*AI Magazine*, 36\(1\):15–24, 2015\.
- \[4\]C\. Busso, M\. Bulut, C\.\-C\. Lee, A\. Kazemzadeh, E\. Mower, S\. Kim, J\. N\. Chang, S\. Lee, and S\. S\. Narayanan\.IEMOCAP: Interactive Emotional Dyadic Motion Capture Database\.*Language Resources and Evaluation*, 42\(4\):335–359, 2008\.
- \[5\]J\. Cohen\.A Coefficient of Agreement for Nominal Scales\.*Educational and Psychological Measurement*, 20\(1\):37–46, 1960\.
- \[6\]P\. Ekman\.An Argument for Basic Emotions\.*Cognition and Emotion*, 6\(3–4\):169–200, 1992\.
- \[7\]A\. R\. Feinstein and D\. V\. Cicchetti\.High Agreement but Low Kappa: I\. The Problems of Two Paradoxes\.*Journal of Clinical Epidemiology*, 43\(6\):543–549, 1990\.
- \[8\]K\. L\. Gwet\.Computing Inter\-Rater Reliability and Its Variance in the Presence of High Agreement\.*British Journal of Mathematical and Statistical Psychology*, 61\(1\):29–48, 2008\.
- \[9\]T\. Hosking, P\. Blunsom, and M\. Bartolo\.Human Feedback Is Not Gold Standard\.In*International Conference on Learning Representations \(ICLR\)*, 2024\.
- \[10\]Q\. Wang, Z\. Lou, Z\. Tang, N\. Chen, X\. Zhao, and W\. Zhang\.Assessing Judging Bias in Large Reasoning Models: An Empirical Study\.arXiv:2504\.09946, 2025\.
- \[11\]K\. Krippendorff\.*Content Analysis: An Introduction to Its Methodology*\.Sage Publications, 2nd edition, 2004\.
- \[12\]J\. R\. Landis and G\. G\. Koch\.The Measurement of Observer Agreement for Categorical Data\.*Biometrics*, 33\(1\):159–174, 1977\.
- \[13\]A\. Mehrabian\.Pleasure–Arousal–Dominance: A General Framework for Describing and Measuring Individual Differences in Temperament\.*Current Psychology*, 14\(4\):261–292, 1996\.
- \[14\]A\. Panickssery, S\. R\. Bowman, and S\. Feng\.LLM Evaluators Recognize and Favor Their Own Generations\.In*Advances in Neural Information Processing Systems \(NeurIPS\)*, 2024\.
- \[15\]E\. Pavlick and T\. Kwiatkowski\.Inherent Disagreements in Human Textual Inferences\.*Transactions of the Association for Computational Linguistics*, 7:677–694, 2019\.
- \[16\]R\. W\. Picard\.*Affective Computing*\.MIT Press, 1997\.
- \[17\]B\. Plank\.The “Problem” of Human Label Variation: On Ground Truth in Data, Modeling and Evaluation\.In*Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing \(EMNLP\)*, 2022\.
- \[18\]S\. Poria, E\. Cambria, R\. Bajpai, and A\. Hussain\.A Review of Affective Computing: From Unimodal Analysis to Multimodal Fusion\.*Information Fusion*, 37:98–125, 2017\.
- \[19\]J\. Posner, J\. A\. Russell, and B\. S\. Peterson\.The Circumplex Model of Affect: An Integrative Approach to Affective Neuroscience, Cognitive Development, and Psychopathology\.*Development and Psychopathology*, 17\(3\):715–734, 2005\.
- \[20\]H\. Wang, C\. M\. Poskitt, J\. Sun, and J\. Wei\.ProbGuard: Probabilistic Runtime Monitoring for LLM Agent Safety\.arXiv:2508\.00500, 2025\.
- \[21\]J\. A\. Russell\.A Circumplex Model of Affect\.*Journal of Personality and Social Psychology*, 39\(6\):1161–1178, 1980\.
- \[22\]P\. Wang, L\. Li, L\. Chen, Z\. Cai, D\. Zhu, B\. Lin, Y\. Cao, Q\. Liu, T\. Liu, and Z\. Sui\.Large Language Models Are Not Fair Evaluators\.In*Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(ACL\)*, 2024\.
- \[23\]A\. Zapf, S\. Castell, L\. Morawietz, and A\. Karch\.Measuring Inter\-Rater Reliability for Nominal Data — Which Coefficients and Confidence Intervals Are Appropriate?*BMC Medical Research Methodology*, 16:93, 2016\.
- \[24\]L\. Zheng, W\.\-L\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. P\. Xing, H\. Zhang, J\. E\. Gonzalez, and I\. Stoica\.Judging LLM\-as\-a\-Judge with MT\-Bench and Chatbot Arena\.In*Advances in Neural Information Processing Systems \(NeurIPS\)*, 2023\.Similar Articles
Beyond Autonomy: The Power of an Agent That Knows Its Limits
The COWCORPUS project, a study of 4,200 human-AI interactions, found that agents predicting their own failures and intervention moments are more useful than those simply trying to avoid errors. Researchers identified four stable trust patterns in human-AI collaboration and developed the Perfect Timing Score (PTS) to measure intervention prediction accuracy.
Formal Methods Meet LLMs: Auditing, Monitoring, and Intervention for Compliance of Advanced AI Systems
This paper proposes techniques that combine formal methods (Linear Temporal Logic) with LLMs for auditing, monitoring, and intervening in AI systems to ensure compliance with behavioral constraints, showing that even small-model labelers can match frontier LLM judges in detecting violations.
Are most LLM eval tools still too prompt-focused?
The author questions whether current LLM evaluation tools are too focused on isolated prompts rather than full workflows and agent interactions, noting that step-by-step accuracy can mask overall behavioral drift in production.
What Benchmarks Don't Measure: The Case for Evaluating Abstention Competence in Autonomous Agents
This paper argues that current benchmarks for autonomous agents fail to evaluate whether an agent should have proceeded at all, introducing a 'compliance bias'. The authors propose a taxonomy of abstention-warranted scenarios and new evaluation protocols (Safety Rate, Usability Rate, Informed Refusal Rate) with preliminary results showing tunable safety–usability tradeoffs across model families.
Quoting Andreas Påhlsson-Notini
Andreas Påhlsson-Notini critiques current AI agents for exhibiting frustratingly human traits like lack of focus and constraint negotiation.