When Retrieval Metrics Mislead: Measuring Policy Signal in Long-Horizon Tool-Use Agents

arXiv cs.CL Papers

Summary

This paper examines the reliability of exact-match retrieval recall as a proxy for downstream policy classification performance in long-horizon tool-use agents. Experiments with Qwen2.5 classifiers on τ-bench show that low clause recall does not significantly degrade classifier accuracy, suggesting that retrieval metrics alone can mislead when evaluating policy signal.

arXiv:2606.23937v1 Announce Type: new Abstract: Exact-match retrieval recall is often used as a proxy for whether a retriever supplies useful policy context to a downstream decision model. We test this proxy for pre-action policy classification in tau-bench using Qwen2.5-3B/7B classifiers. Under gold-policy conditioning, a compact structured state improves macro-F1 over raw trajectories by 0.13-0.17 after tuning. We then replace the benchmark-designated policy clause with the top-ranked clause retrieved from decision-time context. Although the exact governing clause is retrieved at rank 1 for only 7% of airline states, the primary 3B classifier obtains macro-F1 0.58 with retrieved clauses versus 0.60 with gold clauses (Delta=-0.02, task-cluster 95% CI [-0.23,+0.21]); mismatched-policy and no-policy controls score 0.32 and 0.21. We do not detect a macro-F1 difference between retrieved and gold clauses in this configuration, although the interval remains too wide to establish non-inferiority. The same qualitative pattern appears with a second retriever and at 7B, while varying across fine-tuning configurations. These results indicate that exact-match clause recall can underestimate downstream policy utility in this benchmark setting, motivating evaluation with retrieved policies in the classification loop rather than recall alone.
Original Article
View Cached Full Text

Cached at: 06/24/26, 07:44 AM

# Measuring Policy Signal in Long-Horizon Tool-Use Agents
Source: [https://arxiv.org/html/2606.23937](https://arxiv.org/html/2606.23937)
## When Retrieval Metrics Mislead: Measuring Policy Signal in Long\-Horizon Tool\-Use Agents

Tianyu Ding Amazon Web Services tianyd@amazon\.com&Juan Pablo De la Cruz Weinstein Amazon Web Services jcruam@amazon\.com

###### Abstract

Exact\-match retrieval recall is often used as a proxy for whether a retriever supplies useful policy context to a downstream decision model\. We test this proxy for pre\-action policy classification inτ\\tau\-bench using Qwen2\.5\-3B/7B classifiers\. Under gold\-policy conditioning, a compact structured state improves macro\-F1 over raw trajectories by0\.130\.13–0\.170\.17after tuning\. We then replace the benchmark\-designated policy clause with the top\-ranked clause retrieved from decision\-time context\. Although the exact governing clause is retrieved at rank 1 for only7%7\\%of airline states, the primary 3B classifier obtains macro\-F10\.580\.58with retrieved clauses versus0\.600\.60with gold clauses \(Δ=−0\.02\\Delta=\-0\.02, task\-cluster 95% CI\[−0\.23,\+0\.21\]\[\-0\.23,\+0\.21\]\); mismatched\-policy and no\-policy controls score0\.320\.32and0\.210\.21\. We do not detect a macro\-F1 difference between retrieved and gold clauses in this configuration, although the interval remains too wide to establish non\-inferiority\. The same qualitative pattern appears with a second retriever and at 7B, while varying across fine\-tuning configurations\. These results indicate that exact\-match clause recall can underestimate downstream policy utility in this benchmark setting, motivating evaluation with retrieved policies in the classification loop rather than recall alone\.

## 1Introduction

Policy\-constrained tool\-use systems may retrieve a policy clause before deciding whether a proposed action is permissible, requires more evidence, or should be blocked\. Retrieval is commonly evaluated by whether the benchmark\-designated clause appears among the top\-ranked results\. When exact\-match recall is used as a proxy for downstream utility, it effectively treats nonmatching retrieved clauses as uninformative\. We test that assumption inτ\\tau\-bench\.

We train Qwen2\.5\-3B/7B classifiers on pre\-action states from airline trajectories\. The study separates two interventions: how decision context is represented when the benchmark\-designated policy is supplied, and how classifier performance changes when the supplied policy is retrieved rather than gold\. This design distinguishes representation effects from the validity of exact\-match recall as a measure of policy\-retrieval utility:

> *\(Q1\) Under gold\-policy conditioning, how does decision\-staterepresentationaffect policy classification?**\(Q2\) When policy clauses are retrieved, does exact\-match recall predict downstream classifier performance?*

For Q1 we hold the classifier, training data, and recipe fixed and vary*only*how the decision state is represented\. We find that a compactstructured decision state—an explicit elicitation of \{request intent, evidence read so far, applicable policy assertion, pending action class\}—substantially improves macro\-F1 over raw trajectory text and avoids majority\-class prediction\. This is a gold\-policy representation result: the state constructor is given the correct applicable policy\. For Q2 we replace that gold policy with a retrieved policy clause at test time\. Exact\-match retrieval recall is low—an offline gold\-injection diagnostic predicts the structured advantage needs high exact\-clause access, yet off\-the\-shelf retrievers recover the exact clause only0\.070\.07of the time at rank 1\. That diagnostic, however, is not the pipeline evaluation\. When we run the direct retrieved\-policy intervention, the primary classifier shows no detectable macro\-F1 difference vs\. the benchmark’s gold clause and beats both mismatched\-policy and no\-policy controls by wide margins\.For this reported configuration, exact\-match recall is too pessimistic: nonmatching retrieved clauses can contain decision\-relevant policy information, so exact\-match recall@k underestimates downstream policy utility\.The effect is sensitive to fine\-tuning configuration—the gold\-policy gap re\-opens for under\-trained and maximally\-discriminating classifiers—so the supported claim is local to the reported configuration rather than universal\.

#### Contribution and scope\.

We make three contributions\. First, we isolate the effect of decision\-state representation under gold\-policy conditioning and show that structured states improve Qwen2\.5\-3B/7B macro\-F1 over raw trajectories\. Second, we characterize exact\-match policy retrieval across twoτ\\tau\-bench domains, multiple query constructions, and four retrievers, and use a gold\-injection probe to quantify the prediction implied by treating nonmatching clauses as uninformative\. Third, we test that prediction directly by replacing gold policies with retrieved, mismatched, and absent policy inputs\. The direct intervention shows that low exact\-match recall can coexist with a small observed gold–retrieved performance gap in the primary configuration, while the effect varies with fine\-tuning\. Together, these analyses evaluate the criterion validity of exact\-match recall for policy\-conditioned decision models\. All claims concern benchmark\-authoredτ\\tau\-bench policy assertions and offline classification\.

## 2Setup

#### Task and data\.

We use pre\-action decision states extracted fromτ\\tau\-bench\-airline\(Yaoet al\.,[2024](https://arxiv.org/html/2606.23937#bib.bib3)\)trajectories\. Each state has a gold decisiona⋆∈\{allow,verify,refuse\}a^\{\\star\}\\in\\\{\\texttt\{allow\},\\texttt\{verify\},\\texttt\{refuse\}\\\}derived from the task’s policy assertions and the evidence visible at the checkpoint\. We use a fixed split grouped bytask\_id\(train195195/ test8585;1515disjoint test tasks; zero task overlap\), so no test task’s policy strings appear in training\.

#### Construct scope\.

Throughout the paper, “policy” and “rule” mean the benchmark\-authored natural\-language assertions attached to eachτ\\tau\-bench task, not naturalistic multi\-page operating policies\. The identification task is therefore a benchmark\-proxy task: recover the governing assertion from decision\-time context over the domain’s assertion pool\. This is the right construct for auditing whether the gold\-policy input used by the classifier is obtainable inside the benchmark, but it is not evidence that real policy\-document retrieval would be equally easy or hard\.

#### Representations \(the independent variable\)\.

Holding everything else fixed we vary the classifier input:raw\(full conversation \+ raw tool outputs,∼\\sim2118 tokens\);masked\(tool outputs replaced by typed placeholders,∼\\sim717\);structured\(explicit \{request, evidence, policy, action\-class\} fact extraction,∼\\sim185\); andraw\+policy\(raw with the elicited policy assertion appended\)\. The structured state is produced by a generator that reads the trajectory and the task policy spec;eliciting the applicable policy assertion is an explicit part of the method’s cost—the policy input is supplied by the state\-construction procedure rather than inferred by the classifier itself\.

#### How the structured state is built \(and what it does*not*use\)\.

The generator sees only information available at the checkpoint: the dialogue so far, parsed tool outputs read up to that point, the pending action’s type \(read vs\. write/transfer\), and the task’s natural\-language policy assertions\.*request*is the user’s stated goal;*evidence*lists parsed facts from tool results \(e\.g\.cabin: basic\_economy; insurance: no\);*policy*is the applicable assertion lifted verbatim from the task spec \(e\.g\. “Agent should not offer compensation unless the user asks”\);*action class*is read vs\. write\. Crucially the generator doesnotsee the gold decisiona⋆a^\{\\star\}, and the policy assertion is a*rule*, not the answer—it keyword\-matches the gold action only 14–33% of the time \(Section[B\.2](https://arxiv.org/html/2606.23937#A2.SS2)\)\. This is*decision\-relevant*state, not*label\-derived*state\.

#### Decision models and evaluation\.

The primary model is a supervised fine\-tuned Qwen2\.5\-3B/7B classifier\(Qwen Team,[2024](https://arxiv.org/html/2606.23937#bib.bib18)\)with LoRA\(Huet al\.,[2022](https://arxiv.org/html/2606.23937#bib.bib17)\)\(r=16r\{=\}16,55epochs, lr5×10−55\\\!\\times\\\!10^\{\-5\}, bf16\), which predicts whether the proposed action may proceed, requires additional verification, or should be blocked\. This is the same fine\-tuning recipe under which the documented majority\-class failure occurred\. For computationally inexpensive diagnostic analyses, we also fit a balanced logistic\-regression classifier over frozen MiniLM embeddings\(Reimers and Gurevych,[2019](https://arxiv.org/html/2606.23937#bib.bib11)\)\. The metric is 3\-way macro\-F1 \(the frozen\-encoder structured\>\>raw gap given gold policy holds for MiniLM andbge\-largebut note5\-large; the Qwen classifier is our primary representation evidence—Appendix[C](https://arxiv.org/html/2606.23937#A3)\)\. We operationalize downstream policy utility as macro\-F1 when a retrieved policy clause replaces the benchmark\-designated policy clause in the classifier input\. Confidence intervals are nonparametric bootstraps\. Representation\-table intervals resample the per\-example test set\. For the direct retrieved\-policy intervention, point estimates average per\-seed macro\-F1 values; paired\-delta CIs are task\-cluster bootstraps over the1515test task IDs \(50005000resamples\), recomputing and averaging seed\-level deltas per replicate\. SFT results pool the33seeds’ test predictions in the representation tables\. We call a classifier*collapsed*when it assigns one class to≥90%\\geq 90\\%of test items \(the documentedce\_smokefailure isrefuse≈\\approx100%\)\.

## 3Policy\-conditioned representation ablation

Holding the classifier, training data, and recipe fixed, we vary*only*the input representation\. In this section the classifier is given the correct applicable policy clause; we replace this gold\-policy condition in §[4](https://arxiv.org/html/2606.23937#S4)\. The main result is that a compactstructured decision state\(∼\\sim185 tokens\) that surfaces the applicable policy reaches macro\-F10\.6010\.601at 3B vs\.0\.2930\.293for raw trajectory text \(∼\\sim2118 tokens\)—a paired gap of\+0\.308\+0\.308\(CI\[\+0\.237,\+0\.380\]\[\+0\.237,\+0\.380\]\), which is not an end\-to\-end gain but a representation effect*given*the rule\. Under proper per\-cell tuning \(held\-out validation split\), the structured advantage*survives at both scales*:\+0\.173\+0\.173at 3B,\+0\.133\+0\.133at 7B \(Table[7](https://arxiv.org/html/2606.23937#A2.T7)in Appendix[B](https://arxiv.org/html/2606.23937#A2)\)\.

The effect is not brevity \(length/info\-matched controls\), not a leaked label \(a mismatched policy sharply reduces performance\), and not output\-format compliance \(it survives parsed\-only rescoring\)\. Among the tested representation interventions, explicit policy access produces the largest observed improvement: appending the applicable policy to raw text helps nearly as much as the full structured state, and a mismatched policy is worse than no policy\. Cross\-actor transfer \(Sonnet→\\toNova\) is positive but not significant atn=50n\{=\}50\(Appendix[B](https://arxiv.org/html/2606.23937#A2)\)\.

Critical scope:these results are gold\-policy conditioned: the structured state contains the correct benchmark policy clause\. Section[4](https://arxiv.org/html/2606.23937#S4)asks whether the classifier still works when that clause is*retrieved*rather than supplied by the benchmark\.

## 4Evaluating exact\-match recall as a proxy for downstream policy utility

Every result so far conditions on the classifier receiving the applicable benchmark policy assertion\. The downstream measurement question is whether exact\-match retrieval recall predicts classifier performance when that assertion is*retrieved*from decision\-time context\. The concern is that retrieval may be too weak: off\-the\-shelf retrievers recover the exact governing clause only rarely \(recall@1≈0\.07\\approx 0\.07\)\. We first run an exact\-recall diagnostic, then*feed retrieved clauses directly into the classifier*rather than inferring downstream utility from recall alone\. We then map broader exact\-match recall patterns across domains, protocols, and retrievers\. The reported Qwen classifier is more robust to imperfect retrieval than exact\-recall diagnostics alone would suggest\.

#### Gold\-policy injection diagnostic\.

We hold the diagnostic probe, data, and recipe fixed and degrade only the policy input: for a fractionppof states the classifier receives the gold clause, otherwise it receives the MiniLM top\-1 retrieved clause from the same retrieval pool\. This gold\-injection sweep makesppan*effective exact\-gold access rate*\. We run this on the frozen\-encoder diagnostic probe \(whose gold\-policy structured−\-raw gap is\+0\.115\+0\.115\)\. On a coarse grid the gap first becomes robustly positive nearp≈0\.75p\\\!\\approx\\\!0\.75\(\+0\.104\+0\.104,95%95\\%CI\[\+0\.004,\+0\.205\]\[\+0\.004,\+0\.205\]\); on a finer grid \(steps of0\.10\.1, three injection seeds\) the gap’s CI excludes zero only atp=1\.0p\\\!=\\\!1\.0\(withp=0\.8p\\\!=\\\!0\.8marginal, CI lower−0\.001\-0\.001\), i\.e\. the finer sweep places the threshold*higher*, not lower \(and it is robust to the classifier head—a more regularised logistic head and a small MLP never clear zero even at full injection; Appendix[C](https://arxiv.org/html/2606.23937#A3)\)\.Either way, this diagnostic probe says the structured\-vs\.\-raw advantage becomes robust only when exact gold\-clause access is high—at least∼\\sim0\.750\.75\(Figure[1](https://arxiv.org/html/2606.23937#S4.F1)\)\. We treat this as an*order\-of\-magnitude*diagnostic, not a sharp pipeline threshold: it is estimated on a frozen\-encoder probe, the finer grid puts it if anything*higher*, and the Qwen classifier’s larger gold\-policy gap \(\+0\.13\+0\.13–0\.170\.17tuned\) could behave differently\. The direct intervention below asks whether this exact\-recall diagnostic predicts the primary Qwen configuration\.

#### The proxy assumption\.

The diagnostic’sppis an exact\-gold access rate inside a frozen\-encoder probe; the retrieval number is exact\-match recall@kk; the direct intervention is the Qwen classifier with retrieved clauses in the input\. Treating the diagnostic as a deployment prediction assumes that exact gold\-clause recovery is a close proxy for downstream policy utility\. The direct intervention tests that assumption and finds it too pessimistic for the reported classifier: a non\-gold but decision\-context\-aligned retrieved clause can be useful, so exact\-match recall is a poor stand\-alone proxy for downstream policy utility in this configuration\. The diagnostic is pessimistic not because its non\-gold inputs are random—they are MiniLM top\-1 retrieved clauses—but because the probe treats those non\-gold retrieved clauses as low\-utility, while the direct classifier intervention measures their downstream effect\.

We establish the diagnostic crossing point with a*controlled*gold\-injection sweep \(varying identification while holding everything else fixed\) rather than an observational “does retrieval success predict classifier accuracy” correlation; the latter is tiny \(11/8511/85\) and confounded with per\-example difficulty, and we report it only in Appendix[C](https://arxiv.org/html/2606.23937#A3)\.

#### Retrieved\-policy classifier intervention\.

The diagnostic crossing point and the recall measurement are both*indirect*: they infer a downstream outcome from how often retrieval returns the*exact*gold clause\. Table[1](https://arxiv.org/html/2606.23937#S4.T1)reports the central direct test on the retrieved\-policy Qwen classifier—train on the benchmark’s gold structured states, then at test time replace the policy line with the*top\-1 retrieved*clause \(recall@10\.070\.07\), the gold clause, a mismatched policy clause, or no policy line at all\. On the primary configuration \(Qwen2\.5\-3B, the same configuration that reaches structured macro\-F10\.600\.60given the gold clause\), the retrieved clause yields macro\-F10\.5800\.580versus0\.6040\.604for the benchmark’s gold clause—a gap of−0\.024\-0\.024\(95%95\\%CI\[−0\.233,\+0\.207\]\[\-0\.233,\+0\.207\]\):*no detectable degradation in this experiment*\(we prespecify a non\-inferiority marginδ=0\.05\\delta\{=\}0\.05macro\-F1,∼\\sim10%10\\%of the gold−\-raw gap; atn=85n\{=\}85the gap’s CI is too wide to establish non\-inferiority, so we report “no detectable loss,” not equivalence\)\. The retrieved clause sits far above both a mismatched policy clause \(0\.3150\.315; top\-1 retrieved−\-mismatched\+0\.265\+0\.265, CI\[\+0\.056,\+0\.440\]\[\+0\.056,\+0\.440\]\) and a classifier given*no*policy line \(0\.2060\.206, which falls into a majority\-verifyregime\); top\-1 over no\-policy is\+0\.374\+0\.374\(CI\[\+0\.121,\+0\.569\]\[\+0\.121,\+0\.569\], excludes zero\)\.So the retrieved clause supplies measurably useful policy information in this classifier evaluation despite recovering the exact gold string only7%7\\%of the time\.

Table 1:Effect of policy input on the primary Qwen2\.5\-3B classifier\.The classifier is trained on gold structured states; at test time only the policy line changes\. Macro\-F1 is the mean of per\-seed macro\-F1 values over three seeds on the samen=85n\{=\}85airline states\. Delta intervals are task\-cluster paired bootstraps over the mean seed\-level delta\. The top\-1 retrieved clause shows no detectable loss vs\. the benchmark gold clause, although the CI is compatible with a meaningful degradation\. Mismatched\-policy and no\-policy controls show that the retrieved clause contributes decision\-relevant policy information\.
#### Analysis of informative nonmatching clauses\.

The mechanism evidence is consistent with decision\-context\-aligned policy information rather than gold recovery\. The top\-1 retrieved clause is no closer to the gold clause in embedding space than a random clause is \(cosine0\.390\.39vs\.0\.400\.40\), but it is far more similar to the request\+\+evidence context than a random clause \(0\.550\.55vs\.0\.260\.26\), and the clause pool is not merely redundant \(mean pairwise cosine0\.390\.39,72/12272/122clusters at cosine≥0\.8\\geq 0\.8\)\. An additional control tempers the mechanism: the retrieved clauses are also somewhat longer than the mismatched ones \(2121vs\.1010tokens\), so we cannot fully separate context alignment from a length effect; we report this as a partial confound rather than a clean mechanism\. Clause length alone cannot explain the policy\-input effects: the gold and mismatched clauses are essentially equal length \(both∼\\sim1010tokens\) yet differ by\+0\.29\+0\.29macro\-F1\. However, length may still contribute to the retrieved\-versus\-mismatched contrast, so the context\-alignment analysis should be interpreted as suggestive rather than causal\. The effect is alsosensitive to fine\-tuning configuration: the gold\-policy gap stays near zero across the mid\-range classifiers that reach∼\\sim0\.600\.60given gold \(including a second, different\-family retriever,bge\-large, and at 7B\), but it*re\-opens*for under\-trained classifiers \(gold≤0\.49\\leq 0\.49; top\-1−\-gold down to−0\.16\-0\.16\) and for the single most aggressively\-trained classifier we tried \(8 epochs, lr10−410^\{\-4\}; gold0\.630\.63, top\-1−\-gold−0\.15\-0\.15\)\. The result is therefore scoped to the reported configuration: in this regime, low exact\-match recall does not imply low downstream policy utility\. \(The low\-cost frozen\-encoder diagnostic probe shows the same pattern; seed\-level 3B/7B direct\-policy results appear in Appendix Table[11](https://arxiv.org/html/2606.23937#A3.T11)\.\)

#### Exact\-match retrieval performance\.

We then measure how often the applicable clause is exactly retrieved, framed as retrieval over the domain’s full natural\-language policy clause set \(the “haystack”\): for each decision we rank all clauses by similarity to the available context and record whether the gold clause is in the top\-kk\. We do this fortwo public domains\(τ\\tau\-bench airline,122122clauses; retail,5151clauses\),two query\-construction protocols\(the request\+\+evidence decision state; and pooled pre\-action user context from the trajectory\), andfour off\-the\-shelf retrievers: a MiniLM bi\-encoder, two stronger instruction\-tuned bi\-encoders \(bge\-large\-en\-v1\.5,e5\-large\-v2\), and a cross\-encoder reranker \(bge\-reranker\-large\) scoring the*entire*clause pool \(the ceiling of a first\-stage\+\+rerank pipeline\)\.All evaluated configurations remain below the diagnostic crossing: across the four retrievers, recall@5 is0\.180\.18–0\.240\.24on airline and0\.550\.55–0\.600\.60on retail \(and0\.160\.16on airline under the leaner request\+\+evidence query protocol\); the strongest retriever \(the cross\-encoder reranker\) does not change the picture \(0\.180\.18airline,0\.570\.57retail\)\. The shortfall persists even at largerkk: the best recall@10 across retrievers is only0\.360\.36\(airline\) and0\.720\.72\(retail\), and the full recall@kkcurves \(Appendix Figure[2](https://arxiv.org/html/2606.23937#A1.F2)\) stay below the diagnostic crossing untilkkreaches large candidate sets comprising tens of clauses\. The gold clause has a poor median rank of approximately4545out of122122on airline\. The identification shortfall is therefore*not*an artifact of a weak bi\-encoder—it is robust to retriever strength\. We also go beyond off\-the\-shelf retrieval and test three additional retrieval pipelines—query expansion, a hybrid first stage with cross\-encoder reranking of the top candidates, and a coarse\-to\-fine hierarchical retrieve—none of which clears the diagnostic crossing either \(best recall@50\.380\.38airline /0\.580\.58retail; the hierarchical variant is actually worse, as coarse section routing drops the gold clause; Appendix[C](https://arxiv.org/html/2606.23937#A3)\)\. So the exact\-match shortfall is not merely a weak\-retriever artifact: it survives stronger, structured identification\. Retail is the closest call: at recall@5≈0\.55\\approx 0\.55–0\.600\.60it sits closest to the threshold of any configuration, though still below it \(and the higher finer\-grid threshold places retail more clearly below\); we nonetheless flag retail as the near\-miss case in exact\-recall terms rather than a decisive failure\.

Exact\-match recall also varies substantially by decision class \(Appendix[C](https://arxiv.org/html/2606.23937#A3), Table[10](https://arxiv.org/html/2606.23937#A3.T10)\): recall@5 is0\.000\.00forallow,0\.050\.05forverify, and0\.500\.50forrefuse\. Even forrefuse, the benchmark\-designated clause is absent from the top five in approximately half of the evaluated states\.

#### Sensitivity to author\-side scenario fields\.

A third query construction—appending the task’stask\_instructions/known\_infoscenario fields—reaches higher recall \(up to0\.580\.58airline,0\.780\.78retail, the latter nominally clearing the diagnostic crossing\)\. We*exclude*it because those fields are written from the task author’s knowledge of the correct resolution and leak policy\-relevant content into the user\-side query, and because our scoring there credits the best of multiple gold clauses \(an upper bound\)\. It does not reflect what a classifier can see at decision time\. We report it for completeness; the two main protocols use only decision\-time\-available context\. We report this protocol as a sensitivity analysis but exclude it from the primary comparison\. The two main sweeps differ in construction—the request\+\+evidence figure isn=85n\{=\}85airline states, the four\-retriever robustness sweep uses the trajectory\-context protocol atn=50n\{=\}50—but both, and retail, sit below the diagnostic crossing\.

#### Sensitivity to domain and query construction\.

Airline \(≈0\.16\\approx 0\.16\) looks much harder than retail \(≈0\.55\\approx 0\.55\), which would suggest identification difficulty is*domain\-dependent*—and indeed, when the two domains are queried with*different*protocols, their gold\-rank\-percentile confidence intervals separate\. But that separation disappears under matched query construction, suggesting sensitivity to the query protocol rather than evidence for a stable domain difference\. In a controlled2×22\{\\times\}2\(domain×\\timesquery protocol\), applying the*same*trajectory\-context protocol to both domains does not yield a detectable difference \(gold\-rank percentile0\.100\.10airline vs\.0\.090\.09retail; Mann–Whitneyp=0\.29p\{=\}0\.29,n=50/40n\{=\}50/40\)\. This is a*failure to detect*a difference at this sample size, not proof of equivalence, so the defensible statement is that the domains are*not distinguishable under a matched protocol*, and the large cross\-domain gap one might report appears only when the two domains are queried differently\. The supported statement is one of*consistency*\(not domain\-invariance\): achievable exact\-match recall stays below the offline gold\-injection diagnostic across both domains, both query protocols, and all four retrievers \(Figure[1](https://arxiv.org/html/2606.23937#S4.F1)\), with no stable cross\-domain difference detected under a matched protocol\.

![Refer to caption](https://arxiv.org/html/2606.23937v1/x1.png)Figure 1:Gold\-injection exact\-match diagnostic\.The curve is the structured−\-raw classifier macro\-F1 gap as a function of effective*exact\-gold access rate*\(offline gold\-injection sweep: gold clause for a fractionppof states, MiniLM top\-1 retrieved clause otherwise;95%95\\%CI band\); it crosses zero only near recall≈0\.75\\approx 0\.75\. Triangles mark*achievable*exact\-match recall@5 for every configuration \(two domains×\\timestwo query protocols×\\timesfour retrievers\), all far left of the crossing—which*predicts*substantially lower downstream performance if nonmatching retrieved clauses are treated as low\-utility\. Table[1](https://arxiv.org/html/2606.23937#S4.T1)tests that prediction directly by replacing the gold policy input with retrieved policy inputs\.
#### Interpretation\.

The representation effect \(Section[3](https://arxiv.org/html/2606.23937#S3), full ablation in Appendix[B](https://arxiv.org/html/2606.23937#A2)\) is conditioned on access to the benchmark\-designated policy clause\. The remaining question is whether retrieved policy inputs preserve this representation advantage—and Steps 1 and 3 confirm that exact\-match recall*is*low\. But the direct test \(Step 2\) shows that the recall shortfall does not translate into a detectable loss on the reported classifier:for this classifier, the top\-1 retrieved clause shows no detectable macro\-F1 loss vs\. the benchmark’s gold clause despite low exact\-match recall, so exact\-match recall@k would mis\-rank the utility of retrieved clauses in this setting if used alone\. Two bounds matter: the effect is sensitive to fine\-tuning configuration \(the gold\-policy gap re\-opens for under\-trained and maximally\-discriminating classifiers, so low exact\-match recall can still translate into detectable losses in those regimes\), and atn=85n\{=\}85we report “no detectable degradation” rather than equivalence or established non\-inferiority\. This reframes the evaluation target: report retrieved\-policy classifier performance alongside exact\-clause recall, and reserve high\-stakes guarantees for systems with verification beyond retrieval alone\. Appendix Table[2](https://arxiv.org/html/2606.23937#A1.T2)maps each main identification number to its result file\.

## 5Related work

#### Tool\-use agents and intermediate state\.

Our classifier targets the pre\-action checkpoint of the reason\-then\-act loop popularized by ReAct\(Yaoet al\.,[2023](https://arxiv.org/html/2606.23937#bib.bib15)\), where an agent interleaves reasoning and tool calls\. The idea of decisions riding on an explicit*structured intermediate state*has deep roots in task\-oriented dialogue, where belief/dialogue\-state tracking maintains a structured \(and ideally calibrated\) state distribution that downstream policy decisions consume\(van Niekerket al\.,[2020](https://arxiv.org/html/2606.23937#bib.bib16)\); SSDG’s compliance\-oriented decision state is a close analogue for write\-action gating\. Deciding*whether to act at all*—to abstain, refuse, or ask—is itself studied as a first\-class capability: abstention abilities of LLMs\(Madhusudhan and others,[2024](https://arxiv.org/html/2606.23937#bib.bib19)\)and asking clarifying questions before acting\(Wu,[2023](https://arxiv.org/html/2606.23937#bib.bib20)\)both motivate our three\-way allow/verify\-then\-proceed/refuse output rather than a binary block/allow classifier\.

#### Compliance gating\.

Pre\-action compliance gating for tool agents is dominated by two non\-learned paradigms: symbolic/formal runtime enforcement that compiles policies into rules or SMT constraints and intercepts tool calls\(Chen and others,[2025](https://arxiv.org/html/2606.23937#bib.bib13); Winstonet al\.,[2026](https://arxiv.org/html/2606.23937#bib.bib14)\), and prompt/document\-level policy conditioning, theτ\\tau\-bench default\. Closest to our structured\-state setting,Zwerdlinget al\.\([2025](https://arxiv.org/html/2606.23937#bib.bib22)\)compile company\-policy documents into per\-tool guard code onτ\\tau\-bench\-airline; like the symbolic enforcers, this presupposes the applicable policy is already in hand and does not ask whether the governing clause can be*identified*at decision time—the question we isolate here\. Retrieval\-based guards\(Xianget al\.,[2024](https://arxiv.org/html/2606.23937#bib.bib23)\)and agentic\-RAG policy\-as\-code\(Romeoet al\.,[2025](https://arxiv.org/html/2606.23937#bib.bib24)\)do retrieve guard knowledge, and rule\-based runtime\-enforcement DSLs\(Wanget al\.,[2025](https://arxiv.org/html/2606.23937#bib.bib29)\)sit in our pre\-action allow/refuse design space, but all presuppose the rule and none measures identification recall against downstream classifier utility\. The nearest compliance\-benchmark neighbor, MANTRA\(Anandet al\.,[2026](https://arxiv.org/html/2606.23937#bib.bib30)\), synthesizes SMT\-validated compliance traces but verifies behavior*post hoc*rather than gating actions pre\-emptively\. Learned guards also classify over*raw*trajectory text against harm taxonomies\(Inan and others,[2023](https://arxiv.org/html/2606.23937#bib.bib9)\)or raw prefixes\. The closest representation\-axis work, sufficient\-context for RAG\(Joren and others,[2024](https://arxiv.org/html/2606.23937#bib.bib12)\), distills a single sufficiency signal for generate\-vs\-abstain, not a multi\-field policy state for write\-action gating\. Recent agent\-safety work motivates but does not run our ablation: the “verifier tax” shows runtime enforcement intercepts most unsafe actions yet rarely yields safe completion\(Sah and others,[2026](https://arxiv.org/html/2606.23937#bib.bib7)\), and proxy\-state evaluation infers a structured state*post hoc*for reward/eval rather than as a pre\-action classifier input\(Chuanget al\.,[2026](https://arxiv.org/html/2606.23937#bib.bib8)\)\. To our knowledge, among the venues surveyed above, we are the first to run a controlled*decision\-state representation ablation*\(model\+\+data\+\+recipe fixed\) specifically for a*learned allow/verify/refuse pre\-action classifier in policy\-constrained tool\-use agents*\(as opposed to guardrails or abstention in general\)\.

#### Exact\-match recall vs\. downstream policy utility\.

A prominent line treats an upstream retrieval/selection step, not the downstream model, as what bounds end\-to\-end quality: for*tools*,Shiet al\.\([2025](https://arxiv.org/html/2606.23937#bib.bib27)\)show strong retrievers select the right tool poorly and that benchmarks mask this by pre\-annotating the relevant tool; for*skills*,Wanget al\.\([2026](https://arxiv.org/html/2606.23937#bib.bib21)\)find the binding constraint onτ\\tau\-bench is matching skills to tasks; and task\-aligned retrieval\(Sunet al\.,[2026](https://arxiv.org/html/2606.23937#bib.bib26)\)argues one should retrieve the*applicable*item, not the most lexically similar one\. Our direct intervention partially*contrasts*with this line: at the level of*exact*clause recall our retriever is just as weak \(7%7\\%at rank 1\), yet the reported classifier does not show a detectable macro\-F1 loss because a non\-gold but decision\-context\-aligned clause can remain useful\. Prior work has established that retrieval relevance and exact\-match metrics can be weak proxies for downstream utility \(sufficient\-context for RAG\(Joren and others,[2024](https://arxiv.org/html/2606.23937#bib.bib12)\); answer\-equivalence shows token\-level exact\-match*underestimates*correct*outputs*\(Bulianet al\.,[2022](https://arxiv.org/html/2606.23937#bib.bib1)\); and recall is itself a problematic retrieval\-quality metric whose link to LLM response quality is weak\(Schwartzet al\.,[2025](https://arxiv.org/html/2606.23937#bib.bib2)\)\)\. We evaluate this decoupling in a new setting: retrieval of benchmark\-authored policy clauses for pre\-action compliance classification\. The compliance\-enforcement literature largely*assumes*the rule is in hand—symbolic enforcers compile a*known*policy\(Chen and others,[2025](https://arxiv.org/html/2606.23937#bib.bib13); Winstonet al\.,[2026](https://arxiv.org/html/2606.23937#bib.bib14)\), proxy\-state evaluators infer state for a*given*task\(Chuanget al\.,[2026](https://arxiv.org/html/2606.23937#bib.bib8)\)—and where it does retrieve, it \(correctly\) warns that retrieval alone gives no high\-stakes guarantee; our claim is correspondingly about the reported classifier’s*accuracy*, not a safety guarantee\. Closest in spirit to our mechanism,Chen \([2026](https://arxiv.org/html/2606.23937#bib.bib28)\)argue tool\-selection failure is at the decision readout, not identification\. In our setting, the corresponding observation is that the classifier can exploit a context\-aligned nonmatching clause\. To our knowledge, the specific measurement—that exact\-match clause recall undercounts downstream policy utility for a compliance classifier, and that this varies with fine\-tuning configuration—has not been reported\.

## 6Limitations

Construct validity: a benchmark\-native proxy for “the rule,” not real policy documents\.Our rule corpus is the set of each domain’s distinctτ\\tau\-benchnl\_assertionstrings \(122122airline,5151retail\), and both the gold labels and the gold\-policy input are drawn from these benchmark\-authored assertions\. This is a*proxy*for rule identification: we measure whether the governing*assertion*can be recovered from decision\-time context, not whether an agent can identify the applicable clause from a naturalistic, multi\-page policy document\. This scope has two implications: \(i\) external validity to real policy corpora is*unproven*—a longer, redundant, or differently\-structured policy document could be easier or harder; and \(ii\) our stronger\-identifier results \(query expansion, cross\-encoder reranking, hierarchical retrieval\) are robustness checks*within this same surrogate construct*, not evidence about realistic policy\-document identification\. We therefore frame the contribution as a benchmark\-scoped metric\-validity study: exact\-match recall can be too pessimistic a proxy for downstream policy utility, rather than a general law about tool\-use agents or policy documents\.

Exact recovery is not equivalent to downstream utility\.We show exact\-match policy\-clause recall falls below the offline gold\-injection diagnostic across the configurations we test \(two domains, two query protocols, four off\-the\-shelf retrievers\), while the direct classifier evaluation shows that low exact recall need not imply low downstream utility\. The shortfall is scoped to*training\-free, off\-the\-shelf*retrieval; a domain\-fine\-tuned dense retriever, a learned clause selector, or an agent\-in\-the\-loop that asks clarifying questions could improve exact recovery\. More importantly, the direct intervention shows that exact recovery is not the only utility metric: retrieved\-clause\-in\-the\-loop classifier accuracy must be measured directly\. The diagnostic crossing \(≈0\.75\\approx 0\.75on a coarse grid; the finer grid puts it higher\) is therefore an order\-of\-magnitude warning about exact recall, not a sharp deployment threshold\.

Scale and scope\.This is a small\-NNstudy of a pre\-action classifier evaluated offline rather than inside a live agent loop\. The representation half is onτ\\tau\-bench\-airline \(test8585,1515tasks; cross\-actorn=50n\{=\}50\); the identification half adds retail \(n=40n\{=\}40tasks carrying policy assertions\)\. Two domains is the ceiling of the public benchmark substrate, not a convenience sample: of the availableτ2\\tau^\{2\}\-bench domains, only airline \(50/5050/50tasks\) and retail \(40/11440/114\) expose the per\-task natural\-language policy assertions our labels and haystack require; telecom \(0/22850/2285\) and banking \(0/970/97\) expose none, so a third domain would require generating new trajectories \(reintroducing the actor confound\) and is out of scope for a training\-free study\. The retail trajectories come from different actor models than the airline decision states, so we keep the cross\-domain claim at the actor\-agnostic*identification*level \(retrieval recall and rank\), not at absolute classifier macro\-F1\. We report two model scales \(3B, 7B\), untuned \(shared hyperparameters\) and tuned \(per\-cell selection on a small held\-out split,5555states /1010tasks\), reporting “best observed configuration” rather than a tuned optimum; the tuned ranking of*masked*vs\.*structured*should not be over\-read\. The evidence supports a bounded claim: given the benchmark policy, raw trajectory text is a poor classifier input, and in the primary configuration a top\-1*retrieved*clause showed no detectable degradation vs\. the benchmark gold clause despite low exact\-match recall\. Retrieval is not universally sufficient: the result is sensitive to fine\-tuning configuration \(the gold\-policy gap re\-opens for under\-trained and maximally\-discriminating classifiers\), the test sets are small with benchmark\-derived labels, and the decision\-context\-alignment mechanism is partly confounded with retrieved\-clause length\. Live\-loop deployment, learned identification, and high\-stakes verification beyond retrieval are left to future work\.

## 7Conclusion

Exact\-match clause recall is not a reliable stand\-alone measure of policy\-retrieval utility for the classifier studied here\. Under gold\-policy conditioning, structured decision states outperform raw trajectories, confirming that explicit policy context is useful\. However, replacing the gold clause with a top\-ranked retrieved clause yields similar macro\-F1 in the primary Qwen2\.5\-3B configuration despite7%7\\%exact recall@1, whereas mismatched\-policy and no\-policy inputs perform substantially worse\. This result is sensitive to fine\-tuning and does not establish non\-inferiority, but it demonstrates that nonmatching retrieved clauses can contain decision\-relevant policy information\. Evaluations of policy retrieval for tool\-use compliance classifiers should therefore report downstream retrieved\-policy performance alongside exact\-match recall\.

## Data and reproducibility

All experiments use public benchmarks:τ\\tau\-bench\(Yaoet al\.,[2024](https://arxiv.org/html/2606.23937#bib.bib3)\)and the MIT\-licensedτ2\\tau^\{2\}\-bench release\(Barreset al\.,[2025](https://arxiv.org/html/2606.23937#bib.bib25)\)\(airline and retail domains\)\. All trajectories are generated by publicly\-available models on these open\-source benchmarks\. For the representation half, the airline pre\-action decision states are extracted fromτ\\tau\-bench\-airline trajectories produced by Claude\-Sonnet\-4 and Amazon\-Nova\-Lite agents; the classifier that is trained/evaluated on them is a separate small model \(Qwen2\.5\-3B/7B, or a frozen MiniLM probe\)\. For the identification half, we use the publicτ2\\tau^\{2\}\-bench rollouts \(Claude\-3\.7/GPT\-4\.1/o4\-mini actors\) for airline and retail\. In all cases the gold allow/verify/refuse labels and the policy clauses derive from each task’s published natural\-language policy assertions\. No customer, production, or proprietary data is used and no internal models are involved—τ\\tau\-bench personas are synthetic and all actor/classifier models are public\. The anonymous artifact contains the model outputs and analysis scripts required to reproduce all reported statistics, including the gold\-injection diagnostic sweep and the cross\-domain identification measurements \(including the excluded leaky\-query protocol, flagged as such\)\.

## References

- MANTRA: synthesizing smt\-validated compliance benchmarks for tool\-using llm agents\.External Links:2605\.06334Cited by:[§5](https://arxiv.org/html/2606.23937#S5.SS0.SSS0.Px2.p1.4)\.
- V\. Barres, H\. Dong, S\. Ray, X\. Si, and K\. Narasimhan \(2025\)τ2\\tau^\{2\}\-Bench: evaluating conversational agents in a dual\-control environment\.External Links:2506\.07982Cited by:[Data and reproducibility](https://arxiv.org/html/2606.23937#Sx1.p1.5)\.
- J\. Bulian, C\. Buck, W\. Gajewski, B\. Börschinger, and T\. Schuster \(2022\)Tomayto, tomahto\. beyond token\-level answer equivalence for question answering evaluation\.InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),pp\. 291–305\.Cited by:[§5](https://arxiv.org/html/2606.23937#S5.SS0.SSS0.Px3.p1.2)\.
- S\. Chen \(2026\)Looking is not picking: an attention\-segment account of tool\-selection failures in llm agents\.External Links:2606\.16364Cited by:[§5](https://arxiv.org/html/2606.23937#S5.SS0.SSS0.Px3.p1.2)\.
- Z\. Chenet al\.\(2025\)ShieldAgent: shielding agents via verifiable safety policy reasoning\.External Links:2503\.22738Cited by:[§5](https://arxiv.org/html/2606.23937#S5.SS0.SSS0.Px2.p1.4),[§5](https://arxiv.org/html/2606.23937#S5.SS0.SSS0.Px3.p1.2)\.
- Y\. Chuang, C\. Kulkarni, A\. Chiu,et al\.\(2026\)Toward scalable verifiable reward: proxy state\-based evaluation for multi\-turn tool\-calling llm agents\.External Links:2602\.16246Cited by:[§5](https://arxiv.org/html/2606.23937#S5.SS0.SSS0.Px2.p1.4),[§5](https://arxiv.org/html/2606.23937#S5.SS0.SSS0.Px3.p1.2)\.
- E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, and W\. Chen \(2022\)LoRA: low\-rank adaptation of large language models\.InInternational Conference on Learning Representations \(ICLR\),External Links:2106\.09685Cited by:[§2](https://arxiv.org/html/2606.23937#S2.SS0.SSS0.Px5.p1.9)\.
- H\. Inanet al\.\(2023\)Llama guard: llm\-based input\-output safeguard for human\-ai conversations\.External Links:2312\.06674Cited by:[§5](https://arxiv.org/html/2606.23937#S5.SS0.SSS0.Px2.p1.4)\.
- H\. Jorenet al\.\(2024\)Sufficient context: a new lens on retrieval augmented generation systems\.External Links:2411\.06037Cited by:[§5](https://arxiv.org/html/2606.23937#S5.SS0.SSS0.Px2.p1.4),[§5](https://arxiv.org/html/2606.23937#S5.SS0.SSS0.Px3.p1.2)\.
- N\. Madhusudhanet al\.\(2024\)Do LLMs know when to not answer? investigating abstention abilities of large language models\.External Links:2407\.16221Cited by:[§5](https://arxiv.org/html/2606.23937#S5.SS0.SSS0.Px1.p1.1)\.
- Qwen Team \(2024\)Qwen2\.5 technical report\.External Links:2412\.15115Cited by:[§2](https://arxiv.org/html/2606.23937#S2.SS0.SSS0.Px5.p1.9)\.
- N\. Reimers and I\. Gurevych \(2019\)Sentence\-bert: sentence embeddings using siamese bert\-networks\.InProceedings of EMNLP\-IJCNLP,Cited by:[§2](https://arxiv.org/html/2606.23937#S2.SS0.SSS0.Px5.p1.9)\.
- F\. Romeo, L\. Arena, F\. Blefari, F\. A\. Pironti, M\. Lupinacci, and A\. Furfaro \(2025\)ARPaCCino: an agentic\-rag for policy as code compliance\.External Links:2507\.10584Cited by:[§5](https://arxiv.org/html/2606.23937#S5.SS0.SSS0.Px2.p1.4)\.
- T\. Sahet al\.\(2026\)The verifier tax: horizon\-dependent safety\-success tradeoffs in tool\-using llm agents\.External Links:2603\.19328Cited by:[§5](https://arxiv.org/html/2606.23937#S5.SS0.SSS0.Px2.p1.4)\.
- S\. Schwartz, O\. Vasilyev, and R\. Sawaya \(2025\)How important is recall for measuring retrieval quality?\.External Links:2512\.20854Cited by:[§5](https://arxiv.org/html/2606.23937#S5.SS0.SSS0.Px3.p1.2)\.
- Z\. Shi, Y\. Wang, L\. Yan, and P\. Ren \(2025\)Retrieval models aren’t tool\-savvy: benchmarking tool retrieval for large language models\.External Links:2503\.01763Cited by:[§5](https://arxiv.org/html/2606.23937#S5.SS0.SSS0.Px3.p1.2)\.
- Z\. Sun, S\. Xu, and T\. Li \(2026\)Beyond similarity: task\-aligned retrieval for language models\.External Links:2605\.27951Cited by:[§5](https://arxiv.org/html/2606.23937#S5.SS0.SSS0.Px3.p1.2)\.
- C\. van Niekerk, M\. Heck, C\. Geishauser, H\. Lin, N\. Lubis, M\. Moresi, and M\. Gašić \(2020\)Knowing what you know: calibrating dialogue belief state distributions via ensembles\.InFindings of EMNLP,External Links:2010\.02586Cited by:[§5](https://arxiv.org/html/2606.23937#S5.SS0.SSS0.Px1.p1.1)\.
- H\. Wang, C\. M\. Poskitt, and J\. Sun \(2025\)AgentSpec: customizable runtime enforcement for safe and reliable llm agents\.External Links:2503\.18666Cited by:[§5](https://arxiv.org/html/2606.23937#S5.SS0.SSS0.Px2.p1.4)\.
- Y\. Wang, Y\. Zhou, Y\. Liang, C\. Zhang, F\. Liu, J\. Zhou, and H\. Yao \(2026\)Not all skills help: measuring and repairing agent knowledge\.External Links:2606\.15390Cited by:[§5](https://arxiv.org/html/2606.23937#S5.SS0.SSS0.Px3.p1.2)\.
- C\. Winston, C\. Winston, and R\. Just \(2026\)Solver\-aided verification of policy compliance in tool\-augmented llm agents\.External Links:2603\.20449Cited by:[§5](https://arxiv.org/html/2606.23937#S5.SS0.SSS0.Px2.p1.4),[§5](https://arxiv.org/html/2606.23937#S5.SS0.SSS0.Px3.p1.2)\.
- J\. J\. Wu \(2023\)Large language models should ask clarifying questions to increase confidence in generated code\.External Links:2308\.13507Cited by:[§5](https://arxiv.org/html/2606.23937#S5.SS0.SSS0.Px1.p1.1)\.
- Z\. Xiang, L\. Zheng, Y\. Li, J\. Hong, Q\. Li, H\. Xie, J\. Zhang, Z\. Xiong, C\. Xie, C\. Yang, D\. Song, and B\. Li \(2024\)GuardAgent: safeguard llm agents by a guard agent via knowledge\-enabled reasoning\.External Links:2406\.09187Cited by:[§5](https://arxiv.org/html/2606.23937#S5.SS0.SSS0.Px2.p1.4)\.
- S\. Yao, N\. Shinn, P\. Razavi, and K\. Narasimhan \(2024\)τ\\tau\-Bench: a benchmark for tool\-agent\-user interaction in real\-world domains\.External Links:2406\.12045Cited by:[§2](https://arxiv.org/html/2606.23937#S2.SS0.SSS0.Px1.p1.5),[Data and reproducibility](https://arxiv.org/html/2606.23937#Sx1.p1.5)\.
- S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. Narasimhan, and Y\. Cao \(2023\)ReAct: synergizing reasoning and acting in language models\.InInternational Conference on Learning Representations \(ICLR\),External Links:2210\.03629Cited by:[§5](https://arxiv.org/html/2606.23937#S5.SS0.SSS0.Px1.p1.1)\.
- N\. Zwerdling, D\. Boaz, E\. Rabinovich, G\. Uziel, D\. Amid, and A\. Anaby\-Tavor \(2025\)Towards enforcing company policy adherence in agentic workflows\.External Links:2507\.16459Cited by:[§5](https://arxiv.org/html/2606.23937#S5.SS0.SSS0.Px2.p1.4)\.

## Appendix AIdentification\-number provenance

Table[2](https://arxiv.org/html/2606.23937#A1.T2)maps the main identification numbers to the result files used to compute them\.

Table 2:Provenance map for the identification\-half headline numbers: each quantity, its value, and the committed result file \(underexperiments/\) it is recomputed from\. The two airline R@5 figures use different query protocols \(request\+\+evidence vs\. trajectory\-context\), reported separately; all land below the diagnostic crossing\.![Refer to caption](https://arxiv.org/html/2606.23937v1/x2.png)Figure 2:Policy\-clause identifiability profile \(recall@kk\)\.For each domain, recall@kkof the applicable gold policy clause as a function ofkk\(log scale\), for all four off\-the\-shelf retrievers, under the trajectory\-context query protocol\. Dotted line = the≈0\.75\\approx 0\.75diagnostic crossing\. On airline \(pool122122\) recall stays far below this crossing untilk=50k\\\!=\\\!50; on retail \(pool5151\) it reaches the crossing only neark=20k\\\!=\\\!20–5050, i\.e\. only with large candidate sets comprising a substantial fraction of the policy pool\. No retriever, including the full\-pool cross\-encoder reranker, identifies the applicable clause at small, ingestablekk\.
## Appendix BFull representation ablation

This appendix gives the full representation ablation summarized in Section[3](https://arxiv.org/html/2606.23937#S3)\.Throughout, the classifier is given the correct applicable policy clause; these are representation effects*given*the rule, not end\-to\-end gains\(Section[4](https://arxiv.org/html/2606.23937#S4)replaces this gold\-policy condition with retrieved policies\)\.

### B\.1Primary 3B result

Table[3](https://arxiv.org/html/2606.23937#A2.T3)reports the primary*3B*representation result \(§[B\.6](https://arxiv.org/html/2606.23937#A2.SS6)shows how it shifts at 7B\)\. Under the*identical*recipe that produced the refuse\-collapse, the structured\-state classifier reaches macro\-F10\.6010\.601\(CI\[0\.543,0\.661\]\[0\.543,0\.661\]\) versus0\.2930\.293for raw—a paired gap of\+0\.308\+0\.308\(CI\[\+0\.237,\+0\.380\]\[\+0\.237,\+0\.380\]\), roughly a doubling\. Crucially,*no*trained representation reproduces the degenerate collapse \(0/120/12runs collapse, vs thece\_smokebaseline’s refuse\-100%\): the collapse was a property of the raw\-text small\-SFT regime, and a better representation alone escapes it\. Figure[3](https://arxiv.org/html/2606.23937#A2.F3)visualizes the ordering\.

Table 3:Main representation ablation \(Qwen2\.5\-3B, LoRA, identical recipe to the committedce\_smokecollapse†; 3 seeds; pooled paired bootstrap, 5000 resamples, per\-task split\)\. “tok”=mean input tokens; “coll\.”=seeds that collapsed to one class\. Structured decision\-state roughly doubles raw macro\-F1 \(pairedΔ=\+0\.308\\Delta\{=\}\{\+\}0\.308, CI\[\+0\.237\+0\.237,\+0\.380\+0\.380\]\) and*no*trained variant reproduces the refuse\-100% collapse\.†ce\_smoke=the companion postmortem’s documented small\-data collapse, recomputed from committed records \(rawstate\_text, 3 supervision views\); macro\-F1≈0\.12\\approx 0\.12at the refuse\-100% collapse \(trained accuracy0\.230\.23–0\.300\.30, i\.e\. the majority\-class rate\)\.![Refer to caption](https://arxiv.org/html/2606.23937v1/x3.png)Figure 3:Three\-way decision macro\-F1 by classifier input under the identical SFT recipe \(3 seeds, bootstrap CIs\)\. The dashed line is the majority\-class \(collapse\) macro\-F1\. Structured decision\-state roughly doubles raw\.
### B\.2It is not brevity, not scaffold, not a leaked label, not format compliance

Table 4:Controls \(offline frozen\-encoder gate\)\. The structured advantage is not brevity \(length/info\-matched\), not schema scaffold alone \(shuffled fields\), and not a leaked label: replacing the policy line with a wrong\-but\-plausible real policy collapses the gate\. Policy text keyword\-matches the gold action only 14–33% and the split is task\-disjoint\. AllΔ\\Deltaare paired\-bootstrap macro\-F1 gaps \(structured minus the named control\); leading zeros omitted \(e\.g\.\.184=0\.184\.184=0\.184\); all CIs exclude zero\. Rows are offline\-MiniLM\-logreg gaps; the length\-matched row is from an earlier3535\-task pilot run and the others from the195195\-task offline ablation, so the controls are directional rather than from a single jointly\-fitted model\.Table[4](https://arxiv.org/html/2606.23937#A2.T4)collects the frozen\-encoder diagnostic controls\. The structured advantage survives a*length\-matched*control \(raw truncated to the structured length;\+0\.184\+0\.184\) and an*information\-matched*control \(a random raw window of equal length;\+0\.180\+0\.180\), so it is not merely brevity or denoising\. It survives*field\-shuffling*\(keeping the schema but replacing field contents with another example’s;\+0\.313\+0\.313\), so it is not the schema scaffold alone\.

#### Leakage audit\.

Because the structured state contains an “applicable policy” line, a possible alternative explanation is that the classifier simply reads a near\-label rationale\. Three facts rule this out\. First, replacing the policy line with a*mismatched\-but\-plausible*real policy assertion from another example sharply reduces performance \(Δ=\+0\.268\\Delta\{=\}\{\+\}0\.268, CI\[\+0\.163,\+0\.373\]\[\+0\.163,\+0\.373\]\)—the classifier uses the*correct*policy content, not a generic format cue, and a shuffled real policy is wrong content in the same distribution, so this is not a hidden gold\-label leak\. Second, the policy text keyword\-matches the gold action only 14–33% of the time; the classifier must reason policy\+\+evidence→\\rightarrowaction\. Third, the train/test split is task\-disjoint \(no test task’s policy strings appear in training\), so the gain generalizes to unseen tasks within this domain—though not necessarily to unseen*policy corpora*\.

#### Not output\-format compliance\.

The raw SFT classifier emits some unparseable continuations on its long inputs\. Re\-scoring on the parsed\-only subset*widens*the gap \(raw parsed\-only macro\-F10\.3110\.311vs structured0\.6020\.602\), so the advantage is decision quality, not the structured classifier’s higher format compliance\.

### B\.3How much is policy access vs\. structured selection?

We decompose the SFT gain\. Appending the elicited policy to raw text \(raw\+policy\) recovers part of the gap \(\+0\.137\+0\.137over raw, CI\[\+0\.084,\+0\.191\]\[\+0\.084,\+0\.191\]\), but the structured state still beats raw\+policy by\+0\.171\+0\.171\(CI\[\+0\.100,\+0\.241\]\[\+0\.100,\+0\.241\]\)\. So policy access explains roughly half the gain and the compact structured selection of request/evidence/policy/action\-class explains the other half:*structured decision\-state distillation improves over raw trajectories even when policy access is controlled; the gain is not explained by merely appending policy text\. Generic structure alone does not dominate raw*—indeed, in the frozen\-encoder diagnostic probe a*contentless*generic\-policy structured state underperforms raw, so the structured win depends on real elicited content, not formatting in the abstract\.

#### Where the gain concentrates\.

On the easyallow\-vs\-rest contrast, representation barely matters \(0\.7860\.786vs0\.7830\.783\); the structured advantage concentrates on the safety\-relevantrefuse\-vs\-rest contrast \(0\.6810\.681vs0\.6230\.623\), which motivates keeping a distinctverify\-then\-proceedclass rather than a binary block/allow classifier\.

### B\.4Per\-class and per\-task behaviour

To check whether aggregate macro\-F1 masks class\-specific behavior, Table[5](https://arxiv.org/html/2606.23937#A2.T5)reports per\-class precision/recall/F1 \(computed from the SFT predictions, pooled over seeds\)\. The structured classifier’s gains are concentrated and*safety\-aligned*:allowF1 rises0\.39→0\.630\.39\\to 0\.63andrefuserecall reaches1\.001\.00\(it misses no required refusal among then=18n\{=\}18test refusals\), withraw\+policysitting in between\. This comes with a visible tradeoff: structured trades someverifyrecall \(0\.46→0\.400\.46\\to 0\.40\) for the largeallow/refusegains—it is not uniformly better on every class\. Per task \(15 disjoint test tasks, seed 42\), the structured\-over\-raw accuracy delta is positive on8tasks, tied on5, and negative on2: the gain is broad, not driven by one or two tasks, though the small per\-task counts mean individual tasks are noisy\.

Table 5:Per\-class precision/recall/F1 \(SFT gate, pooled over 3 seeds; computed from committed predictions\)\. The structured gate’s gains concentrate onallow\(F1 0\.39→\\to0\.63\) and on*never missing a required refusal*\(refuserecall=1\.00=1\.00\); it trades someverifyrecall for this\. raw\+policy sits between\.
### B\.5Error analysis: the effect spans policy types, and the mechanism is visible

Table 6:Per\-gold\-class accuracy \(3B SFT, seed 42, committed predictions\)\. The structured\-state advantage is*not*concentrated in one decision type: structured improves over raw on*all three*classes \(\+0\.33\+0\.33allow,\+0\.37\+0\.37verify,\+0\.39\+0\.39refuse\), including therefuseclass\. This argues the effect is a general property of the classifier’s decision, not a single\-category artifact\.Breaking accuracy out by the gold decision \(Table[6](https://arxiv.org/html/2606.23937#A2.T6)\) shows the structured advantage is not a single\-category artifact: structured improves over raw on*all three*classes \(\+0\.33\+0\.33allow,\+0\.37\+0\.37verify,\+0\.39\+0\.39refuse\)\. Inspecting individual cases makes the mechanism concrete\. On a must\-refuse task \(remove a passenger, which the policy forbids\), the raw classifier—given the full∼\\sim2,000\-token transcript—predictsallow, while the structured classifier, whose input surfaces the line*“Policy: do not remove passenger…”*, correctly predictsrefuse\. The reverse also occurs: on a legitimate\-proceed task the raw classifier over\-refuses while the structured classifier, seeing the satisfied policy condition, correctly allows\. The quantitative gain therefore tracks the proposed mechanism—the classifier performs well when the decision\-relevant policy is explicit in its input, and errs when that signal is buried in raw trajectory text\.

#### On sample size\.

We are explicit about statistical scope rather than inflating it\. The unit of analysis is the per\-task trajectory; our panel is280280decision states from280280trajectories \(test split8585over1515disjoint tasks\), and all confidence intervals bootstrap that unit \(grouped by task for matched\-condition deltas\)\. We do*not*manufacture largerNNby extracting multiple correlated checkpoints per trajectory, which would overstate independence\. Instead, the evidence’s strength comes from*replication*: the same ordering \(raw worst; policy access the lever\) holds across two model scales \(3B, 7B\), tuned and untuned, and two classifier families \(frozen\-encoder probe and generative SFT\)\. A consistent effect across these independent settings is more convincing than a single large but correlated sample\.

### B\.6Does it hold at 7B and across actors?

We re\-run the full ablation at Qwen2\.5\-7B \(Table[7](https://arxiv.org/html/2606.23937#A2.T7)\) and test transfer across the trajectory\-generating actor \(Table[8](https://arxiv.org/html/2606.23937#A2.T8)\)\. This is where the boundary of the effect appears\.

Table 7:Representation ablation across model scale,*untuned*\(3B\-inherited hyperparameters\) vs\.tuned\(per\-cell grid over lr×\\timesepochs, selected on a held\-out validation split, evaluated once on the untouched test; 3 seeds\)\.The structured\>\>raw advantage survives proper tuning at both scales\(\+0\.173\+0\.173at 3B,\+0\.133\+0\.133at 7B\), directly addressing the concern that the 7B result was an artifact of reused hyperparameters\. Tuning lifts the weak raw baseline \(most at 7B,0\.306→0\.4050\.306\\\!\\to\\\!0\.405\) and shrinks the large*untuned*3B gap \(\+0\.308→\+0\.173\+0\.308\\\!\\to\\\!\+0\.173\), so the two scales*converge*to a consistent∼\\sim\+0\.13\+0\.13–0\.170\.17structured advantage once both arms are tuned\. Only raw and structured were tuned \(the decisive contrast\); masked/raw\+policy shown untuned for reference\.#### What is robust\.

Across*both*scales, raw trajectory text is the worst input \(0\.290\.29–0\.310\.31untuned\), and getting the applicable policy into the input is the dominant lever: at 7B, simply*appending*the policy to raw text \(raw\+policy,0\.5630\.563\) is the single best representation, and the leakage controls \(Section[B\.2](https://arxiv.org/html/2606.23937#A2.SS2)\) show the classifier uses the policy’s*content*\. Notably, our two methods*triangulate*the same conclusion from opposite ends: a*frozen\-encoder*probe with no fine\-tuning \(Table[4](https://arxiv.org/html/2606.23937#A2.T4), the cross\-actor test\) and a*tuned generative*SFT classifier \(Tables[3](https://arxiv.org/html/2606.23937#A2.T3),[7](https://arxiv.org/html/2606.23937#A2.T7)\) both rank raw last and reward policy access—so the effect is not an artifact of any single training setup\.

#### Is the scale story just untuned hyperparameters?

Our first 7B runs reused the 3B hyperparameters, which raises the possibility that the apparent scale shift is a tuning artifact\. We therefore retuned: for each \(model, representation\) cell we grid\-searched learning rate×\\timesepochs \(LoRA rank fixed\), selected the best configuration on a*held\-out validation split*\(task\-disjoint from both train and test\), and evaluated once on the untouched test set \(3 seeds\)\. The structured\>\>raw advantagesurvives tuning at both scales:\+0\.173\+0\.173at 3B and\+0\.133\+0\.133at 7B \(Table[7](https://arxiv.org/html/2606.23937#A2.T7)\)\. Tuning behaves as expected—it lifts the weak raw baseline, most at 7B \(0\.306→0\.4050\.306\\\!\\to\\\!0\.405\), and shrinks the very large*untuned*3B gap \(\+0\.308→\+0\.173\+0\.308\\\!\\to\\\!\+0\.173\)\. The supported reading is therefore one of*convergence*, not divergence: once both arms are properly tuned, structured beats raw by a consistent∼\\sim\+0\.13\+0\.13–0\.170\.17at 3B and 7B, and the dramatic “3B doubles raw” figure was partly an artifact of an*untuned*raw baseline\. We report “best observed configuration” rather than a tuned optimum, since the validation split is small \(5555states /1010tasks\)\.

Table 8:Cross\-actor transfer \(offline gate; train on Sonnet\-actor decision states, test on Nova\-actor,n=50n\{=\}50\)\. The gate*transfers*across the trajectory\-generating model \(all inputs≥0\.64\\geq 0\.64\), but the structured\-vs\-raw advantage is not robust here \(\+0\.103\+0\.103, CI\[−0\.080,\+0\.288\]\[\-0\.080,\+0\.288\], crosses zero\) and masked leads—consistent with the scale finding that the*specific*structured edge is regime\-dependent while abstraction\-away\-from\-raw is general\.
#### Cross\-actor\.

Training on Sonnet\-actor decision states and testing on Nova\-actor states \(Table[8](https://arxiv.org/html/2606.23937#A2.T8), frozen\-encoder probe\), the classifier*transfers*—all inputs score≥0\.64\\geq 0\.64and raw is again the weakest—but at this small sample \(n=50n\{=\}50\) the structured\-vs\-raw margin is not significant \(\+0\.103\+0\.103, CI\[−0\.080,\+0\.288\]\[\-0\.080,\+0\.288\]\) and masking leads\. The robust conclusion—abstraction away from raw text plus policy access—holds across actors;*which*abstraction wins remains underdetermined at this sample size \(here masking; at small scale structured\), consistent with the tuned\-scale picture where masked and structured are close\.

## Appendix CRobustness checks for the identification analysis

We summarise here the supporting checks referenced in Section[4](https://arxiv.org/html/2606.23937#S4); all use the same frozen\-encoder diagnostic probe\.

#### Gold\-injection threshold is robust to the classifier head\.

Re\-running the gold\-injection sweep with a more regularised logistic head \(C=0\.1C\{=\}0\.1\) and a small MLP \(6464hidden units\) on the same MiniLM features, the structured−\-raw gap fails to clear zero*even at full gold injection*\(p=1\.0p\{=\}1\.0\): both never robustly beat raw\. The balanced\-logreg head \(C=1\.0C\{=\}1\.0\) we report is thus the*most*favourable to the structured arm; the high diagnostic crossing is a property of the task, not of the head\.

#### Representation advantage across encoders\.

On the offline probe, the structured−\-raw gap given gold policy is\+0\.115\+0\.115\(CI\[\+0\.014,\+0\.221\]\[\+0\.014,\+0\.221\]\) for MiniLM and\+0\.142\+0\.142\(\[\+0\.023,\+0\.260\]\[\+0\.023,\+0\.260\]\) forbge\-large\-en\-v1\.5, but only\+0\.038\+0\.038\(\[−0\.092,\+0\.164\]\[\-0\.092,\+0\.164\], not significant\) fore5\-large\-v2\. The offline representation effect is therefore mostly but not universally encoder\-robust; our primary representation evidence is the Qwen classifier \(Section[3](https://arxiv.org/html/2606.23937#S3)\), for which the offline probe is only a low\-cost control\.

#### Stronger identification pipelines\.

Beyond the four off\-the\-shelf retrievers, Table[9](https://arxiv.org/html/2606.23937#A3.T9)reports three additional retrieval pipelines—query expansion, hybrid first\-stage with cross\-encoder reranking of the top\-20, and coarse\-to\-fine hierarchical retrieve\. None clears the diagnostic crossing \(best recall@50\.380\.38airline /0\.580\.58retail\); the hierarchical variant is the worst, as the coarse section step drops the gold clause\. This is the direct check against “you only tested weak retrieval”: stronger, structured identification does not lift recall above the bar\.

airlineretailIdentification pipelineR@5R@10R@5R@10Best off\-the\-shelf retriever0\.240\.360\.600\.72Query expansion \+ hybrid0\.380\.480\.470\.68Hybrid top\-20→\\tocross\-enc\. rerank0\.320\.500\.570\.70Hierarchical \(section→\\toclause\)0\.220\.260\.350\.47*Break\-even \(struct\. beats raw\)*≳0\.75\\gtrsim 0\.75Table 9:Stronger, additional retrieval pipelines do*not*clear the exact\-recall diagnostic\. Beyond the four off\-the\-shelf retrievers, we test query expansion, a hybrid first\-stage with cross\-encoder reranking of the top\-2020candidates, and a coarse\-to\-fine hierarchical retrieve\. The best recall@5 reaches only0\.380\.38\(airline\) /0\.580\.58\(retail\), still well below the≳0\.75\\gtrsim 0\.75diagnostic crossing; hierarchical routing is worse because the coarse step drops the gold clause\. The exact\-match shortfall is thus not an artifact of weak/off\-the\-shelf retrieval\.
#### Per\-class identifiability\.

Table[10](https://arxiv.org/html/2606.23937#A3.T10)breaks airline policy\-clause identifiability down by gold decision class\.

Table 10:Policy identifiability by decision class \(airline, MiniLM over the122122\-clause pool, request\+\+evidence query\)\. Exact\-match recall varies substantially by decision class: recall@5 is0\.000\.00forallow,0\.050\.05forverify, and0\.500\.50forrefuse\. Therefuseclass has the highest exact\-match recall, plausibly because prohibitive clauses use more distinctive language\. Per\-classnnis small, so this analysis is descriptive\. \(This breakdown uses the MiniLM retriever; the per\-class recalls therefore aggregate to≈0\.13\\approx 0\.13, slightly below the0\.160\.16headline figure, which is the stronger hybrid retriever—the conclusion is unchanged either way\.\)
#### Direct\-policy seed\-level results\.

Table[11](https://arxiv.org/html/2606.23937#A3.T11)reports the per\-seed macro\-F1 values behind the 3B direct\-policy intervention in Table[1](https://arxiv.org/html/2606.23937#S4.T1)and the corresponding 7B check\.

Table 11:Seed\-level direct retrieved\-policy results\.Each cell is macro\-F1 on the samen=85n\{=\}85states; means average the three seeds\. The 3B aggregate CIs are in Table[1](https://arxiv.org/html/2606.23937#S4.T1)\. At 7B, MiniLM top\-1 vs\. gold is−0\.017\-0\.017\(task\-cluster CI\[−0\.201,\+0\.158\]\[\-0\.201,\+0\.158\]\), whilebge\-largetop\-1 vs\. gold is\+0\.059\+0\.059\(CI\[−0\.189,\+0\.304\]\[\-0\.189,\+0\.304\]\)\.
#### Observational retrieval–accuracy split \(reported, not relied upon\)\.

The retrieved\-policy top\-1 classifier is not better when the gold clause is in the top\-5 \(0\.360\.36,n=11n\{=\}11\) than when it is missed \(0\.460\.46,n=74n\{=\}74\), so we treat this correlation as difficulty\-confounded and rely on the controlled injection/direct\-policy tests instead\.

## Appendix DReproducibility: fine\-tuning configuration

Because several of our main results come from supervised fine\-tuning \(SFT\) of the classifier \(Tables[3](https://arxiv.org/html/2606.23937#A2.T3),[5](https://arxiv.org/html/2606.23937#A2.T5),[7](https://arxiv.org/html/2606.23937#A2.T7)\), we disclose the training configuration\. The frozen\-encoder controls \(Tables[4](https://arxiv.org/html/2606.23937#A2.T4),[8](https://arxiv.org/html/2606.23937#A2.T8)\) use only a balanced logistic\-regression head onall\-MiniLM\-L6\-v2embeddings\.

#### Configuration\.

Table[12](https://arxiv.org/html/2606.23937#A4.T12)lists the fine\-tuning configuration\. Untuned runs share one recipe \(lr=5×10−5\\text\{lr\}\{=\}5\\\!\\times\\\!10^\{\-5\},55epochs\); tuned runs select lr/epochs on a task\-disjoint validation split \(Table[13](https://arxiv.org/html/2606.23937#A4.T13)\)\.

Table 12:Gate fine\-tuning configuration from the training script and committed per\-run records\.Table 13:Tuned configuration per \(model, representation\) cell\. Grid lr∈\{2,5,10\}×10−5\\in\\\{2,5,10\\\}\\\!\\times\\\!10^\{\-5\}and epochs∈\{3,5,8\}\\in\\\{3,5,8\\\}; best validation cell retrained on fit\+\+val and evaluated once on the untouched test \(33seeds\)\.

Similar Articles

WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

Hugging Face Daily Papers

WildClawBench evaluates language and vision-language models on realistic long-horizon tasks using actual CLI environments with real tools. The benchmark reveals that even the best model achieves only 62.2% accuracy, indicating long-horizon agent evaluation remains challenging.

Not All Skills Help: Measuring and Repairing Agent Knowledge

arXiv cs.CL

This paper identifies that naive skill accumulation in LLM agents can cause performance regressions, as skills beneficial for some tasks hurt others. The authors propose Assay, a framework that measures per-skill causal contributions and applies per-task masking, achieving state-of-the-art results on AppWorld and τ-bench without weight updates.

Search Discipline for Long-Horizon Research Agents

arXiv cs.AI

This paper identifies a failure mode in long-horizon research agents where optimizing an aggregate metric can select candidates that improve the headline number but break critical subgroups (inversion). It proposes a search-discipline protocol with an external control loop that audits candidates based on disaggregated behavior rather than the score.