Beyond the Black Box: Interpretability of Agentic AI Tool Use
Summary
This paper introduces a mechanistic interpretability toolkit using Sparse Autoencoders and linear probes to monitor internal model states before AI agents invoke tools, aiming to improve diagnostics and safety in enterprise workflows.
View Cached Full Text
Cached at: 05/11/26, 07:08 AM
# Beyond the Black Box: Interpretability of Agentic AI Tool Use
Source: [https://arxiv.org/html/2605.06890](https://arxiv.org/html/2605.06890)
Hariom Tatsat Quantitative Analytics, Barclays hariom\.x\.tatsat@barclays\.comAriye Shater Quantitative Analytics, Barclays ariye\.shater@barclays\.com
###### Abstract
AI agents are promising for high\-stakes enterprise workflows, but dependable deployment remains limited because tool\-use failures are difficult to diagnose and control\. Agents may skip required tool calls, invoke tools unnecessarily, or take actions whose consequence becomes visible only after execution\. Existing observability methods are mostly external: prompts reveal correlations, and evaluations score outputs only after the model has already acted\. In long\-horizon settings, these failures are especially costly because an early tool mistake can alter the rest of the trajectory, increase token consumption, and create downstream safety and security risk\.
We introduce a mechanistic\-interpretability toolkit built onSparse Autoencoders \(SAEs\)andlinear probes\. The framework reads model states before each action and infers both whether a tool is needed and how consequential the next tool action is likely to be\. By decomposing activations into sparse features, it identifies the internal layers and features most associated with tool decisions and tests their functional importance through feature ablation\. We train the probes on multi\-step trajectories from theNVIDIA Nemotron function\-calling datasetand apply the same workflow toGPT\-OSS 20BandGemma 3 27Bmodels\.
The goal is not to replace external evaluation, but to add a missing layer: visibility into what the model signaled internally before action\. This helps surface deeper causes of agent failure, especially in long\-horizon runs where an early mistake can reshape the rest of the agentic interaction\. More broadly, the paper shows how mechanistic interpretability can support practical internal observability for monitoring tool calls and risk in agent systems\.
## 1Introduction
An*AI agent*solves a task through repeated decision steps rather than a single response\. At each step, it can either answer directly or delegate to an external tool, observe the result, and continue\. We study this*tool\-decision boundary*: whether the model should call a tool now, and how consequential that action is likely to be\.
This boundary is operationally important because tool\-use failures are often difficult to diagnose before errors become visible\. An agent may skip a required tool call, invoke a tool unnecessarily, or take an action whose consequence becomes visible only after execution\. Standard observability methods remain incomplete here\. Prompting reveals correlations rather than mechanisms, and behavioral evaluation measures outputs only after the model has already acted\. The Berkeley Function Calling Leaderboard \(BFCL\) reflects this limitation in its multi\-turn design by combiningstate\-basedandresponse\-basedchecks, since read\-only tool chains may be invisible to state\-only evaluation \(Patil et al\., 2025\)\. Related work on tool\-selection hallucinations points in the same direction: hidden states can contain useful same\-pass signals for tool\-call errors that are not visible from outputs alone \(Healy et al\., 2026\)\.
We address this gap with an internal monitoring framework that reads model activations immediately before each action and estimates whether the model is internally preparing to delegate to a tool and, secondarily, whether that action is likely to be low, medium, or high risk\.
Mechanistic interpretability provides a mechanism to inspect model internals\. Linear probes are classifiers trained on intermediate representations to test whether a target concept is linearly recoverable from them \(Alain & Bengio, 2017\)\. Sparse Autoencoders \(SAEs\) go further, decomposing activations into a sparse basis of more interpretable features \(Bricken et al\., 2023\)\. We combine both: pre\-action activations are first mapped into a sparse feature basis via an SAE, then read out by two task\-specific probes: theTool\-Need Probe\(binary: tool call vs\. no tool call\) and theTool\-Risk Probe\(ternary: low / medium / high risk\)\. This pipeline is lightweight yet interpretable, recovering tool\-decision signals from internal state, pinpointing the layers where these signals are strongest, and surfacing the individual features most predictive of tool use and risk\. Ablating those features then provides causal confirmation\. The paper makes four contributions:
- •A pre\-action internal monitoring framework for repeated tool decisions in agent trajectories\.
- •Two complementary readouts: the Tool\-Need Probe \(Probe 1\) and the Tool\-Risk Probe \(Probe 2\)\.
- •Localization of tool\-decision signals to sparse features and late layers, with feature ablation\.
- •Evaluation onGPT\-OSS 20BandGemma 3 27B instruction\-tuned \(IT\)models using held\-out Nemotron test data and zero\-shot BFCL transfer\.
More broadly, the paper shows how mechanistic interpretability can support a practical form of internal observability for agent systems: not just explaining failures after the fact, but helping monitor tool calls and risk before execution\.
Section[2](https://arxiv.org/html/2605.06890#S2)positions the paper relative to prior work and states the research questions\. Section[3](https://arxiv.org/html/2605.06890#S3)defines the decision\-point formulation, datasets, internal\-state extraction, probe setup, and feature ablation method\. Section[4](https://arxiv.org/html/2605.06890#S4)presents the main empirical results, including held\-out Nemotron performance, the illustrative financial trace, layer concentration, and ablation\. Section[5](https://arxiv.org/html/2605.06890#S5)evaluates held\-out replay and zero\-shot BFCL transfer, and Section[6](https://arxiv.org/html/2605.06890#S6)discusses implications, deployment considerations, and limitations\.
## 2Related Work and Research Questions
Prior work on tool use has mostly evaluated agents from the outside, through end\-task success, function\-call correctness, or benchmark\-specific response scoring\. This external evaluation tradition includes learned tool\-use setups such as Toolformer and broader function\-calling / API benchmarks such as ToolLLM / ToolBench, ToolACE, HammerBench, and BFCL \(Schick et al\., 2023;Qin et al\., 2023;Liu et al\., 2024;Wang et al\., 2025;Patil et al\., 2025\)\. BFCL is the most relevant benchmark for our setting because it evaluates abstention and multi\-turn behavior, and combinesstate\-basedandresponse\-basedchecks when read\-only tool chains are not visible from final state alone\. These benchmarks are essential for measuring observable behavior, but they do not reveal whether the model had internally recognized the need to delegate before acting\.
A second line of work studies hidden states directly\. Activation\-probing results show that internal representations can predict downstream behavior before it is externally visible \(Li et al\., 2025;McKenzie et al\., 2025\)\. In tool\-selection settings, internal representations have also been shown to distinguish correct from hallucinated tool calls, with calibration playing an important role when such probes are intended for deployment rather than only offline analysis \(Healy et al\., 2026\)\. In parallel, Sparse Autoencoder work shows that dense activations can be decomposed into more interpretable sparse features, making it possible to localize semantically meaningful internal components rather than operate only on opaque residual vectors \(Bricken et al\., 2023;Cho et al\., 2025\)\.
Our work connects these directions in a multi\-step agent setting\. Rather than evaluating tool use only from outputs, we monitor model state immediately before each action\. Rather than probing dense hidden states alone, we probe SAE features that support layer localization, sparse feature inspection, and ablation\. This is especially relevant for long\-horizon agents, where early tool or coordination failures can propagate through the rest of the trajectory \(Cemri et al\., 2025\)\. It is also aligned with the view that external tools should be invoked when they are epistemically necessary, rather than reflexively \(Wang et al\., 2025\)\. It also complements our earlier finance\-focused study of mechanistic interpretability, which examined domain\-specific LLM behavior rather than pre\-action tool decisions in agent trajectories \(Tatsat & Shater, 2025\)\.
The paper is organized around four research questions\.RQ1asks whether model activations encode whether a tool should be used at a given decision step\.RQ2asks which sparse features and layers most strongly encode tool\-need and tool\-risk signals\.RQ3asks whether internal signals can surface missed and unnecessary tool calls more clearly than output\-only monitoring\.RQ4asks whether these signals remain useful across repeated decision points and under zero\-shot transfer to BFCL\. RQ1 and RQ2 are addressed mainly in Sections[3](https://arxiv.org/html/2605.06890#S3)and[4](https://arxiv.org/html/2605.06890#S4); RQ3 and RQ4 are addressed mainly in Sections[4](https://arxiv.org/html/2605.06890#S4),[5](https://arxiv.org/html/2605.06890#S5), and[6](https://arxiv.org/html/2605.06890#S6)\.
## 3Problem Setup and Method
We study agent behavior at repeated*tool decision points*\. At each step, we compare three quantities: what the task requires, what the model internally signals, and what the runtime actually does\. This three\-way view lets us distinguish the main cases that matter for monitoring: correct tool use, missed tool calls, unnecessary tool calls, high\-risk actions, and uncertain decisions\. Figure[1](https://arxiv.org/html/2605.06890#S3.F1)summarizes the full decision\-boundary pipeline\.
Figure 1:Framework overview for mechanistic monitoring of multi\-step agent tool decisions\. Agent trajectories are transformed into decision\-boundary contexts, mapped into pre\-action activations, decomposed with Sparse Autoencoders, and used by the Tool\-Need Probe \(Probe 1\) and Tool\-Risk Probe \(Probe 2\) before execution\.Table[1](https://arxiv.org/html/2605.06890#S3.T1)summarizes the operational outcomes used throughout the paper\. The Tool\-Need Probe provides the internal tool signal, while the Tool\-Risk Probe estimates the likely risk tier of the next tool action\. This compact framing replaces a longer failure taxonomy while preserving the cases that matter most for runtime monitoring\.
Table 1:Operational outcomes used throughout the paper\.### 3\.1Data preparation
We convert raw multi\-step agent trajectories into per\-step decision rows\. Each row contains cumulative context truncated at the decision boundary, a binary label indicating whether a tool is required, and a three\-level risk label for the next tool action\. This preserves a faithful pre\-action view: the probe never sees the current step’s output or the future trajectory when computing its prediction\.
The training data comes from the NVIDIA Nemotron function\-calling dataset \(Chandiramani et al\., 2026\), where each raw row corresponds to one decision point in a multi\-step trajectory\. We group rows by trajectory, order them by depth, reconstruct the cumulative context available at each step, and assign a binarytool\_neededlabel from the gold next action\. Tool\-call steps are additionally assigned one of three risk tiers:low,medium, orhigh\. Low\-risk actions are predominantly read\-only retrieval or lookup steps; medium\-risk actions involve bounded creation or write operations; high\-risk actions include authentication, outbound communication, or dangerous execution actions\. Table[13](https://arxiv.org/html/2605.06890#A2.T13)in Appendix[B](https://arxiv.org/html/2605.06890#A2)summarizes the keyword groups used to instantiate this Nemotron risk\-tier scheme\.
Probes are trained only on Nemotron\-derived step rows\. BFCL is reserved for zero\-shot transfer evaluation, using the same per\-step reconstruction and pre\-action probe inference but a different benchmark distribution\.
### 3\.2Internal state extraction
We apply the same decision\-point pipeline to both backbones: identical per\-step context, omission of the current step’s generated output from the activation prompt, and layer\-wise SAE encoding of pre\-action hidden states\. For both models, hidden states are mean\-pooled over the last 32 pre\-action tokens before SAE encoding, rather than read from a single token alone\. This choice provides a practical balance between capturing enough immediate context to stabilize the decision signal and keeping activation extraction computationally manageable at runtime\.
ForGPT\-OSS 20B, we read six post\-residual layers and encode them with public GPT\-OSS SAEs\. ForGemma 3 27B, we read four post\-block residual layers and encode them with Gemma Scope SAEs\. The important point for the paper is not the exact dimensionality of each concatenated vector, but that both models are processed with the same decision\-boundary logic and the same probe\-based monitoring recipe\.
### 3\.3Probe training
TheTool\-Need Probeis the primary probe: it predicts whether a tool call is required at the current decision step\. TheTool\-Risk Probeis secondary: at tool\-call steps it predicts whether the next action is low, medium, or high risk\. Both probes operate on SAE features rather than raw activations, which makes it possible to inspect layer concentration, identify top sparse features, and test feature necessity through ablation\.
Formally, leth~\(ℓ\)∈ℝd\\tilde\{h\}^\{\(\\ell\)\}\\in\\mathbb\{R\}^\{d\}denote the pooled pre\-action hidden state at layerℓ\\ell\. For each selected layer, a pretrained SAE maps this hidden state to a sparse feature vector
z\(ℓ\)=ϕ\(Wenc\(ℓ\)h~\(ℓ\)\+benc\(ℓ\)\),z^\{\(\\ell\)\}=\\phi\\\!\\left\(W\_\{\\mathrm\{enc\}\}^\{\(\\ell\)\}\\tilde\{h\}^\{\(\\ell\)\}\+b\_\{\\mathrm\{enc\}\}^\{\(\\ell\)\}\\right\),whereWenc\(ℓ\)W\_\{\\mathrm\{enc\}\}^\{\(\\ell\)\}andbenc\(ℓ\)b\_\{\\mathrm\{enc\}\}^\{\(\\ell\)\}are the SAE encoder weights and bias for layerℓ\\ell, andϕ\(⋅\)\\phi\(\\cdot\)denotes the SAE nonlinearity\. We concatenate SAE features across the selected layers,
z=\[z\(ℓ1\);⋯;z\(ℓm\)\],z=\[\\,z^\{\(\\ell\_\{1\}\)\};\\cdots;z^\{\(\\ell\_\{m\}\)\}\\,\],wheremmis the number of selected layers\. We then fit linear probes onzzrather than on raw activations\. For Tool\-Need, with binary labely∈\{0,1\}y\\in\\\{0,1\\\},
p\(y=1∣z\)=σ\(w⊤z\+b\),p\(y=1\\mid z\)=\\sigma\(w^\{\\top\}z\+b\),wherewwandbbare probe parameters andσ\(⋅\)\\sigma\(\\cdot\)is the logistic sigmoid\. Tool\-Risk uses a three\-way softmax over\{low,med,high\}\\\{\\mathrm\{low\},\\mathrm\{med\},\\mathrm\{high\}\\\}\. The two probes are trained independently, with distinct targets and feature\-ranking criteria, but are evaluated under the same per\-step runtime framework\. All remaining notation follows standard linear\-algebra conventions\.
Each probe is implemented as a sparse logistic classifier over SAE features, with feature selection based on how well each feature separates the target classes and regularization chosen from ridge, lasso, or elastic net\. Regularization is applied because the SAE feature space is high\-dimensional and often contains correlated latents, so some shrinkage helps control overfitting while keeping the readout interpretable\.
To make representative SAE features easier to interpret, we apply an automated feature\-labeling step to a small number of selected features\. In this workflow, top\-activating examples are summarized into short natural\-language descriptions using a LLM\. More details are deferred to Appendix[A](https://arxiv.org/html/2605.06890#A1)\.
### 3\.4Feature ranking and ablation
To test whether top sparse features are merely correlated with probe predictions or are functionally necessary, we perform representational ablation directly in SAE latent space\. After encoding a step into sparse features, we select a small set of highly\-ranked latents, zero them, re\-run the probe, and compare the ablated prediction with the baseline prediction\. If suppressing a small set of latents sharply reduces probe confidence or flips the label, those features are causally important to the probe’s prediction\.
### 3\.5Evaluation metrics
We report Tool\-Need accuracy, precision, recall, and F1; Tool\-Risk accuracy and macro\-F1; and runtime alignment between expected labels, internal probe decisions, and actual execution\. We also report missed\-tool warning rates, unnecessary\-call warning rates, and risk alerts in replay and transfer settings\.
## 4Experiments and Results
Table 2:Main results: GPT\-OSS 20B vs Gemma 3 27B instruction\-tuned \(IT\) on the core held\-out Nemotron test dimensions\.Table[2](https://arxiv.org/html/2605.06890#S4.T2)gives the headline held\-out Nemotron results across both models\. The key result is that the Tool\-Need Probe \(Probe 1\) provides the stronger and more stable signal at the tool\-call boundary, while the Tool\-Risk Probe \(Probe 2\) becomes useful once a tool call is warranted, where it helps distinguish lower from higher risk actions but is more sensitive to risk\-class structure and transfer setting\. In both models, the strongest tool\-decision information is concentrated in later layers\. As described in Section[3](https://arxiv.org/html/2605.06890#S3)underProbe training, the number of features varies across probes: GPT\-OSS Probe 1 uses lasso with 200 features, all other probes use elastic net, Gemma Probe 1 uses 2000 features, and Probe 2 uses 1000 for both models\.
### 4\.1Tool\-Need Probe results
Both models contain a linearly recoverable tool\-decision signal in their SAE representations\.
Table 3:Tool\-Need Probe confusion matrices on held\-out Nemotron test for GPT\-OSS and Gemma\.
Table 4:Tool\-Risk Probe confusion matrices on held\-out Nemotron data\. Rows = true tier; columns = predicted tier\.
GPT\-OSS achieves75\.3%accuracy on the held\-out Nemotron test, while Gemma achieves71\.4%\. The confusion matrices in Table[4](https://arxiv.org/html/2605.06890#S4.T4)show that both probes behave as reasonably balanced classifiers rather than simply exploiting label frequency\. This tool\-need signal is the paper’s primary contribution: it is recoverable with compact feature sets, interpretable in layer space, and later transfers as an omission\-auditing signal in runtime settings\.
### 4\.2Tool\-Risk Probe results
Knowing that a tool should be called is necessary but not sufficient\. Tool actions can differ substantially in risk even when both are valid\. Probe 2 therefore asks whether internal representations encode not only the decision to call a tool, but the severity of the next external action\.
Both models show strong held\-out Nemotron accuracy:90\.3%for GPT\-OSS and88\.5%for Gemma on tool\-call rows\. The dominant pattern is strong low\-tier recovery, weaker medium\-tier separation, and moderate high\-tier recovery\. This asymmetry is consistent across both models, suggesting it reflects the risk\-tier structure rather than a model\-specific artifact\. Table[4](https://arxiv.org/html/2605.06890#S4.T4)makes this pattern explicit\.
### 4\.3Illustrative multi\-step financial information trace
We retain one illustrative financial trajectory from the Nemotron distribution \(trajectory\_id3344\) to show how the two probes evolve across repeated decision points\. This subsection is intended as a worked trace rather than as a standalone evaluation\. The quantitative evidence for the paper remains the held\-out metrics, confusion matrices, feature tables, and ablation results; the full step\-by\-step trace and the corresponding plot appear in Appendix[C](https://arxiv.org/html/2605.06890#A3), specifically Figure[2](https://arxiv.org/html/2605.06890#A3.F2)and Table[15](https://arxiv.org/html/2605.06890#A3.T15)\.
The main value of this example is that theTool\-Need Probe \(Probe 1\)rises on steps that genuinely require external financial retrieval and falls on follow\-up turns where no new tool call is needed, even though the discussion remains financial\. TheTool\-Risk Probe \(Probe 2\)stays predominantly low on those retrieval steps, which shows that the risk signal is not redundant with tool need\.
Table[6](https://arxiv.org/html/2605.06890#S4.T6)summarizes the key steps, abbreviated prompts, and probe outputs\.
Table 5:Illustrative steps from the Nemotron financial trajectory \(trajectory\_id3344\)\. Probe 2 probabilities are shown as\(plow,pmed,phigh\)\(p\_\{\\mathrm\{low\}\},p\_\{\\mathrm\{med\}\},p\_\{\\mathrm\{high\}\}\)\.
Table 6:Layer concentration of top\-20 Probe\-1 features by model\.
The key transitions are intuitive\. Steps 4, 7, 10, 12, and 14 are read\-only financial\-information retrieval requests, and these steps coincide with elevated Probe 1 scores while Probe 2 remains predominantly low\. Step 15 differs: it is a follow\-up turn in the same conversation, and hereptoolp\_\{\\mathrm\{tool\}\}drops while the Probe 2 distribution becomes more mixed\. Taken on its own, this trace is only illustrative, but it helps show how the probe outputs can evolve differently across successive decision points within a single multi\-step interaction\.
### 4\.4Layer and feature analysis
Both models concentrate their tool\-decision signal in late transformer layers\. Table[6](https://arxiv.org/html/2605.06890#S4.T6)shows how the top Probe 1 features cluster toward the final monitored layers in both backbones\.
This pattern is consistent with a late\-stage decision\-encoding hypothesis: the strongest tool\-decision features appear near the model’s final commitment stage before action\.
Table[8](https://arxiv.org/html/2605.06890#S4.T8)reports representative top SAE features for Probe 1 \(Tool\-Need\), and Table[8](https://arxiv.org/html/2605.06890#S4.T8)reports the corresponding representative features for Probe 2 \(Tool\-Risk\)\.
Table 7:Representative top SAE features for Probe 1 \(Tool\-Need\), shown separately for GPT\-OSS and Gemma\.
Table 8:Representative top SAE features for Probe 2 \(Tool\-Risk\), shown separately for GPT\-OSS and Gemma\.
These feature tables matter because they show that the probes are not only accurate but also readable at the feature level\. Table[8](https://arxiv.org/html/2605.06890#S4.T8)highlights the numerical and formal\-language features associated with tool\-call decisions, while Table[8](https://arxiv.org/html/2605.06890#S4.T8)shows that the strongest Probe 2 features emphasize authentication, account, and credential\-related concepts rather than tool names alone\. This supports the claim that Probe 2 reads action consequence from reasoning context rather than from a static tool\-name lookup\.
### 4\.5Feature ablation
Representational ablation confirms that the top sparse features are functionally necessary, not merely correlated\. Table[10](https://arxiv.org/html/2605.06890#S4.T10)summarizes the effect of ablating high\-ranked latent sets\.
Table 9:Tool\-Need Probe ablation results \(10 held\-out Nemotron steps per model\)\.Δp\\Delta p= mean\|Δptool\|\|\\Delta p\_\{\\mathrm\{tool\}\}\|; Flip = binary prediction flips out of 10\.
Table 10:Held\-out Nemotron replay runtime profile \(760 episodes, GPT\-OSS\)\.
For GPT\-OSS, a small set of top\-ranked features is sufficient to sharply reduceptoolp\_\{\\mathrm\{tool\}\}and flip several predictions, with most of the strongest ablated features coming from the final layer\. Gemma shows the same pattern more gradually, requiring larger ablation sets because the probe spans a broader sparse input\. Random\-feature ablation produces negligible effects, supporting the claim that the identified features are specific components of the signal the probe reads out\.
## 5Runtime Monitoring and Cross\-Dataset Generalization
### 5\.1Training and transfer setup
Probes are trained onNemotrondata only\. Held\-outNemotron replayprovides the same\-distribution runtime check for GPT\-OSS 20B, whileBFCLis used as a strict zero\-shot transfer benchmark with no retraining or threshold tuning \(Patil et al\., 2025\)\. This section therefore separates familiar\-distribution replay from cross\-benchmark transfer\.
### 5\.2Held\-out Nemotron replay
We first evaluate the monitor on held\-out Nemotron replay using GPT\-OSS 20B, the same model family used for activation extraction\. The main result is that missed tool calls dominate, while tool naming is comparatively strong once the model commits to a tool call\. This supports interpreting Probe 1 primarily as a*tool\-call monitor*rather than a formatting checker\. Table[10](https://arxiv.org/html/2605.06890#S4.T10)summarizes the replay profile\.
Held\-out Nemotron replay shows that the central problem is not naming a tool once the model has decided to use one; it is deciding to delegate at all\. This is the clearest runtime justification for Probe 1\.
### 5\.3BFCL as a zero\-shot transfer benchmark
BFCL is used here as a strict zero\-shot transfer benchmark and as the clearest cross\-benchmark generalization test\. No BFCL activations are used for training, retraining, threshold tuning, or calibration\. Instead, BFCL multi\-turn episodes are mapped into the same step\-level format used for Nemotron: cumulative transcript becomescontext, gold call annotations determinetool\_needed, and BFCL tools are heuristically projected into the Nemotron low/medium/high risk\-tier scheme for Probe 2\. This preserves the same pre\-action monitoring setup while exposing the probes to a different benchmark distribution\. Table[11](https://arxiv.org/html/2605.06890#S5.T11)summarizes the transfer results, while Appendix[C](https://arxiv.org/html/2605.06890#A3)includes Table[14](https://arxiv.org/html/2605.06890#A3.T14), an illustrative formatting contrast between the two benchmark styles\.
Table 11:BFCL zero\-shot transfer summaryThe main out\-of\-distribution result is tool\-call transfer\. Both models show expected–internal agreement above 77% on BFCL despite no BFCL activations being used in training, indicating transfer beyond Nemotron\. GPT\-OSS Probe 1 predicts a tool on nearly every BFCL step, while Gemma retains two\-sided behavior with a 0\.2% missed\-tool rate but still over\-triggers on no\-tool steps\. Probe 1 therefore transfers usefully to BFCL, though the two models show different error profiles under distribution shift\.
Probe 2 also shows useful but a slightly weaker transfer\. This is not surprising, since cross\-benchmark risk transfer depends on how well tool categories and risk tiers are specified and aligned across datasets\.
Failures are early in both models: mean first\-failure occurs within the first one or two turns, including1\.17turns for Gemma on the merged BFCL slice\. This suggests a short but practically relevant intervention window for pre\-execution monitoring\. Overall, BFCL should be read as a transfer stress test: Probe 1 transfers best as an omission\-auditing signal at the tool\-call boundary, while Probe 2 remains useful for separating lower\- from higher\-risk actions once a tool call is in play\.
## 6Discussion and Limitations
The broader contribution is to show that mechanistic interpretability can becomeoperationally usefulfor agent systems: extending our earlierBeyond the Black Boxstudy of LLM interpretability \(Tatsat & Shater, 2025\) into agent settings, this paper shows how internal monitoring can move beyond explaining behavior after failure to helping monitor tool decisions before action in realistic high\-stakes workflows\.
A second advantage is generality\. Because the monitor operates on internal representations rather than tool\-specific output patterns, the same probe framework can apply across multiple tools and repeated decision points\. Probe 1 asks whether a tool call is needed at all, while Probe 2 asks whether the next tool action appears more consequential; feature tables and ablations show that these signals are both predictive and localized to late sparse features\.
This is also where SAE\-based monitoring adds something beyond external observability\. The framework does not simply emit a scalar warning; it identifies where the signal is concentrated, which sparse features are most associated with the decision, and whether suppressing those features changes the probe output\.
The runtime results suggest that the framework is most useful as an oversight layer at high\-value decision points\. Held\-out Nemotron replay shows that the main bottleneck is deciding to delegate at all, not naming a tool once delegation has begun\. BFCL then serves as a transfer stress test: Probe 1 transfers best as an omission\-auditing signal, while Probe 2 remains useful for risk\-tiering once a tool call is underway\. We therefore interpret the Nemotron–BFCL gap as reflecting both task transfer and added benchmark\-specific instruction\-following pressure, not simply loss of the internal signal itself\.
#### Limitations\.
Tool\-Need is the stronger and more stable probe, while Tool\-Risk depends more directly on the risk\-tier scheme used to label actions\. The Nemotron and BFCL risk labels are heuristic rather than a shared industry standard, so out\-of\-domain risk predictions should be interpreted with caution\. More broadly, BFCL is not only a distribution shift in tool tasks; it also changes the amount of benchmark\-specific instruction\-following needed for a model to be scored as correct\. That makes BFCL an intentionally demanding transfer setting, but also means that BFCL degradation can reflect a mixture of capability shift, protocol mismatch, and judge\-facing compliance effects\. The paper also studies only two open\-weight backbones, so broader portability across architectures, scales, and post\-training recipes remains an open question\. Finally, feature identities may drift with checkpoint choice and SAE recipe, even when late\-layer concentration appears robust\. Additional examples and per\-step tables appear in Appendix[C](https://arxiv.org/html/2605.06890#A3)\.
## 7Conclusion
Tool decisions leave readable traces in model state*before*external execution\. On Nemotron\-held trajectories, linear probes on SAE\-decomposed activations recover Tool\-Need and Tool\-Risk signals for bothGPT\-OSS 20BandGemma 3 27B IT, with late\-layer concentration and sparse feature sets that survive ablation: evidence that the readout targets a genuine internal signal, not an arbitrary projection of the residual stream\.
The same monitoring recipe transfers across backbones under distinct SAE variants and layer selections and remains informative under live replay and BFCL out\-of\-distribution evaluation, with Tool\-Need acting most reliably as anomission auditorand Tool\-Risk as arisk\-orientedlayer that requires extra care when the risk scheme shifts across tool namespaces\. Selective activation capture at decision boundaries therefore offers a practical complement to external benchmarks for safer, more controllable agent deployment\.
The broader contribution is to show that mechanistic interpretability can becomeoperationally usefulfor agent systems: this paper shows how internal monitoring can move beyond explaining behavior after failure to helping monitor tool decisions before action in realistic high\-stakes workflows\.
## References
Alain, G\., & Bengio, Y\. \(2017\)\. Understanding Intermediate Layers Using Linear Classifier Probes\.arXiv preprint arXiv:1610\.01644\.[https://arxiv\.org/abs/1610\.01644](https://arxiv.org/abs/1610.01644)
Chandiramani, A\., et al\. \(2026\)\. Nemotron 3 Super: Open, Efficient Mixture\-of\-Experts Hybrid Mamba\-Transformer Model for Agentic Reasoning\.arXiv preprint arXiv:2604\.12374\.[https://arxiv\.org/abs/2604\.12374](https://arxiv.org/abs/2604.12374)
Cho, S\., Wu, Z\., & Koshiyama, A\. \(2025\)\. CorrSteer: Steering Improves Task Performance and Safety in LLMs through Correlation\-based Sparse Autoencoder Feature Selection\.arXiv preprint arXiv:2508\.12535\.[https://arxiv\.org/abs/2508\.12535](https://arxiv.org/abs/2508.12535)
Healy, K\., et al\. \(2026\)\. Internal Representations as Indicators of Hallucinations in Agent Tool Selection\.arXiv preprint arXiv:2601\.05214\.[https://arxiv\.org/abs/2601\.05214](https://arxiv.org/abs/2601.05214)
Li, W\., et al\. \(2025\)\. Adaptive Tool Use in Large Language Models with Meta\-Cognition Trigger\.arXiv preprint arXiv:2502\.12961\.[https://arxiv\.org/abs/2502\.12961](https://arxiv.org/abs/2502.12961)
Liu, W\., et al\. \(2024\)\. ToolACE: Winning the Points of LLM Function Calling\.arXiv preprint arXiv:2409\.00920\.[https://arxiv\.org/abs/2409\.00920](https://arxiv.org/abs/2409.00920)
McKenzie, A\., et al\. \(2025\)\. Detecting High\-Stakes Interactions with Activation Probes\.arXiv preprint arXiv:2506\.10805\.[https://arxiv\.org/abs/2506\.10805](https://arxiv.org/abs/2506.10805)
Patil, S\. G\., et al\. \(2025\)\. The Berkeley Function Calling Leaderboard \(BFCL\): From Tool Use to Agentic Evaluation of Large Language Models\.Proceedings of the Forty\-Second International Conference on Machine Learning\.[https://openreview\.net/forum?id=2GmDdhBdDk](https://openreview.net/forum?id=2GmDdhBdDk)
Qin, Y\., et al\. \(2023\)\. ToolLLM: Facilitating Large Language Models to Master 16000\+ Real\-World APIs\.arXiv preprint arXiv:2307\.16789\.[https://arxiv\.org/abs/2307\.16789](https://arxiv.org/abs/2307.16789)
Rai, D\., et al\. \(2024\)\. A Practical Review of Mechanistic Interpretability for Transformer\-Based Language Models\.arXiv preprint arXiv:2407\.02646\.[https://arxiv\.org/abs/2407\.02646](https://arxiv.org/abs/2407.02646)
Schick, T\., et al\. \(2023\)\. Toolformer: Language Models Can Teach Themselves to Use Tools\.Advances in Neural Information Processing Systems, 36\.[https://arxiv\.org/abs/2302\.04761](https://arxiv.org/abs/2302.04761)
Tatsat, H\., & Shater, A\. \(2025\)\. Beyond the Black Box: Interpretability of LLMs in Finance\.arXiv preprint arXiv:2505\.24650\.[https://arxiv\.org/abs/2505\.24650](https://arxiv.org/abs/2505.24650)
Wang, J\., et al\. \(2025\)\. HammerBench: Fine\-Grained Function\-Calling Evaluation in Real Mobile Device Scenarios\.arXiv preprint arXiv:2412\.16516\.[https://arxiv\.org/abs/2412\.16516](https://arxiv.org/abs/2412.16516)
Wang, H\., et al\. \(2025\)\. Position: Agent Should Invoke External Tools ONLY When Epistemically Necessary\.arXiv preprint arXiv:2506\.00886\.[https://arxiv\.org/abs/2506\.00886](https://arxiv.org/abs/2506.00886)
## Appendix AFEATURE LABELING METHODOLOGY
Table[12](https://arxiv.org/html/2605.06890#A1.T12)summarizes the paper’s*feature labeling methodology*for representative SAE features\. We use it only to provide a small number of compact feature labels that help the reader interpret the probe tables\.
For each selected feature, we retrieve top\-activating examples, convert them into a short evidence summary, and use that evidence to produce a concise candidate label\. When available, we also use “Neuronpedia” as an external automated\-interpretability browser for inspecting SAE features and activation examples; its descriptions are treated only as a cross\-check rather than as ground truth\. The resulting labels are therefore best understood as brief interpretability anchors, not as definitive names for model mechanisms\.
Table 12:Technical summary of the feature labeling methodology used for representative SAE features\.
## Appendix BNEMOTRON RISK\-TIER SCHEME
This appendix summarizes the keyword groups used to instantiate the Nemotron risk\-tier scheme referenced in Section[3\.1](https://arxiv.org/html/2605.06890#S3.SS1)\. The scheme is heuristic and is intended to capture the operational risk of the next tool action rather than its domain label\. Table[13](https://arxiv.org/html/2605.06890#A2.T13)lists the tier definitions and representative keyword groups\.
Table 13:Keyword groups used in the Nemotron risk\-tier scheme\. Representative tools are illustrative rather than exhaustive\.
## Appendix CADDITIONAL FINANCIAL QUALITATIVE EXAMPLES
This appendix collects the supplementary qualitative traces that support the worked example in Section[4\.3](https://arxiv.org/html/2605.06890#S4.SS3)\. Figure[2](https://arxiv.org/html/2605.06890#A3.F2)covers the multi\-ticker fundamentals trace, Figures[3](https://arxiv.org/html/2605.06890#A3.F3)and[4](https://arxiv.org/html/2605.06890#A3.F4)cover the Bitcoin DCA trace, Table[14](https://arxiv.org/html/2605.06890#A3.T14)provides the illustrative Nemotron–BFCL formatting contrast referenced in Section[5](https://arxiv.org/html/2605.06890#S5), and Tables[15](https://arxiv.org/html/2605.06890#A3.T15)–[17](https://arxiv.org/html/2605.06890#A3.T17)provide the corresponding full per\-step values\.
### C\.1Illustrative cross\-benchmark formatting example
Table[14](https://arxiv.org/html/2605.06890#A3.T14)shows the non\-evaluative formatting contrast referenced in Section[5](https://arxiv.org/html/2605.06890#S5)\. It is included only to illustrate how similar tool\-use logic can appear under different benchmark representations\.
Table 14:Illustrative formatting contrast between Nemotron and BFCL\. The table is included only to show how similar tool\-use logic can appear under different benchmark representations; it is not itself an evaluation result\.
### C\.2Multi\-ticker fundamentals trace \(trajectory\_id3344\)
Figure[2](https://arxiv.org/html/2605.06890#A3.F2)shows the Tool\-Need probability curve for the multi\-ticker fundamentals trace discussed in Section[4\.3](https://arxiv.org/html/2605.06890#S4.SS3); the full step\-level values appear in Table[15](https://arxiv.org/html/2605.06890#A3.T15)\.
Figure 2:Tool\-Need Probe \(Probe 1\) on the multi\-ticker fundamentals trajectory\. The signal rises on steps that require external financial retrieval and falls on follow\-up no\-tool steps\.
### C\.3Bitcoin DCA scenario \(Nemotron,trajectory\_id4592\)
Figures[3](https://arxiv.org/html/2605.06890#A3.F3)and[4](https://arxiv.org/html/2605.06890#A3.F4)show the corresponding Tool\-Need and Tool\-Risk traces for the Bitcoin DCA scenario; the full step\-level values appear in Table[16](https://arxiv.org/html/2605.06890#A3.T16)\.
Figure 3:Tool\-Need Probe \(Probe 1\) on the Bitcoin DCA trajectory\. The tool\-needed signal rises on calculation\-heavy steps and falls on intervening no\-tool steps\.Figure 4:Tool\-Risk Probe \(Probe 2\) on the Bitcoin DCA trajectory\. Risk probabilities remain overwhelmingly low, consistent with calculator\-style actions\.
### C\.4Full per\-step probe tables
Tables[15](https://arxiv.org/html/2605.06890#A3.T15),[16](https://arxiv.org/html/2605.06890#A3.T16), and[17](https://arxiv.org/html/2605.06890#A3.T17)report the full step\-level outputs underlying the qualitative examples and BFCL slice referenced in the main text\.
Table 15:Nemotron multi\-ticker fundamentals \(trajectory\_id3344\): all pivot steps \(GPT\-OSS probes\)\. “Phase” summarizes the latest user intent at each step; tier = Probe\-2 argmax\.Table 16:Nemotron Bitcoin DCA scenario \(trajectory\_id4592\): all pivot steps \(GPT\-OSS probes\)\.Table 17:BFCL trading episode \(multi\_turn\_base\_102\): all steps in the evaluated slice \(GPT\-OSS probes\)\. Gold risk = heuristic projection used for expected tier\.Similar Articles
Most “agentic AI” conversations feel too abstract. Here is how my agentic research system looks like
The author shares a practical breakdown of an agentic research system they built to identify and evaluate AI use cases within companies. The system uses six agents for discovery, evaluation, and context extraction, emphasizing human-in-the-loop decision-making over full autonomy.
Interpretability
Anthropic's Interpretability team focuses on understanding large language models internally to enhance AI safety and positive outcomes, utilizing a multidisciplinary approach.
@DivyanshT91162: Microsoft Research just dropped a paper that completely flips interpretability on its head. (bookmark this) For years, …
Microsoft Research introduced Agentic-iModels, a framework where coding agents evolve scikit-learn regressors optimized for LLM interpretability rather than human readability, outperforming traditional interpretable ML methods across 65 datasets.
Disillusionment with mechanistic interpretability research [D]
An undergraduate researcher expresses disillusionment with recent mechanistic interpretability research from Anthropic, specifically criticizing their new natural language autoencoder approach as a black-box technique that lacks rigorous metric comparisons against sparse autoencoder baselines.
AI agents are starting to expose how broken most workflows already were
The article argues that AI agents are revealing how unstructured and chaotic many corporate workflows actually are, suggesting that successful automation depends more on clean systems and documentation than on advanced models.