Localizing RL-Induced Tool Use to a Single Crosscoder Feature

arXiv cs.LG 06/26/26, 04:00 AM Papers
Summary
This paper uses Dedicated Feature Crosscoders to localize RL-induced tool-use capability in Qwen2.5-3B to a single steerable feature, achieving +65pp tool-correctness via feature steering and demonstrating capability spillover to frozen base models.
arXiv:2606.26474v1 Announce Type: new Abstract: Fine-tuning through RL reshapes the internal representations of language models to enable agentic behaviors such as tool use, yet the mechanistic basis of these changes remains poorly understood. While RL substantially improves structured tool-call generation, it is unclear which features emerge, which are preserved, and whether identified features can be leveraged for retraining-free behavioral control. In this work, we show that $\textit{Dedicated Feature Crosscoders (DFC)}$ isolate a compact set of RL-specific features that mediate tool-calling capability in $\texttt{Qwen2.5-3B}$. Across a $48$-crosscoder hyperparameter sweep, encode-decode reconstruction improves the RL model's tool correctness by $+31.1 \pm {9.7}$ pp and passively transfers tool-calling ability to the frozen base model by $+6.8 \pm 5.0$ pp which we call a $\textit{capability spillover}$. Our findings show that DFC partitioning concentrates RL-introduced capability into a minimal, steerable feature set that enables runtime behavioral control of agentic LLMs.
Original Article
View Cached Full Text
Cached at: 06/26/26, 05:20 AM
# Localizing RL-Induced Tool Use to a Single Crosscoder Feature
Source: [https://arxiv.org/html/2606.26474](https://arxiv.org/html/2606.26474)
Shubham BhokareAhmed Zeyad A AlzahraniBowen ChengGustavo MercierJessica Hullman

###### Abstract

Fine\-tuning through RL reshapes the internal representations of language models to enable agentic behaviors such as tool use, yet the mechanistic basis of these changes remains poorly understood\. While RL substantially improves structured tool\-call generation, it is unclear which features emerge, which are preserved, and whether identified features can be leveraged for retraining\-free behavioral control\. In this work, we show thatDedicated Feature Crosscoders \(DFC\)isolate a compact set of RL\-specific features that mediate tool\-calling capability111see Limitations \(section[7](https://arxiv.org/html/2606.26474#S7)\)inQwen2\.5\-3B\. Across a4848\-crosscoder hyperparameter sweep, encode\-decode reconstruction improves the RL model’s tool correctness by\+31\.1±9\.7\+31\.1\\pm\{9\.7\}pp and passively transfers tool\-calling ability to the frozen base model by\+6\.8±5\.0\+6\.8\\pm 5\.0pp which we call a*capability spillover*\. Our findings show that DFC partitioning concentrates RL\-introduced capability into a minimal, steerable feature set that enables runtime behavioral control of agentic LLMs\.

## 1Introduction

Large language models are increasingly trained to perform agentic tasks such as invoking external tools and interacting with external systems\(Ouyanget al\.,[2022](https://arxiv.org/html/2606.26474#bib.bib10)\)\. RL enables these behaviors, yet how RL fine\-tuning alters a model’s internal representations is not well understood\.

Mechanistic interpretability has introduced sparse autoencoders \(SAEs\)\(Brickenet al\.,[2023](https://arxiv.org/html/2606.26474#bib.bib2); Cunninghamet al\.,[2023](https://arxiv.org/html/2606.26474#bib.bib3); Templetonet al\.,[2024](https://arxiv.org/html/2606.26474#bib.bib4)\)to decompose model activations into interpretable features, and Crosscoders\(Lindseyet al\.,[2024](https://arxiv.org/html/2606.26474#bib.bib5)\)extend this shared decomposition across two related models\. Dedicated Feature Crosscoders\(Jiralerspong and Bricken,[2025](https://arxiv.org/html/2606.26474#bib.bib6)\)further partition the joint dictionary intoAA\-exclusive,BB\-exclusive, and shared sub\-dictionaries with gradient masking in order to enforce exclusivity\. We test whether capability differences introduced by RL concentrate in theAA\-exclusive partition as intended and whether those features can be steered for retraining\-free behavioral control\.

We make the following contributions:

1. 1\.We train and evaluate 48 crosscoder variants across a systematic hyperparameter sweep on a ToolRL\-fine\-tunedQwen2\.5\-3Bpair \(§[3\.3](https://arxiv.org/html/2606.26474#S3.SS3)\\S\\ref\{sec:hyperparam\}\)\.
2. 2\.We identify*capability spillover*: passing the frozen base model’s activations through a jointly\-trained crosscoder passively transfers\+6\.8\+6\.8pp tool\-correctness with zero fine\-tuning \(§[5\.2](https://arxiv.org/html/2606.26474#S5.SS2)\\S\\ref\{sec:spillover\}\)\.
3. 3\.We show that the DFC exclusive partition acts as a*filter*, concentrating the strongest model\-specific features but not fully isolating the RL\-induced capability difference \(*sink*\)222A sink is the limiting case of a filter\. It is reached only when the capability is orthogonal to shared structure\.; penalizing it degrades RL\-model fidelity \(§[5\.4](https://arxiv.org/html/2606.26474#S5.SS4)\\S\\ref\{sec:filter\}\)\.
4. 4\.We demonstrate that steering asingleAA\-exclusive neuron\(interchangeably:feature\) achieves\+65\.0\+65\.0ppΔ\\Deltatool\-correctness, generalizing across the supermajority of probed layers \(§[6\.1](https://arxiv.org/html/2606.26474#S6.SS1)\\S\\ref\{sec:saturation\}\)\.
5. 5\.We provide geometric and autointerp evidence thatAA\-exclusive features occupy a separable "Tool Interaction" region which is distinct from shared and B\-exclusive features under the DFC, but not under a matched CrossCoder \(§[4](https://arxiv.org/html/2606.26474#S4)\\S\\ref\{sec:feature\_analysis\}\)\.

DFC\-based model diffing is a promising tool for identifying and modulating the representations introduced by RL post\-training\. Our implications for mechanistic interpretability suggest runtime behavioral control of agentic LLMs\.

## 2Related Work

![Refer to caption](https://arxiv.org/html/2606.26474v1/figures/crosscoders.png)\(a\)Crosscoders
![Refer to caption](https://arxiv.org/html/2606.26474v1/figures/ties.png)\(b\)Tied Crosscoders
![Refer to caption](https://arxiv.org/html/2606.26474v1/figures/dfc-1.png)\(c\)Dedicated Feature Crosscoders

Figure 1:Three architectures for joint sparse decomposition of paired model activations#### Sparse autoencoders and mechanistic interpretability\.

SAEs are popular as a tool for decomposing LLM activations into interpretable features\(Brickenet al\.,[2023](https://arxiv.org/html/2606.26474#bib.bib2); Cunninghamet al\.,[2023](https://arxiv.org/html/2606.26474#bib.bib3); Templetonet al\.,[2024](https://arxiv.org/html/2606.26474#bib.bib4); Elhageet al\.,[2022](https://arxiv.org/html/2606.26474#bib.bib1)\)\. These methods are built on the hypothesis that model representations are superpositions of many more features than there are neurons, and that sparse dictionary learning can recover them\.

#### Crosscoders and model diffing\.

Crosscoders\(Lindseyet al\.,[2024](https://arxiv.org/html/2606.26474#bib.bib5)\)\(fig\.[1\(a\)](https://arxiv.org/html/2606.26474#S2.F1.sf1)\) extend SAEs to pairs of models by jointly encoding their activations into one shared sparse dictionary\. Dedicated Feature Crosscoders\(Jiralerspong and Bricken,[2025](https://arxiv.org/html/2606.26474#bib.bib6)\)\(fig\.[1\(c\)](https://arxiv.org/html/2606.26474#S2.F1.sf3)\), like the Tied Crosscoders\(Aranguri,[2025](https://arxiv.org/html/2606.26474#bib.bib15)\)\(fig\.[1\(b\)](https://arxiv.org/html/2606.26474#S2.F1.sf2)\), further partition this dictionary with gradient masking\. It forces each model to read from and write to only its designated partition\. Our work is the first systematic behavioral evaluation of DFC across a hyperparameter sweep, and the first to demonstrate single\-neuron steering saturation\.

#### RL fine\-tuning for tool use\.

ToolRL\(Qianet al\.,[2025](https://arxiv.org/html/2606.26474#bib.bib7)\)demonstrates that RL from execution feedback substantially improves structured tool\-call generation inQwen2\.5\-3B\. We use the ToolRL\-fine\-tuned model as ModelAAthroughout our experiments\.

#### Activation steering\.

Our steering approach operates in the sparse dictionary of the crosscoder which allows for feature\-level intervention\(Templetonet al\.,[2024](https://arxiv.org/html/2606.26474#bib.bib4)\)\. We identify the minimal steerable subset of neurons, and a singleAA\-exclusive neuron is enough to maximize the effect\.

#### Automated interpretability\.

Automated interpretability methods\(Billset al\.,[2023](https://arxiv.org/html/2606.26474#bib.bib16)\)use LLM judges to label and score SAE features\. We apply the protocol to validateAA\-exclusive feature interpretations\.

## 3Methodology

### 3\.1Models and Training Data

We study two models sharing theQwen2\.5\-3Barchitecture \(d=2048d=2048,3636layers\):Model B \(Base\):Qwen/Qwen2\.5\-3B\(Qwen Team,[2025](https://arxiv.org/html/2606.26474#bib.bib9)\)andModel A \(RL\):chengq9/ToolRL\-Qwen2\.5\-3B\(Qianet al\.,[2025](https://arxiv.org/html/2606.26474#bib.bib7)\), RL\-fine\-tuned for structured<tool\_call\>invocation\.

Training uses40,00040\{,\}000FineWeb\(Penedoet al\.,[2024](https://arxiv.org/html/2606.26474#bib.bib17)\)samples \(general\-domain\) and40,00040\{,\}000ToolRLinstruction\-output pairs\.

### 3\.2DFC Architecture and Training Objective

The DFC dictionary of sizeDDis partitioned into A\-exclusive\[0,aend\)\[0,a\_\{\\text\{end\}\}\), B\-exclusive\[aend,bend\)\[a\_\{\\text\{end\}\},b\_\{\\text\{end\}\}\), and shared\[bend,D\)\[b\_\{\\text\{end\}\},D\)features, with gradient masking enforcing exclusivity\. The training objective is presented in the Equation \[[1](https://arxiv.org/html/2606.26474#S3.E1)\]:

ℒ=MSE\(h,h^\)\+λsh⋅‖fsh‖1¯\+λexcl2\(‖fA∪sh‖1¯\+‖fB∪sh‖1¯\)\\begin\{split\}\\mathcal\{L\}=\{\}&\\mathrm\{MSE\}\(h,\\hat\{h\}\)\+\\lambda\_\{\\mathrm\{sh\}\}\\cdot\\overline\{\\\|f\_\{\\mathrm\{sh\}\}\\\|\_\{1\}\}\\\\ &\+\\frac\{\\lambda\_\{\\mathrm\{excl\}\}\}\{2\}\\left\(\\overline\{\\\|f\_\{A\\cup\\mathrm\{sh\}\}\\\|\_\{1\}\}\+\\overline\{\\\|f\_\{B\\cup\\mathrm\{sh\}\}\\\|\_\{1\}\}\\right\)\\end\{split\}\(1\)whereh=\(hA,hB\)h=\(h\_\{A\},h\_\{B\}\)is the concatenated residual stream post\-MLP and\|⋅\|1¯\\overline\{\|\\cdot\|\_\{1\}\}denotes the mean of\|fi\|\|f\_\{i\}\|\. Top\-kksparsity is enforced in the encoder\.

### 3\.3Hyperparameter Sweep

The sweep varies five axes that determine the decomposition’s capacity, sparsity, and partition structure \(Table[1](https://arxiv.org/html/2606.26474#S3.T1)\)\.Architectureis the top\-level choice between a standard Crosscoder \(one unpartitioned dictionary jointly encoding both models\) and a DFC \(the same dictionary explicitly split into A\-exclusive, B\-exclusive, and shared sub\-dictionaries via gradient masking\)\.Dictionary sizeDDis the total number of features the dictionary can represent\.Top\-kkis the sparsity budget: at each forward pass, only thekkhighest\-activating features are retained\.Exclusive sharepp\(DFC only\) is the fraction ofDDallocated to each exclusive partition\.𝝀excl\\boldsymbol\{\\lambda\_\{\\text\{excl\}\}\}is the magnitude penalty applied specifically to the exclusive partitions \(in the CrossCoder rows of TableLABEL:tab:full\_sweep\_evalwe reuse the same coefficient on the equivalent unmasked features — see code release\): setting it to0\(“free”\) lets exclusive features fire without extra cost, while10−310^\{\-3\}\(“penalised”\) pushes the model to route signal through shared features unless exclusivity is needed which allows to test whether the exclusive partition behaves as a sink or a filter\.

Table 1:4848crosscoder variants \(3636DFC and1212crosscoders\) trained for9,0009\{,\}000steps, batch1,0241\{,\}024, Adamlr=10−4lr=10^\{\-4\},λsh=10−3\\lambda\_\{\\text\{sh\}\}=10^\{\-3\}\.
### 3\.4Behavioral Scoring Rubric

Each generated response is scored on three metrics\(Qianet al\.,[2025](https://arxiv.org/html/2606.26474#bib.bib7)\):

- •format\_accuracy:<tool\_call\>present and JSON"name":"\.\.\."field present;
- •tool\_correctness: called name fuzzy\-matches a numbered tool in the prompt;
- •overall\_score∈\{−1,0,\+1,\+2\}\\in\\\{\-1,0,\+1,\+2\\\}:\+2\+2= both correct,0= format only,\+1\+1= tool only,−1\-1= neither\. Evaluation: 100 held\-out ToolRL prompts per sweep variant and 40 prompts per steering cell \(seed 42\), greedy decoding,max\_new\_tokens=200,max\_length=2048\.

### 3\.5Neuron Identification and Targeted Steering

Features are ranked by Cohen’sddon tool\-use vs\. general\-text activations \[Eq\.[2](https://arxiv.org/html/2606.26474#S3.E2)\]:

di=μitool−μigen\(sitool2\+sigen2\)/2d\_\{i\}=\\frac\{\\mu^\{\\text\{tool\}\}\_\{i\}\-\\mu^\{\\text\{gen\}\}\_\{i\}\}\{\\sqrt\{\(s^\{\\text\{tool\}\\,2\}\_\{i\}\+s^\{\\text\{gen\}\\,2\}\_\{i\}\)/2\}\}\(2\)and filtered by firing rate \(ρitool≥0\.3\\rho^\{\\text\{tool\}\}\_\{i\}\\geq 0\.3,ρigen≤0\.1\\rho^\{\\text\{gen\}\}\_\{i\}\\leq 0\.1\)\. Targeted steering applies an additive correction to Model A’s residual stream \[Eq\.[3](https://arxiv.org/html/2606.26474#S3.E3)\]:

hA′=hA\+∑i∈S\(α−1\)⋅fi⋅Wdec\[i,A,:\]h^\{\\prime\}\_\{A\}=h\_\{A\}\+\\sum\_\{i\\in S\}\(\\alpha\-1\)\\cdot f\_\{i\}\\cdot W\_\{\\text\{dec\}\}\[i,A,:\]\(3\)for subsetSSof A\-exclusive features ranked bydid\_\{i\}, steering coefficientα\\alpha, and decoder columnWdec\[i,A,:\]∈ℝdW\_\{\\text\{dec\}\}\[i,A,:\]\\in\\mathbb\{R\}^\{d\}\. \(Section[5](https://arxiv.org/html/2606.26474#S5)\)\.

## 4Feature Space Analysis

### 4\.1DFC vs CrossCoder Decoder Geometry

To test whether the partitioned geometry observed for the DFC \(Figure[2](https://arxiv.org/html/2606.26474#S4.F2)\) is a consequence of the gradient\-mask architecture or merely of label imbalance, we construct a matched\-size proxy partition for the unconstrained CrossCoder\. For each dictionary featureii, the decoder has two columns:Wdec\[i,0,:\]W\_\{\\text\{dec\}\}\[i,0,:\], the direction the feature decodes into Model A’s residual stream, andWdec\[i,1,:\]W\_\{\\text\{dec\}\}\[i,1,:\], the direction it decodes into Model B’s\. We define the*mass\-ratio*

ri=log⁡∥Wdec\[i,0,:\]∥2∥Wdec\[i,1,:\]∥2r\_\{i\}=\\log\\\!\\frac\{\\lVert W\_\{\\text\{dec\}\}\[i,0,:\]\\rVert\_\{2\}\}\{\\lVert W\_\{\\text\{dec\}\}\[i,1,:\]\\rVert\_\{2\}\}\(4\)so thatri≫0r\_\{i\}\\gg 0indicates a feature writing almost entirely into Model A \(effectively A\-exclusive\),ri≪0r\_\{i\}\\ll 0indicates Model B \(effectively B\-exclusive\), andri≈0r\_\{i\}\\approx 0indicates balanced contribution \(shared\)\. For the DFC, the partition mask forces this ratio: the first819819features haveWdec\[i,1,:\]=0W\_\{\\text\{dec\}\}\[i,1,:\]=0\(sori→\+∞r\_\{i\}\\to\+\\infty\), the next819819haveWdec\[i,0,:\]=0W\_\{\\text\{dec\}\}\[i,0,:\]=0\(ri→−∞r\_\{i\}\\to\-\\infty\), and the remaining6,5546\{,\}554have unconstrained A and B decoders\. For the CrossCoder no such constraint exists, so we sort all8,1928\{,\}192features byrir\_\{i\}and label the top819819as the A\-exclusive proxy, the bottom819819as the B\-exclusive proxy, and the middle6,5546\{,\}554as the shared proxy\. This produces partition slices of identical size to the DFC’s, so any difference in UMAP separation is attributable to the training objective \(partition mask vs no mask\) rather than to label imbalance\.

With matched sizes, the CrossCoder’s A\-biased features still mix uniformly with B\-biased and shared features\. The A\-exclusive proxy in the CC has*negative*silhouette \(−0\.168\-0\.168\) — on average, mass\-ratio\-A\-biased features sit closer to non\-A\-biased centroids than to their own — andkk\-NN purity drops from the DFC’s0\.9840\.984to0\.1580\.158, i\.e\. the precision of recovering A\-exclusive proxy labels among an A\-biased feature’s nearest neighbours falls to roughly the prior\|A\|/D≈10%\|A\|/D\\approx 10\\%, the chance level for a10%10\\%minority class under random label assignment\. HDBSCAN clusters in the CC UMAP recover the mass\-ratio labels at ARI=0\.08=0\.08versus0\.930\.93for the DFC: the structure simply isn’t there to be recovered\. This falsifies the alternative explanation that the DFC’s clean separation is a UMAP artefact: with matched dictionary size \(D=8,192D=8\{,\}192\) and matched sparsity \(k=160k=160\), the same UMAP pipeline produces no separation when the partition mask is removed\. The geometric structure in the DFC plot is caused by the architectural partitioning and mass\-asymmetry alone is insufficient to produce it\. The gradient mask reorganises feature directions, not just decoder magnitudes\.

![Refer to caption](https://arxiv.org/html/2606.26474v1/figures/umap_dfc_cc.png)Figure 2:Decoder UMAP at matched hyperparameters \(D=8,192D=8\{,\}192,k=160k=160;nneighbors=30n\_\{\\text\{neighbors\}\}=30,min⁡\_dist=0\.1\\min\\\_\\text\{dist\}=0\.1, cosine metric, seed4242\)\.Left:the DFC \(dfc\-D8k\-excl10\-freeexcl\-k160\) yields three spatially distinct regions: A\-exclusive \(red\), B\-exclusive \(blue\), and shared \(grey\)\.Right:the CrossCoder \(cc\-D8k\-k160\) with a mass\-ratio proxy partition of identical sizes \(819/819/6,554819/819/6\{,\}554\) shows A\-biased and B\-biased features mixing uniformly with shared features throughout the dense core\.

## 5Reconstruction Fidelity and Capability Spillover

### 5\.1Sweep\-Level Reconstruction

Table 2:Sweep mean±\\pmstd,n=48n=48crosscoders,100100hold\-out prompts each\.Across the4848\-model sweep, reconstruction substantially improves the RL model’s behavioral performance relative to its pre\-reconstruction baseline \(Table[2](https://arxiv.org/html/2606.26474#S5.T2)\)\. Every one of the4848trained variants improves ModelAAtool correctness \(48/4848/48\); under a one\-sided exact\-binomial sign test against the chance rate of0\.50\.5, this corresponds top≈3\.6×10−15p\\approx 3\.6\\times 10^\{\-15\}\.

At the same time, behavioral preservation \(and even performance boost\) does not track raw reconstruction loss in any simple way\. Across the sweep, training MSE correlates only weakly with the gain in Model A performance \(r=\+0\.08r=\+0\.08, 95% CI\[−0\.21,\+0\.36\]\[\-0\.21,\+0\.36\]\)\. In other words, lower MSE does not reliably predict better behavioral reconstruction\. This suggests that the information required for successful tool calling occupies a relatively sparse and behaviorally privileged subspace: a model can incur substantial activation\-level distortion while still preserving the task\-relevant structure needed to emit correct tool calls\. One plausible interpretation is that the top\-kkbottleneck acts as an implicit regulariser, discarding noisy or behaviorally irrelevant components while retaining the sparse features most important for tool\-use performance\.

### 5\.2Capability Spillover

Model B’s tool\-correctness rises from 0% to6\.8%6\.8\\%post\-reconstruction \(paired\-ttacross the4848runs against the always\-zero baseline:t≈9\.4t\\approx 9\.4,p≈1×10−12p\\approx 1\\times 10^\{\-12\}\) without any fine\-tuning\. We call this*capability spillover*: tool\-calling intent is delocalized into the shared decoder weights during joint training and reconstruction passively routes it into Model B’s activation stream\.

Format accuracy never spills over\(ΔB\\Delta\_\{B\}format=0=0pp across all 48 runs\)\. Model B correctly identifies tools in prose but never produces the exact<tool\_call\>\.\.\."name":\.\.\.surface form\. It suggests that semantic tool\-selection intent is at least partially represented in shared directions, but the exact surface\-form machinery required to produce the<tool\_call\>template remains isolated in narrow RL\-specific features\. This interpretation is consistent with the later autointerp evidence, which shows that the most discriminative A\-exclusive features are structural\-template detectors rather than broad abstract tool\-use concepts\.

### 5\.3Architecture Comparison

Table 3:Welch’stt\-test onΔB\\Delta\_\{B\}:t=1\.63t=1\.63,p=0\.12p=0\.12\(directional, not significant\)\. Train MSE indistinguishable\.Comparing the two architectures at sweep level \(Table[3](https://arxiv.org/html/2606.26474#S5.T3)\), CrossCoders and DFCs achieve essentially indistinguishable reconstruction error but somewhat different behavioral transfer profiles\. CrossCoders \(1212runs\) achieve a mean Model A tool\-correctness gain of\+32\.9\+32\.9pp and a mean Model B gain of\+8\.9\+8\.9pp, while DFCs \(36 runs\) achieve\+30\.5\+30\.5pp on Model A and\+6\.1\+6\.1pp on Model B\. Mean training MSE is the same to three decimal places for both families\. Raw reconstruction loss alone does not explain the behavioral differences\.

The directional reduction in Model B spillover under DFC is suggestive but not statistically decisive in this sweep\. A Welch’stt\-test onΔB\\Delta\_\{B\}yieldst=1\.63t=1\.63,p=0\.12p=0\.12, so we do not claim a significant architecture\-level difference here\. The more cautious interpretation is that DFC*may*reduce spillover relative to an unpartitioned CrossCoder, but the evidence is currently underpowered and should be treated as directional rather than conclusive\.

### 5\.4The DFC Partition as a Filter, Not a Sink

λexcl=10−3\\lambda\_\{\\text\{excl\}\}=10^\{\-3\}reduces ModelAAfidelity relative toλexcl=0\\lambda\_\{\\text\{excl\}\}=0at both the5%5\\%and10%10\\%exclusive shares \(34\.8→25\.834\.8\\to 25\.8ppΔA\\Delta Aat5%5\\%and35\.2→32\.035\.2\\to 32\.0pp at10%10\\%, each averaged over the six matched\(D,k\)\(D,k\)pairs in Appendix[C](https://arxiv.org/html/2606.26474#A3), TableLABEL:tab:full\_sweep\_eval\)\. Sparsifying the exclusive partition therefore pushes tool\-specific signal back into the shared partition rather than removing it\. Rather than isolating all RL\-specific signal into a protected compartment, the exclusive partition appears to act more like a*filter*that concentrates the most model\-specific residue\.

If the exclusive slice were a true sink for the capability difference, sparsifying it should mostly remove redundant or nonessential signal while leaving shared structure intact\. Instead, penalizing the exclusive slice appears to push some tool\-relevant information back into the shared partition, where it is both less behaviorally efficient for ModelAAand more available for spillover into ModelBB\. The DFC partition therefore helps localize model\-specific features, but it does not perfectly disentangle capability\-specific content from the shared representational substrate\.

## 6Targeted\-Neuron Steering

### 6\.1DFC Saturates at a Single Feature

![Refer to caption](https://arxiv.org/html/2606.26474v1/figures/ablation_l13/ablation_saturation_curve.png)Figure 3:Saturation curve:Δ\\Deltatool\-correctness vs\|S\|\|S\|\(log\-xx\) for all conditions at layer 13\. DFC A\-excl plateaus from\|S\|=1\|S\|=1; CrossCoder needs\|S\|=33\|S\|=33to reach its peak\. B\-excl steering has zero effect throughout\. Devices used:dfc\-D8k\-excl10\-k45andcc\-D8k\-k45Steering asingle A\-exclusive neuron\(\|S\|=1\|S\|=1,α=32\\alpha=32\) achieves\+65\.0\\mathbf\{\+65\.0\}ppΔ\\Deltatool\-correctness \(95% CI\[\+47\.9,\+82\.1\]\[\+47\.9,\+82\.1\]\), matching the performance of steering all 7 available A\-exclusive neurons\. CrossCoder requires\|S\|=33\|S\|=33to reach its unbudgeted peak of\+70\.0\+70\.0pp \(\[\+53\.5,\+86\.5\]\[\+53\.5,\+86\.5\]\) \(Fig\.[3](https://arxiv.org/html/2606.26474#S6.F3)\)\. DFC A\-exclusive plateaus immediately; CC climbs slowly across33×33\\timesthe budget\. The full per\-\|S\|\|S\|breakdown with CIs is in Table[9](https://arxiv.org/html/2606.26474#A4.T9)\(Appendix[D](https://arxiv.org/html/2606.26474#A4)\)\.

### 6\.2Ablation Studies

Table[4](https://arxiv.org/html/2606.26474#S6.T4)isolates four mechanisms\. B\-exclusive steering has zero effect at every cell\. A\-exclusive beats shared under a small budget, despite shared holding more tool\-relevant features by Cohen’sdd\. Signal in A\-exclusive is concentrated; signal in shared is diffuse\. Steering A\-exclusive and shared together underperforms A\-exclusive alone\. The likely cause is destructive interference between non\-orthogonal decoder directions amplified at matchedα\\alpha\. The unpartitioned CrossCoder eventually matches DFC A\-exclusive’s peak, but only by recruiting an order of magnitude more features\.

ConditionBestΔ\\Delta\(%\)95% CI\|S\|\|S\|α\\alphaWithin\|S\|≤10\|S\|\\leq 10DFC⋅\\cdotA\-exclusive\+65\.0\[\+47\.9,\+82\.1\]132DFC⋅\\cdotshared\+47\.5\[\+27\.0,\+68\.0\]516DFC⋅\\cdotB\-exclusive\+0\.0\[\+0\.0,\+0\.0\]11DFC⋅\\cdotA\-excl∪\\cupshared\+35\.0\[\+17\.9,\+52\.1\]916CrossCoder⋅\\cdotall\+57\.5\[\+41\.5,\+73\.5\]232Unbudgeted \(any\|S\|\|S\|\)DFC⋅\\cdotA\-exclusive\+65\.0\[\+47\.9,\+82\.1\]132DFC⋅\\cdotshared\+62\.5\[\+43\.8,\+81\.2\]386CrossCoder⋅\\cdotall\+70\.0\[\+53\.5,\+86\.5\]336

Table 4:Best targeted\-steering cell per condition, layer 13,n=40n=40prompts per cell\. DFC A\-excl achieves its maximum at\|S\|=1\|S\|=1; CC requires\|S\|=33\|S\|=33\.
### 6\.3DFC vs CrossCoder: Budget\-Dependent Advantage

DFC A\-exclusive significantly outperforms shared, B\-exclusive, and the A\-excl∪\\cupshared combination, with effect sizes ranging from moderate \(dz=0\.40d\_\{z\}=0\.40\) to very large \(dz=1\.03d\_\{z\}=1\.03\) \(Table[5](https://arxiv.org/html/2606.26474#S6.T5)\)\. Against the unconstrained CrossCoder, the difference is null \(dz=0\.00d\_\{z\}=0\.00,p=0\.51p=0\.51\)\. Within the power of this study, the two architectures’ unbudgeted ceilings are indistinguishable\. This is the central tradeoff\. The DFC’s value is that it reaches*its*ceiling with one neuron, whereas the CrossCoder needs thirty\-three to reach*its*\. For interpretability and targeted intervention, single\-neuron control is qualitatively different from distributed control; for raw behavioral effect at any cost, the architectures are interchangeable\.

Table 5:Cell\-level paired tests vs DFC A\-excl \(n=30n=30shared cells, one\-sided H1: A\-excl\>\>baseline\)\. A\-excl beats shared, B\-excl, and combo; ties CC unbudgeted
### 6\.4Cross\-Layer Generalisation

![Refer to caption](https://arxiv.org/html/2606.26474v1/figures/k45_layers/top_per_layer_tool_vs_clean_absk.png)Figure 4:% tool\-correctness improvement per layer at\|S\|≤10\|S\|\\leq 10\(DFC A\-excl,dfc\-D8k\-excl10\-k45\)\. Layers 18 and 24 are dead \(no tool\-signed A\-excl features\)\.Across\-layers \(fig\.[4](https://arxiv.org/html/2606.26474#S6.F4)\), the mean best\-cell isΔ=\+43\.3\\Delta=\+43\.3pp \(95% CI\[\+22\.7,\+64\.0\]\[\+22\.7,\+64\.0\]\), Cohen’sdz=1\.61d\_\{z\}=1\.61, and pairedttp=0\.0006p=0\.0006, Wilcoxonp=0\.0078p=0\.0078\. Single\-neuron saturation is not a layer\-1313artefact\.

### 6\.5B\-Exclusive Steering Has No Effect

B\-exclusive steering producesΔ=\+0\.0\\Delta=\+0\.0pp at every\(\|S\|,α\)\(\|S\|,\\alpha\)cell \(CI\[\+0\.0,\+0\.0\]\[\+0\.0,\+0\.0\]\)\. We include this comparison as an end\-to\-end sanity check on the steering harness rather than a behavioural baseline: by construction, B\-exclusive decoder columns have their Model\-A output zeroed by the gradient mask, so any non\-zero effect here would indicate a bug in our patching pipeline \(wrong layer, wrong projection, leaked gradient\)\. The flat\+0\.0\+0\.0pp result therefore confirms that the harness is working as intended, and the magnitude of the A\-excl effect \(dz=1\.03d\_\{z\}=1\.03,p=2\.2×10−6p=2\.2\\times 10^\{\-6\}, Table[5](https://arxiv.org/html/2606.26474#S6.T5)\) is being read out faithfully\.

### 6\.6Combining A\-excl and Shared Causes Interference

Steering A\-excl∪\\cupshared simultaneously is worse than either alone: the best combo\+35\.0\+35\.0pp at\|S\|=9\|S\|=9vs A\-excl\+65\.0\+65\.0pp at\|S\|=1\|S\|=1\(prompt\-paired one\-sidedttacross then=40n=40ToolRL prompts shared by both cells,p=0\.002p=0\.002, Cohen’sdz=0\.56d\_\{z\}=0\.56; the3030shared\(\|S\|,α\)\(\|S\|,\\alpha\)cells in Table[5](https://arxiv.org/html/2606.26474#S6.T5)aggregate these prompt\-level paired tests across cells\)\. The most likely mechanism is destructive interference between non\-orthogonal A\-excl and shared decoder directions boosted at equalα\\alpha\.

### 6\.7Autointerp of the most important neuron

AA\-exclusive features \(such as feat\.\#136\\\#136from figure[5](https://arxiv.org/html/2606.26474#A4.F5)\) are*structural\-template detectors*\. They fire on narrow formatting markers \(<tool\_call\>,<response\>, parameter boilerplate\)\. The max\-activating fragment receives the explanation \(pseudocode from Appendix[A](https://arxiv.org/html/2606.26474#A1)\):*“the structure of a dialogue system, specifically the interaction between a tool call and a subsequent response\.”*DFC partitioning concentrates this detector in A\-exclusive parameter space\. Cohen’sddvalues larger than∼5\\sim 5here reflect a near\-zero variance under the general\-text condition \(the feature simply does not fire there\) and should be read as “cleanly separated” rather than as comparable effect sizes; the AUROC numbers, as used in Appendix[B](https://arxiv.org/html/2606.26474#A2)for another device, are the more meaningful ranking signal\.

### 6\.8Hypotheses Summary

Please see Table[6](https://arxiv.org/html/2606.26474#S6.T6)\.

Table 6:Hypotheses and status from the cross\-layer and ablation sweeps\.

## 7Discussion

#### Joint training delocalises capability\.

Our results partially refute the DFC’s design intent: that the exclusive partition cleanly contains the model\-specific capability\. Tool\-calling capability survives in the shared decoder weights well enough to result in\+6\.8\+6\.8pp spillover to Model B and to makeλexcl\\lambda\_\{\\text\{excl\}\}penalisation hurt Model A fidelity\. This is empirical evidence that superposition crosses model boundaries in joint sparse decompositions when underlying capabilities are not orthogonal in residual\-stream space\.

#### A minimal feature set carries most of the signal\.

UMAP\+HDBSCAN finds tens of clustered features out of 819 A\-exclusive; the discrimination ranking concentrates Cohen’sddin single\-digit feature indices; and a single feature suffices for maximum steering effect across 7 of 9 probed layers\. This suggests that sparse\-feature\-level interpretability of agentic capabilities works\.

#### Capability transfer is a side channel\.

The fact that a frozen base model gains\+6\.8\+6\.8pp tool\-correctness purely from the decomposition pipeline, with no fine\-tuning, suggests that releasing crosscoders trained between model pairs with different capability levels may inadvertently provide a substrate that reintroduces the capability at inference time\.

#### Steering as low\-intervention behavioral control\.

We find that\+65\+65pp tool\-correctness from a post\-top\-kkdelta on a single feature requires zero retraining\. The same machinery can in principle suppress unwanted behaviors by clamping the corresponding features to zero, positioning DFC\-based steering alongside RLHF/DPO/SFT for safety post\-training with the advantages of being inference\-time, gradient\-free, and inspectable\.

#### Limitations\.

We use “capability” as shorthand for reliable, structured tool\-call generation\. We do not claim RL installs a tool\-use ability the base model wholly lacks\. Our metrics measure the*propensity*to emit correct tool calls under a fixed evaluation prompt, and we did not run a prompted baseline tuned to maximally elicit base\-model tool use; our interventions should be read as modulating this propensity\. We study a single model pair \(Qwen2\.5\-3B\) and a single task \(ToolRL\)\. The architecture\-effecttt\-test is underpowered \(12 CC vs 36 DFC runs\)\. The steering results come fromn=40n=40prompts per cell; the best\-cell numbers have not been replicated on a fresh sample\. The autointerp pass covered only A\-exclusive features \(Section[6\.7](https://arxiv.org/html/2606.26474#S6.SS7)\); the shared partition is the natural next target given that it carries most of the spilled tool capability\.

#### Future Work\.

Since RL shapes agents that take consequential actions, the behaviors RL installs should become objects we can monitor and steer\. Model diffing is a step toward that because if RL\-induced behaviors reliably condense into small feature sets, then steering them offers runtime control that complements RLHF, DPO, and SFT, with the advantage of being legible\. Realising this vision will depend on having robust diffing methods;*Delta\-Crosscoders*\(Kassemet al\.,[2026](https://arxiv.org/html/2606.26474#bib.bib18)\), which explicitly model the asymmetric nature of fine\-tuning, are a promising path toward the fidelity such control would require\. A further frontier is*knowledge\-boundary discovery*\(Liet al\.,[2025](https://arxiv.org/html/2606.26474#bib.bib19)\): the most consequential thing an agent decides is often whether to act at all\. Locating the features that govern the boundary between committing to an action and abstaining under uncertainty would let us inspect an agent’s willingness to act\.

## Acknowledgments

We appreciate Prof\. Manling Li’s guidance on the task selection and future work directions\. We are thankful to Thomas Jiralerspong for clarifying our questions about the architecture and setup\. We are grateful to Yuliia Tkachuk for her help with the visualizations \(Fig\.[1](https://arxiv.org/html/2606.26474#S2.F1)\)\. This work was supported in part by the Northwestern University Department of Computer Science\. We also thank Prof\. Hullman for additional support\.

## References

- S\. Aranguri \(2025\)Tied crosscoders: explaining chat behavior from base model\.Note:[https://www\.lesswrong\.com/posts/3T8eKyaPvDDm2wzor/research\-question](https://www.lesswrong.com/posts/3T8eKyaPvDDm2wzor/research-question)LessWrong post\. Accessed: 2025\-08\-20Cited by:[§2](https://arxiv.org/html/2606.26474#S2.SS0.SSS0.Px2.p1.1)\.
- S\. Bills, N\. Cammarata, D\. Mossing,et al\.\(2023\)Language models can explain neurons in language models\.OpenAI Blog\.Cited by:[§2](https://arxiv.org/html/2606.26474#S2.SS0.SSS0.Px5.p1.1)\.
- T\. Bricken, A\. Templeton, J\. Batson,et al\.\(2023\)Towards monosemanticity: decomposing language models with dictionary learning\.Transformer Circuits Thread\.External Links:[Link](https://transformer-circuits.pub/2023/monosemantic-features/index.html)Cited by:[§1](https://arxiv.org/html/2606.26474#S1.p2.3),[§2](https://arxiv.org/html/2606.26474#S2.SS0.SSS0.Px1.p1.1)\.
- H\. Cunningham, A\. Ewart, L\. Riggs, R\. Huben, and L\. Sharkey \(2023\)Sparse autoencoders find highly interpretable features in language models\.arXiv preprint arXiv:2309\.08600\.Cited by:[§1](https://arxiv.org/html/2606.26474#S1.p2.3),[§2](https://arxiv.org/html/2606.26474#S2.SS0.SSS0.Px1.p1.1)\.
- N\. Elhage, T\. Hume, C\. Olsson,et al\.\(2022\)Toy models of superposition\.Transformer Circuits Thread\.External Links:[Link](https://transformer-circuits.pub/2022/toy_model/index.html)Cited by:[§2](https://arxiv.org/html/2606.26474#S2.SS0.SSS0.Px1.p1.1)\.
- T\. Jiralerspong and T\. Bricken \(2025\)Cross\-architecture model diffing with crosscoders: unsupervised discovery of differences between llms\.InMechanistic Interpretability Workshop at NeurIPS 2025,External Links:[Link](https://openreview.net/forum?id=ZB84SvrZB8)Cited by:[§1](https://arxiv.org/html/2606.26474#S1.p2.3),[§2](https://arxiv.org/html/2606.26474#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Kassem, T\. Jiralerspong, N\. Rostamzadeh, and G\. Farnadi \(2026\)Delta\-crosscoder: robust crosscoder model diffing in narrow fine\-tuning regimes\.External Links:2603\.04426,[Link](https://arxiv.org/abs/2603.04426)Cited by:[§7](https://arxiv.org/html/2606.26474#S7.SS0.SSS0.Px6.p1.1)\.
- M\. Li, Y\. Zhao, W\. Zhang, S\. Li, W\. Xie, S\. Ng, T\. Chua, and Y\. Deng \(2025\)Knowledge boundary of large language models: a survey\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 5131–5157\.External Links:[Link](https://aclanthology.org/2025.acl-long.256/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.256),ISBN 979\-8\-89176\-251\-0Cited by:[§7](https://arxiv.org/html/2606.26474#S7.SS0.SSS0.Px6.p1.1)\.
- J\. Lindsey, A\. Templeton, J\. Marcus, T\. Conerly, J\. Batson, and C\. Olah \(2024\)Sparse crosscoders for cross\-layer features and model diffing\.Transformer Circuits Thread\.External Links:[Link](https://transformer-circuits.pub/2024/crosscoders/index.html)Cited by:[§1](https://arxiv.org/html/2606.26474#S1.p2.3),[§2](https://arxiv.org/html/2606.26474#S2.SS0.SSS0.Px2.p1.1)\.
- L\. Ouyang, J\. Wu, X\. Jiang,et al\.\(2022\)Training language models to follow instructions with human feedback\.InAdvances in Neural Information Processing Systems,Vol\.35,pp\. 27730–27744\.Cited by:[§1](https://arxiv.org/html/2606.26474#S1.p1.1)\.
- G\. Penedo, H\. Kydlíček, L\. Ben Allal, A\. Lozhkov, M\. Mitchell, C\. Raffel, L\. Von Werra, and T\. Wolf \(2024\)The FineWeb datasets: decanting the web for the finest text data at scale\.InAdvances in Neural Information Processing Systems,External Links:[Link](https://arxiv.org/abs/2406.17557)Cited by:[§3\.1](https://arxiv.org/html/2606.26474#S3.SS1.p2.2)\.
- C\. Qian, E\. C\. Acikgoz, Q\. He, H\. Wang, X\. Chen, D\. Hakkani\-Tür, G\. Tur, and H\. Ji \(2025\)ToolRL: reward is all tool learning needs\.External Links:2504\.13958,[Link](https://arxiv.org/abs/2504.13958)Cited by:[§2](https://arxiv.org/html/2606.26474#S2.SS0.SSS0.Px3.p1.1),[§3\.1](https://arxiv.org/html/2606.26474#S3.SS1.p1.2),[§3\.4](https://arxiv.org/html/2606.26474#S3.SS4.p1.1)\.
- Qwen Team \(2025\)Qwen2\.5 technical report\.arXiv preprint arXiv:2412\.15115\.Cited by:[§3\.1](https://arxiv.org/html/2606.26474#S3.SS1.p1.2)\.
- A\. Templeton, T\. Conerly, J\. Marcus,et al\.\(2024\)Scaling monosemanticity: extracting interpretable features from Claude 3 Sonnet\.Transformer Circuits Thread\.External Links:[Link](https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html)Cited by:[§1](https://arxiv.org/html/2606.26474#S1.p2.3),[§2](https://arxiv.org/html/2606.26474#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2606.26474#S2.SS0.SSS0.Px4.p1.1)\.

## Appendix AAutointerp Pseudocode

```
AUTOINTERP(features, texts, K=10):
  # Stage 1: Find top-K activating examples per feature
  for each shard in feature_cache:
    for each feature f:
      maintain min-heap of size K by activation value

  # Stage 2: Explain
  for each feature f where topk[f] is non-empty:
    top_texts = lookup texts for topk[f]
    explanation = LLM("what pattern do these texts share?", top_texts)

  # Stage 3: Score (detection task)
    positives = top_texts              # should match explanation
    negatives = sample K random texts  # should not match
    combined  = shuffle(positives + negatives)
    predictions = LLM("MATCH or NO_MATCH?", explanation, combined)
    score = accuracy(predictions, ground_truth)
    interpretable = score >= 0.8
```

## Appendix BTop A\-Exclusive Feature Interpretations

Fromdfc\-D8k\-excl10\-freeexcl\-k160,n=40n=40prompts, ToolRL test split\.

Table 7:Top A\-exclusive features by Cohen’sddwith Gemma autointerp descriptions\.
## Appendix CFull Sweep Evaluation Table

TableLABEL:tab:full\_sweep\_evalreports the key hyperparameters and post\-reconstruction evaluation outcomes for all4848evaluated sweep variants\. We include both Model A and Model B post\-reconstruction metrics so that individual model configurations can be cited directly in the appendix without referring back to the JSONL artifact\.

Table 8:Full post\-reconstruction evaluation table across all4848sweep variants\.ppdenotes the exclusive share per model,λexcl\\lambda\_\{\\mathrm\{excl\}\}the exclusive\-partition L1 penalty, andλsh\\lambda\_\{\\mathrm\{sh\}\}the shared\-feature L1 penalty\. All rows use 100 held\-out ToolRL prompts\.ModelArchDDkkppλexcl\\lambda\_\{\\mathrm\{excl\}\}λsh\\lambda\_\{\\mathrm\{sh\}\}A ovA fmtA toolΔ\\DeltaA toolB ovB fmtB toolΔ\\DeltaB toolcc\-D16k\-k160CrossCoder163841600%010−310^\{\-3\}1666748\-0\.92044cc\-D16k\-k45CrossCoder16384450%010−310^\{\-3\}0\.55515233\-0\.7601212cc\-D16k\-k90CrossCoder16384900%010−310^\{\-3\}0\.37454627\-0\.94033cc\-D16k\-nol1\-k160CrossCoder163841600%000\.26424223\-0\.7001515cc\-D16k\-nol1\-k45CrossCoder16384450%000\.59535334\-0\.7201414cc\-D16k\-nol1\-k90CrossCoder16384900%000\.05353516\-0\.7001515cc\-D8k\-k160CrossCoder81921600%010−310^\{\-3\}0\.47494930\-0\.92044cc\-D8k\-k45CrossCoder8192450%010−310^\{\-3\}0\.76585940\-0\.7601212cc\-D8k\-k90CrossCoder8192900%010−310^\{\-3\}0\.50505031\-0\.94033cc\-D8k\-nol1\-k160CrossCoder81921600%000\.21414021\-0\.84088cc\-D8k\-nol1\-k45CrossCoder8192450%001\.10707051\-0\.6601717cc\-D8k\-nol1\-k90CrossCoder8192900%000\.80606041\-1000dfc\-D16k\-excl10\-freeexcl\-k160DFC1638416010%010−310^\{\-3\}0\.75595839\-0\.92044dfc\-D16k\-excl10\-freeexcl\-k45DFC163844510%010−310^\{\-3\}\-0\.01333314\-0\.98011dfc\-D16k\-excl10\-freeexcl\-k90DFC163849010%010−310^\{\-3\}0\.69575637\-0\.6201919dfc\-D16k\-excl10\-k160DFC1638416010%10−310^\{\-3\}10−310^\{\-3\}0\.60525435\-0\.90055dfc\-D16k\-excl10\-k45DFC163844510%10−310^\{\-3\}10−310^\{\-3\}0\.54505233\-0\.86077dfc\-D16k\-excl10\-k90DFC163849010%10−310^\{\-3\}10−310^\{\-3\}0\.54505227\-0\.7001515dfc\-D16k\-excl3\-freeexcl\-k160DFC163841603%010−310^\{\-3\}0\.49495031\-0\.94033dfc\-D16k\-excl3\-freeexcl\-k45DFC16384453%010−310^\{\-3\}0\.49495031\-0\.88066dfc\-D16k\-excl3\-freeexcl\-k90DFC16384903%010−310^\{\-3\}0\.23414122\-0\.82099dfc\-D16k\-excl3\-k160DFC163841603%10−310^\{\-3\}10−310^\{\-3\}0\.56525233\-0\.84088dfc\-D16k\-excl3\-k45DFC16384453%10−310^\{\-3\}10−310^\{\-3\}0\.28424324\-0\.88066dfc\-D16k\-excl3\-k90DFC16384903%10−310^\{\-3\}10−310^\{\-3\}0\.26424223\-0\.86077dfc\-D16k\-excl5\-freeexcl\-k160DFC163841605%010−310^\{\-3\}0\.74585839\-0\.94033dfc\-D16k\-excl5\-freeexcl\-k45DFC16384455%010−310^\{\-3\}0\.44484829\-0\.88066dfc\-D16k\-excl5\-freeexcl\-k90DFC16384905%010−310^\{\-3\}0\.33434526\-0\.86077dfc\-D16k\-excl5\-k160DFC163841605%10−310^\{\-3\}10−310^\{\-3\}0\.49495031\-0\.92044dfc\-D16k\-excl5\-k45DFC16384455%10−310^\{\-3\}10−310^\{\-3\}0\.37454627\-0\.94033dfc\-D16k\-excl5\-k90DFC16384905%10−310^\{\-3\}10−310^\{\-3\}0\.49495031\-0\.94033dfc\-D8k\-excl10\-freeexcl\-k160DFC819216010%010−310^\{\-3\}1\.49838364\-1000dfc\-D8k\-excl10\-freeexcl\-k45DFC81924510%010−310^\{\-3\}0\.27414324\-0\.88066dfc\-D8k\-excl10\-freeexcl\-k90DFC81929010%010−310^\{\-3\}0\.56525233\-0\.82099dfc\-D8k\-excl10\-k160DFC819216010%10−310^\{\-3\}10−310^\{\-3\}0\.32444425\-0\.84088dfc\-D8k\-excl10\-k45DFC81924510%10−310^\{\-3\}10−310^\{\-3\}0\.70565738\-0\.6002020dfc\-D8k\-excl10\-k90DFC81929010%10−310^\{\-3\}10−310^\{\-3\}0\.60545334\-0\.96022dfc\-D8k\-excl3\-freeexcl\-k160DFC81921603%010−310^\{\-3\}0\.68565637\-0\.88066dfc\-D8k\-excl3\-freeexcl\-k45DFC8192453%010−310^\{\-3\}0\.47494930\-0\.88066dfc\-D8k\-excl3\-freeexcl\-k90DFC8192903%010−310^\{\-3\}0\.10363718\-0\.82099dfc\-D8k\-excl3\-k160DFC81921603%10−310^\{\-3\}10−310^\{\-3\}0\.50505031\-0\.90055dfc\-D8k\-excl3\-k45DFC8192453%10−310^\{\-3\}10−310^\{\-3\}0\.32444425\-0\.98011dfc\-D8k\-excl3\-k90DFC8192903%10−310^\{\-3\}10−310^\{\-3\}0\.32444425\-0\.88066dfc\-D8k\-excl5\-freeexcl\-k160DFC81921605%010−310^\{\-3\}0\.53515132\-0\.92044dfc\-D8k\-excl5\-freeexcl\-k45DFC8192455%010−310^\{\-3\}0\.71575738\-0\.96022dfc\-D8k\-excl5\-freeexcl\-k90DFC8192905%010−310^\{\-3\}0\.92646445\-0\.94033dfc\-D8k\-excl5\-k160DFC81921605%10−310^\{\-3\}10−310^\{\-3\}\-0\.2226267\-0\.96022dfc\-D8k\-excl5\-k45DFC8192455%10−310^\{\-3\}10−310^\{\-3\}0\.35454526\-0\.94033dfc\-D8k\-excl5\-k90DFC8192905%10−310^\{\-3\}10−310^\{\-3\}0\.55515233\-0\.7401313
## Appendix DFull Saturation Table

BestΔ\\Deltatool\-correctness \(max overα\\alpha\) at each\|S\|\|S\|with 95% paired\-ttCIs, referenced from Section[6\.1](https://arxiv.org/html/2606.26474#S6.SS1)\. “—” indicates no sweep coverage\.

Table 9:BestΔ\\Delta\(max overα\\alpha\) at each\|S\|\|S\|with 95% paired\-ttCIs\. DFC A\-excl plateaus at\+47\+47–6565pp from\|S\|=1\|S\|=1; CC needs\|S\|=33\|S\|=33to reach\+70\+70pp\. “—” = no sweep coverage\.PROMPTToolRL hold\-out \(prompt\_index=2\): system prompt listing seven list/string utilities and asking the model to emit <think\>/<tool\_call\>/<response\>\.Available tools:flatten\_list,split\_list,get\_range,reverse\_words,reverse\_string,merge\_dictionaries,is\_subset\.TOOLRL\-QWEN2\.5\-3B\+ DFC RECONSTRUCTION \(CLEAN\)BEFORE STEERINGNO<tool\_call\>EMITTED<think\>The user’s request is not clear from the provided information\. I need more details about what task or information they are looking for\. </think\> <response\> Could you please provide more details about the task you want to accomplish? Are you looking for information, performing a calculation, or need help with a specific process? </response\> ⇒\\Rightarrowformat\_accuracy=False ⇒\\Rightarrowtool\_correctness=FalseSAME MODEL\+ STEER FEAT \#136,\|S\|=1\|S\|\{=\}1,α=32\\alpha\{=\}32AFTER STEERINGSTRUCTURED TOOL CALLUser:I have a list of numbers \[1,2,3,4,5\]\. How can I split it into chunks of 2? <think\>The user wants to split the given list into chunks of 2\. We can use the ’split\_list’ function to achieve this\. The input list is \[1,2,3,4,5\] and the chunk size is 2\. </think\> <tool\_call\> \{"name":"split\_list", "parameters":\{"lst":\[1,2,3,4,5\], "chunk\_size":2\}\} </tool\_call\> ⇒\\Rightarrowformat\_accuracy=True ⇒\\Rightarrowtool\_correctness=True

Figure 5:Tool\-use behaviour before vs\. after a single A\-exclusive steering vector\.Both panels show the same ToolRL\-Qwen2\.5\-3B passed through the same DFC \(dfc\-D8k\-excl10\-k45, layer 13\)\.Left:standard reconstruction — the model asks for clarification and never emits a<tool\_call\>\(format\_accuracy=\\texttt\{format\\\_accuracy\}=False,tool\_correctness=\\texttt\{tool\\\_correctness\}=False\)\.Right:the same model with an additive delta on the single highest\-Cohen’s\-ddA\-exclusive feature \(index136136\),\|S\|=1\|S\|=1,α=32\\alpha=32— the model now infers a plausible user goal, reasons through it inside<think\>, and emits a syntactically valid<tool\_call\>namingsplit\_listwith the correct argument schema \(format\_accuracy=\\texttt\{format\\\_accuracy\}=True,tool\_correctness=\\texttt\{tool\\\_correctness\}=True\)\.### D\.1Practical Sweet Spot andα\\alphaTradeoff

![Refer to caption](https://arxiv.org/html/2606.26474v1/figures/ablation_l13/ablation_tradeoff_marginals_alpha_only.png)Figure 6:BestΔ\\Deltavsα\\alpha, pooled across DFC A\-excl, shared, and CC\. Moderateα\\alpha\(≈6\\approx 6–3232\) is consistently near\-optimal\. Over\-steering \(α=64\\alpha=64, large\|S\|\|S\|\) degrades performance\.Key operating points: DFC A\-excl at\|S\|=1\|S\|=1,α=32\\alpha=32yields\+65\.0\+65\.0pp; CC at\|S\|=33\|S\|=33,α=6\\alpha=6yields\+70\.0\+70\.0pp —33×33\\timesthe neuron budget for a55pp gain\. Over\-steering \(α=64\\alpha=64,\|S\|≥9\|S\|\\geq 9\) degrades performance \(e\.g\.\|S\|=33\|S\|=33,α=64→\+0\.0\\alpha=64\\to\+0\.0pp\)\.

### D\.2Per\-Condition Heatmaps

![Refer to caption](https://arxiv.org/html/2606.26474v1/figures/heatmap_dfc-D8k-excl10-k45_dA_tool_vs_clean.png)\(a\)dfc\-D8k\-excl10\-k45: sweet spot at low\|S\|\|S\|, highα\\alpha\.
![Refer to caption](https://arxiv.org/html/2606.26474v1/figures/heatmap_dfc-D8k-excl10-freeexcl-k160_dA_tool_vs_clean.png)\(b\)dfc\-D8k\-excl10\-freeexcl\-k160: sweet spot at high\|S\|\|S\|, moderateα\\alpha\.

Figure 7:Per\-cellΔ\\Deltatool\-correctness heatmaps \(\|S\|×α\|S\|\\times\\alpha\) for the two representative DFCs\. Sparsek=45k=45peaks at targeted low\-\|S\|\|S\|steering; densek=160k=160benefits from broader coverage\.

## Appendix EModels and Repository
Localizing RL-Induced Tool Use to a Single Crosscoder Feature

Similar Articles

CacheRL:Multi-Turn Tool-Calling Agents via Cached Rollouts and Hybrid Reward

Cross-Lingual Steering for Figurative Language Generation

@neural_avb: Locally generating GRPO-like rollouts with my SLM, and using this tiny RM as the rubric. Next I'll be RL training on fr…

Translate-R1: Cost-Aware Translation Tool Use via Reinforcement Learning

Incentivizing Parametric Knowledge via Reinforcement Learning with Verifiable Rewards for Cross-Cultural Entity Translation

Submit Feedback

Similar Articles

CacheRL:Multi-Turn Tool-Calling Agents via Cached Rollouts and Hybrid Reward
Cross-Lingual Steering for Figurative Language Generation
@neural_avb: Locally generating GRPO-like rollouts with my SLM, and using this tiny RM as the rubric. Next I'll be RL training on fr…
Translate-R1: Cost-Aware Translation Tool Use via Reinforcement Learning
Incentivizing Parametric Knowledge via Reinforcement Learning with Verifiable Rewards for Cross-Cultural Entity Translation