Adversarial Concept Search: Predicting Compositional Errors From Feature Geometry

arXiv cs.AI 06/15/26, 04:00 AM Papers
adversarial concept-search compositional-generalization feature-geometry llm error-prediction
Summary
This paper proposes Adversarial Concept Search, a method that uses the representational geometry of large language models to predict compositional failures without evaluating specific inputs. The approach identifies high-risk scenarios by measuring interference between salient features.
arXiv:2606.13934v1 Announce Type: new Abstract: Humans cannot always intuit what scenarios are most challenging to LLMs. Hoping to capture challenging edge cases, developers either design problems to be difficult for humans or curate extensive benchmarks. What if we could instead anticipate which scenarios a model will fail on? In this paper, we use an LLM's representational geometry to predict which concept combinations it will fail on. We attribute this compositional failure to interference between salient features. In tasks that require systematic composition - toy programmatic settings, multihop reasoning, multilingual factual recall - we find that when a pair of concepts is encoded near-orthogonally, the model reliably composes them. When their linear encodings are close, producing interference, the model fails to compose them. Our method reliably anticipates failure modes across different compositional tasks, without evaluating specific inputs. These results lay the groundwork to use representational geometry to identify high-risk examples, construct targeted stress tests, and provide a scalable foundation for active learning in real-world deployment.
Original Article
View Cached Full Text
Cached at: 06/15/26, 09:10 AM
# Adversarial Concept Search: Predicting Compositional Errors From Feature Geometry
Source: [https://arxiv.org/html/2606.13934](https://arxiv.org/html/2606.13934)
Jennifer Meng Lu Brown University meng\_lu@brown\.edu&Ruochen Zhang Brown University ruochen\_zhang@brown\.edu&Isabelle Lee University of Southern California lee\.isabelle\.g@gmail\.com&David Alvarez\-Melis Harvard University dam@seas\.harvard\.edu&Ellie Pavlick Brown University ellie\_pavlick@brown\.edu&Naomi Saphra Boston University nsaphra@bu\.edu

###### Abstract

Humans cannot always intuit what scenarios are most challenging to LLMs\. Hoping to capture challenging edge cases, developers either design problems to be difficult for humans or curate extensive benchmarks\. What if we could instead anticipate which scenarios a model will fail on? In this paper, we use an LLM’s representational geometry to predict which concept combinations it will fail on\. We attribute this compositional failure to interference between salient features\. In tasks that require systematic composition—toy programmatic settings, multihop reasoning, multilingual factual recall—we find that when a pair of concepts is encoded near\-orthogonally, the model reliably composes them\. When their linear encodings are close, producing interference, the model fails to compose them\. Our method reliably anticipates failure modes across different compositional tasks,withoutevaluating specific inputs\. These results lay the groundwork to use representational geometry to identify high\-risk examples, construct targeted stress tests, and provide a scalable foundation for active learning in real\-world deployment\.

## 1Introduction

As large language models \(LLMs\) improve across a wide range of tasks and domains, it is increasingly difficult to identify remaining challenges\. Since humans cannot reliably predict which concept combinations will be challenging for LLMs, this gap in understanding makes dataset curation inefficient and limits our ability to anticipate LLM failure modes\. Developers either design problems to challenge humans—but not LLMs—\-or they curate broad benchmarks, hoping to cover informative edge cases\. What if we could instead anticipate which scenarios an individual model will fail on?

![Refer to caption](https://arxiv.org/html/2606.13934v1/x1.png)Figure 1:The angle between atomic concept representations identifies the most challenging compositions, enabling failure prediction across the combinatorial space without evaluating specific inputs\.We call this objectiveAdversarial Concept Search\(ACS\): the task of identifying meaningful conceptual scenarios that are likely to induce model failure without evaluating the model on specific inputs that instantiate those scenarios\. In this paper, we efficiently identify adversarial scenarios by predicting failures ofcompositional generalization, specifically systematicity\[[21](https://arxiv.org/html/2606.13934#bib.bib4),[29](https://arxiv.org/html/2606.13934#bib.bib81)\]\. A system generalizes compositionally when it successfully recombines known atomic concepts in novel configurations that were not observed during training\. For example, if the training distribution includes “black dog” and “white cat”, a compositional model will reliably process “black cat”\. This capacity is crucial for robust performance in out\-of\-distribution \(OOD\) settings\.

By mapping these compositional capabilities, we could systematically construct custom challenge sets for efficient, targeted stress testing\. As a practical demonstration, we successfully leverage a model’s internal representational geometry to predict its failures in untested scenarios\. Unlike existing error prediction methods such as used for active learning\[[43](https://arxiv.org/html/2606.13934#bib.bib24)\], we do not require a specific input processed by the model\. When curating datasets for semi\-supervised settings like language modeling, such methods are of limited practical use\. In these settings, we lack diverse existing inputs and are constrained by the expense of generating and processing every possible input\. Our approach instead predicts errors from only adescriptionof the concepts involved and their atomic representations\. This tool allows a developer to prioritize generating or collecting coherent, challenging inputs\. For example, it may be prohibitively expensive to professionally translate all English corpora into Russian for a multilingual LLM, but by identifying which concepts will produce Russian errors, we can prioritize collecting Russian data in problem areas likenotable deaths in Russian\. Across multiple tasks, we will predict these compositional errors from the geometry of their atomic concepts\.

These predictions rely on our hypothesis that models make mistakes when features stored insuperpositioncause interference\. LLMs rely on superposition\[[6](https://arxiv.org/html/2606.13934#bib.bib136),[18](https://arxiv.org/html/2606.13934#bib.bib82)\]to encode many features in limited dimensions by sharing linear directions\. It has been shown that, even without orthogonal feature encodings, this compression can in principle be perfectlylossless—that is, invertible—as long as feature activations are sparse\[[12](https://arxiv.org/html/2606.13934#bib.bib10)\]\. In practice, however, we posit that models learn a*lossy*coding of their input features: when multiple non\-orthogonal feature encodings are simultaneously active, they can interfere with one another and damage model performance\.

This interference is especially salient in compositional settings, where multiple task\-relevant features must be jointly represented\. We hypothesize that LLM failures can arise from geometric interference between atomic concept representations, and that these failures can be predicted from the angles between their encoded feature directions\. Our conceptual model can use representational geometry to proactively identify failure cases, paving the way for dynamic stress testing and scalable active learning\. Our contributions are as follows:

1. 1\.Description and analysis of compositional interference in lossy superposition\.We attribute compositional errors to interference between non\-orthogonal atomic concept representations\. Because only a few features activate at a time, prior theory argues that an ideal decoder can recover the active features even when they are encoded in superposition\. We posit that when decoding is lossy, recovery error is governed by interference between the features being composed\.
2. 2\.Proof\-of\-concept in a controlled setting \(SCAN\)\.We validate this hypothesis in SCAN\[[32](https://arxiv.org/html/2606.13934#bib.bib44)\], a synthetic compositional generalization benchmark\. In this controlled setting, we measure pairwise interactions between concept representations and show that smaller angles correlate with compositional errors\. This result holds across various data conditions and model sizes, confirming the predicted relationship between interference and failure\.
3. 3\.Predicting compositional failure in real LLMs\.We successfully predict successful compositional generalization in realistic LLM tasks from geometric interference between the constituent concepts\. In multihop question answering \(QA\), the angle between component single\-hop representations predicts whether the LLM will successfully compose them\. In multilingual factual recall, the angle between fact representations and language subspaces predicts retrieval accuracy\. Across both settings, we find that greater separation between atomic representations corresponds to more reliable composition, demonstrating that representation geometry can predict compositional failures without evaluating the composed task itself\. This establishes a scalable foundation for identifying challenging input scenarios and guiding active learning in real\-world deployment\.

## 2Compositionality and Lossy Superposition

We will first explain why LLMs may fail to compose non\-orthogonal feature representations and then explain how to leverage this phenomenon for error prediction\. Compositionality has long been associated with orthogonal atomic feature representations\[[45](https://arxiv.org/html/2606.13934#bib.bib50)\], but modern theories of feature superposition permit lossless reconstruction from non\-orthogonal encodings, even with exponentially more input features than representation dimensions\[[12](https://arxiv.org/html/2606.13934#bib.bib10),[22](https://arxiv.org/html/2606.13934#bib.bib132)\]\. In practice, however, we contend that superposition islossyand, therefore, that compositional errors are predictable from the geometry of the active features being combined\. We then describe how features can be extracted from language models and used to empirically measure angular distance between them across different settings\.

### 2\.1Background: Lossy Superposition

Intuitively, one can storeddfeatures inddlinear dimensions\. So why is lossless compression of more thanddfeatures possible under current theories of superposition? The key assumption iskk\-sparsity: onlyk≪dk\\ll dfeatures can be simultaneously active in any given input\. This realistic sparsity assumption permits sufficiency theorems from compressed sensing\[[12](https://arxiv.org/html/2606.13934#bib.bib10)\], guaranteeing that the encoded representation can be inverted, recovering the original features exactly\. This sparsity allows us to recover anexponentialnumber of featuresmmfrom alinearnumber of dimensionsdd\. Specifically, if we know thelinear encoderA∈ℝd×mA\\in\\mathbb\{R\}^\{d\\times m\}, we can exactly recover any specificfeature vectorz∈\[−1,1\]mz\\in\[\-1,1\]^\{m\}withkk\-sparsesupportsupp\(z\)⊂\[m\]\\textrm\{supp\}\(z\)\\subset\[m\]from its encoded representationAzAz\. GivenAzAz, we reconstruct the feature vectorz^\\hat\{z\}with zerorecovery error,

‖z−z^‖2=0\.\\displaystyle\\\|z\-\\hat\{z\}\\\|\_\{2\}=0\.\(1\)Crucially,this guarantee does not degrade when feature representations are non\-orthogonal\. Even if we prohibit nonlinear decoding by the next LLM layer, concepts encoded with very similar representations can still be recovered using a biorthogonal decoder dictionary\[[22](https://arxiv.org/html/2606.13934#bib.bib132)\]\.

Although this ideal decoding exists under mild assumptions,111For classic compressed sensing, the relevant assumption is the Restricted Isometry Property\[[13](https://arxiv.org/html/2606.13934#bib.bib5)\]\.it is unlikely to hold during real LLM inference; we can assume some noise in the representation\.These guarantees only exists in noiseless settings\.In compressed sensing theory, recovery is more sensitive to noise when the feature encoding matrix contains encodings with high cosine similarity\[[8](https://arxiv.org/html/2606.13934#bib.bib6)\]\. Specifically, robust decoding requires an encoderAAwith low global coherence, defined as the maximum similarity between all columns,

ρ=maxi,j∈\[m\]j≠i⁡\|cos⁡\(ai,aj\)\|\.\\rho=\\max\_\{\\begin\{subarray\}\{c\}i,j\\in\[m\]\\\\ j\\neq i\\end\{subarray\}\}\\left\|\\cos\{\(a\_\{i\},a\_\{j\}\)\}\\right\|\.Worst\-case recovery error bounds do not depend on low coherence with an ideal decoder, but do when decoding under noise\. Robust decoding motivates why correlated features are known to be encoded orthogonally, while anti\-correlated features exhibit negative interference\[[18](https://arxiv.org/html/2606.13934#bib.bib82)\]\.

Global coherence provides bounds on worst\-case errors, but we are concerned about errors on specific feature combinations\. In a specific scenario, not all interference is equal\. We are most concerned with featuressalient to that scenario—specifically, the sparse supportsupp\(z\)\\textrm\{supp\}\(z\)and any features that are relevant in the context of the sparse support\.222One simple way to limit the salient feature set is by leveraging structured sparsity and assuming features are organized into active blocks of related features\. This approach is suggested byAdcocket al\.\[[2](https://arxiv.org/html/2606.13934#bib.bib8)\]and inspires our concept\-salient subspace construction by SVD in Section[4](https://arxiv.org/html/2606.13934#S4)\.Regardless of how the LLM identifies thesalient support𝒮\\mathcal\{S\}, if it restricts decoding to only salient features, then the dimension required for robust recovery is controlled with high probability by the structured bound ofAdcocket al\.\[[2](https://arxiv.org/html/2606.13934#bib.bib8)\]\. In effect, this bound depends on an example’s interaction with the salient support, itself bounded by the salient support’slocal cumulative coherence,

α\(𝒮\)=maxi∈𝒮∑j∈𝒮j≠i\|cos⁡\(ai,aj\)\|\.\\alpha\(\\mathcal\{S\}\)=\\max\_\{i\\in\\mathcal\{S\}\}\\sum\_\{\\begin\{subarray\}\{c\}j\\in\\mathcal\{S\}\\\\ j\\neq i\\end\{subarray\}\}\\left\|\\cos\{\(a\_\{i\},a\_\{j\}\)\}\\right\|\.\(2\)
Further theoretical details on the relevant bound, as well as related bounds for robust linear compressed sensing, are provided in Appendix[A](https://arxiv.org/html/2606.13934#A1)\.

Why might feature recovery be less robust when the salient support has high cumulative coherence? Intuitively, robust recovery is hampered by destructive interference from the most damaging feature: the salient feature that has the smallest angle with other features salient to the support\. This is captured by the local cumulative coherence term, which we operationalize below asCompositional Interference\(CI\)\. We will leverage this interference metric to rank concept combinations by likelihood of LLM compositional failure\. Our intuition is simple: when an LLM cannot robustly recover a set of features in superposition due to interference, it is more likely to make mistakes during processing\.

### 2\.2Measuring compositional interference

The theory above predicts that compositional failures should be more likely when the features active or salient in a compositional input have high cumulative coherence\. How can we compute this valuewithout access to an exampleof the compositional scenario?

In real LLMs, the ground\-truth concepts learned by the model are not directly accessible\. While methods such as Sparse Autoencoders \(SAEs\)\[[17](https://arxiv.org/html/2606.13934#bib.bib149)\]are proposed to disentangle the representation space into discrete features, they introduce additional assumptions and implementation challenges\. We instead seek a simple proxy for compositional interference using models’ residual representations\.

![Refer to caption](https://arxiv.org/html/2606.13934v1/x2.png)\(a\)Large\-scale structure in representation space\.
![Refer to caption](https://arxiv.org/html/2606.13934v1/x3.png)\(b\)SCAN example\.

Figure 2:After controlling large\-scale representation structure, compositional generalization can be predicted from feature interference\.\(a\) Representations cluster by global structures that are irrelevant to the concepts of interest\. Colors indicate examples from the same topic subset \(Layer 9\)\. \(b\) Hypothesized illustration of compositional failure for Section[3](https://arxiv.org/html/2606.13934#S3)\. Two commands share the same structure but differ in a single concept \(runvs\.jump\)\. The relevant atomic concepts at the top are near\-orthogonal, enabling correct composition\.#### Atomic concepts and salient features\.

We distinguish between atomicconceptsin the input and latentfeaturesin the model representation\. Let𝒞\\mathcal\{C\}denote the set of atomic concepts, and letC⊆𝒞C\\subseteq\\mathcal\{C\}denote a set of active atomic concepts which are instantiated by a specific inputx∈𝒳\(C\)x\\in\\mathcal\{X\}\(C\)\. These concepts correspond to a salient feature support𝒮\(C\)\\mathcal\{S\}\(C\), i\.e\., the indices of latent features that are active or relevant for these concepts\.333In particular, each inputxxhas a corresponding feature vectorz\(x\)z\(x\); for anyx∈𝒳\(C\)x\\in\\mathcal\{X\}\(C\),supp⁡\(z\(x\)\)⊆𝒮\(C\)\\operatorname\{supp\}\(z\(x\)\)\\subseteq\\mathcal\{S\}\(C\)\.For example, the conceptSpanishmay correspond not to a single direction, but to a set of feature directions that are highly active in Spanish text and encode different aspects of the language\.

#### Estimating salient feature encodings from examples\.

In real LLMs, neither the salient support𝒮\(C\)\\mathcal\{S\}\(C\)nor the salient feature encodingsA𝒮\(C\)=\{ai:i∈𝒮\(C\)\}A\_\{\\mathcal\{S\}\(C\)\}=\\\{a\_\{i\}:i\\in\\mathcal\{S\}\(C\)\\\}are directly observed\. We therefore estimateA𝒮\(C\)A\_\{\\mathcal\{S\}\(C\)\}using residual\-stream representations from examples that instantiate each concept inCC\. Preferably, for an atomic conceptc∈Cc\\in C, a vector representation would simply use a residual\-stream representation from a single inputx∈𝒳\(c\)x\\in\\mathcal\{X\}\(c\)that isolates conceptcc\. In practice, the best proxy depends on how the concept is instantiated in each empirical setting\. Accordingly, we approximate concept representations using a single activation vector, an average across contexts, or a set of high\-variance directions across examples of the concept; in the last case, multiple feature encodings inA𝒮\(C\)A\_\{\\mathcal\{S\}\(C\)\}would jointly represent the atomic concept\. For example, in multilingual fact recall, the inputla capital de Espa~na esinstantiates both a factual concept,capital of Spain, and a language concept,Spanish\. We estimate their representations separately: the factual concept representation is extracted from the English version of the input \(the capital of Spain is\) while the Spanish concept is represented by high\-variance feature directions across Spanish texts\.

#### Accounting for multiscale structure\.

Raw representations in LLMs are often dominated by large\-scale structure irrelevant to the atomic concepts of interest\. In our LLM experiments \(Section[4](https://arxiv.org/html/2606.13934#S4)\), residuals cluster by prompt type, task family, and other contextual factors\. \(Dominant clusters are highlighted in Figure[2\(a\)](https://arxiv.org/html/2606.13934#S2.F2.sf1)\.\) As a result, these cluster identities can dominate the inner products used to estimate angles between atomic representations \(see Appendix[C](https://arxiv.org/html/2606.13934#A3)\)\. Empirically, angles computed from raw representations are dominated by background cluster identities \(Appendix Figure[7](https://arxiv.org/html/2606.13934#A3.F7)\(a\)\)\.

To mitigate this effect, we apply mean\-centering to the dominant clusters in raw representations before estimating salient feature encodings\. Letx∈𝒳\(C\)x\\in\\mathcal\{X\}\(C\)be an example, letγ\(x\)\\gamma\(x\)denote its background cluster, and letμγ\(x\)\\mu\_\{\\gamma\(x\)\}be the empirical mean residual representation of examples in that cluster\. We define the cluster\-centered residual representation ashc\(x\)=h\(x\)−μγ\(x\)h\_\{c\}\(x\)=h\(x\)\-\\mu\_\{\\gamma\(x\)\}\. We then use these centered residuals to estimate the salient feature encodingsA𝒮\(C\)A\_\{\\mathcal\{S\}\(C\)\}used for angle computation\. This centering places angle values from different background clusters on a more comparable scale, reducing artificial binning effects caused by dominant clusters \(Appendix Figure[7](https://arxiv.org/html/2606.13934#A3.F7)\(b\)\) and improving predictions \(Figure[22](https://arxiv.org/html/2606.13934#A6.F22)\(a,c\)\)\.

#### Estimating compositional interference\.

We definecompositional interference\(CI\) for a composition with active concept setCCas a normalized444By normalizing, we ensure that differences in example difficulty are not solely due to different salient feature counts\.variant of local cumulative coherence over its salient support:

CI\(C\)=maxi∈𝒮\(C\)⁡1\|𝒮\(C\)\|∑j∈𝒮\(C\)\|cos⁡\(ai,aj\)\|\.\\textrm\{CI\}\(C\)=\\max\_\{i\\in\\mathcal\{S\}\(C\)\}\\frac\{1\}\{\|\\mathcal\{S\}\(C\)\|\}\\sum\_\{\\begin\{subarray\}\{c\}j\\in\\mathcal\{S\}\(C\)\\end\{subarray\}\}\|\\cos\(a\_\{i\},a\_\{j\}\)\|\.\(3\)Here,aia\_\{i\}andaja\_\{j\}denote empirical encodings of the salient latent feature directions indexed byiiandjj\. CI lets us estimate interference for compositional inputs without evaluating the model on any specific composed input, sinceaia\_\{i\}andaja\_\{j\}are estimated from examples of the constituent atomic concepts\.555Appendix[B](https://arxiv.org/html/2606.13934#A2)provides alternative metrics for quantifying compositional interference\.According to Section[2\.1](https://arxiv.org/html/2606.13934#S2.SS1), higher interference among salient features predicts higher recovery error\.

### 2\.3Using and evaluating predictions for ACS

We hypothesize that recovery error affects model accuracy in systematic compositional scenarios: if the model cannot reliably recover the atomic concepts in representation space, it is less likely to execute the composition correctly\. This claim allows us to use CI, computed from examples of atomic concepts, for our proposed objective of Adversarial Concept Search\. Generally, ACS aims to answer, “If resource constraints only allow you to test inputs for only a small subset of all possible concept combinations, which subset should you collect to maximize error coverage?” To demonstrate the efficacy of CI in choosing these adversarial scenarios, we will use it to predict compositional errors\. For each experiment, we measure errors by full\-prediction exact\-match accuracy\. We select a representation layer to provide residual\-stream feature representations using a 10% validation set\.

In each setting, we illustrate that CI provides a useful ranking of difficulty for the purposes of selecting a challenge set of scenarios under specified resource constraints\. Assuming higher CI is associated with more compositional failure, we construct subsets of concept combinations with high CI with varying minimum cutoffs\. As shown in Figure[1](https://arxiv.org/html/2606.13934#S1.F1), the left side of the curve contains only the examples predicted to be hardest\. Thexx\-axis reports decreasing CI cutoffs by percentile, and theyy\-axis reports accuracy on the cumulative challenge set\. If we selected the same size of test sets at random, we would expect a flat line at the model’s overall accuracy\. IfCIis predictive, the curve should instead start low and monotonically increase toward the mean as lower\-interference examples are included\.

We also directly evaluate error predictions by their PR\-AUC\. We rank all examples by CI, assuming higher CI is associated with higher likelihood of errors\. The PR\-AUC baseline is the model’s overall failure rate,1−accuracy1\-\\mathrm\{accuracy\}, which is the score of a majority\-label baseline\.

## 3Adversarial Concept Search for a synthetic task

We first validate our hypothesis in a controlled synthetic setting by training toy models from scratch\. This allows us to control the data distribution and model scale, isolate the role of feature interaction, and directly test whether representation geometry predicts compositional generalization\.

![Refer to caption](https://arxiv.org/html/2606.13934v1/x4.png)\(a\)CI between atomic concepts\.
![Refer to caption](https://arxiv.org/html/2606.13934v1/x5.png)\(b\)Atomic concept representations \(PCA\)\.

Figure 3:Compositional models use orthogonal features\.Representations of atomic concepts in 64\-dimension SCAN models trained with 8% coverage \(left\) and 80% coverage \(right\)\. Higher training coverage leads to representations that are more \(a\) pairwise orthogonal and \(b\) separated along principal components\.#### Dataset

The SCAN benchmark\[[33](https://arxiv.org/html/2606.13934#bib.bib134)\]is a testbed for systematic compositional generalization\. SCAN specifies a set of primitive concepts, including actions \(e\.g\.,jump,walk\) and operators \(e\.g\.,left,twice\), which can be combined into instructions \(see Figure[2\(b\)](https://arxiv.org/html/2606.13934#S2.F2.sf2)\)\. This controlled setting allows us to examine whether the angles between atomic concepts can predict compositional failures\.

### 3\.1Experiments

To predict model errors on SCAN, for each example, we use the primitive concepts that appear in the command as the active concept setcc, and estimateA𝒮\(C\)A\_\{\\mathcal\{S\}\(C\)\}by averaging residual activations for each primitive across contexts containing that concept, e\.g\., averaging activations at thejumptoken across commands containingjump\. We compute CI over these estimated salient feature encodings and hypothesize that lower interference corresponds to near\-orthogonal representations and reliable composition, while higher interference increases recovery error and leads to compositional failure\.

To study compositional generalization under varying difficulty, we followLake and Baroni \[[33](https://arxiv.org/html/2606.13934#bib.bib134)\]in restricting the training set coverage to\{4%,8%,16%,36%,64%,80%\}\\\{4\\%,8\\%,16\\%,36\\%,64\\%,80\\%\\\}of distinct commands while keeping the total number of training examples fixed at 100K\. We provide training details in Appendix[D](https://arxiv.org/html/2606.13934#A4)\. Models trained with lower coverage underperform on novel compositions at test time\. To study how model capacity affects CI, we train autoregressive decoder\-only Transformers with model sizesd∈\{8,12,32,64\}d\\in\\\{8,12,32,64\\\}for each coverage ratio\. Intuitively, smaller models enforce stronger compression and superposition, while larger models permit increasingly disentangled representations\.

To establish SCAN as a testbed for our hypothesis, we first show that it allows us to indirectly control CI\. As shown in Figure[3](https://arxiv.org/html/2606.13934#S3.F3)and Appendix Figure[8](https://arxiv.org/html/2606.13934#A4.F8), representation structure varies with training coverage and model capacity\. Figure[3](https://arxiv.org/html/2606.13934#S3.F3)shows that lower coverage causes representations to cluster by functional role \(e\.g\., actions vs\. compositional operators\), limiting the separation between atomic concepts\. As coverage increases, representations become more isotropic, creating a controlled setting in which interference varies systematically by hyperparameter\.

Figure[4](https://arxiv.org/html/2606.13934#S4.F4)confirms that example\-levelCIpredicts compositional behavior within each SCAN model\. The cumulative accuracy curves show that CI provides a useful ranking of example difficulty—stricter CI cutoffs provide harder test sets with lower model accuracy\. Across all models with non\-extreme accuracies,666Defined as0\.2<acc<0\.990\.2<\\mathrm\{acc\}<0\.99; outside of this range, compositional generalization either fails entirely or saturates\.this ranking is significantly better than random ordering \(p<0\.01p<0\.01\)\. Furthermore, the binned curves in Appendix Figures[11](https://arxiv.org/html/2606.13934#A4.F11)show that the effect is not only aggregate, as accuracy is strongly negatively correlated with CI\. When we control for command, length in Appendix Figures[14](https://arxiv.org/html/2606.13934#A4.F14)and[15](https://arxiv.org/html/2606.13934#A4.F15), the trend continues to hold\. The PR\-AUC metric for CI ranking consistently exceeds each model’s failure\-rate baseline\.

Overall,CIis a reliable signal for compositional difficulty across model sizes and training regimes\. Using only a single geometric property—and without access to a specific input example—we can rank compositional scenarios by difficulty\.

## 4Adversarial Concept Search for LLMs

We now extend our toy proof\-of\-concept to predict compositional errors in an LLM,Llama\-3\.2\-3B\[[25](https://arxiv.org/html/2606.13934#bib.bib156)\]\. We evaluate our approach on two different tasks: multihop QA and multilingual fact retrieval\. This section will demonstrate our power to predict errors across diverse input distributions and surface forms\.

### 4\.1Experiments

For both LLM tasks, we focus on examples where the LLM succeeds on the atomic components\. This lets us isolate compositional errors—the model has the pieces but fails to combine them\.

![Refer to caption](https://arxiv.org/html/2606.13934v1/x6.png)Figure 4:SCAN models fail on examples with high interference\.Model accuracy on SCAN test sets with varying cutoffs for minimumCI\. \(More models in Appendix Figures[10](https://arxiv.org/html/2606.13934#A4.F10)\.\) Horizontal lines show each model’s overall accuracy as a baseline; blue curves show test set accuracies when examples are sorted byCI\. Subset accuracy responds near\-monotonically toCI, significantly outperforming a random ordering\. CI ranking’s PR\-AUC beats the error rate baseline for all models\.#### Multihop Reasoning\.

We first study two\-hop factual reasoning tasks using the dataset and 10\-shot prompt setup fromKhandelwal and Pavlick \[[31](https://arxiv.org/html/2606.13934#bib.bib133)\]\(full dataset details in Appendix[E](https://arxiv.org/html/2606.13934#A5)\)\. In multihop QA, each composed query is built from two constituent atomic concepts,ffandgg, corresponding to the first\-hop and second\-hop queries\. For example, a composed queryg\(f\(x\)\)g\(f\(x\)\)may combine an atomic first\-hop queryff, such asauthor of 1984, with an atomic second\-hop querygg, such asbirthyear of George Orwell, into the two\-hop promptbirthyear of the author of 1984\. To focus specifically on compositional failures, we filter to examples for which the model answers the corresponding single\-hop queries correctly\. We extract residual activations from the last token of each atomic query and cluster\-center them to obtain concept representationsafa\_\{f\}andaga\_\{g\}, which estimate the salient feature encodingsA𝒮\(C\)A\_\{\\mathcal\{S\}\(C\)\}in this setting\. We then compute CI to predict failure on the composed queryg\(f\(x\)\)g\(f\(x\)\), using only the atomic queries forffandgg\.

#### Multilingual Fact Recall\.

We investigate multilingual factual recall using the KLAR dataset\[[48](https://arxiv.org/html/2606.13934#bib.bib147)\], which tests if models can answer facts across different languages consistently\. When model capabilities differ by language, prior work has linked this inconsistency to a lack of alignment in internal representations\[[4](https://arxiv.org/html/2606.13934#bib.bib161),[35](https://arxiv.org/html/2606.13934#bib.bib162),[11](https://arxiv.org/html/2606.13934#bib.bib163),[10](https://arxiv.org/html/2606.13934#bib.bib101)\]\. Although this setting is not usually framed as a compositional task, prior work\[[48](https://arxiv.org/html/2606.13934#bib.bib147),[36](https://arxiv.org/html/2606.13934#bib.bib51),[49](https://arxiv.org/html/2606.13934#bib.bib55),[50](https://arxiv.org/html/2606.13934#bib.bib146),[44](https://arxiv.org/html/2606.13934#bib.bib102),[9](https://arxiv.org/html/2606.13934#bib.bib98)\]suggests that models implement a multi\-stage process: map a non\-English query onto a language\-agnostic representation, retrieve the answer, then generate it in the target language\. This process induces a compositionality gap analogous to multihop reasoning; even when the model can perform the atomic steps, it may still fail when they are combined\. We therefore treat multilingual factual recall as requiring the composition of two atomic components: a factual representation and a language\-specific representation\.

![Refer to caption](https://arxiv.org/html/2606.13934v1/x7.png)Figure 5:LLMs fail on examples with high interference\.\(a\) For multihop QA, CI provides a strong ranking of example difficulty\. \(b\) For multilingual fact recall, PR\-AUC for the CI\-based error prediction outperforms a majority baseline on every language\. \(c\) For multilingual fact recall, CI ranking chooses difficult challenge sets for each language\. \(Each line is one language; color legend matches that in \(b\)\.\) PR\-AUC result details in Appendix Figure[25](https://arxiv.org/html/2606.13934#A7.F25)and[26](https://arxiv.org/html/2606.13934#A7.F26)\.For each query in target languageℓ∈ℒ\\ell\\in\\mathcal\{L\}, the active concept setccconsists of a factual conceptqqand a language conceptℓ\\ell\. We extract the fact representation from the corresponding English factual queryq∈𝒬q\\in\\mathcal\{Q\}and cluster\-center it to obtainaqa\_\{q\}\(see Appendix[C](https://arxiv.org/html/2606.13934#A3)\)\. Drawing on prior work\[[14](https://arxiv.org/html/2606.13934#bib.bib135)\], we represent the language conceptℓ\\ellas a low\-rank subspace rather than a single feature vector\. For each languageℓ\\ell, we collect residual representations from 8,000 samples in the multilingual OSCAR corpus\[[1](https://arxiv.org/html/2606.13934#bib.bib148)\]and apply uncentered SVD to obtain an orthonormal basisBℓB\_\{\\ell\}\.777We retain the basis capturing0\.990\.99of the variance; we tune this variance threshold on the development set by sweeping0\.850\.85,0\.900\.90,0\.950\.95, and0\.990\.99\.In this setting, we estimateA𝒮\(C\)A\_\{\\mathcal\{S\}\(C\)\}=\{aq\}∪Bℓ\\\{a\_\{q\}\\\}\\cup B\_\{\\ell\}and computeCI\\mathrm\{CI\}to predict failure on the multilingual fact query\.

This setting makes the combinatorial challenge especially pronounced: each fact can be paired with any language, so the number of fact\-language combinations is\|𝒬\|×\|ℒ\|\|\\mathcal\{Q\}\|\\times\|\\mathcal\{L\}\|\. Evaluating—and, in a real\-world setting, translating—all such combinations is expensive\. We therefore ask whetherCI, computed from an English fact activation vector and a target\-language subspace, can predict cross\-lingual transfer failure without access to the translated input\.

### 4\.2Results

Ranking multihop reasoning examples\.As seen in Figure[5](https://arxiv.org/html/2606.13934#S4.F5)\(a\), CI is highly predictive of compositional failure in multihop reasoning\. The curve shows a clear monotonic trend: the LLM has lower accuracy on ACS challenge sets with stricter CI cutoffs\. Appendix Figure[22](https://arxiv.org/html/2606.13934#A6.F22)\(b\) shows the same relationship at the bin level, confirming a strong negative correlation betweenCI\\mathrm\{CI\}and accuracy \(r=−0\.855r=\-0\.855\)\. Failure PR\-AUC beats the model’s failure\-rate baseline, indicating that CI helps identify the specific multihop examples the model gets wrong\. Crucially, this prediction is obtained without evaluating the composed query itself: we can anticipate which multihop questions will be difficult solely from interference between their constituent atomic queries\.

Ranking multilingual fact examples\.Multilingual factual recall shows the same pattern\. We explicitly evaluate trends within each language, rather than only across languages, because multilingual knowledge transfer is determined primarily by language\-specific factors like resource level and orthography—factors that identify similarity to English, and therefore directly control interference with the English\-language “atomic” fact\. For each target language, Figure[5](https://arxiv.org/html/2606.13934#S4.F5)\(c\) shows that the LLM performs worse on datasets with higher interference between the English fact representation and the target\-language subspace\. Appendix Figure[27](https://arxiv.org/html/2606.13934#A7.F27)confirms that accuracy also correlates with individual CI bin ranges for each language\. PR\-AUC metrics \(Figure[5](https://arxiv.org/html/2606.13934#S4.F5)\(b\)\) confirm the predictive accuracy of CI: for each language, CI beats the corresponding language\-specific baseline\.

As an illustration, Figure[6](https://arxiv.org/html/2606.13934#S4.F6)\(a\) visualizes example fact\-language combinations, displayed with CI scores and whether the LLM answers correctly\. Qualitatively, we observe that correct model predictions have a lower CI, while model errors are associated with a higher CI\. Overall, the multilingual results suggest that compositional interference predicts cross\-lingual transfer failures more precisely than assumptions based on the resource level of each language\. It uses only the geometry between English fact representations and target\-language subspaces, without access to the true translation\.

![Refer to caption](https://arxiv.org/html/2606.13934v1/x8.png)Figure 6:LLMs fail on datasets with high interference\.\(a\) Fact recall across languages: columns are English facts, rows are target languages, marks indicate LLM correctness on each multilingual fact, and color shows normalized CI within each language\. CI is broadly predictive of whether the model answers the compositional question correctly\. \(b,c\) Subspace\-level CI correlates strongly with accuracy for most languages in multilingual fact recall; Japanese is shown here, with all languages in Appendix Figure[30](https://arxiv.org/html/2606.13934#A8.F30)and[31](https://arxiv.org/html/2606.13934#A8.F31)\. In multihop reasoning, the correlation is weaker but follows the same trend\. Both \(b\) and \(c\) show significant correlations \(p<0\.01p<0\.01\)\.#### Coarse\-grained concepts

The results thus far have shown that CI predicts the difficulty of a specific scenario\. However, we can also predict the difficulty of more general scenarios—rather than evaluating the likelihood of the model answering incorrectly when prompted forthe capital of Spainin Spanish, we can predict its overall accuracy on a dataset of national capitals in Spanish\. As when we measured the angle between a fact vector against a language subspace, we will define coarse categories according to their subspaces\.888Similarly, we extract subspaces by applying uncentered SVD to the cluster\-centered residual representations and retaining the basis vectors that explain a fixed proportion of variance\. We sweep variance thresholds of 0\.85, 0\.90, 0\.95, and 0\.99 across layers on the development set, and report the best\-performing configuration\. When constructing category subspaces, we discard category groups with fewer than five correct examples\.In multilingual factual recall, CI shows a strong negative correlation with LLM accuracy for most languages \(Figure[6](https://arxiv.org/html/2606.13934#S4.F6)\(b\), Appendix Figures[30](https://arxiv.org/html/2606.13934#A8.F30)and[31](https://arxiv.org/html/2606.13934#A8.F31)\)\. In multihop reasoning, the correlation is also negative and significant \(Figure[6](https://arxiv.org/html/2606.13934#S4.F6)\(c\)\)\. These results suggest CI can predict the general difficulty of broad compositional tasks\.

## 5Discussion and Future Work

#### Compositionality and orthogonality\.

Orthogonal structures are associated with adaptation to new contexts in brain scans\[[20](https://arxiv.org/html/2606.13934#bib.bib9),[37](https://arxiv.org/html/2606.13934#bib.bib47)\]and in theory\.Smolensky \[[45](https://arxiv.org/html/2606.13934#bib.bib50)\]first proposed orthogonal representations as a path to variable binding in connectionist systems\.Plate \[[40](https://arxiv.org/html/2606.13934#bib.bib48)\]later proposed a method for encoding systematically compositional atomic properties with quasi\-orthogonal vectors in a high\-dimensional space, a preview of modern overparameterized neural nets\. Recent work in toy models\[[47](https://arxiv.org/html/2606.13934#bib.bib46)\]suggested that compositional models use a linearly factored representational structure\. Relatedly,Olah \[[39](https://arxiv.org/html/2606.13934#bib.bib158)\]framed composition and superposition as competing strategies for allocating limited capacity among near\-orthogonal feature directions\. However, these previous works are limited to synthetic or small toy settings and do not attempt to explicitly predict whether a model will succeed at specific systematic combinations based on their geometry\. Furthermore, compressed sensing addresses the concern of catastrophic interference between non\-orthogonal feature encodings in realistic settings\[[7](https://arxiv.org/html/2606.13934#bib.bib7)\], permitting efficient compression and, in theory, decoupling orthogonality from compositionality\. By assuming superposition to be lossy, we explain the importance of orthogonality in practice and operationalize our hypothesis through testable predictions in realistic and natural settings in LLMs\.

#### Adversarial examples\.

Neural networks make significant classification errors after small perturbations of their inputs\[[46](https://arxiv.org/html/2606.13934#bib.bib3),[23](https://arxiv.org/html/2606.13934#bib.bib2)\]\. This phenomenon has long been attributed to compression artifacts\.Elhageet al\.\[[18](https://arxiv.org/html/2606.13934#bib.bib82)\]found that a toy model became vulnerable to adversarial attack as superposition emerged during training, andGorton and Lewis \[[24](https://arxiv.org/html/2606.13934#bib.bib79)\]further argued that adversarial examples arise partly from superposition\-induced feature interference\. In their account, adversarial attacks exploit worst\-case interference: perturbations can coordinate many superposed features so that small changes accumulate into a large downstream error\. Relatedly,Aden\-Aliet al\.\[[3](https://arxiv.org/html/2606.13934#bib.bib164)\]show that many individually small similarities between near\-orthogonal dataset\-example representations and a target behavior can add coherently during fine\-tuning, causing the model to behave as if it had been given a hidden system prompt\. Our work studies the same underlying failure mode, but focuses on the subset of salient features required by a particular composition\. The underlying causes of adversarial examples are therefore closely related to the ones we exploit to identify challenging compositions, though our work explores the discrete space of input descriptions, rather than generating specific inputs\.

Traditional adversarial generation requires a ground truth input and a target output\. Adversarial examples are then found by locally perturbing the ground truth input to elicit the target output\. In discrete modalities like language modeling this has limitations: continuous perturbations, however small, rarely map onto valid, coherent input sequences\. By contrast, our method does not take a specific input or target output\. Instead of searching the continuous representation space for specific errors, we search the combinatorial space of atomic concepts to identify difficult scenarios for hypothetical inputs\.

#### Predicting model behavior

In interpretability, hypotheses are often validated by predicting how a model will respond to targeted mechanistic interventions\[[42](https://arxiv.org/html/2606.13934#bib.bib99)\]\. However, it rarely leverages these mechanistic insights to predict how anunalteredmodel will naturally behave when processing novel, complexinputs\. If our goal is to understand holisticcomputationrather than merely localimplementation, we must validate our theories by anticipating a model’s edge\-case failures purely from its internal representations, without requiring manual perturbations\.

Recent work provides evidence that such prediction is possible\. Prior studies have shown that OOD performance can be predicted from different forms of internal structure, including loss\-landscape geometry\[[30](https://arxiv.org/html/2606.13934#bib.bib104)\], hidden activation patterns\[[34](https://arxiv.org/html/2606.13934#bib.bib110)\], and mechanistic accounts of character\-counting circuits\[[26](https://arxiv.org/html/2606.13934#bib.bib159)\]\. In contrast,Huanget al\.\[[28](https://arxiv.org/html/2606.13934#bib.bib138)\]show that causal mechanisms are more predictive of OOD behavior, whileChouet al\.\[[15](https://arxiv.org/html/2606.13934#bib.bib153)\]find that task\-relevant geometric properties of in\-distribution object manifolds forecast poor OOD generalization in image classification\. More directly related to compositional generalization,An and Du \[[5](https://arxiv.org/html/2606.13934#bib.bib157)\]predict OOD compositional generalization on SCAN, andBlumet al\.\[[9](https://arxiv.org/html/2606.13934#bib.bib98)\]show that the degree of representational alignment among training examples predicts cross\-lingual generalization in fully trained models\. However, both approaches require representations from compositional examples, whereas we only require the atomic concept representations\. Overall, these analyses are constrained to simple tasks with well\-understood algorithmic structure or to specific task domains\. By contrast, our approach applies to any setting where models are expected to systematically compose atomic concepts\.

#### Limitations and Future work

We deliberately restrict our method to a single geometric measurement calculated on simple, easily derived atomic concept representations\. This minimal setup makes the analysis transparent and suggests that representation geometry contains useful information about compositional performance\. At the same time, it leaves open richer ways to identify and analyze concepts, such as SAE features, causal features, or hierarchical manifold structures\[[18](https://arxiv.org/html/2606.13934#bib.bib82),[17](https://arxiv.org/html/2606.13934#bib.bib149),[19](https://arxiv.org/html/2606.13934#bib.bib151),[16](https://arxiv.org/html/2606.13934#bib.bib154),[38](https://arxiv.org/html/2606.13934#bib.bib155)\]\.

For our demonstration, we explore compositional scenarios which all have existing inputs\. But future work could efficiently search larger combinatorial spaces and generate de novo inputs from arbitrary combinations of concepts, enabling more systematic adversarial data synthesis for language models\. The efficiency of the search itself could be improved; our method finds the highest interference pairs fromnnconcept combinations withO\(n\)O\(n\)vector multiplications, but a carefully\-designed search could test combinations more selectively\. Another gap is our exclusive focus on destructive interference, where overlap between active representations impairs independent recovery\. Correlated features can also exhibitconstructiveinterference, where overlap supports recovery\[[41](https://arxiv.org/html/2606.13934#bib.bib145)\]\. Future work should clarify when non\-orthogonality predicts failure or provides useful co\-activation structure\.

## 6Acknowledgements

This work was enabled in part by a gift from the Chan Zuckerberg Initiative Foundation to establish the Kempner Institute for the Study of Natural and Artificial Intelligence\. IL and NS are supported by a grant from Coefficient Giving and the Berkeley Existential Research Institute \(BERI\)\. We thank Melanie Weber, Thomas Fel, Joshua Batson, Annabelle Michael Carrell, Aaron Mueller, Yonatan Belinkov, Jing Huang, Victoria R\. Li, Ekdeep Lubana, David Klindt, Sanchit Ahuja, SueYeon Chung, Ekaterina Shutova, Sebastian Ruder, Hadas Orgad, Sweta Karlekar, Apoorv Khandelwal, Michael Lepori, Tianze Hua, Zhuonan Yang and other members of the LUNAR lab for helpful discussion and feedback on this work\.

## References

- \[1\]\(2021\)Ungoliant: an optimized pipeline for the generation of a very large\-scale multilingual web corpus\.InCMLC 2021\-9th Workshop on Challenges in the Management of Large Corpora,Cited by:[§4\.1](https://arxiv.org/html/2606.13934#S4.SS1.SSS0.Px2.p2.12)\.
- \[2\]B\. Adcock, C\. Boyer, and S\. Brugiapaglia\(2021\)On oracle\-type local recovery guarantees in compressed sensing\.Information and Inference: A Journal of the IMA10\(1\),pp\. 1–49\.Cited by:[§A\.1](https://arxiv.org/html/2606.13934#A1.SS1.p1.5),[§A\.1](https://arxiv.org/html/2606.13934#A1.SS1.p2.2),[Appendix B](https://arxiv.org/html/2606.13934#A2.p1.3),[§2\.1](https://arxiv.org/html/2606.13934#S2.SS1.p3.2),[footnote 2](https://arxiv.org/html/2606.13934#footnote2)\.
- \[3\]I\. Aden\-Ali, N\. Golowich, A\. Liu, A\. Shetty, A\. Moitra, and N\. Haghtalab\(2026\)Subliminal effects in your data: a general mechanism via log\-linearity\.External Links:2602\.04863,[Link](https://arxiv.org/abs/2602.04863)Cited by:[§5](https://arxiv.org/html/2606.13934#S5.SS0.SSS0.Px2.p1.1)\.
- \[4\]X\. Ai, M\. K\. Ihsani, and M\. Kan\(2025\)Are knowledge and reference in multilingual language models cross\-lingually consistent?\.External Links:2507\.12838,[Link](https://arxiv.org/abs/2507.12838)Cited by:[§4\.1](https://arxiv.org/html/2606.13934#S4.SS1.SSS0.Px2.p1.1)\.
- \[5\]Z\. An and W\. Du\(2026\)Representational homomorphism predicts and improves compositional generalization in transformer language model\.External Links:2601\.18858,[Link](https://arxiv.org/abs/2601.18858)Cited by:[§5](https://arxiv.org/html/2606.13934#S5.SS0.SSS0.Px3.p2.1)\.
- \[6\]S\. Arora, Y\. Li, Y\. Liang, T\. Ma, and A\. Risteski\(2018\)Linear algebraic structure of word senses, with applications to polysemy\.Transactions of the Association for Computational Linguistics6,pp\. 483–495\.Cited by:[§1](https://arxiv.org/html/2606.13934#S1.p4.1)\.
- \[7\]V\. Barin Pacela, S\. Joshi, I\. Camacho, S\. Lacoste\-Julien, and D\. Klindt\(2026\)Stop probing, start coding: why linear probes and sparse autoencoders fail at compositional generalisation\.arXiv e\-prints,pp\. arXiv–2603\.Cited by:[§5](https://arxiv.org/html/2606.13934#S5.SS0.SSS0.Px1.p1.1)\.
- \[8\]Z\. Ben\-Haim, Y\. C\. Eldar, and M\. Elad\(2010\-10\)Coherence\-based performance guarantees for estimating a sparse vector under random noise\.IEEE Transactions on Signal Processing58\(10\),pp\. 5030–5043\.External Links:ISSN 1941\-0476,[Link](http://dx.doi.org/10.1109/TSP.2010.2052460),[Document](https://dx.doi.org/10.1109/tsp.2010.2052460)Cited by:[§2\.1](https://arxiv.org/html/2606.13934#S2.SS1.p2.1)\.
- \[9\]C\. Blum, K\. Filippova, A\. Yuan, A\. Ghandeharioun, J\. Zimmert, F\. Zhang, J\. Hoffmann, T\. Linzen, M\. Wattenberg, L\. Dixon, and M\. Geva\(2025\-08\)Beyond the Rosetta Stone: Unification Forces in Generalization Dynamics\.arXiv\.Note:arXiv:2508\.11017 \[cs\]External Links:[Link](http://arxiv.org/abs/2508.11017),[Document](https://dx.doi.org/10.48550/arXiv.2508.11017)Cited by:[§4\.1](https://arxiv.org/html/2606.13934#S4.SS1.SSS0.Px2.p1.1),[§5](https://arxiv.org/html/2606.13934#S5.SS0.SSS0.Px3.p2.1)\.
- \[10\]C\. Blum, K\. Filippova, A\. Yuan, A\. Ghandeharioun, J\. Zimmert, F\. Zhang, J\. Hoffmann, T\. Linzen, M\. Wattenberg, L\. Dixon, and M\. Geva\(2025\)Beyond the rosetta stone: unification forces in generalization dynamics\.External Links:2508\.11017,[Link](https://arxiv.org/abs/2508.11017)Cited by:[§4\.1](https://arxiv.org/html/2606.13934#S4.SS1.SSS0.Px2.p1.1)\.
- \[11\]Y\. Bu, X\. Liu, Z\. Ren, Y\. Yang, and J\. Dai\(2026\)Align once, benefit multilingually: enforcing multilingual consistency for llm safety alignment\.External Links:2602\.16660,[Link](https://arxiv.org/abs/2602.16660)Cited by:[§4\.1](https://arxiv.org/html/2606.13934#S4.SS1.SSS0.Px2.p1.1)\.
- \[12\]E\. J\. Candès, J\. Romberg, and T\. Tao\(2006\)Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information\.IEEE Transactions on information theory52\(2\),pp\. 489–509\.Cited by:[§1](https://arxiv.org/html/2606.13934#S1.p4.1),[§2\.1](https://arxiv.org/html/2606.13934#S2.SS1.p1.14),[§2](https://arxiv.org/html/2606.13934#S2.p1.1)\.
- \[13\]E\. Candes and T\. Tao\(2005\)Decoding by linear programming\.External Links:math/0502327,[Link](https://arxiv.org/abs/math/0502327)Cited by:[footnote 1](https://arxiv.org/html/2606.13934#footnote1)\.
- \[14\]T\. A\. Chang, Z\. Tu, and B\. K\. Bergen\(2022\)The geometry of multilingual language model representations\.External Links:2205\.10964,[Link](https://arxiv.org/abs/2205.10964)Cited by:[§4\.1](https://arxiv.org/html/2606.13934#S4.SS1.SSS0.Px2.p2.12)\.
- \[15\]C\. Chou, A\. Kirsanov, Y\. Yang, and S\. Chung\(2026\)Diagnosing generalization failures from representational geometry markers\.External Links:2603\.01879,[Link](https://arxiv.org/abs/2603.01879)Cited by:[§5](https://arxiv.org/html/2606.13934#S5.SS0.SSS0.Px3.p2.1)\.
- \[16\]C\. Chou, H\. Le, Y\. Wang, and S\. Chung\(2025\)Feature learning beyond the lazy\-rich dichotomy: insights from representational geometry\.External Links:2503\.18114,[Link](https://arxiv.org/abs/2503.18114)Cited by:[§5](https://arxiv.org/html/2606.13934#S5.SS0.SSS0.Px4.p1.1)\.
- \[17\]H\. Cunningham, A\. Ewart, L\. Riggs, R\. Huben, and L\. Sharkey\(2023\)Sparse autoencoders find highly interpretable features in language models\.arXiv preprint arXiv:2309\.08600\.Cited by:[§2\.2](https://arxiv.org/html/2606.13934#S2.SS2.p2.1),[§5](https://arxiv.org/html/2606.13934#S5.SS0.SSS0.Px4.p1.1)\.
- \[18\]N\. Elhage, T\. Hume, C\. Olsson, N\. Schiefer, T\. Henighan, S\. Kravec, Z\. Hatfield\-Dodds, R\. Lasenby, D\. Drain, C\. Chen,et al\.\(2022\)Toy models of superposition\.arXiv preprint arXiv:2209\.10652\.Cited by:[§1](https://arxiv.org/html/2606.13934#S1.p4.1),[§2\.1](https://arxiv.org/html/2606.13934#S2.SS1.p2.2),[§5](https://arxiv.org/html/2606.13934#S5.SS0.SSS0.Px2.p1.1),[§5](https://arxiv.org/html/2606.13934#S5.SS0.SSS0.Px4.p1.1)\.
- \[19\]T\. Fel, E\. S\. Lubana, J\. S\. Prince, M\. Kowal, V\. Boutin, I\. Papadimitriou, B\. Wang, M\. Wattenberg, D\. Ba, and T\. Konkle\(2025\)Archetypal sae: adaptive and stable dictionary learning for concept extraction in large vision models\.External Links:2502\.12892,[Link](https://arxiv.org/abs/2502.12892)Cited by:[§5](https://arxiv.org/html/2606.13934#S5.SS0.SSS0.Px4.p1.1)\.
- \[20\]T\. Flesch, K\. Juechems, T\. Dumbalska, A\. Saxe, and C\. Summerfield\(2022\)Orthogonal representations for robust context\-dependent task performance in brains and neural networks\.Neuron110\(7\),pp\. 1258–1270\.Cited by:[§5](https://arxiv.org/html/2606.13934#S5.SS0.SSS0.Px1.p1.1)\.
- \[21\]J\. A\. Fodor and Z\. W\. Pylyshyn\(1988\)Connectionism and cognitive architecture: a critical analysis\.Cognition28\(1\-2\),pp\. 3–71\.Cited by:[§1](https://arxiv.org/html/2606.13934#S1.p2.1)\.
- \[22\]N\. Garg, J\. Kleinberg, and K\. Peng\(2026\)How many features can a language model store under the linear representation hypothesis?\.External Links:2602\.11246,[Link](https://arxiv.org/abs/2602.11246)Cited by:[§A\.2](https://arxiv.org/html/2606.13934#A1.SS2.p1.3),[§A\.2](https://arxiv.org/html/2606.13934#A1.SS2.p2.2),[§A\.2](https://arxiv.org/html/2606.13934#A1.SS2.p3.8),[§2\.1](https://arxiv.org/html/2606.13934#S2.SS1.p1.15),[§2](https://arxiv.org/html/2606.13934#S2.p1.1)\.
- \[23\]I\. J\. Goodfellow, J\. Pouget\-Abadie, M\. Mirza, B\. Xu, D\. Warde\-Farley, S\. Ozair, A\. Courville, and Y\. Bengio\(2014\)Generative adversarial networks\.External Links:1406\.2661,[Link](https://arxiv.org/abs/1406.2661)Cited by:[§5](https://arxiv.org/html/2606.13934#S5.SS0.SSS0.Px2.p1.1)\.
- \[24\]L\. Gorton and O\. Lewis\(2025\)Adversarial examples are not bugs, they are superposition\.arXiv preprint arXiv:2508\.17456\.Cited by:[§5](https://arxiv.org/html/2606.13934#S5.SS0.SSS0.Px2.p1.1)\.
- \[25\]A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan,et al\.\(2024\)The llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.Cited by:[§4](https://arxiv.org/html/2606.13934#S4.p1.1)\.
- \[26\]W\. Gurnee, E\. Ameisen, I\. Kauvar, J\. Tarng, A\. Pearce, C\. Olah, and J\. Batson\(2026\)When models manipulate manifolds: the geometry of a counting task\.External Links:2601\.04480,[Link](https://arxiv.org/abs/2601.04480)Cited by:[§5](https://arxiv.org/html/2606.13934#S5.SS0.SSS0.Px3.p2.1)\.
- \[27\]R\. Hendel, M\. Geva, and A\. Globerson\(2023\)In\-context learning creates task vectors\.InFindings of the Association for Computational Linguistics: EMNLP 2023,pp\. 9318–9333\.Cited by:[Figure 29](https://arxiv.org/html/2606.13934#A8.F29),[Figure 29](https://arxiv.org/html/2606.13934#A8.F29.3.2)\.
- \[28\]J\. Huang, J\. Tao, T\. Icard, D\. Yang, and C\. Potts\(2025\)Internal causal mechanisms robustly predict language model out\-of\-distribution behaviors\.arXiv preprint arXiv:2505\.11770\.Cited by:[§5](https://arxiv.org/html/2606.13934#S5.SS0.SSS0.Px3.p2.1)\.
- \[29\]D\. Hupkes, V\. Dankers, M\. Mul, and E\. Bruni\(2020\)Compositionality decomposed: how do neural networks generalise?\.Journal of Artificial Intelligence Research67,pp\. 757–795\.Cited by:[§1](https://arxiv.org/html/2606.13934#S1.p2.1)\.
- \[30\]J\. Juneja, R\. Bansal, K\. Cho, J\. Sedoc, and N\. Saphra\(2023\)Linear connectivity reveals generalization strategies\.External Links:2205\.12411,[Link](https://arxiv.org/abs/2205.12411)Cited by:[§5](https://arxiv.org/html/2606.13934#S5.SS0.SSS0.Px3.p2.1)\.
- \[31\]A\. Khandelwal and E\. Pavlick\(2025\)How do language models compose functions?\.External Links:2510\.01685,[Link](https://arxiv.org/abs/2510.01685)Cited by:[Appendix E](https://arxiv.org/html/2606.13934#A5.p1.1),[§4\.1](https://arxiv.org/html/2606.13934#S4.SS1.SSS0.Px1.p1.11)\.
- \[32\]B\. Lake and M\. Baroni\(2018\)Generalization without systematicity: on the compositional skills of sequence\-to\-sequence recurrent networks\.InInternational conference on machine learning,pp\. 2873–2882\.Cited by:[item 2](https://arxiv.org/html/2606.13934#S1.I1.i2.p1.1)\.
- \[33\]B\. Lake and M\. Baroni\(2018\)Generalization without systematicity: on the compositional skills of sequence\-to\-sequence recurrent networks\.InInternational conference on machine learning,pp\. 2873–2882\.Cited by:[§3](https://arxiv.org/html/2606.13934#S3.SS0.SSS0.Px1.p1.1),[§3\.1](https://arxiv.org/html/2606.13934#S3.SS1.p2.2)\.
- \[34\]V\. R\. Li, J\. Kaufmann, M\. Wattenberg, D\. Alvarez\-Melis, and N\. Saphra\(2025\)Can interpretation predict behavior on unseen data?\.External Links:2507\.06445,[Link](https://arxiv.org/abs/2507.06445)Cited by:[§5](https://arxiv.org/html/2606.13934#S5.SS0.SSS0.Px3.p2.1)\.
- \[35\]Z\. W\. Lim, A\. F\. Aji, and T\. Cohn\(2025\)Language\-specific latent process hinders cross\-lingual performance\.External Links:2505\.13141,[Link](https://arxiv.org/abs/2505.13141)Cited by:[§4\.1](https://arxiv.org/html/2606.13934#S4.SS1.SSS0.Px2.p1.1)\.
- \[36\]M\. Lu, R\. Zhang, C\. Eickhoff, and E\. Pavlick\(2025\-05\)Paths Not Taken: Understanding and Mending the Multilingual Factual Recall Pipeline\.arXiv\.Note:arXiv:2505\.20546 \[cs\]External Links:[Link](http://arxiv.org/abs/2505.20546),[Document](https://dx.doi.org/10.48550/arXiv.2505.20546)Cited by:[§4\.1](https://arxiv.org/html/2606.13934#S4.SS1.SSS0.Px2.p1.1)\.
- \[37\]L\. Luettgau, N\. Chen, T\. Erdmann, S\. Veselic, R\. Moran, Z\. Kurth\-Nelson, and R\. J\. Dolan\(2024\)A neural mechanism for compositional generalization of structure in humans\.bioRxiv,pp\. 2024–09\.Cited by:[§5](https://arxiv.org/html/2606.13934#S5.SS0.SSS0.Px1.p1.1)\.
- \[38\]M\. Nickel and D\. Kiela\(2017\)Poincaré embeddings for learning hierarchical representations\.Advances in neural information processing systems30\.Cited by:[§5](https://arxiv.org/html/2606.13934#S5.SS0.SSS0.Px4.p1.1)\.
- \[39\]C\. Olah\(2023\-05\)Distributed representations: composition & superposition\.Note:Transformer Circuits Thread\. Published May 4, 2023[https://transformer\-circuits\.pub/2023/superposition\-composition/index\.html](https://transformer-circuits.pub/2023/superposition-composition/index.html)Cited by:[§5](https://arxiv.org/html/2606.13934#S5.SS0.SSS0.Px1.p1.1)\.
- \[40\]T\. A\. Plate\(1995\)Holographic reduced representations\.IEEE Transactions on Neural networks6\(3\),pp\. 623–641\.Cited by:[§5](https://arxiv.org/html/2606.13934#S5.SS0.SSS0.Px1.p1.1)\.
- \[41\]L\. Prieto, E\. Stevinson, M\. Barsbey, T\. Birdal, and P\. A\. Mediano\(2026\)From data statistics to feature geometry: how correlations shape superposition\.arXiv preprint arXiv:2603\.09972\.Cited by:[§5](https://arxiv.org/html/2606.13934#S5.SS0.SSS0.Px4.p2.2)\.
- \[42\]N\. Saphra and S\. Wiegreffe\(2024\)Mechanistic?\.External Links:2410\.09087,[Link](https://arxiv.org/abs/2410.09087)Cited by:[§5](https://arxiv.org/html/2606.13934#S5.SS0.SSS0.Px3.p1.1)\.
- \[43\]B\. Settles\(2009\)Active learning literature survey\.Technical reportUniversity of Wisconsin\-Madison Department of Computer Sciences\.Cited by:[§1](https://arxiv.org/html/2606.13934#S1.p3.1)\.
- \[44\]N\. Shani and A\. Basirat\(2025\-11\)Language dominance in multilingual large language models\.InProceedings of the 8th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP,Y\. Belinkov, A\. Mueller, N\. Kim, H\. Mohebbi, H\. Chen, D\. Arad, and G\. Sarti \(Eds\.\),Suzhou, China,pp\. 137–148\.External Links:[Link](https://aclanthology.org/2025.blackboxnlp-1.7/),[Document](https://dx.doi.org/10.18653/v1/2025.blackboxnlp-1.7),ISBN 979\-8\-89176\-346\-3Cited by:[§4\.1](https://arxiv.org/html/2606.13934#S4.SS1.SSS0.Px2.p1.1)\.
- \[45\]P\. Smolensky\(1990\)Tensor product variable binding and the representation of symbolic structures in connectionist systems\.Artificial intelligence46\(1\-2\),pp\. 159–216\.Cited by:[§2](https://arxiv.org/html/2606.13934#S2.p1.1),[§5](https://arxiv.org/html/2606.13934#S5.SS0.SSS0.Px1.p1.1)\.
- \[46\]C\. Szegedy, W\. Zaremba, I\. Sutskever, J\. Bruna, D\. Erhan, I\. Goodfellow, and R\. Fergus\(2014\)Intriguing properties of neural networks\.External Links:1312\.6199,[Link](https://arxiv.org/abs/1312.6199)Cited by:[§5](https://arxiv.org/html/2606.13934#S5.SS0.SSS0.Px2.p1.1)\.
- \[47\]A\. Uselis, A\. Dittadi, and S\. J\. Oh\(2025\)Does data scaling lead to visual compositional generalization?\.External Links:2507\.07102,[Link](https://arxiv.org/abs/2507.07102)Cited by:[§5](https://arxiv.org/html/2606.13934#S5.SS0.SSS0.Px1.p1.1)\.
- \[48\]M\. Wang, H\. Adel, L\. Lange, Y\. Liu, E\. Nie, J\. Strötgen, and H\. Schuetze\(2025\-07\)Lost in multilinguality: dissecting cross\-lingual factual inconsistency in transformer language models\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 5075–5094\.External Links:[Link](https://aclanthology.org/2025.acl-long.253/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.253),ISBN 979\-8\-89176\-251\-0Cited by:[§4\.1](https://arxiv.org/html/2606.13934#S4.SS1.SSS0.Px2.p1.1)\.
- \[49\]C\. Wendler, V\. Veselovsky, G\. Monea, and R\. West\(2024\-08\)Do llamas work in English? on the latent language of multilingual transformers\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 15366–15394\.External Links:[Link](https://aclanthology.org/2024.acl-long.820/),[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.820)Cited by:[§4\.1](https://arxiv.org/html/2606.13934#S4.SS1.SSS0.Px2.p1.1)\.
- \[50\]Y\. Zhao, W\. Zhang, G\. Chen, K\. Kawaguchi, and L\. Bing\(2024\)How do large language models handle multilingualism?\.Advances in Neural Information Processing Systems37,pp\. 15296–15319\.Cited by:[§4\.1](https://arxiv.org/html/2606.13934#S4.SS1.SSS0.Px2.p1.1)\.

## Appendix ARecovery bounds related to local cumulative coherence

This section expands on our claim that robust recovery bounds can be controlled by local cumulative coherence\. Specifically, we want to recover an estimatez^\\hat\{z\}for anmm\-dimensionalkk\-sparse feature vectorzzfrom noisy conditions\. Robust recovery guarantees provide high probability bounds on the recovery error‖z^−z‖2\\\|\\hat\{z\}\-z\\\|\_\{2\}\.

### A\.1Robust compressed sensing

The specific robust compressed sensing bound provided byAdcocket al\.\[[2](https://arxiv.org/html/2606.13934#bib.bib8)\]limits the number of measurements required to guarantee a high probability of exact recovery\. As is common in the superposition literature, we treat the hidden dimensionddas the number of sampled measurements\. In robust compressed sensing, we assume that rather than access to the encodedAzAz, we instead observe the noisy representationAz\+ϵAz\+\\epsilon, where the noise is bounded by some constant‖ϵ‖<η\\\|\\epsilon\\\|<\\eta\. Therefore, the relevant bound can be paraphrased as claiming that we can guarantee the recovery error on our estimatedz^\\hat\{z\}from a noisy encoding satisfies,

‖z−z^‖2≤\(c1\+c2k\)η,\\\|z\-\\hat\{z\}\\\|\_\{2\}\\leq\(c\_\{1\}\+c\_\{2\}\\sqrt\{k\}\)\\eta,\(4\)for constantsc1,c2c\_\{1\},c\_\{2\}\.

Adcocket al\.\[[2](https://arxiv.org/html/2606.13934#bib.bib8)\]guarantee this robust recovery with probability1−ε1\-\\varepsilonif the hidden dimensionddis at least,

d≳Θ\(supp\(z\),A\)log2⁡\(m/ε\),d\\gtrsim\\Theta\(\\text\{supp\}\(z\),A\)\\log^\{2\}\(m/\\varepsilon\),\(5\)where we define the interaction term as a positive real number,

Θ\(supp\(z\),A\)=‖A𝒮⊤Asupp\(z\)‖∞→∞\\Theta\(\\text\{supp\}\(z\),A\)=\\\|A\_\{\\mathcal\{S\}\}^\{\\top\}A\_\{\\text\{supp\}\(z\)\}\\\|\_\{\\infty\\rightarrow\\infty\}\(6\)
In Equation[2](https://arxiv.org/html/2606.13934#S2.E2), we refer to the bound in Equation[5](https://arxiv.org/html/2606.13934#A1.E5)as effectively controlled by mutual coherence in the salient support,

α\(𝒮\)\\displaystyle\\alpha\(\\mathcal\{S\}\)=maxi∈𝒮∑j∈𝒮\|cos⁡\(ai,aj\)\|\\displaystyle=\\max\_\{i\\in\\mathcal\{S\}\}\\sum\_\{\\begin\{subarray\}\{c\}j\\in\\mathcal\{S\}\\end\{subarray\}\}\\left\|\\cos\{\(a\_\{i\},a\_\{j\}\)\}\\right\|\(7\)≥maxi∈𝒮∑j∈supp\(z\)\|cos⁡\(ai,aj\)\|\\displaystyle\\geq\\max\_\{i\\in\\mathcal\{S\}\}\\sum\_\{\\begin\{subarray\}\{c\}j\\in\\textrm\{supp\}\(z\)\\end\{subarray\}\}\\left\|\\cos\{\(a\_\{i\},a\_\{j\}\)\}\\right\|\(8\)=Θ\(supp\(z\),A\)\.\\displaystyle=\\Theta\(\\text\{supp\}\(z\),A\)\.\(9\)Therefore we can satisfy Equation[5](https://arxiv.org/html/2606.13934#A1.E5)if hidden dimensionddis at least,

d≳α\(𝒮\)log2⁡\(m/ε\)\.\\displaystyle d\\gtrsim\\alpha\(\\mathcal\{S\}\)\\log^\{2\}\(m/\\varepsilon\)\.\(10\)

### A\.2Robust linear compressed sensing

We now assume thelinearcompressed sensing setting ofGarget al\.\[[22](https://arxiv.org/html/2606.13934#bib.bib132)\], which weakens the lossless recovery guarantee from compressed sensing tonear\-losslessrecovery by accepting the Linear Representation Hypothesis \(LRH\)\. Under the LRH,Garget al\.\[[22](https://arxiv.org/html/2606.13934#bib.bib132)\]showed that a linear decoder of dimensionddcan still recover a number of features exponential inddup to a small constant errorϵ\\epsilon\.

In theory, the number ofkk\-sparse features which can be linearly accessible is exponential indd\[[22](https://arxiv.org/html/2606.13934#bib.bib132)\]—even if the encoded features are linearly correlated\. However, the theory only guarantees that this near\-lossless decodingexists, not that it is applied by the neural network\. We argue that if the neural network’s next layer does linearly decode the representation, it may not reflect the theoretical ideal probe dictionary\. We will treat the LLM’s linear encoding as explicit in its activations, but treat its linear decoding as implicit\. We assume this implicit linear decoding to be a perturbation of the ideal decoding, leading to the following argument\.

LetA∈ℝd×m\{\{A\}\}\\in\\mathbb\{R\}^\{d\\times m\}be the representation matrix \(columnsa1,…,ama\_\{1\},\\dots,a\_\{m\}\) andB∈ℝd×m\{\{B\}\}\\in\\mathbb\{R\}^\{d\\times m\}be the probe matrix \(columnsb1,…,bmb\_\{1\},\\dots,b\_\{m\}\)\. Given a feature vectorz∈\[−1,1\]mz\\in\[\-1,1\]^\{m\}withkknonzero features, the decoded estimate is

z^=B⊤Az\.\\hat\{z\}=\{\{B\}\}^\{\\top\}\{\{A\}\}z\.Due to the results ofGarget al\.\[[22](https://arxiv.org/html/2606.13934#bib.bib132)\], we can assume that for allkk\-sparsezz,

‖z^−z‖∞≤ε\.\\\|\\hat\{z\}\-z\\\|\_\{\\infty\}\\leq\\varepsilon\.Now assume that the implicit linear decoding during downstream processing,B′\{\{B\}\}^\{\\prime\}, is a perturbation of the ideal probeB\{\{B\}\}, specificallyB′=B\+ΔB\{\{B\}\}^\{\\prime\}=\{\{B\}\}\+\\Delta\{\{B\}\}where

‖ΔB‖2→∞≤η\.\\\|\\Delta\{\{B\}\}\\\|\_\{2\\to\\infty\}\\;\\leq\\;\\eta\.We will show that this perturbation has its largest impact on recovery error when the sparse active features have highly correlated encodings\.

###### Theorem 1\(Perturbation sensitivity bounded by correlations inA𝒮\{\{A\}\}\_\{\\mathcal\{S\}\}\)\.

LetB′=B\+ΔB\{\{B\}\}^\{\\prime\}=\{\{B\}\}\+\\Delta\{\{B\}\}, and suppose each probe vector changes by at mostη\\eta:

‖Δbi‖2≤ηfor alli\.\\\|\\Delta b\_\{i\}\\\|\_\{2\}\\leq\\eta\\quad\\text\{for all \}i\.Letzzbe akk\-sparse feature vector with supportsupp\(z\)\\mathrm\{supp\}\(z\)\. Then

‖B′⁣⊤Az−z‖∞≤ε\+ηk‖Asupp\(z\)‖2\.\\displaystyle\\\|\{\{B\}\}^\{\\prime\\top\}\{\{A\}\}z\-z\\\|\_\{\\infty\}\\;\\leq\\;\\varepsilon\\;\+\\;\\eta\\,k\\,\\\|\{\{A\}\}\_\{\\mathrm\{supp\}\(z\)\}\\\|\_\{2\}\.\(11\)In particular, if the product⟨ai,aj⟩\\langle a\_\{i\},a\_\{j\}\\ranglefori,j∈𝒮i,j\\in\\mathcal\{S\}is large and positive, that feature pair increases the perturbation sensitivity on this input\.

###### Proof\.

We decompose the new error into the original decoding error plus the effect of changing the probe:

B′⁣⊤Az−z=\(B⊤Az−z\)\+ΔB⊤Az\.\{\{B\}\}^\{\\prime\\top\}\{\{A\}\}z\-z=\(\{\{B\}\}^\{\\top\}\{\{A\}\}z\-z\)\+\\Delta\{\{B\}\}^\{\\top\}\{\{A\}\}z\.By assumption, the first term has‖B⊤Az−z‖∞≤ε\\\|\{\{B\}\}^\{\\top\}\{\{A\}\}z\-z\\\|\_\{\\infty\}\\leq\\varepsilon, so we only need to bound‖ΔB⊤Az‖∞\\\|\\Delta\{\{B\}\}^\{\\top\}\{\{A\}\}z\\\|\_\{\\infty\}\. Sincezzis supported onsupp\(z\)\\mathrm\{supp\}\(z\), we can writeAz=Asupp\(z\)zsupp\(z\)\{\{A\}\}z=\{\{A\}\}\_\{\\mathrm\{supp\}\(z\)\}z\_\{\\mathrm\{supp\}\(z\)\}, wherezsupp\(z\)z\_\{\\mathrm\{supp\}\(z\)\}is the restriction ofzzto indices insupp\(z\)\{\\mathrm\{supp\}\(z\)\}\. Theii\-th coordinate of this perturbation is bounded using Cauchy–Schwarz, the per\-column bound‖Δbi‖2≤η\\\|\\Delta b\_\{i\}\\\|\_\{2\}\\leq\\eta, and thekk\-sparsity constraint onzz,

\|\(ΔB⊤Az\)i\|≤‖Δbi‖2‖Az‖2≤η‖Az‖2≤η‖A‖2‖z‖2≤ηk‖A‖2\.\|\(\\Delta\{\{B\}\}^\{\\top\}\{\{A\}\}z\)\_\{i\}\|\\;\\leq\\;\\\|\\Delta b\_\{i\}\\\|\_\{2\}\\,\\\|\{\{A\}\}z\\\|\_\{2\}\\;\\leq\\;\\eta\\,\\\|\{\{A\}\}z\\\|\_\{2\}\\;\\leq\\;\\eta\\,\\\|\{\{A\}\}\\\|\_\{2\}\\,\\\|z\\\|\_\{2\}\\;\\leq\\;\\eta\\,k\\,\\\|\{\{A\}\}\\\|\_\{2\}\.
Sincezzis supported onsupp\(z\)\\mathrm\{supp\}\(z\), we can express the above using only the nonzero features\. Taking the maximum over all coordinatesii,

‖B′⁣⊤Az−z‖∞\\displaystyle\\\|\{\{B\}\}^\{\\prime\\top\}\{\{A\}\}z\-z\\\|\_\{\\infty\}≤ε\+ηk‖Asupp\(z\)‖2\.\\displaystyle\\;\\leq\\;\\varepsilon\\;\+\\;\\eta\\,k\\,\\\|\{\{A\}\}\_\{\\mathrm\{supp\}\(z\)\}\\\|\_\{2\}\.\(12\)∎

Under Theorem[1](https://arxiv.org/html/2606.13934#Thmtheorem1), deviations from an ideal decoder can damage feature recovery more on inputs when encodings of the active salient support are correlated\. If an LLM implicitly processes feature encodings with a flawed decoder, it will make therefore more mistakes when composing non\-orthogonal encodings\.

Note that the quantity of interest in our robust recovery bound differs depending on whether the model’s intrinsic decoding is linear or nonlinear\.If we assume the model decodes nonlinearly, as in compressed sensing, then the quantity of interest is cumulative coherence in the salient support\. If we assume the model uses linear decoding, then the quantity of interest is a simple norm on the active support—a subset of the salient support, and one that is not identifiable without analyzing specific input instantiating the scenario\. Without studying specific inputs, we remain limited to measurements on the salient support, not the active subset\.

Either way, cosine similarity within the salient support causes destructive interference\.

## Appendix BAlternative Metrics to Capture Compositional Interference

In the paper, we use cumulative coherence derived fromAdcocket al\.\[[2](https://arxiv.org/html/2606.13934#bib.bib8)\]and measure this quantity\. We consider several alternative geometric metrics for estimating compositional interference among the salient features of a composition\. Let𝒮\(C\)\\mathcal\{S\}\(C\)denote the salient feature support, and letaia\_\{i\}be the representation vector associated with featureii\. All metrics below are computed over pairwise similarities among active feature directions\.

#### Minimum similarity\.

The minimum similarity captures the least aligned pair of salient features:

CImin\(C\)=mini,j∈𝒮\(C\)i≠j⁡\|cos⁡\(ai,aj\)\|\.\\textrm\{CI\}\_\{\\min\}\(C\)=\\min\_\{\\begin\{subarray\}\{c\}i,j\\in\\mathcal\{S\}\(C\)\\\\ i\\neq j\\end\{subarray\}\}\|\\cos\(a\_\{i\},a\_\{j\}\)\|\.\(13\)

#### Maximum similarity\.

The maximum similarity captures the most aligned pair of salient features:

CImax\(C\)=maxi,j∈𝒮\(C\)i≠j⁡\|cos⁡\(ai,aj\)\|\.\\textrm\{CI\}\_\{\\max\}\(C\)=\\max\_\{\\begin\{subarray\}\{c\}i,j\\in\\mathcal\{S\}\(C\)\\\\ i\\neq j\\end\{subarray\}\}\|\\cos\(a\_\{i\},a\_\{j\}\)\|\.\(14\)

#### Mean similarity\.

The mean similarity averages pairwise alignment across all salient feature pairs:

CImean\(C\)=1\|𝒮\(C\)\|\(\|𝒮\(C\)\|−1\)∑i,j∈𝒮\(c\)i≠j\|cos⁡\(ai,aj\)\|\.\\textrm\{CI\}\_\{\\mathrm\{mean\}\}\(C\)=\\frac\{1\}\{\|\\mathcal\{S\}\(C\)\|\(\|\\mathcal\{S\}\(C\)\|\-1\)\}\\sum\_\{\\begin\{subarray\}\{c\}i,j\\in\\mathcal\{S\}\(c\)\\\\ i\\neq j\\end\{subarray\}\}\|\\cos\(a\_\{i\},a\_\{j\}\)\|\.\(15\)When\|𝒮\(C\)\|=2\|\\mathcal\{S\}\(C\)\|=2, different CI aggregation metrics reduce to the same pairwise comparison and therefore yield the same correlational result\. When\|𝒮\(C\)\|\>2\|\\mathcal\{S\}\(C\)\|\>2, however, these metrics can diverge because they aggregate multiple pairwise interactions differently\. Therefore, whenever examples contain more than two salient features, we additionally report results for alternative aggregation metrics to enable comparison across definitions of interference \(Figures[16](https://arxiv.org/html/2606.13934#A4.F16),[17](https://arxiv.org/html/2606.13934#A4.F17),[20](https://arxiv.org/html/2606.13934#A4.F20),[21](https://arxiv.org/html/2606.13934#A4.F21),[19](https://arxiv.org/html/2606.13934#A4.F19),[18](https://arxiv.org/html/2606.13934#A4.F18),[28](https://arxiv.org/html/2606.13934#A7.F28),[29](https://arxiv.org/html/2606.13934#A8.F29), and Table[5](https://arxiv.org/html/2606.13934#A8.T5)\)\.

Across these comparisons, most similarity\-based metrics remain predictive in most settings\. Since the alternative metrics are partly intercorrelated, this suggests that the central signal comes from angular similarity among concept representations\. Nevertheless, empirical results in the paper show that the cumulative coherence metric derived from the bounds in Section[2\.2](https://arxiv.org/html/2606.13934#S2.SS2)is the most stable metric to measure compositional interference across all scenarios\.

## Appendix CCluster Mean Centering

We expand on Section[2\.2](https://arxiv.org/html/2606.13934#S2.SS2), “Accounting for multiscale structure,” to provide more mathematical intuition and details\.

For an inputx∈𝒳\(c\)x\\in\\mathcal\{X\}\(c\), letCCdenote the active concept set, let𝒮\(C\)\\mathcal\{S\}\(C\)denote its salient support, and letA𝒮\(C\)=\{ai:i∈𝒮\(C\)\}A\_\{\\mathcal\{S\}\(C\)\}=\\\{a\_\{i\}:i\\in\\mathcal\{S\}\(C\)\\\}denote the corresponding salient feature encodings\. To provide a simple intuition, we consider the setting in which each salient feature is represented by a single linear direction, and we use the linear representation hypothesis to model residual stream activations\. The empirical estimatea~i\\tilde\{a\}\_\{i\}of a salient feature encoding may contain not only the direction of interest, but also background components induced by prompt format, task family, language, or other contextual structure\. We write

a~i=ziai\+∑u∈ℬizu\(i\)au,\\tilde\{a\}\_\{i\}=z\_\{i\}a\_\{i\}\+\\sum\_\{u\\in\\mathcal\{B\}\_\{i\}\}z\_\{u\}^\{\(i\)\}a\_\{u\},whereziaiz\_\{i\}a\_\{i\}is the salient feature contribution we aim to isolate,ziz\_\{i\}is its activation coefficient, andℬi\\mathcal\{B\}\_\{i\}indexes other background directions mixed into the empirical estimate of featureii\. Then the raw inner product between two empirical estimatesa~i\\tilde\{a\}\_\{i\}anda~j\\tilde\{a\}\_\{j\}, fori,j∈𝒮\(c\)i,j\\in\\mathcal\{S\}\(c\), expands as

⟨a~i,a~j⟩\\displaystyle\\langle\\tilde\{a\}\_\{i\},\\tilde\{a\}\_\{j\}\\rangle=zizj⟨ai,aj⟩\+zi∑ℓ∈ℬjzℓ\(j\)⟨ai,aℓ⟩\\displaystyle=z\_\{i\}z\_\{j\}\\langle a\_\{i\},a\_\{j\}\\rangle\+z\_\{i\}\\sum\_\{\\ell\\in\\mathcal\{B\}\_\{j\}\}z\_\{\\ell\}^\{\(j\)\}\\langle a\_\{i\},a\_\{\\ell\}\\rangle\+zj∑u∈ℬizu\(i\)⟨au,aj⟩\\displaystyle\\quad\+z\_\{j\}\\sum\_\{u\\in\\mathcal\{B\}\_\{i\}\}z\_\{u\}^\{\(i\)\}\\langle a\_\{u\},a\_\{j\}\\rangle\+∑u∈ℬi∑ℓ∈ℬjzu\(i\)zℓ\(j\)⟨au,aℓ⟩\.\\displaystyle\\quad\+\\sum\_\{u\\in\\mathcal\{B\}\_\{i\}\}\\sum\_\{\\ell\\in\\mathcal\{B\}\_\{j\}\}z\_\{u\}^\{\(i\)\}z\_\{\\ell\}^\{\(j\)\}\\langle a\_\{u\},a\_\{\\ell\}\\rangle\.\(16\)The first term is the salient feature interaction we aim to measure, while the remaining terms arise from background structure\. Thus, raw inner products between empirical feature estimates can be dominated by shared background structure rather than by the local interaction of interest\.

To mitigate this effect, we apply cluster mean\-centering at the level of residual representations before estimating salient feature encodings\. Letγ\(x\)\\gamma\(x\)denote the background cluster associated with examplexx, and letμγ\(x\)\\mu\_\{\\gamma\(x\)\}be the empirical mean residual representation of examples in that cluster\. We define the centered residual representation as

hc\(x\)=h\(x\)−μγ\(x\)\.h\_\{c\}\(x\)=h\(x\)\-\\mu\_\{\\gamma\(x\)\}\.We then use these centered residuals to estimate the salient feature encodingsA𝒮\(C\)A\_\{\\mathcal\{S\}\(C\)\}used for angle computation\. Mean\-centering reduces dominant cluster\-level structure and isolates residual variation that more directly reflects interactions between salient feature directions\.

In this work, we identify background clusters by visualizing empirical representations with SVD and observing that they group by prompt type, task family, language, or other contextual factors\. We then center each representation by the empirical mean of its corresponding cluster\. Automatically discovering these background clusters is an important direction for future work\.

Beyond the mathematical intuition, empirical observations in Appendix Figures[7](https://arxiv.org/html/2606.13934#A3.F7),[22](https://arxiv.org/html/2606.13934#A6.F22),[23](https://arxiv.org/html/2606.13934#A6.F23), and[24](https://arxiv.org/html/2606.13934#A6.F24)further motivate the need for cluster mean\-centering\.

![Refer to caption](https://arxiv.org/html/2606.13934v1/x9.png)Figure 7:Angle distributions of concept interactions in the multi\-hop dataset before and after mean\-centering\. The top panel shows that, without mean\-centering, angles between interacting concepts from different clusters are dominated by global clustering structure\. After mean\-centering, shown in the bottom panel, this global effect is reduced, allowing us to better capture local interactions between concepts\.
## Appendix DSCAN Additional Details

We expand on the experimental details from Section[3](https://arxiv.org/html/2606.13934#S3)\.

#### Training Details

We train decoder\-only Transformers on the SCAN benchmark using a size\-variation data regime\. Specifically, we vary the fraction of training commands seen by the model across coverage ratios of4%,8%,16%,32%,64%,4\\%,8\\%,16\\%,32\\%,64\\%,and80%80\\%of the full SCAN training split\. We evaluate four model sizes with hidden dimensiond∈\{8,12,32,64\}d\\in\\\{8,12,32,64\\\}\. Each model uses a feed\-forward dimension of4d4d, 4 attention heads, and 10 Transformer layers\. All models are trained with the Adam optimizer using a learning rate of10−310^\{\-3\}and batch size 256\. Models are trained for up to 3,000–4,000 epochs with early stopping: training halts if development\-set exact\-match accuracy does not improve for 200 epochs ford=8d=8, or for 300 epochs ford∈\{12,32,64\}d\\in\\\{12,32,64\\\}\. The development set is drawn as 10% of the held\-out test split\.

Input sequences are formatted as\[BOS\] input \[SEP\] output \[EOS\], and the training loss is cross\-entropy computed only over the output tokens\. During evaluation, we use greedy autoregressive decoding and report full\-sequence exact\-match accuracy\. All runs use a single random seed \(42\)\.

#### Additional Results

Figure[8](https://arxiv.org/html/2606.13934#A4.F8)and Figure[9](https://arxiv.org/html/2606.13934#A4.F9)visualize how atomic concept representations vary with training coverage and model capacity\. Figure[10](https://arxiv.org/html/2606.13934#A4.F10)and Figure[11](https://arxiv.org/html/2606.13934#A4.F11)show that examples with higher CI tend to have lower accuracy, supporting CI as a useful ranking signal for challenge\-set construction\. Figure[12](https://arxiv.org/html/2606.13934#A4.F12)and Figure[13](https://arxiv.org/html/2606.13934#A4.F13)further report the corresponding PR\-AUC results, showing that CI provides a predictive signal across coverage and model\-size settings\. Table[1](https://arxiv.org/html/2606.13934#A4.T1)shows that CI is negatively correlated with correctness under point\-biserial correlation in most settings\. Finally, to verify that CI is not merely tracking dataset distribution, specifically sequence length, Figure[14](https://arxiv.org/html/2606.13934#A4.F14)and Figure[15](https://arxiv.org/html/2606.13934#A4.F15)break the results down by command length and show that the trend largely persists within length groups\.

\(a\)Average accuracy\.
\(b\)Point\-biserial correlation with CI\.

Table 1:SCAN results across training coverage and model dimension\. Left: average exact\-match accuracy\. Right: point\-biserial correlationrpbr\_\{\\mathrm\{pb\}\}between CI score and binary correctness\. The point\-biserial correlation is the Pearson correlation between a continuous variable and a binary variable; here, it measures whether examples with higher CI are less likely to be answered correctly\. Since the original correlations were computed with1−CI1\-\\mathrm\{CI\}, we flip their signs when reporting correlations with CI\. Except in near\-saturated regimes with extreme accuracies, correlations are consistently negative, indicating that higher CI predicts lower item\-level correctness\. Significance:p∗<0\.05\{\}^\{\*\}p<0\.05,p∗∗<0\.01\{\}^\{\*\*\}p<0\.01,p∗⁣∗∗<10−3\{\}^\{\*\*\*\}p<10^\{\-3\}\.![Refer to caption](https://arxiv.org/html/2606.13934v1/x10.png)Figure 8:Heatmaps of pairwise angular distances between atomic concepts in the SCAN dataset, computed from models with different hidden dimensions and training coverage levels\. Columns correspond to training coverage ratios of4%,8%,16%,36%4\\%,8\\%,16\\%,36\\%, and64%64\\%, while rows correspond to model dimensionsd∈\{8,12,32,64\}d\\in\\\{8,12,32,64\\\}\. For each setting, we show the layer selected on the development set based on the best AUC\.![Refer to caption](https://arxiv.org/html/2606.13934v1/x11.png)Figure 9:PCA visualization of atomic concept representations in the SCAN dataset computed from models with different hidden dimensions and training coverage levels\. Columns correspond to training coverage ratios of4%,8%,16%,36%4\\%,8\\%,16\\%,36\\%, and64%64\\%, while rows correspond to model dimensionsd∈\{8,12,32,64\}d\\in\\\{8,12,32,64\\\}\. For each setting, we show the layer selected on the development set based on the best AUC\.![Refer to caption](https://arxiv.org/html/2606.13934v1/x12.png)Figure 10:Cumulative plots for models of all sizes on the SCAN dataset across training coverage ratios\.![Refer to caption](https://arxiv.org/html/2606.13934v1/x13.png)Figure 11:Noncumulative plots for models of all sizes on the SCAN dataset across training coverage ratios\.![Refer to caption](https://arxiv.org/html/2606.13934v1/x14.png)Figure 12:PR\-AUC for different model variants on the SCAN dataset\. Rows correspond to coverage ratioscov∈\{4%,8%,16%\}cov\\in\\\{4\\%,8\\%,16\\%\\\}, and columns correspond to model dimensionsd∈\{8,12,32,64\}d\\in\\\{8,12,32,64\\\}\. PR\-AUC for different model variants on the SCAN dataset, where CI is computed as the mean cosine similarity between interacting concepts\.![Refer to caption](https://arxiv.org/html/2606.13934v1/x15.png)Figure 13:PR\-AUC for different model variants on the SCAN dataset\. Rows correspond to coverage ratioscov∈\{32%,64%,80%\}cov\\in\\\{32\\%,64\\%,80\\%\\\}, and columns correspond to model dimensionsd∈\{8,12,32,64\}d\\in\\\{8,12,32,64\\\}\. PR\-AUC for different model variants on the SCAN dataset, where CI is computed as the mean cosine similarity between interacting concepts\.![Refer to caption](https://arxiv.org/html/2606.13934v1/x16.png)Figure 14:SCAN predictive trends broken down by command length forCImean\\mathrm\{CI\}\_\{\\mathrm\{mean\}\}\. Rows correspond to coverage ratioscov∈\{4%,8%,16%\}cov\\in\\\{4\\%,8\\%,16\\%\\\}, and columns correspond to model dimensionsd∈\{8,12,32,64\}d\\in\\\{8,12,32,64\\\}\. To check whether the main trend is driven only by command length, we plot cumulative accuracy curves separately for each length group at the selected layers\. The trend is generally consistent across length groups, but is weaker for the shortest commands, especially for 2\-concept examples\. Manual inspection of this group suggests that many failures reflect operator confusion rather than interference between the action and operator representations: for example, the model may mapjump twiceto three jumps, or produce fourwalkactions forwalk thrice\. Higher\-length groups contain more active components and more directly reflect the multi\-compositional interference captured by our metric\.![Refer to caption](https://arxiv.org/html/2606.13934v1/x17.png)Figure 15:SCAN predictive trends broken down by command length forCImean\\mathrm\{CI\}\_\{\\mathrm\{mean\}\}\. Rows correspond to coverage ratioscov∈\{32%,64%,80%\}cov\\in\\\{32\\%,64\\%,80\\%\\\}, and columns correspond to model dimensionsd∈\{8,12,32,64\}d\\in\\\{8,12,32,64\\\}\.![Refer to caption](https://arxiv.org/html/2606.13934v1/x18.png)Figure 16:Cumulative plots for models of all sizes on the SCAN dataset across training coverage ratios, where the X axis is the value when computing the interference using maximum cosine\.![Refer to caption](https://arxiv.org/html/2606.13934v1/x19.png)Figure 17:Noncumulative plots for models of all sizes on the SCAN dataset across training coverage ratios, where the X axis is the value when computing the interference using maximum cosine\.![Refer to caption](https://arxiv.org/html/2606.13934v1/x20.png)Figure 18:Cumulative plots for models of all sizes on the SCAN dataset across training coverage ratios, where the X axis is the value when computing the interference using mean cosine\.![Refer to caption](https://arxiv.org/html/2606.13934v1/x21.png)Figure 19:Noncumulative plots for models of all sizes on the SCAN dataset across training coverage ratios, where the X axis is the value when computing the interference using mean cosine\.![Refer to caption](https://arxiv.org/html/2606.13934v1/x22.png)Figure 20:Cumulative plots for models of all sizes on the SCAN dataset across training coverage ratios, where the X axis is the value when computing the interference using minimum cosine\.![Refer to caption](https://arxiv.org/html/2606.13934v1/x23.png)Figure 21:Noncumulative plots for models of all sizes on the SCAN dataset across training coverage ratios, where the X axis is the value when computing the interference using minimum cosine\.

## Appendix EMultihop Dataset Details

We expand on the dataset details provided in Section[4](https://arxiv.org/html/2606.13934#S4)for multihop QA task\. We study two\-hop factual reasoning tasks using the datasets and 10\-shot prompting setup fromKhandelwal and Pavlick \[[31](https://arxiv.org/html/2606.13934#bib.bib133)\]\. To test whether our method generalizes beyond common compositions, we additionally construct several datasets targeting less common multihop compositions that span different representation clusters\. These datasets includeint\-plus8\-parity,int\-plus2\-parity,int\-plus5\-parity,int\-plus2\-str,int\-plus5\-str,int\-plus8\-str, andartist\-birthyear\-times\-two\. Since these are structured tasks, we construct the integer\-based datasets by sampling 1000 valid input–output combinations from the space of possible cases\. The exception isartist\-birthyear\-times\-two, which we construct by adapting the original artist birth\-year data fromKhandelwal and Pavlick \[[31](https://arxiv.org/html/2606.13934#bib.bib133)\]\.

We filter to examples for which the model answers both constituent single\-hop queries correctly, ensuring that errors reflect compositional failures rather than missing single\-hop knowledge\. In total, our multihop evaluation contains 26 datasets\.

## Appendix FMultihop Additional Results

We expand on the experimental details for the multihop QA task from Section[4](https://arxiv.org/html/2606.13934#S4)\.

Figure[22](https://arxiv.org/html/2606.13934#A6.F22)shows that CI predicts multihop failures, with both cumulative and noncumulative trends holding after mean\-centering\. Table[2](https://arxiv.org/html/2606.13934#A6.T2)confirms this quantitatively using point\-biserial correlation: CI is negatively correlated with correctness after mean\-centering\. Figure[22](https://arxiv.org/html/2606.13934#A6.F22)and Table[2](https://arxiv.org/html/2606.13934#A6.T2)also show that, without mean\-centering, the trend does not hold, demonstrating the importance of the cluster\-based mean\-centering procedure from Section[2\.2](https://arxiv.org/html/2606.13934#S2.SS2), “Accounting for multiscale structure\.” Figure[23](https://arxiv.org/html/2606.13934#A6.F23)further motivates cluster centering by showing that the raw aggregate trend is obscured by group\-level structure, even though strong within\-group trends remain\. Figure[24](https://arxiv.org/html/2606.13934#A6.F24)shows that CI is predictive within each cluster\-pair setting, motivating cluster mean\-centering for global comparison\.

![Refer to caption](https://arxiv.org/html/2606.13934v1/x24.png)Figure 22:Multihop Dataset\. \(a\) Cumulative plots when we extract the atomic representations for concepts without mean\-centering\. \(b\) Non\-cumulative plots when we extract the atomic representations for concepts with mean\-centering\. \(c\) Non\-cumulative plots when we extract the atomic representations for concepts without mean\-centering\.Table 2:Point\-biserial correlations between CI and binary correctness \(before and after mean\-centering\)\. With mean\-centering, CI is negatively correlated with correctness, while without mean\-centering, the sign reverses\.![Refer to caption](https://arxiv.org/html/2606.13934v1/x25.png)Figure 23:When we average1−CI1\-\\mathrm\{CI\}over all examples in each dataset and compare this dataset\-level score against mean dataset accuracy, the overall correlation is weak and not statistically significant\. This is consistent with the cumulative curve without mean\-centering, where the PR\-AUC signal is also weak \(Figure[22](https://arxiv.org/html/2606.13934#A6.F22)\)\. However, the same plot reveals strong within\-group correlations: within each group,1−CI1\-\\mathrm\{CI\}is predictive of model accuracy\. This suggests that CI does capture interference and failure likelihood, but the signal is obscured when aggregating across groups\. This observation motivates our mean\-centering procedure\. The weak aggregate trend is driven by discrete orthogonality bands induced by group\-level background structure, rather than by the absence of a CI signal\. Mean\-centering compensates for these group\-level offsets, allowing the within\-group relationship between CI and model failure to become visible\.![Refer to caption](https://arxiv.org/html/2606.13934v1/x26.png)Figure 24:As discussed in Section[2\.2](https://arxiv.org/html/2606.13934#S2.SS2), model representations can cluster by domain\-level structure that is not directly relevant to compositional interference\. When we break examples into within\-cluster\-pair settings, the trends remain strong within each setting, indicating that CI is still predictive of interference and model failure\. However, the absolute CI value ranges are shifted across cluster pairs, making raw CI values difficult to compare globally\. This motivates our mean\-centering procedure: by adjusting for these cluster\-level offsets, we can recover a comparable CI signal across different representation clusters\.
## Appendix GMultilingual Fact\-Recall Additional Results

We expand on the experimental details for the multilingual fact\-recall setting from Section[4](https://arxiv.org/html/2606.13934#S4)\. Table[3](https://arxiv.org/html/2606.13934#A7.T3)provides the mapping from language abbreviations to full language names\. Figure[26](https://arxiv.org/html/2606.13934#A7.F26)shows PR\-AUC curves across languages, showing that CI provides a predictive signal for multilingual fact\-recall errors\. Figure[25](https://arxiv.org/html/2606.13934#A7.F25)reports the corresponding PR\-AUC values and baselines for each language\. Figure[27](https://arxiv.org/html/2606.13934#A7.F27)shows the noncumulative trend, indicating that examples with higher CI are more error\-prone across languages\.

Table 3:Language abbreviations and their corresponding full names\.Table 4:Point\-biserial correlations for multilingual fact\-recall\. Each row reportsrpbr\_\{\\mathrm\{pb\}\}, the point\-biserial correlation between CI and binary correctness\. All reported correlations are significant atp<0\.01p<0\.01\.![Refer to caption](https://arxiv.org/html/2606.13934v1/figs/appendix/appendix_multilingual_pr_auc_baseline_table.png)Figure 25:PR\-AUC values for different languages\.![Refer to caption](https://arxiv.org/html/2606.13934v1/x27.png)Figure 26:PR\-AUC plots for the multilingual factual recall dataset for different languages\.![Refer to caption](https://arxiv.org/html/2606.13934v1/x28.png)Figure 27:Noncumulative plot with x\-axis sorted from high to low CI on the multilingual factual recall dataset\. The trend is monotonically increasing across languages\.![Refer to caption](https://arxiv.org/html/2606.13934v1/x29.png)Figure 28:We also evaluate alternative interference metrics, as described in Appendix[B](https://arxiv.org/html/2606.13934#A2)\. We omit the mean\-similarity metric because, in this setting, it is mathematically equivalent to CI\. Since the subspace bases are orthonormal, averaging signed basis\-direction similarities equals zero, so it is the same quantity as captured by CI\.
## Appendix HCoarse\-grained concepts

We expand on the additional experimental details for the coarse\-grained concept analysis from Section[4](https://arxiv.org/html/2606.13934#S4)\. For multihop QA, Figure[29](https://arxiv.org/html/2606.13934#A8.F29)shows that subspace\-level CI also predicts the overall difficulty of broader task categories; it further reports results with alternative interference metrics from Section[B](https://arxiv.org/html/2606.13934#A2), showing that the predictive trend is generally robust across metric choices\. For multilingual fact\-recall, Figure[30](https://arxiv.org/html/2606.13934#A8.F30)and Figure[31](https://arxiv.org/html/2606.13934#A8.F31)show that subspace\-level CI also predicts coarse\-grained task difficulty across languages\. Table[5](https://arxiv.org/html/2606.13934#A8.T5)reports per\-language correlations for alternative subspace\-level interference metrics\. Figure[32](https://arxiv.org/html/2606.13934#A8.F32)shows that when all language\-topic pairs are pooled together, the overall CI–accuracy correlation is only weakly negative, likely because language\-specific factors shift baseline difficulty and partially obscure the stronger within\-language trends\.

![Refer to caption](https://arxiv.org/html/2606.13934v1/x30.png)Figure 29:We define coarse task\-level subspaces and use compositional interference between subspaces to predict the general difficulty of broadly described datasets\. In multilingual fact\-recall, CI is negatively correlated with mean accuracy across languages\. We also evaluate alternative interference metrics, as described in Appendix[B](https://arxiv.org/html/2606.13934#A2)\. Other metrics all show moderate negative correlation, which strengthens the point that the central signal comes from angular similarity among concept representations, while the derived cumulative coherence bound provides the highest predictive power\. We further test a task\-vector variant, where examples within the same coarse category are averaged to form a mean vector representation, following the intuition of task vectors\[[27](https://arxiv.org/html/2606.13934#bib.bib140)\]\. Similarity between these task vectors also acts as a moderate negative predictor of accuracy\.![Refer to caption](https://arxiv.org/html/2606.13934v1/x31.png)Figure 30:We define coarse task\-level subspaces and use compositional interference between subspaces to predict the general difficulty of a broadly\-described dataset\. Correlation between mean accuracy across languages in the multilingual factual recall dataset and CI; this figure includes languages Japanese, Russian, Korean, and Dutch\.![Refer to caption](https://arxiv.org/html/2606.13934v1/x32.png)Figure 31:We define coarse task\-level subspaces and use compositional interference between subspaces to predict the general difficulty of a broadly\-described dataset\. Correlation between mean accuracy across languages in the multilingual factual recall dataset and CI; this figure includes languages Vietnamese, Chinese, French, Hungarian, Spanish, and Ukrainian\.![Refer to caption](https://arxiv.org/html/2606.13934v1/x33.png)Figure 32:Coarse task\-level correlation between Accuracy and CI for multilingual fact recall when all language\-topic pairs are pooled together\. Overall, the pooled correlation between CI and accuracy is weakly negative, even though the within\-language trends remain moderate to strong\. This suggests that cross\-language comparisons are affected by language\-specific factors, such as resource level, orthographic similarity to English, and transfer from English\-language representations\. These factors can shift the baseline difficulty of each language and partially obscure the within\-language relationship between CI and model accuracy\.Table 5:Per\-language Pearson correlations between task accuracy and subspace\-level interference values computed with different metrics\. Each cell reportsrr, with the correspondingpp\-value in parentheses\. Correlations are computed overn=20n=20task groups per language\.
## Appendix ICode and Compute

We will release the code upon the paper decision\. For the SCAN experiments in Section[3](https://arxiv.org/html/2606.13934#S3), each toy model was trained on a single NVIDIA GeForce RTX 3090 GPU\. For the real\-LLM experiments in Section[4](https://arxiv.org/html/2606.13934#S4), all runs on Llama are conducted on a single 3090 GPU as well\.
Adversarial Concept Search: Predicting Compositional Errors From Feature Geometry

Similar Articles

R-APS: Compositional Reasoning and In-Context Meta-Learning for Constrained Design via Reflective Adversarial Pareto Search

Principles of Concept Representation in Sentence Encoders

Concrete Jungle: Towards Concreteness Paved Contrastive Negative Mining for Compositional Understanding

Structural Instability of Feature Composition

Hybrid Adversarial Defence for Natural Language Understanding Tasks

Submit Feedback

Similar Articles

R-APS: Compositional Reasoning and In-Context Meta-Learning for Constrained Design via Reflective Adversarial Pareto Search
Principles of Concept Representation in Sentence Encoders
Concrete Jungle: Towards Concreteness Paved Contrastive Negative Mining for Compositional Understanding
Structural Instability of Feature Composition
Hybrid Adversarial Defence for Natural Language Understanding Tasks