Principles of Concept Representation in Sentence Encoders

arXiv cs.CL 06/08/26, 04:00 AM Papers
concept-representation sentence-encoders compositionality fine-tuning hard-negatives evaluation nlp
Summary
This paper investigates principles of concept representation in sentence encoders through the lens of compositional semantics, identifying four key principles: fine-tuning recalibrates latent geometry, semantic signal concentrates in the final layer, hard negatives improve discrimination but not ranking, and supervision effectiveness depends on composition type.
arXiv:2606.06994v1 Announce Type: new Abstract: What makes a sentence encoder produce good concept representations? We approach this through the lens of representational compositionality: an encoder supports a concept family only when its latent space admits a low-distortion realization of the corresponding semantic operator. This framing predicts both where current encoders succeed and where they are structurally mismatched to their supervision. Through a controlled ablation over encoder conditions trained on 3.3 million synonym and definition pairs from WordNet and Wiktionary, evaluated on three decontaminated splits and a modifier-labeled noun-phrase benchmark, we identify four principles. Fine-tuning recalibrates the latent geometry rather than expanding it (P1). Semantic signal concentrates in the final transformer layer before concept-specific training begins, making cross-layer pooling redundant (P2). Hard negatives improve discrimination and stress-test robustness without improving retrieval ranking, showing that calibration and ranking are independently addressable (P3). Finally, the effectiveness of supervision depends on the composition type of the target concept. Extensional training helps intersective and subsective families while degrading relational and intensional ones, exposing a structural limitation of current training paradigms (P4). We release two new evaluation datasets: a DBpedia semantic-gap benchmark and a modifier-labeled NP paraphrase suite.
Original Article
View Cached Full Text
Cached at: 06/08/26, 09:21 AM
# Principles of Concept Representation in Sentence Encoders
Source: [https://arxiv.org/html/2606.06994](https://arxiv.org/html/2606.06994)
Isabelle Mohr1,2,John Dujany2,Jonathan Souquet2,Andre Freitas1, 1Idiap Research Institute,2Merck KGaA, Correspondence:[isabelle\.mohr@idiap\.ch](https://arxiv.org/html/2606.06994v1/mailto:[email protected])

###### Abstract

What makes a sentence encoder produce good concept representations? We approach this through the lens of representational compositionality: an encoder supports a concept family only when its latent space admits a low\-distortion realization of the corresponding semantic operator\. This framing predicts both where current encoders succeed and where they are structurally mismatched to their supervision\. Through a controlled ablation over encoder conditions trained on 3\.3 million synonym and definition pairs from WordNet and Wiktionary, evaluated on three decontaminated splits and a modifier\-labeled noun\-phrase benchmark, we identify four principles\. Fine\-tuning recalibrates the latent geometry rather than expanding it \(P1\)\. Semantic signal concentrates in the final transformer layer before concept\-specific training begins, making cross\-layer pooling redundant \(P2\)\. Hard negatives improve discrimination and stress\-test robustness without improving retrieval ranking, showing that calibration and ranking are independently addressable \(P3\)\. Finally, the effectiveness of supervision depends on the composition type of the target concept\. Extensional training helps intersective and subsective families while degrading relational and intensional ones, exposing a structural limitation of current training paradigms \(P4\)\. We release two new evaluation datasets: a DBpedia semantic\-gap benchmark and a modifier\-labeled NP paraphrase suite\.

Principles of Concept Representation in Sentence Encoders

Isabelle Mohr1,2, John Dujany2, Jonathan Souquet2, Andre Freitas1,1Idiap Research Institute,2Merck KGaA,Correspondence:[isabelle\.mohr@idiap\.ch](https://arxiv.org/html/2606.06994v1/mailto:[email protected])

Frozen Encoder \(B0\)anisotropy 0\.126“where someonedrew last breath”deathPlacesemantic gapdirty cupfilthy cupnot clean cupsurface form splitsa mammalnot a mammalindistinguishableConcept\-Equivalence SupervisionInfoNCE \+ 0\.5 BCE⋅\\cdot3\.3M pairssyndirty cupnot clean cup↑\\uparrowsimt2ddeathPlace“where someonedrew last breath”↑\\uparrowsimhna mammalnot a mammal↓\\downarrowsimpull together \(equiv\. pairs\)push apart \(hard negatives\)reshapesgeometryFine\-tuned Encoder \(B1\)anisotropy 0\.012“where someonedrew last breath”deathPlacebridgednot clean cupdirty cupfilthy cupconcept cluster \(P1\)a mammalnot a mammalnegation separated \(P3\)

Figure 1:Concept\-equivalence fine\-tuning, illustrated\.B0:a frozen encoder splits synonymous paraphrases \(teal\) by surface form, conflates negations \(orange\), and leaves NL queries far from structured targets \(blue\)\.Centre:synonym \(syn\) and term\-definition \(t2d\) pairs are pulled together via InfoNCE; hard negatives \(hn\) are pushed apart via BCE\.B1:the space is recalibrated—paraphrases cluster \(P1\), the semantic gap bridges, and negation separates \(P3\)\.## 1Introduction

Semantic compositionality is the principle in linguistics and philosophy that the meaning of a complex expression arises from the meanings of its individual components together with the way those components are structured and combined\(Frege,[1892](https://arxiv.org/html/2606.06994#bib.bib1)\)\. In order for encoders to produce faithful concept representations, they must capture conceptual compositionality\. Different modifier types such as intersective, subsective, relational, modal, and privative, contribute meaning through fundamentally different semantic operators\(Carvalhoet al\.,[2025](https://arxiv.org/html/2606.06994#bib.bib8)\), yet a sentence encoder must realize all of them inside a single latent geometry scored by one similarity function\. Our theoretical lens is that conceptual compositionality in encoders is an approximate homomorphism problem \(formalised in Appendix[D](https://arxiv.org/html/2606.06994#A4)\)\. An encoder yields good concept representations only when the typed semantic operators required by a concept family admit low\-distortion geometric realization\. The empirical question this paper asks and answers is which parts of that structure current sentence encoders already support, and which parts remain mismatched to the supervision used to train them\.

We study this through concept retrieval, which makes representational quality of concept compositions directly measurable\. The encoder must map a query to the same region as any denotationally equivalent target, regardless of surface form\. Complex nominals pose a hard challenge\. Intersective, relational, and privative modifiers each realize meaning through different operators, yet all must coexist in a single latent geometry scored by one similarity function\. This is a fundamental compositional tension that is illustrated geometrically in Figure[1](https://arxiv.org/html/2606.06994#S0.F1)\. Evidence that this tension remains an open problem comes from our negation stress test, where the best frozen baseline scores 0\.470 ROC\-AUC \(below chance\), assigning higher similarity to negated definitions than to correct ones\.

Several design dimensions are natural candidates for closing this gap\. Fine\-tuning on concept\-equivalence pairs may reshape the latent space toward the right geometry\. At the readout level, probing studies establish that upper transformer layers encode more semantic information than lower ones\(Peterset al\.,[2018](https://arxiv.org/html/2606.06994#bib.bib17); Jawaharet al\.,[2019](https://arxiv.org/html/2606.06994#bib.bib7); Tenneyet al\.,[2019](https://arxiv.org/html/2606.06994#bib.bib16)\), motivating cross\-layer pooling as a candidate improvement\. Hard negative supervision may further sharpen discrimination between near\-miss distractors\. And theoretically, there is no formal account of when a sentence encoder can support a given modifier composition family, nor of the geometric conditions under which current training objectives succeed or fail\.

We structure the empirical investigation around three hypotheses\.H1*\(fine\-tuning is necessary\):*concept\-equivalence fine\-tuning substantially improves complex nominal retrieval over frozen baselines\.H2*\(cross\-layer pooling\):*weighted or input\-adaptive mixtures over multiple layers improve over fine\-tuned mean pooling\.H3*\(hard negatives\):*hard negative supervision improves retrieval ranking in addition to calibration\.

#### Contributions\.

\(T\)ετ\\varepsilon\_\{\\tau\}\-compositionality framework:We introduce a formal characterisation of when a sentence encoder supports a given modifier composition family:fθf\_\{\\theta\}isετ\\varepsilon\_\{\\tau\}\-compositional if a low\-distortion latent operatorΦτ\\Phi\_\{\\tau\}exists for modifier typeτ\\tau\. This identifies two interacting bottlenecks \(representational and objective\) and predicts the modifier\-family pattern \(P4\) from first principles\. Formal bounds are derived in Appendix[D](https://arxiv.org/html/2606.06994#A4)\. We additionally identify four empirical principles of compositional concept representation\.\(P1\) Recalibration, not expansion:concept\-equivalence fine\-tuning reshapes the latent geometry, collapsing anisotropy from 0\.126 to 0\.012 and improving term\-to\-definition Recall@10 from 0\.552 to 0\.654, while leaving effective rank unchanged\. Fine\-tuning recalibrates which regions of the space collapse together, without growing the space\.\(P2\) Final\-layer concentration precedes fine\-tuning:sentence\-level pre\-training already concentrates semantic signal into the final transformer layer, explaining why cross\-layer pooling offers no consistent benefit after concept\-equivalence fine\-tuning\. Confirmed across both NP\-paraphrase and term\-to\-definition tasks\.\(P3\) Calibration and ranking are dissociable:hard negatives improve discrimination \(ROC\-AUC\+\+0\.19–\+\+0\.46\) without improving retrieval ranking, establishing that calibration and ranking are separable training targets\.\(P4\) Supervision must match composition type:concept\-equivalence training improves intersective and subsective families while degrading relational and intensional types, exposing a fundamental mismatch between equivalence\-only supervision and typed semantic operators\. We also release two new evaluation datasets: a DBpedia semantic\-gap benchmark \(3k train / 250 test, zero lexical overlap\) and a modifier\-labeled NP paraphrase suite \(4,000 pairs across five composition families\)\.

## 2Related Work

#### Dense retrieval and contrastive learning\.

Sentence\-BERT\(Reimers and Gurevych,[2019](https://arxiv.org/html/2606.06994#bib.bib2)\)establishes fine\-tuned bi\-encoders as practical dense retrievers, and SimCSE\(Gaoet al\.,[2021](https://arxiv.org/html/2606.06994#bib.bib3)\)shows that in\-batch contrastive objectives with hard negatives improve representation geometry\. We extend this interface to concept\-equivalence retrieval and find that semantic signal dominates the final layer in sentence\-fine\-tuned encoders, limiting the benefit of cross\-layer pooling\.

#### Biomedical concept normalisation\.

BioSyn\(Sunget al\.,[2020](https://arxiv.org/html/2606.06994#bib.bib4)\), SapBERT\(Liuet al\.,[2021](https://arxiv.org/html/2606.06994#bib.bib5)\), and BioLORD\(Remyet al\.,[2022](https://arxiv.org/html/2606.06994#bib.bib6)\)demonstrate that synonym\-marginalization and definition\-aware contrastive training substantially improve biomedical entity retrieval, andTutubalinaet al\.\([2020](https://arxiv.org/html/2606.06994#bib.bib12)\)show that reported accuracy depends heavily on split design\. Training on dictionary definitions as concept supervision has prior support\(Hillet al\.,[2016](https://arxiv.org/html/2606.06994#bib.bib14); Carvalhoet al\.,[2023](https://arxiv.org/html/2606.06994#bib.bib15)\); we use WordNet/Wiktionary synonym and definition pairs in the same vein, but centre our analysis on the mechanism of fine\-tuning and introduce controlled modifier\-type evaluation absent from prior work\.

#### Distributional composition and modifier sensitivity\.

Compositional distributional semantics\(Mitchell and Lapata,[2010](https://arxiv.org/html/2606.06994#bib.bib21); Baroni and Zamparelli,[2010](https://arxiv.org/html/2606.06994#bib.bib22)\)shows that typed composition operators outperform uniform ones and that adjective denotation depends on semantic role—directly motivating theετ\\varepsilon\_\{\\tau\}\-compositionality framework in §[3](https://arxiv.org/html/2606.06994#S3)\. The modifier typology underlying our benchmark \(intersective, subsective, relational, modal, privative\) originates in formal semantics\(Partee,[1995](https://arxiv.org/html/2606.06994#bib.bib23)\), andEttingeret al\.\([2018](https://arxiv.org/html/2606.06994#bib.bib11)\),Shwartz \([2019](https://arxiv.org/html/2606.06994#bib.bib13)\), andCarvalhoet al\.\([2025](https://arxiv.org/html/2606.06994#bib.bib8)\)show that these distinctions remain problematic for contemporary encoders\. Our NP paraphrase benchmark operationalises these concerns as a retrieval task with explicit modifier\-family labels, to our knowledge the first to stratify retrieval performance across Montague modifier families\.

#### Layer distribution and geometry\.

Probing studies establish that lower layers of pretrained transformers encode syntax while upper layers encode semantics\(Peterset al\.,[2018](https://arxiv.org/html/2606.06994#bib.bib17); Jawaharet al\.,[2019](https://arxiv.org/html/2606.06994#bib.bib7); Tenneyet al\.,[2019](https://arxiv.org/html/2606.06994#bib.bib16); Rogerset al\.,[2020](https://arxiv.org/html/2606.06994#bib.bib18)\), andEthayarajh \([2019](https://arxiv.org/html/2606.06994#bib.bib9)\)show this hierarchy is reflected in anisotropy: upper layers are geometrically more uniform and task\-ready\. We find this distribution collapses in sentence\-fine\-tuned encoders: prior contrastive training has already concentrated semantic signal into the final layer, and concept\-equivalence fine\-tuning sharpens it further, leaving nothing for cross\-layer readout to exploit\. While hyperbolic geometries\(Nickel and Kiela,[2017](https://arxiv.org/html/2606.06994#bib.bib10); Valentinoet al\.,[2024](https://arxiv.org/html/2606.06994#bib.bib20)\)offer theoretically richer hierarchical structure, our results confirm that geometry choice is secondary to training supervision once the space is well\-calibrated\.

## 3Representational Compositionality

### 3\.1Problem Setting

Let𝒳\\mathcal\{X\}be a space of texts \(terms, noun phrases, definitions, ontology labels\)\. We learn an encoderfθ:𝒳→ℝdf\_\{\\theta\}:\\mathcal\{X\}\\to\\mathbb\{R\}^\{d\}and a retrieval scorescoreθ\(q,y\)∈ℝ\\mathrm\{score\}\_\{\\theta\}\(q,y\)\\in\\mathbb\{R\}\. Given a queryqqand a candidate pool𝒟=\{y1,…,yN\}\\mathcal\{D\}=\\\{y\_\{1\},\\dots,y\_\{N\}\\\}, the objective is to rank semantically equivalent candidates first:

scoreθ\(q,y\+\)\\displaystyle\\mathrm\{score\}\_\{\\theta\}\(q,y^\{\+\}\)\>scoreθ\(q,y−\)\\displaystyle\>\\mathrm\{score\}\_\{\\theta\}\(q,y^\{\-\}\)whenever⟦q⟧=⟦y\+⟧≠⟦y−⟧,\\displaystyle\\quad\\text\{whenever \}\\llbracket q\\rrbracket=\\llbracket y^\{\+\}\\rrbracket\\neq\\llbracket y^\{\-\}\\rrbracket,where⟦⋅⟧\\llbracket\\cdot\\rrbracketdenotes the concept denotation\. In practice, positives are synonym pairs, term\-definition pairs, and cross\-source definition pairs from the same concept\. The hard case is a zero\-overlap triplet where the correct answery\+y^\{\+\}shares no surface tokens withqq\.

### 3\.2From Semantic Composition to Representational Composition

Semantic compositionality alone does not guarantee compositional representations\. Letfθ:𝒳→ℝdf\_\{\\theta\}:\\mathcal\{X\}\\to\\mathbb\{R\}^\{d\}be a sentence encoder\. We say thatfθf\_\{\\theta\}isετ\\varepsilon\_\{\\tau\}\-compositional for modifier familyτ\\tauif there exists a latent operator

Φτ:ℝd×ℝd→ℝd\\Phi\_\{\\tau\}:\\mathbb\{R\}^\{d\}\\times\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}^\{d\}such that for all valid modifier–head pairs\(m,h\)\(m,h\)of typeτ\\tau,

‖fθ\(m∘τh\)−Φτ\(fθ\(m\),fθ\(h\)\)‖≤ετ\.\\left\\lVert f\_\{\\theta\}\\\!\\left\(m\\circ\_\{\\tau\}h\\right\)\-\\Phi\_\{\\tau\}\\\!\\left\(f\_\{\\theta\}\(m\),f\_\{\\theta\}\(h\)\\right\)\\right\\rVert\\leq\\varepsilon\_\{\\tau\}\.This definition makes explicit the bridge between conceptual and geometric compositionality\. Semantic composition is typed, but the encoder must realize all such operators in a shared latent space\. Concept retrieval then depends on low distortion of the relevantΦτ\\Phi\_\{\\tau\}, and a scoring function that ranks denotationally equivalent expressions above non\-equivalent distractors\.

Appendix[D](https://arxiv.org/html/2606.06994#A4)formalizes these definitions and derives theoretical bounds on retrieval distortion for each composition family\.

### 3\.3Why Pooled Embeddings Struggle

Following Montague semantics\(Carvalhoet al\.,[2025](https://arxiv.org/html/2606.06994#bib.bib8)\), the denotation of a complex nominal is:

⟦m∘τh⟧=Cτ\(⟦m⟧,⟦h⟧\),\\llbracket m\\circ\_\{\\tau\}h\\rrbracket=C\_\{\\tau\}\\\!\\bigl\(\\llbracket m\\rrbracket,\\,\\llbracket h\\rrbracket\\bigr\),whereτ\\tauis the modifier composition type andCτC\_\{\\tau\}is a type\-specific operator\. Different modifiers types instantiate materially differentCτC\_\{\\tau\}\. These five types constitute the standard complete typology of modifier composition under Montague generative semantics\(Carvalhoet al\.,[2025](https://arxiv.org/html/2606.06994#bib.bib8)\), spanning the full range from extensional set\-intersection \(intersective\) to intensional non\-instantiation \(privative\)\. A pooled vectorz\(x\)=g\(H\(1\),…,H\(L\)\)z\(x\)=g\(H^\{\(1\)\},\\dots,H^\{\(L\)\}\)cannot expose the typeτ\\tauexplicitly, but encodes it implicitly in the geometry\.

This creates two interacting bottlenecks\. Therepresentation bottleneck: structured token\-role information is collapsed into one vector, and multiple semantic relations must coexist in one shared neighborhood structure\. Theobjective bottleneck: standard sentence embedding training rewards broad distributional semantic similarity, not operator\-sensitive conceptual precision\. Our study isolates and investigates these two aspects\. Through contrastive fine\-tuning on synonym and definition pairs directly, we target the objective bottleneck, and a cross\-layer readout ablation tests whether the representation bottleneck matters after fine\-tuning\.

## 4Proposed Method

Training DataWordNet315K syn\+\+212K t2dWiktionary590K syn\+\+2\.2M t2d3\.3M total pairsHard negativesantonym⋅\\cdotnegatePOS swap⋅\\cdottype swapSupervisionℒsyn\\mathcal\{L\}\_\{\\mathrm\{syn\}\}: synonym pairs“fast”≈\\approx“quick”ℒt2d\\mathcal\{L\}\_\{\\mathrm\{t2d\}\}: term↔\\leftrightarrowdefn\+0\.7ℒd2d\+\\;0\.7\\,\\mathcal\{L\}\_\{\\mathrm\{d2d\}\}cross\-src\+0\.5ℒneg\+\\;0\.5\\,\\mathcal\{L\}\_\{\\mathrm\{neg\}\}: BCEhard\-negative termheadword maskingBi\-Encoderall\-mpnet\-base\-v2layer 12layer 11layer 10layer 9⋮\\vdotslayer 1Readout \(ablation\)mean⋅\\cdotwtd\. mean⋅\\cdotCLSinput\-dep\. gate⋅\\cdotdeep sup\.8k steps⋅\\cdotbatch 16InferenceQueryq→fθ\(q\)∈ℝdq\\to f\_\{\\theta\}\(q\)\\\!\\in\\\!\\mathbb\{R\}^\{d\}Candidatey→fθ\(y\)y\\to f\_\{\\theta\}\(y\)score\(q,y\)=fθ\(q\)⋅fθ\(y\)‖fθ\(q\)‖‖fθ\(y\)‖\\mathrm\{score\}\(q,y\)=\\dfrac\{f\_\{\\theta\}\(q\)\\cdot f\_\{\\theta\}\(y\)\}\{\\\|f\_\{\\theta\}\(q\)\\\|\\,\\\|f\_\{\\theta\}\(y\)\\\|\}rank by scoreEvaluation3 decontaminated splitsin\-domain⋅\\cdotHH⋅\\cdotSHR@10⋅\\cdotMRR⋅\\cdotROC\-AUCNP paraphrase5 modifier familiesR@1⋅\\cdotR@10DBpedia gap245 test queriesR@10⋅\\cdotMRRH1H2H3H3P2P3P4

Figure 2:End\-to\-end pipeline\. WordNet/Wiktionary synonym and definition pairs \(3\.3M\) are combined with five hard\-negative rule types in a joint InfoNCE \+ BCE objective, training a bi\-encoder \(all\-mpnet\-base\-v2\) under eight ablation conditions varying readout strategy and hard\-negative supervision\. Evaluation covers three decontaminated term\-definition splits \(in\-domain, HH, SH\), modifier\-sensitive NP paraphrase retrieval, and zero\-lexical\-overlap DBpedia property retrieval\. Coloured badges indicate which component each hypothesis and principle targets\.### 4\.1Model and Conditions

The models used in our ablations are based on the bi\-encoder sentence\-transformer architecture\(Reimers and Gurevych,[2019](https://arxiv.org/html/2606.06994#bib.bib2)\)where both query and candidate texts are encoded independently, and similarity is computed via cosine distance\. We define readout \(how a single vector is extracted from the transformer’s hidden states\) as the primary design variable\. Table[1](https://arxiv.org/html/2606.06994#S4.T1)defines the eight conditions evaluated\.

Table 1:Encoder conditions \(shared backbone:all\-mpnet\-base\-v2, 8k steps, batch 16, single seed\)\. B0/B0\-WM are frozen and differ only in pooling, isolating cross\-layer averaging without fine\-tuning\. Weighted mean pools the lastK=4K\{=\}4layers; M2 uses a per\-input MLP gate; M3 attaches per\-layer projection heads during training\.The full eight\-condition ablation is conducted on a single backbone,all\-mpnet\-base\-v2, to isolate the contribution of each readout and supervision design choice without conflating backbone effects\. To test whether the layerwise structure and the B0→\\toB1 gain generalise beyond this choice, we additionally train B0 and B1 on three further backbones:paraphrase\-mpnet\-base\-v2\(same MPNet architecture, paraphrase\-oriented pre\-training objective\),e5\-base\-v2\(a different model family with strong general\-purpose text embeddings\), andjina\-embeddings\-v5\-text\-nano\(PEFT\-based, trained with a distinct contrastive objective outside the standard sentence\-transformer paradigm\)\. Together, these four backbones provide same\-architecture/different\-objective variation \(all\-mpnetvs\.para\-mpnet\), cross\-family generalisation \(e5\-base\), and a deliberate architectural contrast case \(jina\-nano\)\.

### 4\.2Cross\-Layer Readout

LetH\(l\)\(x\)∈ℝT×dH^\{\(l\)\}\(x\)\\in\\mathbb\{R\}^\{T\\times d\}be the token hidden states at layerllfor an input of lengthTT\. The B1 baseline uses standard mean pooling over the final layer only\. The cross\-layer variants instead form a weighted mixture over the lastK=4K\{=\}4layers before pooling:

z\(x\)\\displaystyle z\(x\)=P\(∑l=L−K\+1LαlH\(l\)\(x\)\),\\displaystyle=P\\\!\\left\(\\sum\_\{l=L\-K\+1\}^\{L\}\\alpha\_\{l\}\\,H^\{\(l\)\}\(x\)\\right\),α\\displaystyle\\alpha=softmax\(w\),\\displaystyle=\\mathrm\{softmax\}\(w\),whereP\(⋅\)P\(\\cdot\)denotes mean pooling,w∈ℝKw\\in\\mathbb\{R\}^\{K\}are learned scalar weights shared across all inputs \(B3, M1\), andz\(x\)z\(x\)is L2\-normalised before scoring\.

M2 \(input\-dependent gating\)\.The global weightswware replaced by a per\-input MLP:

α\(x\)=softmax\(MLP\(P\(H\(L\)\(x\)\)\)\),\\alpha\(x\)=\\mathrm\{softmax\}\\\!\\left\(\\mathrm\{MLP\}\\\!\\left\(P\(H^\{\(L\)\}\(x\)\)\\right\)\\right\),so different inputs can weight layers differently\.

M3 \(deep supervision\)\.During training, a separate linear projection headϕl\\phi\_\{l\}is attached to each of the lastKKlayers, and an InfoNCE loss is computed at every layer independently:

ℒdeep=∑l=L−K\+1LℒInfoNCE\(ϕl\(P\(H\(l\)\)\)\)\.\\mathcal\{L\}\_\{\\mathrm\{deep\}\}=\\sum\_\{l=L\-K\+1\}^\{L\}\\mathcal\{L\}\_\{\\mathrm\{InfoNCE\}\}\\\!\\left\(\\phi\_\{l\}\\\!\\left\(P\(H^\{\(l\)\}\)\\right\)\\right\)\.At inference the projection heads are discarded and the standard weighted\-mean readout is used, forcing all layers to develop useful representations during training rather than letting the gradient concentrate at the final layer\.

### 4\.3Training Objective

The positive loss is a weighted InfoNCE sum over three supervision views:

ℒpos=ℒsyn\+ℒt2d\+0\.7ℒd2d,\\mathcal\{L\}\_\{\\mathrm\{pos\}\}=\\mathcal\{L\}\_\{\\mathrm\{syn\}\}\+\\mathcal\{L\}\_\{\\mathrm\{t2d\}\}\+0\.7\\,\\mathcal\{L\}\_\{\\mathrm\{d2d\}\},whereℒsyn\\mathcal\{L\}\_\{\\mathrm\{syn\}\},ℒt2d\\mathcal\{L\}\_\{\\mathrm\{t2d\}\}, andℒd2d\\mathcal\{L\}\_\{\\mathrm\{d2d\}\}are InfoNCE losses over synonym pairs, term\-definition pairs, and cross\-source definition\-definition pairs respectively\. The 0\.7 weight on d2d reflects its noisier cross\-source alignment signal\.

When hard negatives are used \(all conditions except B0 and B3\), a binary cross\-entropy term targets five rule\-based negative types per definition \(see Appendix[G](https://arxiv.org/html/2606.06994#A7)for full rule definitions and examples\)\. The complete objective is:

ℒ=ℒpos\+0\.5ℒneg,\\mathcal\{L\}=\\mathcal\{L\}\_\{\\mathrm\{pos\}\}\+0\.5\\,\\mathcal\{L\}\_\{\\mathrm\{neg\}\},where the 0\.5 coefficient moderates the contribution of the hard\-negative term relative to the positive InfoNCE objective\. Unlike the batch\-level InfoNCE loss, the BCE term applies a gradient of fixed magnitude to each negative pair independently of batch size, without down\-weighting, its aggregate contribution would scale with the number of negative pairs and risk overshadowing the ranking objective\. The coefficient was not ablated, however Appendix[F](https://arxiv.org/html/2606.06994#A6)provides evidence of its importance by evaluating a variant \(B1\-unified\) in which hard negatives are folded into the InfoNCE denominator, eliminating the need for an explicit weight at the cost of a substantial reduction in calibration\.

### 4\.4Training Details

Full hyperparameters are listed in Appendix[E](https://arxiv.org/html/2606.06994#A5)\. All experiments use AdamW with linear warmup \(100 steps\) and linear decay to zero\. A layerwise LR decay of 0\.90 per transformer layer mitigates catastrophic forgetting\(Howard and Ruder,[2018](https://arxiv.org/html/2606.06994#bib.bib19)\); the pooling head and cross\-layer weights receive5×5\\timesthe base rate\.

### 4\.5Data

#### Training\.

3\.3 million synonym and term\-to\-definition pairs fromWordNet\(527K: 315K syn \+ 212K t2d\) andWiktionary\(2\.8M: 590K syn \+ 2\.2M t2d\)\. Cross\-source definition–definition pairs \(d2d\) are generated on\-the\-fly by grouping definitions of the same term across both sources \(up to 6 per term; not counted in the 3\.3M total\)\. A headword\-masking transform removes the defined term from each definition before training to prevent lexical shortcutting\. Representative examples of all three pair types are given in Appendix[B](https://arxiv.org/html/2606.06994#A2)\.

#### Evaluation splits\.

Three decontaminated splits, each with test concepts strictly absent from training\.In\-domain\(t2d / d2t\): held\-out WordNet/Wiktionary pairs partitioned at the synset level\.Head\-holdout\(HH\): the most frequent synsets withheld entirely from training, testing generalisation to high\-frequency unseen concepts\.Source\-holdout\(SH\): trained on Wiktionary, evaluated on WordNet, testing cross\-source generalisation\. Each split includes hard\-negative stress tests \(Appendix[G](https://arxiv.org/html/2606.06994#A7)\)\.

#### DBpedia semantic\-gap benchmark\.

A new evaluation dataset of \(query, property label\) pairs over the DBpedia ontology, consisting of 3,063 training queries generated by GPT\-4o\-mini with zero lexical overlap with property labels, and 245 manually reviewed test pairs with strictly disjoint train/test properties\.

#### NP paraphrase benchmark\.

A new modifier\-sensitive evaluation suite of 4,000 noun\-phrase paraphrase pairs balanced across five composition families, namely intersective \(900\), subsective \(800\), modal \(800\), privative \(800\), relational \(700\)\. These cover the complete Montague modifier typology \(Section[3](https://arxiv.org/html/2606.06994#S3)\)\. Pairs are drawn from two equivalence regimes of 2,000 each, to createstrict\(GPT\-4o\-mini\-generated paraphrases with zero lexical overlap enforced after head masking\) andnear\(rule\-based head\-broadening or relation\-abstraction transforms applied to the strict pairs\)\.

## 5Empirical Analysis

### 5\.1Layer\-12 Dominance: The Mechanism Behind H2

Retrieval quality \(MRR\) by layerMRRLayer index0\.420\.440\.460\.480\.500\.520\.540\.5656789101112B0 all\-mpnetB1 all\-mpnetB0 para\-mpnetB1 para\-mpnetB0 e5\-baseB1 e5\-baseB0 jina\-nanoB1 jina\-nanoGeometry \(anisotropy\) by layerAnisotropyLayer index0\.100\.200\.300\.400\.5056789101112B0 all\-mpnetB1 all\-mpnetB0 para\-mpnetB1 para\-mpnetB0 jina\-nanoB1 jina\-nanoB0 e5\-baseB1 e5\-base

Figure 3:Per\-layer MRR \(left\) and anisotropy \(right\) for transformer layers 5–12 on NP paraphrase pairs \(solid = B0 frozen, dashed = B1 fine\-tuned\)\. Final\-layer MRR dominance pre\-exists fine\-tuning across all tested backbones: layer\-12 MRR far exceeds earlier layers, with anisotropy dropping sharply at the final layer\. Fine\-tuning sharpens both effects without redistributing signal to earlier layers, leaving nothing for cross\-layer readout to exploit\.We present the layerwise analysis before the H1 ablation to establish the mechanistic explanation that directly predicts the H2 null result and motivates theK=4K\{=\}4window choice in the cross\-layer conditions\. B1 results appear here alongside B0 for contrast; that B1 represents a substantial retrieval improvement over B0 is confirmed in §[5\.2](https://arxiv.org/html/2606.06994#S5.SS2)\.

Figure[3](https://arxiv.org/html/2606.06994#S5.F3)shows that the final layer dominates in both B0 and B1\. MRR at layer 12 far exceeds layers 5–7, while anisotropy drops sharply at the last two layers\. Fine\-tuning sharpens this concentration, in that earlier layers see lower MRR and higher anisotropy after training, but training does not redistribute information across layers\. The pattern holds across all tested sentence\-transformer backbones\.

Probing studies\(Jawaharet al\.,[2019](https://arxiv.org/html/2606.06994#bib.bib7); Tenneyet al\.,[2019](https://arxiv.org/html/2606.06994#bib.bib16)\)established the layer hierarchy on raw pretrained models, where signal*is*distributed and cross\-layer readout could plausibly help\. In an already\-sentence\-fine\-tuned encoder, that distribution collapses before our fine\-tuning even begins, and concept\-equivalence training sharpens it further\. There is no distributed multi\-layer signal for a cross\-layer readout to exploit\. The H2 ablation \(Table[3](https://arxiv.org/html/2606.06994#S5.T3), §[5\.3](https://arxiv.org/html/2606.06994#S5.SS3)\) confirms this directly\.

### 5\.2H1: Fine\-Tuning Substantially Improves Concept Retrieval

Table[2](https://arxiv.org/html/2606.06994#S5.T2)shows that B0 \(frozenall\-mpnet\-base\-v2with mean pooling\) achieves t2d R@10 = 0\.552 and negate\-stress ROC\-AUC = 0\.470 \(below chance\)\. We highlight this failure mode in the introduction, where the frozen encoder assigns*higher*similarity to negated definitions than to correct ones\. Concept\-equivalence fine\-tuning resolves both, as B1 reaches t2d R@10 = 0\.654 and negate ROC\-AUC = 0\.980, confirming that supervision is necessary and that it specifically addresses the negation failure\. However, fine\-tuning is not uniformly beneficial, given that B0 already achieves NP paraphrase R@10 = 0\.704, exceeding all fine\-tuned conditions \(B1: 0\.656, B3: 0\.655\)\. The backbone’s prior sentence\-level training is well\-suited to paraphrase retrieval\. Concept\-equivalence fine\-tuning improves cross\-modal concept matching precisely where the backbone is weak, at the cost of paraphrase\-level similarity where it was already well\-calibrated\.

The B3 vs\. M1 comparison reveals that B3 matches M1 on retrieval ranking \(0\.657 vs\. 0\.651 t2d R@10\) but is far weaker on calibration \(pair ROC\-AUC 0\.769 vs\. 0\.961\) and semantic stress tests \(negate ROC\-AUC 0\.528 vs\. 0\.984\)\. Hard negatives train the model to*discriminate*, not to*rank*\. Whether hard negatives are necessary depends entirely on the downstream use case\.

Table 2:H1 ablation results\. t2d = term\-to\-definition \(in\-domain\); HH = head\-holdout; Negate = negation stress\-test ROC\-AUC; NP R@10 = noun\-phrase paraphrase \(nn=4,000\)\. Bold: best fine\-tuned condition per column\.Fine\-tuning collapses anisotropy from 0\.126 to 0\.012 while leaving effective rank essentially unchanged \(247\.0→\\to247\.2\), confirming that concept\-equivalence training reorganises the representation space rather than expanding it\. Hard negatives are the dominant factor, given that B3 without hard negatives reaches only 0\.166 anisotropy under an otherwise identical regime\. Full geometry statistics, condition\-level comparisons, and an anisotropy–effective\-rank scatter are in Appendix[H](https://arxiv.org/html/2606.06994#A8)\(Figures[4](https://arxiv.org/html/2606.06994#A8.F4)and[5](https://arxiv.org/html/2606.06994#A8.F5)\)\.

### 5\.3H2: Cross\-Layer Pooling Offers No Consistent Benefit

Table[3](https://arxiv.org/html/2606.06994#S5.T3)reveals cross\-layer pooling to be redundant\. No pooling variant consistently outperforms the fine\-tuned mean\-pool baseline B1\. CLS pooling \(B2\) and the wider K=6 window underperform B1 on all splits\. Input\-dependent gating \(M2\) is the only variant that matches or marginally exceeds B1 \(\+\+0\.005 t2d R@10\), and the margin is<<1% absolute\. Deep supervision \(M3\) improves over the static weighted mixture M1 but does not reach B1\. This is the expected consequence of the layer\-concentration principle \(Section[5\.1](https://arxiv.org/html/2606.06994#S5.SS1)\)\.

Table 3:H2 retrieval results \(R@10\)\. HH = head\-holdout; t2d = term\-to\-definition; d2t = definition\-to\-term\. No cross\-layer variant consistently beats B1; M2 matches or marginally exceeds it by<<1% absolute\. Bold: best per column\.
### 5\.4H3: Hard Negatives Improve Calibration, Not Ranking

Table 4:H3 stress\-test results \(ROC\-AUC, in\-domain\)\. Upper: retrieval ranking and calibration; lower: per\-type stress tests\. B3 vs\. M1 isolates hard\-negative supervision \(same pooling and budget\)\. Bold: best per row\.Table[4](https://arxiv.org/html/2606.06994#S5.T4)shows that hard negatives improve discrimination, not ranking\. B3 and M1 provide the cleanest isolation, differing only in whether the BCE hard\-negative term is active\. B3 achieves t2d R@10 = 0\.657 vs\. M1’s 0\.651 \(−\-0\.006, effectively unchanged\), while negate ROC\-AUC jumps from 0\.528 to 0\.984 \(\+\+0\.456\) and type\-swap from 0\.472 to 0\.976 \(\+\+0\.504\)\. Ranking and calibration are separable properties of the concept representation space, governed by different components of the training objective\.

We conclude that hard negatives are warranted when the downstream task requires semantic discrimination\. These tasks include pair scoring, similarity thresholding, and robustness to adversarial paraphrase\. For ranking\-only pipelines, where the goal is Recall@K rather than a calibrated similarity score, B3 matches or exceeds M1, and the additional complexity of rule\-based hard\-negative generation and BCE weighting is unnecessary\.

### 5\.5Modifier\-Family Analysis: Extensional Supervision Helps Extensional Families

Table[5](https://arxiv.org/html/2606.06994#S5.T5)breaks down NP Recall@1 by modifier type, revealing a pattern that mirrors the semantic type hierarchy in Section[3](https://arxiv.org/html/2606.06994#S3)\(see Appendix[C](https://arxiv.org/html/2606.06994#A3)for a plain\-language description of each family\)\. Extensional families \(intersective, subsective\) allow the correct paraphrase to be recovered from proximity to the head\-noun class, while relational, modal, and privative families introduce implicit arguments, possible\-world reference, or head\-extension negation that make head\-noun neighbours actively misleading\. Privative and modal accordingly remain the hardest families \(below 0\.08 and 0\.21 respectively\) across all conditions\.

B1 shows the largest intersective gain over B0 \(\+\+0\.071 R@1\) but substantially hurts relational recall \(−\-0\.162\); no fine\-tuned condition recovers relational performance\. This follows directly from the supervision signal, as synonym and definition pairs encode extensional equivalence, improving families whoseCτC\_\{\\tau\}is extensional \(intersective, subsective\) while degrading or leaving unchanged those with relational or intensionalCτC\_\{\\tau\}\. We call this thesupervision–composition matching principle: a concept embedding benefits from fine\-tuning only when the supervision encodes the same semantic composition structure as the target concept class\. Closing the gap for relational and intensional families requires type\-matched supervision, which remains an open problem\.

Table 5:NP paraphrase R@1 by modifier family \(nn=4,000 pool\)\. Bold: best per row\. ROC\-AUC is uniformly high \(0\.945–0\.999\) and omitted\.Table 6:Cross\-domain transfer to DBpedia \(245 queries, 2,849 candidates\)\. Non\-italic = WordNet/Wiktionary\-only training; italic = 300 in\-domain DBpedia steps \(upper bound\)\. Bold: best cross\-domain\.The aggregate results in Table[5](https://arxiv.org/html/2606.06994#S5.T5)\(mechanistic breakdown in the Appendix[I](https://arxiv.org/html/2606.06994#A9)\) confirm that extensional supervision improves intersective and subsective families \(\+\+0\.064 and\+\+0\.041 R@1\) while degrading relational \(−\-0\.162, 311 regressions\) and leaving modal and privative largely unchanged, consistent with P4 and the geometric account in Section[3\.2](https://arxiv.org/html/2606.06994#S3.SS2)\.

### 5\.6Cross\-Domain Transfer: P1 and P4 Generalise Beyond the Training Domain

Table[6](https://arxiv.org/html/2606.06994#S5.T6)confirms both principles in a new domain\. Generic fine\-tuning \(B1, trained on WordNet/Wiktionary\) transfers partially\. DBpedia R@10 improves from 0\.608 to 0\.649, consistent with P1—the recalibrated geometry clusters concept\-equivalent expressions more tightly regardless of domain\. But 300 steps of DBpedia\-specific supervision reaches R@10 = 0\.845, a further\+\+0\.196 gain that generic fine\-tuning cannot close\. This gap is the cross\-domain expression of P4, showing that WordNet/Wiktionary encodes lexicographic equivalence between synonyms and dictionary definitions, while DBpedia requires mapping natural\-language queries to terse property labels \(deathPlace,populationTotal\) whose structure never appears in lexicographic training\. We validate that P1 generalises across encoder architectures in Appendix[J](https://arxiv.org/html/2606.06994#A10)\. Contrastive fine\-tuning improves every tested backbone, with larger gains where the frozen baseline is weaker\.

## 6Conclusion

In this paper we investigate and identify four principles that govern concept\-equivalent retrieval using sentence encoders, namely the training signal, the readout strategy, the use of hard negatives, and the match between supervision and semantic composition type\. These principles offer concrete guidance to practitioners: use a sentence\-fine\-tuned encoder with mean pooling, add hard negatives only when calibrated scoring matters more than ranking, and match supervision to the semantic composition structure of the target concept class\.

## Limitations

Sentence\-encoder backbones\.All H1/H2 ablations useall\-mpnet\-base\-v2, which is already sentence\-level contrastively trained; the H2 null result may therefore be specific to already\-fine\-tuned backbones\. On a raw pretrained transformer, cross\-layer readout may still recover meaningful signal\.Single seed\.All conditions use one training seed\. H1 margins are large enough to be robust; H2 margins \(<<1% absolute\) may not be stable across seeds\.Type\-matched supervision\.The supervision–composition matching principle \(Section[5](https://arxiv.org/html/2606.06994#S5)\) indicates the need for relation\-typed and negation\-aware supervision pairs, which do not currently exist at scale, so relational, modal, and privative concept types remain systematically underserved\. Constructing such resources may be a valuable next step to extend this work\.

## References

- Nouns are vectors, adjectives are functions: experiments with compositional models of meaning\.InProceedings of EMNLP 2010,pp\. 1183–1193\.Cited by:[§2](https://arxiv.org/html/2606.06994#S2.SS0.SSS0.Px3.p1.1)\.
- D\. S\. Carvalho, E\. Manino, J\. Rozanova, L\. Cordeiro, and A\. Freitas \(2025\)Montague semantics and modifier consistency measurement in neural language models\.InProceedings of COLING 2025,Cited by:[Appendix C](https://arxiv.org/html/2606.06994#A3.p1.1),[§1](https://arxiv.org/html/2606.06994#S1.p1.1),[§2](https://arxiv.org/html/2606.06994#S2.SS0.SSS0.Px3.p1.1),[§3\.3](https://arxiv.org/html/2606.06994#S3.SS3.p1.5),[§3\.3](https://arxiv.org/html/2606.06994#S3.SS3.p1.6)\.
- D\. S\. Carvalho, G\. Mercatali, Y\. Zhang, and A\. Freitas \(2023\)Learning disentangled representations for natural language definitions\.InFindings of EACL 2023,Cited by:[§2](https://arxiv.org/html/2606.06994#S2.SS0.SSS0.Px2.p1.1)\.
- K\. Ethayarajh \(2019\)How contextual are contextualized word representations?\.InProceedings of EMNLP 2019,Cited by:[§2](https://arxiv.org/html/2606.06994#S2.SS0.SSS0.Px4.p1.1)\.
- A\. Ettinger, A\. Elgohary, C\. Phillips, and P\. Resnik \(2018\)Assessing composition in sentence vector representations\.InProceedings of COLING 2018,Cited by:[§2](https://arxiv.org/html/2606.06994#S2.SS0.SSS0.Px3.p1.1)\.
- G\. Frege \(1892\)Über sinn und bedeutung\.Zeitschrift für Philosophie und philosophische Kritik100,pp\. 25–50\.Cited by:[§1](https://arxiv.org/html/2606.06994#S1.p1.1)\.
- T\. Gao, X\. Yao, and D\. Chen \(2021\)SimCSE: simple contrastive learning of sentence embeddings\.InProceedings of EMNLP 2021,Cited by:[§2](https://arxiv.org/html/2606.06994#S2.SS0.SSS0.Px1.p1.1)\.
- F\. Hill, K\. Cho, A\. Korhonen, and Y\. Bengio \(2016\)Learning to understand phrases by embedding the dictionary\.Transactions of the Association for Computational Linguistics4,pp\. 17–30\.Cited by:[§2](https://arxiv.org/html/2606.06994#S2.SS0.SSS0.Px2.p1.1)\.
- J\. Howard and S\. Ruder \(2018\)Universal language model fine\-tuning for text classification\.InProceedings of ACL 2018,Cited by:[§4\.4](https://arxiv.org/html/2606.06994#S4.SS4.p1.1)\.
- G\. Jawahar, B\. Sagot, and D\. Seddah \(2019\)What does BERT learn about the structure of language?\.InProceedings of ACL 2019,Cited by:[§1](https://arxiv.org/html/2606.06994#S1.p3.1),[§2](https://arxiv.org/html/2606.06994#S2.SS0.SSS0.Px4.p1.1),[§5\.1](https://arxiv.org/html/2606.06994#S5.SS1.p3.1)\.
- F\. Liu, E\. Shareghi, Z\. Meng, M\. Basaldella, and N\. Collier \(2021\)Self\-alignment pretraining for biomedical entity representations\.InProceedings of NAACL 2021,Cited by:[§2](https://arxiv.org/html/2606.06994#S2.SS0.SSS0.Px2.p1.1)\.
- J\. Mitchell and M\. Lapata \(2010\)Composition in distributional models of semantics\.Cognitive Science34\(8\),pp\. 1388–1429\.Cited by:[§2](https://arxiv.org/html/2606.06994#S2.SS0.SSS0.Px3.p1.1)\.
- M\. Nickel and D\. Kiela \(2017\)Poincaré embeddings for learning hierarchical representations\.InAdvances in Neural Information Processing Systems 30,Cited by:[§2](https://arxiv.org/html/2606.06994#S2.SS0.SSS0.Px4.p1.1)\.
- B\. H\. Partee \(1995\)Lexical semantics and compositionality\.InAn Invitation to Cognitive Science: Language,L\. Gleitman and M\. Liberman \(Eds\.\),Vol\.1,pp\. 311–360\.Cited by:[Appendix C](https://arxiv.org/html/2606.06994#A3.p1.1),[§2](https://arxiv.org/html/2606.06994#S2.SS0.SSS0.Px3.p1.1)\.
- M\. E\. Peters, M\. Neumann, M\. Iyyer, M\. Gardner, C\. Clark, K\. Lee, and L\. Zettlemoyer \(2018\)Deep contextualized word representations\.InProceedings of NAACL 2018,Cited by:[§1](https://arxiv.org/html/2606.06994#S1.p3.1),[§2](https://arxiv.org/html/2606.06994#S2.SS0.SSS0.Px4.p1.1)\.
- N\. Reimers and I\. Gurevych \(2019\)Sentence\-BERT: sentence embeddings using Siamese BERT\-networks\.InProceedings of EMNLP\-IJCNLP 2019,Cited by:[Appendix A](https://arxiv.org/html/2606.06994#A1.p1.1),[Appendix H](https://arxiv.org/html/2606.06994#A8.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2606.06994#S2.SS0.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2606.06994#S4.SS1.p1.1)\.
- F\. Remy, K\. Demuynck, and T\. Demeester \(2022\)BioLORD: learning ontological representations from definitions for biomedical concepts\.InFindings of EMNLP 2022,Cited by:[§2](https://arxiv.org/html/2606.06994#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Rogers, O\. Kovaleva, and A\. Rumshisky \(2020\)A primer in BERTology: what we know about how BERT works\.Transactions of the Association for Computational Linguistics8,pp\. 842–866\.Cited by:[§2](https://arxiv.org/html/2606.06994#S2.SS0.SSS0.Px4.p1.1)\.
- V\. Shwartz \(2019\)A systematic comparison of English noun compound representations\.InProceedings of the MWE\-WN Workshop, ACL 2019,Cited by:[§2](https://arxiv.org/html/2606.06994#S2.SS0.SSS0.Px3.p1.1)\.
- M\. Sung, H\. Jeon, J\. Lee, and J\. Kang \(2020\)Biomedical entity representations with synonym marginalization\.InProceedings of ACL 2020,Cited by:[§2](https://arxiv.org/html/2606.06994#S2.SS0.SSS0.Px2.p1.1)\.
- I\. Tenney, D\. Das, and E\. Pavlick \(2019\)BERT rediscovers the classical NLP pipeline\.InProceedings of ACL 2019,Cited by:[§1](https://arxiv.org/html/2606.06994#S1.p3.1),[§2](https://arxiv.org/html/2606.06994#S2.SS0.SSS0.Px4.p1.1),[§5\.1](https://arxiv.org/html/2606.06994#S5.SS1.p3.1)\.
- E\. Tutubalina, A\. Kadurin, and Z\. Miftahutdinov \(2020\)Fair evaluation in concept normalization: a large\-scale comparative analysis for BERT\-based models\.InProceedings of COLING 2020,Cited by:[§2](https://arxiv.org/html/2606.06994#S2.SS0.SSS0.Px2.p1.1)\.
- M\. Valentino, D\. Carvalho, and A\. Freitas \(2024\)Multi\-relational hyperbolic word embeddings from natural language definitions\.InProceedings of EACL 2024,Cited by:[§2](https://arxiv.org/html/2606.06994#S2.SS0.SSS0.Px4.p1.1)\.

## Appendix AReproducibility

All backbone models used in this work \(all\-mpnet\-base\-v2,paraphrase\-mpnet\-base\-v2,e5\-base\-v2,jina\-embeddings\-v5\-text\-nano, andparaphrase\-MiniLM\-L6\) are publicly available from the Sentence\-Transformers library\(Reimers and Gurevych,[2019](https://arxiv.org/html/2606.06994#bib.bib2)\)\. Training data are drawn exclusively from WordNet and Wiktionary, both publicly available; the exact pair\-generation procedure is described in the Method section and Appendix[B](https://arxiv.org/html/2606.06994#A2)\. Full hyperparameters \(learning rate, batch size, warmup, layerwise decay schedule, pooling windowKK\) are listed in Appendix[E](https://arxiv.org/html/2606.06994#A5)\. The two evaluation benchmarks introduced in this work—the DBpedia semantic\-gap benchmark and the modifier\-labeled NP paraphrase suite—are released alongside the paper\. Training code, configuration files, and model checkpoints for all eight conditions are released publicly\.111Code release is withheld during double\-blind review to preserve author anonymity\.

## Appendix BTraining Data Examples

Table[7](https://arxiv.org/html/2606.06994#A2.T7)illustrates the three pair types used during training\. Synonym pairs align near\-synonymous terms from the same source\. Term\-to\-definition \(t2d\) pairs align a term with its dictionary definition; the headword is masked from the definition before training to prevent lexical shortcutting\. Cross\-source definition–definition \(d2d\) pairs are generated on\-the\-fly by pairing definitions of the same concept across WordNet and Wiktionary\.

Table 7:Representative training pair examples for each supervision type\. WN = WordNet; Wkt = Wiktionary\. Headwords are masked from definitions during training\.
## Appendix CModifier Composition Types

The five modifier families used throughout this paper originate in formal semantics\(Partee,[1995](https://arxiv.org/html/2606.06994#bib.bib23); Carvalhoet al\.,[2025](https://arxiv.org/html/2606.06994#bib.bib8)\)\. Table[8](https://arxiv.org/html/2606.06994#A3.T8)provides an accessible reference with concrete adjective–noun examples and plain\-language descriptions of the underlying meaning mechanism, for readers less familiar with this typology\.

Table 8:The five modifier composition families with adjective–noun examples and plain\-language meaning descriptions\. Each family requires a structurally different semantic operatorCτC\_\{\\tau\}\(Section[3](https://arxiv.org/html/2606.06994#S3)\), which is why a single latent geometry cannot support all of them equally well under uniform supervision\.
## Appendix DTheoretical Foundations of Conceptual Compositionality

This appendix formalizes the theoretical perspective used in the main text\. It states the structural assumptions under which the observed modifier\-family differences, pooling null result, and fine\-tuning effects follow naturally\.

### D\.1Setup and Notation

Let\(𝒳,∘τ\)\(\\mathcal\{X\},\\circ\_\{\\tau\}\)denote a typed expression algebra, whereτ\\tauindexes composition types such as intersective, subsective, relational, modal, and privative modification\. Let⟦⋅⟧:𝒳→𝒞\\llbracket\\cdot\\rrbracket:\\mathcal\{X\}\\to\\mathcal\{C\}be a semantic interpretation function into a concept domain𝒞\\mathcal\{C\}\. Letf:𝒳→ℝdf:\\mathcal\{X\}\\to\\mathbb\{R\}^\{d\}be a sentence encoder, and lets:ℝd×ℝd→ℝs:\\mathbb\{R\}^\{d\}\\times\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}be the similarity score used for retrieval\. We writex∼yx\\sim ywhen⟦x⟧=⟦y⟧\\llbracket x\\rrbracket=\\llbracket y\\rrbracket\.

We use theετ\\varepsilon\_\{\\tau\}\-compositionality definition from §[3](https://arxiv.org/html/2606.06994#S3): encoderffisετ\\varepsilon\_\{\\tau\}\-compositional for typeτ\\tauif there existsΦτ:ℝd×ℝd→ℝd\\Phi\_\{\\tau\}:\\mathbb\{R\}^\{d\}\\times\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}^\{d\}such that‖f\(m∘τh\)−Φτ\(f\(m\),f\(h\)\)‖≤ετ\\\|f\(m\\circ\_\{\\tau\}h\)\-\\Phi\_\{\\tau\}\(f\(m\),f\(h\)\)\\\|\\leq\\varepsilon\_\{\\tau\}for all valid modifier–head pairs\(m,h\)\(m,h\)of typeτ\\tau\. The quantityετ\\varepsilon\_\{\\tau\}measures the distortion incurred when a typed semantic operator is realized in a single geometric representation space\. Smallετ\\varepsilon\_\{\\tau\}indicates that the encoder supports the corresponding composition family; largeετ\\varepsilon\_\{\\tau\}indicates an operator–geometry mismatch\.

### D\.2Low\-Distortion Homomorphism Requirement

###### Theorem D\.1\(Approximate Homomorphism Requirement\)\.

Suppose retrieval over a composition familyτ\\tauis stable under denotational equivalence: for anyx=m∘τhx=m\\circ\_\{\\tau\}hand anyyysuch thatx∼yx\\sim y, the representationsf\(x\)f\(x\)andf\(y\)f\(y\)lie within radiusδ\\deltaof one another\. Then there exists a latent operatorΦτ\\Phi\_\{\\tau\}whose empirical distortion over observed compositions is at most the optimal prediction error from constituent embeddings:

sup\(m,h\)‖f\(m∘τh\)−Φτ\(f\(m\),f\(h\)\)‖≤ετ∗,\\sup\_\{\(m,h\)\}\\left\\\|f\(m\\circ\_\{\\tau\}h\)\-\\Phi\_\{\\tau\}\(f\(m\),f\(h\)\)\\right\\\|\\leq\\varepsilon\_\{\\tau\}^\{\*\},whereετ∗\\varepsilon\_\{\\tau\}^\{\*\}is the minimum achievable reconstruction error over operators onℝd×ℝd\\mathbb\{R\}^\{d\}\\times\\mathbb\{R\}^\{d\}\.

###### Proof\.

LetZτ=\{\(f\(m\),f\(h\),f\(m∘τh\)\)\}Z\_\{\\tau\}=\\\{\(f\(m\),f\(h\),f\(m\\circ\_\{\\tau\}h\)\)\\\}be the set of observed constituent–composition triples\. DefineΦτ\\Phi\_\{\\tau\}as any minimizer of worst\-case reconstruction error overZτZ\_\{\\tau\}; existence on finite support is immediate by table\-lookup construction, assigning each observed input pair to a centroid of the corresponding target set\. It remains to boundετ∗\\varepsilon\_\{\\tau\}^\{\*\}using the stability assumption\. For each pair\(f\(m\),f\(h\)\)\(f\(m\),f\(h\)\), the targetf\(m∘τh\)f\(m\\circ\_\{\\tau\}h\)may vary across surface realizations of the same composition\. Stability under equivalence ensures all such realizations lie within aδ\\delta\-ball of one another:‖f\(x\)−f\(y\)‖≤δ\\\|f\(x\)\-f\(y\)\\\|\\leq\\deltawhenever⟦x⟧=⟦y⟧\\llbracket x\\rrbracket=\\llbracket y\\rrbracket\. Any operator that maps each input pair to a representative point of the correspondingδ\\delta\-ball therefore achieves worst\-case reconstruction error at mostδ\\deltaover the observed support\. Henceετ∗≤δ\\varepsilon\_\{\\tau\}^\{\*\}\\leq\\delta, establishing the bound\. Conversely, if no low\-distortionΦτ\\Phi\_\{\\tau\}exists \(ετ∗≫δ\\varepsilon\_\{\\tau\}^\{\*\}\\gg\\delta\), then the composed representations are not determined by their constituents, and expressions with identical semantic structure but different surface forms scatter to unrelated regions—contradicting retrieval stability\. ∎

#### Interpretation\.

The theorem makes precise why compositional retrieval is stronger than ordinary sentence similarity: success requires not only clustering equivalent expressions, but also realizing a predictable latent operator for the relevant composition family\.

### D\.3Identifiability from Equivalence Supervision

###### Theorem D\.2\(Supervision Identifiability Boundary\)\.

If training supervision consists only of equivalence constraints of the formx∼y⇒f\(x\)≈f\(y\)x\\sim y\\Rightarrow f\(x\)\\approx f\(y\), then latent operatorsΦτ\\Phi\_\{\\tau\}are identifiable only up to equivalence\-preserving transformations\. In particular, relation\-typed and intensional parameters not expressed in the equivalence labels are underdetermined\.

###### Proof\.

Let\[x\]∼=\{y∈𝒳:⟦y⟧=⟦x⟧\}\[x\]\_\{\\sim\}=\\\{y\\in\\mathcal\{X\}:\\llbracket y\\rrbracket=\\llbracket x\\rrbracket\\\}denote the denotational equivalence class ofxx\. Equivalence supervision constrainsffonly through labeled pairs: it enforcesf\(x\)≈f\(y\)f\(x\)\\approx f\(y\)whenx∼yx\\sim y, andf\(x\)≉f\(z\)f\(x\)\\not\\approx f\(z\)whenx≁zx\\not\\sim z\. This determines the quotient geometry offfover∼\\sim—the relative placement of equivalence class centroids—but leaves the internal compositional structure offfunderdetermined\.

Formally, letT:ℝd→ℝdT:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}^\{d\}be any invertible map satisfying:T\(v\)T\(v\)lies within theδ\\delta\-ball of\[x\]∼\[x\]\_\{\\sim\}wheneverv=f\(x\)v=f\(x\)\. Such transformations form a non\-trivial family; for example, any orthogonal rotation within each class neighborhood qualifies\. The transformed encoderf~=T∘f\\tilde\{f\}=T\\circ fsatisfies all equivalence constraints, since‖f~\(x\)−f~\(y\)‖=‖Tf\(x\)−Tf\(y\)‖≤δ\\\|\\tilde\{f\}\(x\)\-\\tilde\{f\}\(y\)\\\|=\\\|Tf\(x\)\-Tf\(y\)\\\|\\leq\\deltawheneverf\(x\)f\(x\)andf\(y\)f\(y\)already lie withinδ\\deltaof one another\. For any operatorΦτ\\Phi\_\{\\tau\}compatible withff, the conjugated operator

Φ~τ=T∘Φτ∘\(T−1×T−1\)\\tilde\{\\Phi\}\_\{\\tau\}=T\\circ\\Phi\_\{\\tau\}\\circ\(T^\{\-1\}\\times T^\{\-1\}\)satisfiesΦ~τ\(f~\(m\),f~\(h\)\)=T\(Φτ\(f\(m\),f\(h\)\)\)\\tilde\{\\Phi\}\_\{\\tau\}\(\\tilde\{f\}\(m\),\\tilde\{f\}\(h\)\)=T\(\\Phi\_\{\\tau\}\(f\(m\),f\(h\)\)\), matchingf~\(m∘τh\)\\tilde\{f\}\(m\\circ\_\{\\tau\}h\)wheneverΦτ\\Phi\_\{\\tau\}matchedf\(m∘τh\)f\(m\\circ\_\{\\tau\}h\)\. SinceTTcan be chosen to permute the internal geometry arbitrarily within equivalence neighborhoods, the composition operator is identifiable only up to the family of suchTT\. In particular, whether a match arises from intersection, relation binding, modal evaluation, or privative exclusion cannot be inferred from equivalence labels alone, unless those distinctions are explicitly represented in the training signal\. ∎

#### Interpretation\.

This result formalizes the supervision–composition matching principle: synonym and definition pairs identify quotient geometry over denotational equivalence, but they do not identify all typed semantic operators\.

### D\.4Head\-Preservation Bias

###### Theorem D\.3\(Geometric Head Bias\)\.

Consider a retrieval geometry in which composed noun phrases are encouraged to remain close to their heads:s\(f\(m∘τh\),f\(h\)\)s\(f\(m\\circ\_\{\\tau\}h\),f\(h\)\)is high relative to unrelated heads\. Then composition families whose denotations preserve head membership admit lower distortion under this geometry than composition families requiring head exclusion or non\-local relation binding\.

###### Proof sketch\.

The argument proceeds by examining the semantic relationship between composition and head extension for each family type\. For intersective modification,⟦m∘h⟧=⟦m⟧∩⟦h⟧⊆⟦h⟧\\llbracket m\\circ h\\rrbracket=\\llbracket m\\rrbracket\\cap\\llbracket h\\rrbracket\\subseteq\\llbracket h\\rrbracket: every instance of the composition is also an instance of the head\. Geometric proximitys\(f\(m∘h\),f\(h\)\)≥αs\(f\(m\\circ h\),f\(h\)\)\\geq\\alphais therefore semantically coherent—the representation should lie in the same region as the head\. For subsective modification, the composition is head\-preserving in extension \(⟦m∘h⟧⊆⟦h⟧\\llbracket m\\circ h\\rrbracket\\subseteq\\llbracket h\\rrbracket\), so the same argument applies\. For relational modification,⟦m∘h⟧\\llbracket m\\circ h\\rrbracketdepends on an implicit relation argumentRRnot encoded in eitherf\(m\)f\(m\)orf\(h\)f\(h\)alone; head proximity provides no information aboutRR, so the prediction from constituent embeddings incurs additional residual\. For modal modification, the composition introduces an evidential variable indicating uncertainty about head membership; proximity tof\(h\)f\(h\)conflates semantically certain and uncertain instances\. For privative modification,⟦m∘privh⟧∩⟦h⟧=∅\\llbracket m\\circ\_\{\\mathrm\{priv\}\}h\\rrbracket\\cap\\llbracket h\\rrbracket=\\emptyset: proximity tof\(h\)f\(h\)is maximally misleading\. The distortionετ\\varepsilon\_\{\\tau\}therefore increases monotonically across the hierarchy intersective, subsective, relational, modal, privative, following the operator–geometry alignment\. ∎

#### Interpretation\.

This theorem explains why the modifier hierarchy in the experiments is not merely empirical: the ranking follows from how well each operator family aligns with head\-centered similarity\.

### D\.5Privative Incompatibility under a Single Global Metric

###### Theorem D\.4\(Privative Metric Conflict\)\.

Assume the encoder is consistent with respect to head instances: for all genuine instancesy∈⟦h⟧y\\in\\llbracket h\\rrbracket,s\(f\(y\),f\(h\)\)≥α−ηs\(f\(y\),f\(h\)\)\\geq\\alpha\-\\etafor smallη\>0\\eta\>0\(instances cluster near their category head\)\. Then no single global metric can simultaneously satisfy, for all privative constructionsm∘privhm\\circ\_\{\\mathrm\{priv\}\}h, the following three constraints: \(i\)s\(f\(m∘privh\),f\(h\)\)≥αs\(f\(m\\circ\_\{\\mathrm\{priv\}\}h\),\\,f\(h\)\)\\geq\\alpha\(high proximity to lexical head\); \(ii\)⟦m∘privh⟧∩⟦h⟧=∅\\llbracket m\\circ\_\{\\mathrm\{priv\}\}h\\rrbracket\\cap\\llbracket h\\rrbracket=\\emptyset\(semantic exclusion from head extension\); \(iii\)s\(f\(m∘privh\),f\(y\)\)<βs\(f\(m\\circ\_\{\\mathrm\{priv\}\}h\),\\,f\(y\)\)<\\betafor ally∈⟦h⟧y\\in\\llbracket h\\rrbracket, withβ<α−η\\beta<\\alpha\-\\eta\(retrieval isolation from true head instances\)\.

###### Proof\.

Suppose \(i\) and the consistency assumption both hold\. Lety∈⟦h⟧y\\in\\llbracket h\\rrbracketbe any genuine head instance\. By the consistency assumption,f\(y\)f\(y\)lies within distanceη\\eta\(in the metric induced byss\) off\(h\)f\(h\)\. By the triangle inequality, proximity off\(m∘privh\)f\(m\\circ\_\{\\mathrm\{priv\}\}h\)tof\(h\)f\(h\)implies proximity tof\(y\)f\(y\):

s\(f\(m∘privh\),f\(y\)\)≥α−η−ϵ,s\\\!\\left\(f\(m\\circ\_\{\\mathrm\{priv\}\}h\),\\,f\(y\)\\right\)\\geq\\alpha\-\\eta\-\\epsilon,whereϵ\>0\\epsilon\>0is determined by the metric geometry\. Forη\\etaandϵ\\epsilonsmall relative to the gapα−β\>0\\alpha\-\\beta\>0, this contradicts constraint \(iii\), which requiress\(f\(m∘privh\),f\(y\)\)<β<α−ηs\(f\(m\\circ\_\{\\mathrm\{priv\}\}h\),f\(y\)\)<\\beta<\\alpha\-\\eta\. Hence \(i\) and \(iii\) cannot hold simultaneously\. Relaxing \(i\) to allows\(f\(m∘privh\),f\(h\)\)<αs\(f\(m\\circ\_\{\\mathrm\{priv\}\}h\),f\(h\)\)<\\alphasacrifices lexical anchoring\. Since the same global scoressgoverns both, no uniform choice satisfies all three constraints without additional type\-discriminative structure\. ∎

#### Interpretation\.

Privative failures are therefore not just cases of insufficient training\. They expose a representational conflict in ordinary embedding spaces: the model must be close to the head in one sense and far from it in another\.

### D\.6Equivalence Training Reorganizes Geometry

###### Theorem D\.5\(Capacity\-Reorganization Principle\)\.

Let contrastive equivalence training update an encoder primarily through pairwise attraction of positives and repulsion of negatives in a fixeddd\-dimensional embedding space\. Then, absent architectural expansion or new representational channels, improvements in retrieval arise from reorganization of the existing geometry rather than from increased representational capacity\.

###### Proof\.

The encoder output dimension remainsddthroughout fine\-tuning, so the ambient embedding spaceℝd\\mathbb\{R\}^\{d\}and its capacity are fixed\. Contrastive updates are gradient steps on pairwise cosine similarities: positive pairs receive attraction gradients and negative pairs receive repulsion gradients, modifying the weight matrices of the transformer but not adding independent output coordinates\. The effect on the embedding geometry is a change in the covariance structure of\{f\(x\)\}x∈𝒳\\\{f\(x\)\\\}\_\{x\\in\\mathcal\{X\}\}: specifically, the singular value distribution of the embedding matrix shifts\. Anisotropy \(the fraction of variance captured by the leading singular direction\) decreases as attraction and repulsion spread the embedding cloud more uniformly across directions\. Crucially, the effective rank \(number of dimensions carrying non\-negligible variance, measured for example by the participation ratio\(∑iλi\)2/∑iλi2\(\\sum\_\{i\}\\lambda\_\{i\}\)^\{2\}/\\sum\_\{i\}\\lambda\_\{i\}^\{2\}over singular valuesλi\\lambda\_\{i\}\) may remain approximately stable even as the distribution across directions changes\. Contrastive training therefore reorganizes the geometry \(changing which directions carry semantic signal\) without expanding it \(the number of independent representational axes is bounded byddbefore and after\)\. ∎

#### Interpretation\.

The theorem supports the claim that concept\-equivalence fine\-tuning recalibrates a pre\-existing semantic space rather than expanding it\.

### D\.7Distortion and Operator Complexity

###### Theorem D\.6\(Operator Distortion Lower Bound\)\.

Letκ\(τ\)\\kappa\(\\tau\)denote the number of independent semantic degrees of freedom required to realize composition typeτ\\tau, including latent relation variables, modal parameters, and exclusion constraints\. If these degrees of freedom are not encoded in the supervision or representation, then any realization ofτ\\tauin add\-dimensional embedding space incurs nonzero distortion\. Moreover, distortion must increase as the unsupervised degrees of freedom ofτ\\tauincrease relative to the information preserved byff\.

###### Proof sketch\.

An encoder can faithfully realizeCτC\_\{\\tau\}only to the extent that theκ\(τ\)\\kappa\(\\tau\)semantic variables determining that operator are preserved in the representation\. If two expressions differ in a latent variable required byCτC\_\{\\tau\}\(a relation argument, an intensional parameter, or a privative exclusion flag\) but the encoder maps them to near\-identical vectors, then no downstreamΦτ\\Phi\_\{\\tau\}can recover the distinction\. Each such collapsed variable represents at least one dimension of the target space that is rendered inaccessible\. By a counting argument over the required variable assignments: ifkkindependent binary distinctions collapse \(each mapped to indistinguishable vectors\), then at leastkksemantic contrasts cannot be expressed at the output, and the resulting retrieval errors are bounded away from zero for each\. Since the number of collapsed variables is bounded below byκ\(τ\)\\kappa\(\\tau\)minus the number of variables identified by the training signal, distortion increases monotonically asκ\(τ\)\\kappa\(\\tau\)grows relative to supervised information\. Making this argument tight requires an information\-theoretic lower bound on the minimum encoding error for add\-dimensional representation with partially\-observed supervision, which we defer to future work\. ∎

#### Interpretation\.

This result gives a formal reason why relational, modal, and privative families remain difficult even when extensional concept equivalence improves\.

### D\.8Readout Invariance under Final\-Layer Concentration

###### Theorem D\.7\(Readout Invariance\)\.

Lethl\(x\)∈ℝdh\_\{l\}\(x\)\\in\\mathbb\{R\}^\{d\}denote the output of transformer layerllfor expressionxx, and letfr\(x\)=∑l=1Lwlhl\(x\)f\_\{r\}\(x\)=\\sum\_\{l=1\}^\{L\}w\_\{l\}\\,h\_\{l\}\(x\)be the pooled representation under readoutr=\{wl\}l=1Lr=\\\{w\_\{l\}\\\}\_\{l=1\}^\{L\}with∑lwl=1\\sum\_\{l\}w\_\{l\}=1\. Suppose semantic content concentrates in the final layer:hl\(x\)=hL\(x\)\+ξl\(x\)h\_\{l\}\(x\)=h\_\{L\}\(x\)\+\\xi\_\{l\}\(x\)where‖ξl\(x\)‖≤γ\\\|\\xi\_\{l\}\(x\)\\\|\\leq\\gammafor alll<Ll<Land allxx\. Then for any two readout strategiesrrandr′r^\{\\prime\},

‖fr\(x\)−fr′\(x\)‖≤2γ\.\\\|f\_\{r\}\(x\)\-f\_\{r^\{\\prime\}\}\(x\)\\\|\\leq 2\\gamma\.In particular, whenγ\\gammais small, the choice of readout has negligible effect on retrieval\.

###### Proof\.

‖fr\(x\)−fr′\(x\)‖\\displaystyle\\\|f\_\{r\}\(x\)\-f\_\{r^\{\\prime\}\}\(x\)\\\|=‖∑l\(wl−wl′\)hl\(x\)‖\\displaystyle=\\left\\\|\\sum\_\{l\}\(w\_\{l\}\-w^\{\\prime\}\_\{l\}\)\\,h\_\{l\}\(x\)\\right\\\|=‖∑l\(wl−wl′\)\(hL\(x\)\+ξl\(x\)\)‖\\displaystyle=\\left\\\|\\sum\_\{l\}\(w\_\{l\}\-w^\{\\prime\}\_\{l\}\)\\bigl\(h\_\{L\}\(x\)\+\\xi\_\{l\}\(x\)\\bigr\)\\right\\\|=∥hL\(x\)∑l\(wl−wl′\)⏟=0\\displaystyle=\\left\\\|h\_\{L\}\(x\)\\underbrace\{\\sum\_\{l\}\(w\_\{l\}\-w^\{\\prime\}\_\{l\}\)\}\_\{=\\,0\}\\right\.\+∑l\(wl−wl′\)ξl\(x\)∥\\displaystyle\\qquad\\left\.\{\}\+\\sum\_\{l\}\(w\_\{l\}\-w^\{\\prime\}\_\{l\}\)\\,\\xi\_\{l\}\(x\)\\right\\\|≤∑l\|wl−wl′\|⋅‖ξl\(x\)‖≤2γ,\\displaystyle\\leq\\sum\_\{l\}\|w\_\{l\}\-w^\{\\prime\}\_\{l\}\|\\cdot\\\|\\xi\_\{l\}\(x\)\\\|\\;\\leq\\;2\\gamma,where∑l\(wl−wl′\)=0\\sum\_\{l\}\(w\_\{l\}\-w^\{\\prime\}\_\{l\}\)=0because both weight vectors sum to 1, and∑l\|wl−wl′\|≤2\\sum\_\{l\}\|w\_\{l\}\-w^\{\\prime\}\_\{l\}\|\\leq 2because theL1L\_\{1\}distance between two probability vectors is at most 2\. ∎

#### Interpretation\.

This theorem formalizes the P2 finding that pooling strategy is a null factor in fine\-tuned encoders\. Equivalence training concentrates semantic content in the final transformer layer, drivingγ\\gammatoward zero: intermediate layers carry only low\-variance residuals, so the choice among mean pooling, CLS, weighted mean, or gated readout produces embeddings within2γ2\\gammaof one another\. Pre\-fine\-tuning, layer representations are more heterogeneous \(γ\\gammalarger\) and readout strategies diverge correspondingly\. The theorem also predicts that readout sensitivity should correlate with anisotropy: both reflect the distribution of semantic content across encoder layers and directions\.

### D\.9Summary

The theoretical picture is that sentence encoders approximate a homomorphic mapping from a typed semantic algebra into a fixed geometric space\. The seven theorems above provide formal grounding for the four empirical principles of the main paper\.P1\(recalibration, not expansion\) is supported by the Capacity\-Reorganization Principle \(Theorem 5\)\. Contrastive fine\-tuning changes the covariance structure of the space but not its dimension, so improvements arise from redistribution rather than augmentation\.P2\(final\-layer concentration\) is supported by the Readout Invariance theorem\. Once semantic content concentrates in the final layer, all linear pooling strategies converge to within2γ2\\gammaof one another\.P3\(calibration and ranking separable via hard negatives\) is supported by the Approximate Homomorphism and Head\-Preservation theorems \(Theorems 1, 3\)\. Stable retrieval requires a low\-distortion composition operator, and head\-preserving families achieve this under the existing geometry while operator\-complex families require explicit negative supervision\.P4\(supervision must match composition type\) is supported by the Supervision Identifiability theorem \(Theorem 2\)\. Equivalence labels identify geometry only up to equivalence\-preserving transformations; composition\-type\-specific operators are underdetermined without targeted supervision\. The Privative Metric Conflict \(Theorem 4\) and Operator Distortion Lower Bound \(Theorem 6\) additionally explain why privative, modal, and relational families remain hard even after equivalence fine\-tuning\. They require representational structure that a single global metric over synonym\-definition pairs cannot provide\.

## Appendix ETraining Hyperparameters

Table[9](https://arxiv.org/html/2606.06994#A5.T9)lists the full set of hyperparameters used across all conditions\.

Table 9:Hyperparameters shared across all fine\-tuned conditions\.
## Appendix FLoss Variant: Unified InfoNCE

A simplified variant \(B1\-unified\) folds hard negatives directly into the InfoNCE denominator, eliminating the BCE term\. Results are shown in Table[10](https://arxiv.org/html/2606.06994#A6.T10)\. B1\-unified improves retrieval ranking \(t2d R@10\+\+0\.012, t2d MRR\+\+0\.014\) but substantially reduces calibration \(pair ROC\-AUC−\-0\.058\) and hard\-negative stress\-test robustness \(negate ROC\-AUC−\-0\.182\)\. The BCE term provides a fixed\-magnitude gradient per hard negative regardless of batch size; the unified formulation dilutes the hard\-negative signal in proportion to batch size, reducing its effectiveness for calibration\. The primary BCE variant is preferred when downstream use requires calibrated pair scoring or robustness to semantic perturbations\.

Table 10:B1 vs\. B1\-unified across the same evaluation columns as Table[2](https://arxiv.org/html/2606.06994#S5.T2)\. Unified InfoNCE improves retrieval ranking but substantially reduces calibration and stress\-test robustness\. Bold: best per column\.
## Appendix GHard\-Negative Rules

Five rule\-based transforms are applied to positive definitions to generate hard negatives\. The same five rules are used during both training \(BCE hard\-negative term\) and evaluation \(stress\-test ROC\-AUC\), so the stress tests directly probe the discriminations the model was trained on\. Each rule is applied once per positive definition; all generated negatives are used\.

#### Antonym flip\.

A key content word in the definition is identified and replaced by its antonym using a WordNet antonymy lookup\. This tests whether the model can distinguish conceptually opposite definitions that share the same syntactic frame \(*“a sustained increase in prices”*→\\to*“a sustained decrease in prices”*\)\.

#### Lexical negation\.

The first finite verb or copula in the definition is negated by inserting*not*\(*“is a warm\-blooded vertebrate”*→\\to*“is not a warm\-blooded vertebrate”*\)\. This is the most challenging type: the negative differs from the positive by a single token, and the models must assign lower similarity to what is linguistically a denial of the positive concept\.

#### Same\-POS random swap\.

The positive definition is replaced with a randomly sampled definition from a different word sharing the same coarse POS tag \(noun, verb, adjective, adverb\)\. This tests broad categorical discrimination: the model must not rely solely on POS\-level similarity cues to rank candidates\.

#### Same\-POS prefix swap\.

As above, but the replacement definition is drawn from a word sharing the same morphological prefix \(*un\-*,*re\-*,*over\-*, etc\.\)\. The prefix overlap makes this harder than random same\-POS swaps, as it introduces surface cues that can mislead lexical\-similarity\-based models\.

#### Ontological type swap\.

The definition is replaced with one from a concept of a different ontological type, where type is drawn from a small closed set \(person, organization, location, event, artifact, biological entity, abstract concept\)\. This tests whether the model has encoded the coarse ontological category of a concept rather than merely its surface distribution \(*“the capital city of France”*→\\to*“a multinational technology corporation”*\)\.

#### Evaluation procedure\.

For each test pair \(queryqq, positive definitiond\+d^\{\+\}\), a negatived−d^\{\-\}is generated by applying one rule tod\+d^\{\+\}\. The model scorescos⁡\(f\(q\),f\(d\+\)\)\\cos\(f\(q\),f\(d^\{\+\}\)\)andcos⁡\(f\(q\),f\(d−\)\)\\cos\(f\(q\),f\(d^\{\-\}\)\); ROC\-AUC is computed over the resulting binary discrimination across all test pairs for that rule\. A score of 1\.0 indicates perfect discrimination; 0\.5 is chance\.

## Appendix HGeometry Diagnostics

We measure two complementary geometry properties on the in\-domain split\.Anisotropyis the mean cosine similarity over 10,000 random pairs of encoded texts: lower values indicate a more isotropic, uniformly spread space\.Effective rankisexp⁡\(H\(σ\)\)\\exp\(H\(\\sigma\)\), the exponential Shannon entropy of the normalised singular\-value spectrum: higher values indicate that more dimensions carry meaningful variance\. Together these characterise whether the model is using the available dimensions efficiently and without degenerate clustering\.

Table 11:Geometry diagnostics \(in\-domain split\)\. Anisotropy: mean cosine over random pairs \(lower==better\)\. Effective rank:exp⁡\(H\(σ\)\)\\exp\(H\(\\sigma\)\), entropy of the normalised eigenspectrum \(higher==better\)\. Hard negatives are the dominant geometric factor \(B1 vs\. B3\)\.![Refer to caption](https://arxiv.org/html/2606.06994v1/x1.png)Figure 4:Anisotropy and effective rank across encoder conditions\. Hard\-negative supervision \(B1, M1\) is the dominant factor driving anisotropy collapse; pooling strategy alone has a secondary effect\.![Refer to caption](https://arxiv.org/html/2606.06994v1/x2.png)Figure 5:Scatter of anisotropy vs\. effective rank across encoder conditions\. Fine\-tuning with hard negatives \(B1, M1\) achieves low anisotropy and high effective rank simultaneously; the frozen weighted mean \(B0\-WM\) is the geometric outlier\.#### Why B0 already has good geometry\.

The frozen mean\-pool baseline \(B0\) achieves surprisingly well\-calibrated geometry \(anisotropy 0\.126, effective rank 247\.0\) without any task\-specific training\. This is a direct consequence of the backbone’s history:all\-mpnet\-base\-v2was itself contrastively trained on 1 billion sentence pairs for sentence\-level similarity\(Reimers and Gurevych,[2019](https://arxiv.org/html/2606.06994#bib.bib2)\), which spreads representations isotropically and concentrates signal in the final layer\. The per\-layer anisotropy plot \(Figure[3](https://arxiv.org/html/2606.06994#S5.F3)\) makes this concrete: layer\-12 anisotropy in the frozen encoder \(0\.146\) is far below layers 5–11 \(0\.45–0\.50\)\. We inherit this structure without modification; the B0 baseline is strong not because of anything we did, but because the backbone was already well\-calibrated\.

#### Why B0\-WM is substantially worse\.

B0\-WM \(frozen uniform weighted mean over the last four layers\) has anisotropy 0\.303 and effective rank only 177\.0—substantially worse than B0 despite using the same frozen backbone\. The reason is straightforward: averaging layers 9–11 into the layer\-12 output introduces high\-anisotropy \(0\.45–0\.50\) intermediate representations into the mixture, diluting the well\-calibrated final layer\. This is the same mechanism as layer\-12 dominance, operating at the pooling level: even without gradient flow, choosing the wrong readout is geometrically costly\. The retrieval penalty mirrors the geometry penalty—B0\-WM achieves only 0\.392 t2d R@10 vs\. B0’s 0\.552 \(Table[2](https://arxiv.org/html/2606.06994#S5.T2)\)\.

#### Fine\-tuning reorganises, not expands\.

The most striking geometry finding is that fine\-tuning on 3\.3 million concept pairs leaves effective rank essentially unchanged \(B0: 247\.0→\\toB1: 247\.2\) while collapsing anisotropy \(0\.126→\\to0\.012\)\. Contrastive training does not recruit new dimensions; it redistributes variance across existing ones, rotating and rescaling directions so that equivalent concepts land in the same neighbourhood and hard negatives are pushed apart\. The representational capacity \(effective rank≈\\approx247\) was already present in the frozen encoder, contributed by the prior sentence\-transformer training\. Concept\-equivalence fine\-tuning directs it, but cannot create it\.

#### Hard negatives are the dominant geometric factor\.

The B3 vs\. B1 comparison isolates the contribution of hard\-negative supervision \(same backbone, same pooling, same budget; only the BCE hard\-negative term differs\): B3 anisotropy is 0\.166, effective rank 193\.9; B1 achieves 0\.012 and 247\.2\. Hard negatives require the model to push semantically distinct but geometrically close representations apart, which directly reduces anisotropy and expands the effective rank\. By contrast, varying the pooling architecture while holding supervision fixed has a much smaller effect: M1 \(weighted mean \+ HN\) achieves anisotropy 0\.015 and rank 239\.4, only marginally below B1 \(mean pool \+ HN\) despite a different readout\. This confirms that the geometry improvement is driven by the training signal, not the readout architecture—consistent with the H2 retrieval null result\.

## Appendix INP Paraphrase Error Examples

Table[12](https://arxiv.org/html/2606.06994#A9.T12)lists the three hardest B1 failures per modifier family, sorted by rank of the correct candidate \(descending\)\. Rank is computed over the full 4,000\-item pool\. “Correct sim” and “Retrieved sim” are cosine similarities to the correct paraphrase and to the rank\-1 retrieved candidate respectively\. The final column classifies the failure mode as discussed in Section[I](https://arxiv.org/html/2606.06994#A9)\.

FamilyQueryCorrect paraphraseCor\. simRankRetrieved rank\-1Ret\. simFailure modeintersectiveclean platefree of dirt plate0\.48783clean dish0\.832adj\. synonymdirty templenot clean temple0\.57759filthy temple0\.937adj\. synonymdirty cupnot clean cup0\.58358filthy cup0\.838adj\. synonymsubsectiveeffective studenthigh impact student0\.68948effective teacher0\.892head\-noun confusioneffective trainerhigh impact trainer0\.72642highly effective trainer0\.933degree\-modifier varianteffective managerhigh impact manager0\.73930highly effective manager0\.918degree\-modifier variantrelationalcommunity cupcontainer owned by local communities0\.253864city cup0\.899compound cross\-retrievalcity cupcontainer associated with municipal service0\.264637community cup0\.899compound cross\-retrievalsummer accountprofile related to the warm season0\.335580student account0\.771compound cross\-retrievalmodalalleged culpritnot proven culprit0\.63394purported culprit†0\.929annotation limitprivativetoy ticketpass lookalike0\.347497play ticket‡0\.807annotation limitfraudulent accountprofile lookalike0\.349470forged account0\.912adj\. synonymspurious ticketpass lookalike0\.372437bogus ticket0\.927adj\. synonymTable 12:Hardest B1 failures per modifier family \(up to 3; 4,000\-item pool\)\. Rank = position of correct paraphrase; lower = harder failure\. Failure modes:adj\. synonym= same\-adjective surface variant ranked above definitional paraphrase;head\-noun confusion= correct modifier, wrong noun;degree\-modifier variant= degree form ranked above definition;compound cross\-retrieval= adjacent compound preferred over relational\-clause paraphrase;annotation limit= retrieved item is a valid paraphrase under multi\-reference annotation\.†\\dagger*purported culprit*is a valid paraphrase of*alleged culprit*; counted wrong due to single\-reference annotation\.‡\\ddagger*play ticket*is a valid paraphrase of*toy ticket*; counted wrong because the annotated reference is*pass lookalike*\.
## Appendix JBackbone Grid on DBpedia

Table[13](https://arxiv.org/html/2606.06994#A10.T13)reports R@10 and MRR for all five backbones on DBpedia \(245 test queries\), both frozen and after 300 in\-domain fine\-tuning steps\.

Table 13:Backbone grid on DBpedia \(245 test queries, 2,849 candidates\)\. All fine\-tuned models use 300 DBpedia\-specific steps\. Bold: best per column\. Contrastive fine\-tuning improves every backbone tested\.
## Appendix KUse of AI Writing Assistance

During the preparation of this work the author\(s\) used Claude \(Anthropic\) in order to assist with experiment implementation \(coding\), writing, editing, and LaTeX preparation of the manuscript\. After using this tool, the author\(s\) reviewed and edited the content as needed and take\(s\) full responsibility for the content of the publication\.
Principles of Concept Representation in Sentence Encoders

Similar Articles

A Geometric View for Understanding Concept Learning and Neuron Interpretation in Sparse Autoencoders

When Does Complexity Conditioning Help a Frozen Sentence Embedding? A Controlled Study of Per-Sentence and Pair-Level Difficulty Adaptation

Stateful Visual Encoders for Vision-Language Models

How can embedding models bind concepts?

From Pixels to Concepts: Do Segmentation Models Understand What They Segment?

Submit Feedback

Similar Articles

A Geometric View for Understanding Concept Learning and Neuron Interpretation in Sparse Autoencoders
When Does Complexity Conditioning Help a Frozen Sentence Embedding? A Controlled Study of Per-Sentence and Pair-Level Difficulty Adaptation
Stateful Visual Encoders for Vision-Language Models
How can embedding models bind concepts?
From Pixels to Concepts: Do Segmentation Models Understand What They Segment?