When Does Complexity Conditioning Help a Frozen Sentence Embedding? A Controlled Study of Per-Sentence and Pair-Level Difficulty Adaptation

arXiv cs.CL Papers

Summary

This paper presents a controlled, multi-seed study testing whether adapting frozen sentence embeddings to input difficulty improves performance. It finds that per-sentence complexity conditioning fails, while a pair-level residual gated by a cross-encoder difficulty signal yields consistent gains on semantic similarity tasks.

arXiv:2606.03244v1 Announce Type: new Abstract: A common intuition is that sentence embeddings should adapt to the difficulty of the input. We test this intuition in a controlled, multi-seed setting: a lightweight post-encoder adapter attaches to a frozen Qwen3-Embedding-0.6B encoder, accessing only its final pooled embedding, and is evaluated on four paraphrase and semantic-similarity tasks (PAWS, MRPC, QQP, STS-B). The naive form of the idea fails: surface-based per-sentence complexity is nearly uncorrelated with frozen-baseline error (Pearson approximately 0.05) and provides no advantage over constant or shuffled controls, while degrading a saturated baseline. Even when the target is aligned to a non-circular pair-difficulty signal, the per-sentence gate still cannot reliably capture difficulty because difficulty is primarily a property of the pair, not the individual sentence. In contrast, a small pair-level residual gated by a held-out cross-encoder difficulty signal yields consistent gains on the larger and graded tasks, including +0.022 Spearman on STS-B and +0.037 on QQP, while remaining anchored to the frozen baseline across all seeds. Because this useful form operates on sentence pairs rather than individual sentences, the resulting model is best understood as a lightweight re-ranker over cached frozen embeddings, not a replacement single-vector embedding; we make no state-of-the-art claim. Our contribution is a controlled account of when difficulty-aware adaptation helps and when it fails, together with a pre-training diagnostic that predicts the available headroom.
Original Article
View Cached Full Text

Cached at: 06/03/26, 09:37 AM

# When Does Complexity Conditioning Help a Frozen Sentence Embedding? A Controlled Study of Per-Sentence and Pair-Level Difficulty Adaptation
Source: [https://arxiv.org/html/2606.03244](https://arxiv.org/html/2606.03244)
Suhwan Hwang

###### Abstract

A common intuition is that sentence embeddings should adapt to the difficulty of the input\. We test this intuition in a controlled, multi\-seed setting: a lightweight*post\-encoder*adapter attaches to a frozenQwen3\-Embedding\-0\.6Bencoder, accessing only its final pooled embedding, and is evaluated on four paraphrase and semantic\-similarity tasks \(PAWS, MRPC, QQP, STS\-B\)\. The naïve form of the idea fails: surface\-based per\-sentence complexity is nearly uncorrelated with frozen\-baseline error \(Pearson≈0\.05\\approx 0\.05\) and provides no advantage over constant or shuffled controls, while degrading a saturated baseline\. Even when the target is aligned to a non\-circular pair\-difficulty signal, the per\-sentence gate still cannot reliably capture difficulty because difficulty is primarily a property of the pair, not the individual sentence\. In contrast, a small pair\-level residual gated by a held\-out cross\-encoder difficulty signal yields consistent gains on the larger and graded tasks, including\+0\.022\+0\.022Spearman on STS\-B and\+0\.037\+0\.037on QQP, while remaining anchored to the frozen baseline across all seeds\. Because this useful form operates on sentence pairs rather than individual sentences, the resulting model is best understood as a lightweight re\-ranker over cached frozen embeddings, not a replacement single\-vector embedding; we make no state\-of\-the\-art claim\. Our contribution is a controlled account of when difficulty\-aware adaptation helps and when it fails, together with a pre\-training diagnostic that predicts the available headroom\.

## 1Introduction

Single\-vector sentence embeddings underpin semantic similarity, retrieval, clustering, and classification\[[12](https://arxiv.org/html/2606.03244#bib.bib12),[5](https://arxiv.org/html/2606.03244#bib.bib5),[8](https://arxiv.org/html/2606.03244#bib.bib8)\]\. A recurring intuition is that not all inputs should be treated identically: longer, syntactically involved, or semantically ambiguous sentences are “harder,” and an embedding model might benefit from adapting its geometry to this difficulty\. This paper asks a narrow but practically important question:*does conditioning a frozen sentence embedding on a measure of input difficulty actually improve it, and if so, under what conditions?*

We study this with a deliberately conservative design\. Rather than fine\-tuning the encoder, we attach a lightweight*post\-encoder*adapter that consumes only the baseline’s final pooled embeddingxx\. This frozen, embedding\-only interface has two advantages\. First, it is maximally portable: it admits any sentence encoder—including ones behind an API—and the embeddings can be cached once, so the cost of the study is dominated by a single forward pass\. Second, it isolates the contribution of difficulty conditioning from the much larger effect of end\-to\-end representation learning, which is what a controlled study requires\.

Within this interface we examine a progression of increasingly faithful instantiations of the difficulty\-conditioning idea, each evaluated against tight controls and across multiple seeds:

1. 1\.a*per\-sentence*complexity scalar, derived from surface syntax, that conditions an elementwise scaling of the embedding;
2. 2\.the same per\-sentence gate, but with the complexity target replaced by a non\-circular, error\-aligned hard\-pair difficulty signal;
3. 3\.a*pair\-level*conditioning that preserves the frozen\-baseline anchor and adds a small, difficulty\-gated residual to the pair similarity;
4. 4\.the pair\-level model with a*held\-out cross\-encoder margin*as the difficulty target, confirmed across four tasks\.

The central methodological device throughout is*signal isolation*: for every claimed gain we ask whether a control with the difficulty signal removed \(constant scale, shuffled complexity, or an ungated residual of matched capacity\) achieves the same effect\. This guards against attributing to “complexity” what is in fact extra parameters or a generic residual\.

Our findings are summarized as follows\. The naïve per\-sentence form does not work: the surface complexity signal is essentially uncorrelated \(Pearson≈0\.05\\approx 0\.05\) with where the baseline actually errs, and conditioning on it matches a constant\-scale control while hurting a near\-saturated baseline\. Aligning the target to a non\-circular pair\-difficulty measure restores the correlation \(up to0\.980\.98\) but still yields no advantage over the constant control, because hard\-pair difficulty is a property of the*pair*and is only weakly recoverable from a single sentence’s frozen embedding\. The mechanism becomes effective only when three conditions hold simultaneously: pair\-level conditioning, a small baseline\-preserving residual, and a difficulty target supplied by an independent cross\-encoder rather than a surface proxy\. Under these conditions, a cross\-encoder\-gated pair residual yields consistent, sign\-stable gains on the graded and large tasks \(e\.g\.\+0\.022\+0\.022Spearman on STS\-B and\+0\.037\+0\.037on QQP, with every seed positive\), never falls below the frozen\-baseline Spearman in the multi\-task confirmation, and does not degrade retrieval\. Because this useful form necessarily operates on sentence*pairs*, the final model is best understood as a minimal pair\-level adapter—a lightweight re\-ranker over cached frozen bi\-encoder embeddings—rather than a drop\-in replacement for the single\-vector embedding\. The study thus begins with an embedding\-only question and arrives at a pair\-scoring answer: the negative result rules out pure per\-sentence adaptation, and the only consistently useful positive result lives at the pair\-scoring stage, which changes the deployment interface from single\-vector embedding to lightweight re\-ranking\.

#### Contributions\.

\(1\) A controlled, multi\-seed study that isolates the conditions under which difficulty conditioning of a frozen sentence embedding helps, rather than a performance\-maximizing system\. \(2\) A clear negative result—per\-sentence complexity scaling does not help even when the target is well aligned—together with the mechanistic reason \(difficulty is a pair property\)\. \(3\) A positive result—a small, baseline\-preserving, cross\-encoder\-gated pair residual that improves the graded/large tasks and never falls below the frozen\-baseline Spearman in the multi\-task confirmation, with the explicit caveat that this useful form is pair\-level and is therefore a lightweight re\-ranker rather than a pure single\-vector embedding—and an analysis of why the cross\-encoder margin generalizes where surface proxies do not\. \(4\) A cheap pre\-training diagnostic, computed from frozen embeddings alone, that predicts the available headroom and correctly forecasts where adaptation helps or hurts; in practice, before training a difficulty\-aware adapter one should first check whether the proposed difficulty signal correlates with frozen\-baseline error, otherwise the adapter tends to learn a constant or self\-suppressing correction\. We make no state\-of\-the\-art claim; the contribution is the controlled understanding and its reproducibility\.

## 2Related Work

#### Sentence embeddings\.

Siamese\-encoder methods such as Sentence\-BERT\[[12](https://arxiv.org/html/2606.03244#bib.bib12)\]and contrastive methods such as SimCSE\[[5](https://arxiv.org/html/2606.03244#bib.bib5)\]produce single\-vector representations that are scored by cosine similarity\. The Massive Text Embedding Benchmark\[[8](https://arxiv.org/html/2606.03244#bib.bib8)\]evaluates such models across retrieval, similarity, classification, clustering, and reranking\. Modern open\-weight encoders, including the Qwen3 embedding family\[[11](https://arxiv.org/html/2606.03244#bib.bib11)\], occupy the top of these leaderboards\. We treat such an encoder as a fixed, black\-box producer of pooled vectors and study what a thin adapter can add on top of it without touching its weights\.

#### Bi\-encoders versus cross\-encoders\.

Bi\-encoders embed each input independently, enabling fast nearest\-neighbor retrieval but limiting cross\-input interaction; cross\-encoders jointly encode a pair and are substantially more accurate at fine\-grained discrimination, at the cost ofO​\(n2\)O\(n^\{2\}\)scoring\[[9](https://arxiv.org/html/2606.03244#bib.bib9)\]\. A standard pattern is to retrieve with a bi\-encoder and re\-rank with a cross\-encoder, or to distill the cross\-encoder into the bi\-encoder\[[7](https://arxiv.org/html/2606.03244#bib.bib7)\]\. Our positive result sits between these regimes: it keeps the frozen bi\-encoder but adds a small pair\-conditioned correction and, crucially, uses a cross\-encoder only as a*difficulty signal*rather than as the scorer\.

#### Difficulty\- and hard\-negative\-aware training\.

The notion that examples differ in difficulty drives hard\-negative mining in dense retrieval, where harder negatives yield stronger encoders\[[6](https://arxiv.org/html/2606.03244#bib.bib6),[16](https://arxiv.org/html/2606.03244#bib.bib16)\]\. These methods change the training distribution of a model that is being learned end\-to\-end\. We instead ask whether an explicit, predicted per\-input or per\-pair difficulty can usefully*condition*a frozen representation at inference time, and we measure this against controls that remove the difficulty signal while keeping the added capacity\.

#### Operational complexity targets\.

Surface and syntax\-inspired readability features \(length, clause counts, subordination markers\) provide cheap, reproducible scalars\. We use such a feature engine to define an operational complexity label, but we do not claim it is a gold linguistic annotation; its role is to serve as one candidate difficulty signal among several, and our experiments show it is poorly aligned with embedding error except on adversarial paraphrase data\. We contrast it with a held\-out cross\-encoder margin, which we find to be a far more robust difficulty signal\.

#### Tasks\.

We evaluate on semantic textual similarity \(STS\-B\[[3](https://arxiv.org/html/2606.03244#bib.bib3),[14](https://arxiv.org/html/2606.03244#bib.bib14)\]\), and three paraphrase\-style tasks: PAWS\[[17](https://arxiv.org/html/2606.03244#bib.bib17)\], whose adversarial pairs decouple lexical overlap from meaning, and MRPC\[[4](https://arxiv.org/html/2606.03244#bib.bib4)\]and QQP\[[10](https://arxiv.org/html/2606.03244#bib.bib10)\]from GLUE\[[15](https://arxiv.org/html/2606.03244#bib.bib15)\]\. SNLI\[[2](https://arxiv.org/html/2606.03244#bib.bib2)\]is used only in preliminary pipeline checks\. These tasks span a wide range of sentence length, label granularity \(graded vs\. binary\), and baseline saturation, which is exactly the variation our study requires\.

## 3Method

We describe the frozen\-baseline interface and then the four instantiations of difficulty conditioning that the experiments compare\. Throughout, a sentenceaahas a frozen baseline embeddingxa∈ℝdx\_\{a\}\\in\\mathbb\{R\}^\{d\}produced by the baseline encoder, andcos⁡\(⋅,⋅\)\\cos\(\\cdot,\\cdot\)denotes cosine similarity\.

#### Notation\.

A few symbols are shared across arms because the arms share one training objective \(Section[3\.6](https://arxiv.org/html/2606.03244#S3.SS6)\); we make their per\-arm meaning explicit here\. We writessfor the score produced by the*active*arm: the cosine between adapted vectors for the per\-sentence arms \(Section[3\.2](https://arxiv.org/html/2606.03244#S3.SS2)\), andsungateds\_\{\\mathrm\{ungated\}\}orsgateds\_\{\\mathrm\{gated\}\}for the pair\-level arms \(Section[3\.4](https://arxiv.org/html/2606.03244#S3.SS4)\)\. We writeccfor the predicted complexity or hardness—cac\_\{a\}for a per\-sentence prediction andca​bc\_\{ab\}for a pair\-level prediction—andc¯\\bar\{c\}for the corresponding supervision target\. We usex^a\\hat\{x\}\_\{a\}for the dimension\-aligned frozen baseline embedding andγ∈\(0,1\)\\gamma\\in\(0,1\)for the \(small\) residual scale\.

### 3\.1Frozen post\-encoder interface

The baseline encoder is called only through its public encode method and is never differentiated:xax\_\{a\}is computed underno\_gradand detached\. A trainable linear map and layer normalization produce a working representation

ha=LayerNorm​\(W​xa\+b\)\.h\_\{a\}=\\mathrm\{LayerNorm\}\(Wx\_\{a\}\+b\)\.\(1\)All trainable parameters live downstream ofxax\_\{a\}; gradients never reach the encoder\. Becausexax\_\{a\}is fixed, embeddings are computed once and cached\. This interface requires only a pooled vector and so admits any encoder, open or closed\.

### 3\.2Per\-sentence complexity scaling

The first instantiation follows the literal “complexity\-conditioned” idea\. A small predictor estimates a scalar complexity from the working representation,ca=σ​\(MLP​\(ha\)\)∈\(0,1\)c\_\{a\}=\\sigma\(\\mathrm\{MLP\}\(h\_\{a\}\)\)\\in\(0,1\), supervised by a targetc¯a\\bar\{c\}\_\{a\}viaMSE​\(ca,c¯a\)\\mathrm\{MSE\}\(c\_\{a\},\\bar\{c\}\_\{a\}\)\. The scalar conditions an elementwise scale that preserves the identity at initialization,

αa=1\+α0​tanh⁡\(g​\(ha,ca\)\),ua=Normalize​\(αa⊙ha\),\\alpha\_\{a\}=1\+\\alpha\_\{0\}\\tanh\\\!\\big\(g\(h\_\{a\},c\_\{a\}\)\\big\),\\qquad u\_\{a\}=\\mathrm\{Normalize\}\(\\alpha\_\{a\}\\odot h\_\{a\}\),\(2\)withα0\\alpha\_\{0\}a small range\. The default targetc¯a\\bar\{c\}\_\{a\}is an operational complexity label∑iwi​f~i/∑iwi∈\[0,1\]\\sum\_\{i\}w\_\{i\}\\tilde\{f\}\_\{i\}/\\sum\_\{i\}w\_\{i\}\\in\[0,1\]computed from capped, normalized surface featuresf~i\\tilde\{f\}\_\{i\}\(token and character counts, punctuation, conjunctions, subordination markers, approximate clause count, and a proxy verb/argument count\)\.

#### Baseline\-preserving variants and controls\.

To stay close to the frozen baseline we also consider a residual/interpolation formua=Normalize​\(\(1−γ\)​x^a\+γ​uasem\)u\_\{a\}=\\mathrm\{Normalize\}\\big\(\(1\-\\gamma\)\\,\\hat\{x\}\_\{a\}\+\\gamma\\,u^\{\\mathrm\{sem\}\}\_\{a\}\\big\)with a smallγ\\gamma, wherex^a\\hat\{x\}\_\{a\}is the \(aligned\) baseline embedding\. The key controls are:baseline\-only\(ua=x^au\_\{a\}=\\hat\{x\}\_\{a\}\);hh\-only\(α\\alphafromhah\_\{a\}with the complexity input zeroed\);fixed/global scalar\(a constant or globally learnedcc\); andshuffled complexity\(ccpermuted across the batch beforeα\\alpha\)\. The contrast between a complexity\-conditioned scale and a constant scale is our signal\-isolation test\.

### 3\.3Aligned hard\-pair difficulty target

Surface complexity need not align with where the baseline errs\. We therefore define a non\-circular, error\-aligned per\-pair difficulty from text and gold labels only,

hardlex​\(a,b\)=\|overlap​\(a,b\)−ya​b\|,\\mathrm\{hard\}\_\{\\mathrm\{lex\}\}\(a,b\)=\\big\|\\,\\mathrm\{overlap\}\(a,b\)\-y\_\{ab\}\\,\\big\|,\(3\)whereoverlap\\mathrm\{overlap\}is token Jaccard andya​by\_\{ab\}is the gold score \(the label for binary tasks\)\. A non\-paraphrase with high lexical overlap, or a paraphrase with low overlap, is hard\. This never uses the baseline embedding, so it cannot be circular, yet by construction it tracks the surface\-similarity axis on which bi\-encoders fail\. We aggregate it to a per\-sentence target \(mean over pairs containing the sentence\) for the per\-sentence gate of Section[3\.2](https://arxiv.org/html/2606.03244#S3.SS2)\.

### 3\.4Pair\-level conditioning

Because difficulty is a property of the pair, we move the conditioning to the pair while keeping the frozen\-baseline cosine as an anchor\. From the two working representations we form the pair featuresϕa​b=\[\|ha−hb\|,ha⊙hb\]\\phi\_\{ab\}=\[\\,\|h\_\{a\}\-h\_\{b\}\|,\\;h\_\{a\}\\odot h\_\{b\}\\,\]and a*raw*\(gate\-free\) residual

δraw​\(a,b\)=tanh⁡\(MLP​\(ϕa​b\)\)\.\\delta\_\{\\mathrm\{raw\}\}\(a,b\)=\\tanh\\\!\\big\(\\mathrm\{MLP\}\(\\phi\_\{ab\}\)\\big\)\.\(4\)Theungatedarm adds this raw residual, scaled by a smallγ\\gamma, to the frozen cosine, so the correction is a bounded perturbation of the baseline:

sungated​\(a,b\)=cos⁡\(xa,xb\)⏟frozen anchor\+γ​δraw​\(a,b\)\.s\_\{\\mathrm\{ungated\}\}\(a,b\)=\\underbrace\{\\cos\(x\_\{a\},x\_\{b\}\)\}\_\{\\text\{frozen anchor\}\}\\;\+\\;\\gamma\\,\\delta\_\{\\mathrm\{raw\}\}\(a,b\)\.\(5\)A separate head predicts a pair hardnessca​b=σ​\(MLP​\(ϕa​b\)\)∈\(0,1\)c\_\{ab\}=\\sigma\\\!\\big\(\\mathrm\{MLP\}\(\\phi\_\{ab\}\)\\big\)\\in\(0,1\), supervised by a difficulty target \(Sections[3\.3](https://arxiv.org/html/2606.03244#S3.SS3)and[3\.5](https://arxiv.org/html/2606.03244#S3.SS5)\)\. Thegatedarm multiplies the raw residual by this predicted hardness and*uses the resulting gated residual in place of the raw one in the score*:

δgated​\(a,b\)=ca​b⋅δraw​\(a,b\),sgated​\(a,b\)=cos⁡\(xa,xb\)\+γ​δgated​\(a,b\)=cos⁡\(xa,xb\)\+γ​ca​b​δraw​\(a,b\),\\delta\_\{\\mathrm\{gated\}\}\(a,b\)=c\_\{ab\}\\cdot\\delta\_\{\\mathrm\{raw\}\}\(a,b\),\\qquad s\_\{\\mathrm\{gated\}\}\(a,b\)=\\cos\(x\_\{a\},x\_\{b\}\)\+\\gamma\\,\\delta\_\{\\mathrm\{gated\}\}\(a,b\)=\\cos\(x\_\{a\},x\_\{b\}\)\+\\gamma\\,c\_\{ab\}\\,\\delta\_\{\\mathrm\{raw\}\}\(a,b\),\(6\)so that easy pairs \(low predicted hardness\) receive little correction\. The two arms use identically sized heads; the only difference is whether the predicted hardness multiplies the residual \(and is supervised\), which makes “does the difficulty signal help” directly testable\.

### 3\.5Held\-out cross\-encoder difficulty

Finally we replace the surface proxy of Eq\. \([3](https://arxiv.org/html/2606.03244#S3.E3)\) with a held\-out cross\-encoder margin\. LetCE​\(a,b\)∈\[0,1\]\\mathrm\{CE\}\(a,b\)\\in\[0,1\]be the similarity from an independent cross\-encoder\. We define

hardCE​\(a,b\)=\|CE​\(a,b\)−12​\(cos⁡\(xa,xb\)\+1\)\|,\\mathrm\{hard\}\_\{\\mathrm\{CE\}\}\(a,b\)=\\big\|\\,\\mathrm\{CE\}\(a,b\)\-\\tfrac\{1\}\{2\}\(\\cos\(x\_\{a\},x\_\{b\}\)\+1\)\\,\\big\|,\(7\)the gap between a strong independent judge and the cheap bi\-encoder cosine\. This estimates “where the embedding is likely wrong” without using gold labels, using a separate model; it is the difficulty target used in the gated pair scorer of Eqs\. \([5](https://arxiv.org/html/2606.03244#S3.E5)\)–\([6](https://arxiv.org/html/2606.03244#S3.E6)\)\.

### 3\.6Training objective

Trained models minimize a similarity\-regression term on the score, a batchwise logistic rank term, a difficulty\-prediction term \(where a hardness head is present\), and—for the per\-sentence vector\-scaling arms—a vector anchor term:

ℒ=MSE​\(s,y\)\+λrank​ℒrank\+λanc​ℒanchor\+λhard​MSE​\(c,c¯\)\.\\mathcal\{L\}=\\mathrm\{MSE\}\(s,y\)\+\\lambda\_\{\\mathrm\{rank\}\}\\,\\mathcal\{L\}\_\{\\mathrm\{rank\}\}\+\\lambda\_\{\\mathrm\{anc\}\}\\,\\mathcal\{L\}\_\{\\mathrm\{anchor\}\}\+\\lambda\_\{\\mathrm\{hard\}\}\\,\\mathrm\{MSE\}\(c,\\bar\{c\}\)\.\(8\)
Symbols\.ssis the score of the active arm—the cosine between adapted vectors for the per\-sentence arms, andsungateds\_\{\\mathrm\{ungated\}\}orsgateds\_\{\\mathrm\{gated\}\}for the pair\-level arms \(Eqs\. \([5](https://arxiv.org/html/2606.03244#S3.E5)\)–\([6](https://arxiv.org/html/2606.03244#S3.E6)\)\)\.yyis the gold similarity score, mapped to\{0,1\}\\\{0,1\\\}for the binary paraphrase tasks \(PAWS, MRPC, QQP\) and to\[0,1\]\[0,1\]for graded STS\-B\.ccis the predicted complexity/hardness \(cac\_\{a\}for per\-sentence arms,ca​bc\_\{ab\}for pair\-level arms\), andc¯\\bar\{c\}is the corresponding supervision target: the surface complexity labelc¯a\\bar\{c\}\_\{a\}of Section[3\.2](https://arxiv.org/html/2606.03244#S3.SS2)in the surface experiments, the aggregatedhardlex\\mathrm\{hard\}\_\{\\mathrm\{lex\}\}target of Section[3\.3](https://arxiv.org/html/2606.03244#S3.SS3)in the aligned per\-sentence experiments, andhardCE​\(a,b\)\\mathrm\{hard\}\_\{\\mathrm\{CE\}\}\(a,b\)of Eq\. \([7](https://arxiv.org/html/2606.03244#S3.E7)\) in the cross\-encoder\-gated pair experiments\. The hardness term is present only when a hardness head is used \(the gated pair arm and the supervised per\-sentence arms\)\.

Rank term\.Within each mini\-batch, over all ordered pairs\(i,j\)\(i,j\)whose gold scores satisfyyi\>yjy\_\{i\}\>y\_\{j\}, the rank term appliesℒrank=softplus​\(−\(si−sj\)/τ\)\\mathcal\{L\}\_\{\\mathrm\{rank\}\}=\\mathrm\{softplus\}\\\!\\big\(\-\(s\_\{i\}\-s\_\{j\}\)/\\tau\\big\)\(averaged over such pairs, temperatureτ=0\.05\\tau=0\.05\), encouraging pairs with larger gold scores to receive larger predicted scores\.

Anchor term\.For the per\-sentence vector\-scaling arms,ℒanchor=1−cos⁡\(u,x^\)\\mathcal\{L\}\_\{\\mathrm\{anchor\}\}=1\-\\cos\(u,\\hat\{x\}\)averaged over the batch is a*vector*anchor that keeps each adapted outputuuclose to its frozen baseline embeddingx^\\hat\{x\}\. The pair\-level arms carry no explicit anchor term \(λanc=0\\lambda\_\{\\mathrm\{anc\}\}=0\): there, baseline preservation is structural, coming from the frozen cosine and the smallγ\\gammain Eqs\. \([5](https://arxiv.org/html/2606.03244#S3.E5)\)–\([6](https://arxiv.org/html/2606.03244#S3.E6)\)\.

Weights\.λrank,λanc,λhard\\lambda\_\{\\mathrm\{rank\}\},\\lambda\_\{\\mathrm\{anc\}\},\\lambda\_\{\\mathrm\{hard\}\}weight the corresponding terms; we use the values set in the experiment scripts \(λrank=0\.1\\lambda\_\{\\mathrm\{rank\}\}=0\.1andλhard=0\.5\\lambda\_\{\\mathrm\{hard\}\}=0\.5throughout, andλanc=0\.05\\lambda\_\{\\mathrm\{anc\}\}=0\.05for the per\-sentence arms\)\. The rank term targets the ordering metrics used in evaluation; the anchor term and the smallγ\\gammajointly enforce baseline preservation\.

## 4Experimental Setup

#### Baseline encoder\.

Unless noted, the frozen baseline is the open\-weightQwen3\-Embedding\-0\.6B\[[11](https://arxiv.org/html/2606.03244#bib.bib11)\]\(d=1024d=1024\), used through its standard encode interface withℓ2\\ell\_\{2\}\-normalized outputs\. Embeddings are computed once per task and cached; no encoder weights are updated\. Preliminary checks \(Section[5\.1](https://arxiv.org/html/2606.03244#S5.SS1)\) also referencebge\-base\-en\-v1\.5\[[1](https://arxiv.org/html/2606.03244#bib.bib1)\]\.

#### Tasks and splits\.

We use PAWS \(labeled\_final\)\[[17](https://arxiv.org/html/2606.03244#bib.bib17)\], MRPC\[[4](https://arxiv.org/html/2606.03244#bib.bib4)\]and QQP\[[10](https://arxiv.org/html/2606.03244#bib.bib10)\]from GLUE\[[15](https://arxiv.org/html/2606.03244#bib.bib15)\], and STS\-B\[[3](https://arxiv.org/html/2606.03244#bib.bib3),[14](https://arxiv.org/html/2606.03244#bib.bib14)\]\. Binary labels are mapped to scores\{0,1\}\\\{0,1\\\}and graded scores to\[0,1\]\[0,1\]\. The per\-sentence and pair\-level studies use20002000training and10001000evaluation pairs; the multi\-task confirmation uses up to40004000training and15001500evaluation pairs \(capped by split size, e\.g\. MRPC validation has408408pairs\)\. All pairs within a task share one cached sentence table\.

#### Cross\-encoder\.

The held\-out difficulty judge in Eq\. \([7](https://arxiv.org/html/2606.03244#S3.E7)\) iscross\-encoder/stsb\-distilroberta\-base\[[13](https://arxiv.org/html/2606.03244#bib.bib13)\], a semantic\-similarity cross\-encoder; its scores are cached per pair and used only to define the difficulty target, never as the evaluation scorer\.

#### Training\.

Adapters are trained with AdamW \(learning rate10−310^\{\-3\}, weight decay10−210^\{\-2\}\), batch size6464–128128, for400400–800800steps depending on the study, with gradient clipping at1\.01\.0\. The baseline\-preserving residual usesγ=0\.25\\gamma=0\.25unless a ceiling regime \(γ=2\.0\\gamma=2\.0\) is stated\. Per\-sentence and pair\-level studies use seeds\{0,1,2\}\\\{0,1,2\\\}; the multi\-task confirmation uses five seeds\{0,1,2,3,4\}\\\{0,1,2,3,4\\\}\.

#### Metrics\.

We report Spearman and Pearson correlation between the \(pair\) score and gold, and two retrieval proxies on the evaluation pairs—Recall@1 and MRR—computed by ranking each query’s gold partner against all candidates\. For pair\-level scorers the candidate matrix is scored by the pair model\. All comparisons are*paired by seed*: we report per\-seed deltas and whether every seed agrees in sign\.

#### Headroom diagnostic\.

Before training, we compute from the frozen embeddings alone: the baseline Spearman, the spread of the per\-pair error\|cos−y\|\|\\cos\-y\|, and the Pearson correlation between each candidate difficulty signal and that error\. This diagnostic predicts whether adaptation has aligned headroom to exploit, and we report it alongside the results\.

#### Reproducibility\.

Each study is a self\-contained script that caches embeddings and cross\-encoder scores, runs all arms and seeds, and writes per\-run metrics and paired deltas\. No metric is hand\-entered; all numbers in Section[5](https://arxiv.org/html/2606.03244#S5)are read from these logs\.

## 5Results

We present the four studies in the order of Section[3\.2](https://arxiv.org/html/2606.03244#S3.SS2)–[3\.5](https://arxiv.org/html/2606.03244#S3.SS5)\. All correlations are Spearman unless stated; deltas are paired by seed\.

#### Preliminary check\.

Earlier bounded ablations onbge\-base\-en\-v1\.5with STS\-B \(the per\-sentence scaling and small interpolation/metric variants of Section[3\.2](https://arxiv.org/html/2606.03244#S3.SS2)\) showed only sub\-0\.0010\.001Spearman movements over a fixed\-interpolation reference and no reliable advantage for sentence\-level complexity supervision overhh\-only and fixed/global controls\. This motivated moving to a modern baseline and to tasks with a wider difficulty range\.

### 5\.1Per\-sentence complexity scaling does not help

We pair the frozenQwen3\-Embedding\-0\.6Bwith PAWS \(long, multi\-clause, adversarial\) and STS\-B \(short, near\-saturated\),2000/10002000/1000pairs,400400steps, seeds\{0,1,2\}\\\{0,1,2\\\}\. The pre\-training diagnostic already separates the regimes: on PAWS the frozen baseline reaches Spearman0\.3380\.338with a large error spread \(std​\|cos−y\|=0\.456\\mathrm\{std\}\\,\|\\cos\-y\|=0\.456\); on STS\-B it reaches0\.9110\.911with a small spread \(0\.1350\.135\)\. However, the surface complexity signal is misaligned: the correlation between sentence length and per\-sentence baseline error is only\+0\.05\+0\.05on both tasks\.

Consistent with this, conditioning gives no leverage\. The signal\-isolation contrast \(complexity\-conditioned scale minus constant\-γ\\gammascale\) is−0\.0007\-0\.0007on PAWS, i\.e\. the complexity signal adds nothing over a constant; on the saturated STS\-B every adaptation arm*reduces*Spearman \(adaptive−0\.0019\-0\.0019vs\. baseline, and−0\.0073\-0\.0073Recall@1\)\. Dropping the baseline and keeping only the learned projection costs≈0\.02\\approx 0\.02Spearman on both tasks, confirming that baseline preservation is load\-bearing but that the trained head carries no information the frozen embedding lacks\.

### 5\.2Aligning the target restores correlation but not gains

Replacing the surface label with the non\-circular hard\-pair target of Eq\. \([3](https://arxiv.org/html/2606.03244#S3.E3)\) \(aggregated per sentence\) raises the correlation with baseline error from\+0\.05\+0\.05to\+0\.98\+0\.98on PAWS—the alignment problem is solved without ever using the baseline embedding\. The full model now improves over the frozen baseline on PAWS by\+0\.0024\+0\.0024Spearman with all three seeds positive, larger than the surface target\. Yet the signal\-isolation contrast remains≈0\\approx 0: conditioning still matches a constant scale\. The reason is structural: the aligned target’s development MSE \(∼0\.205\\sim 0\.205\) does not fall below its label variance \(∼0\.14\\sim 0\.14\), so a single sentence’s frozen embedding only weakly predicts hard\-pair difficulty\. Difficulty is a property of the*pair*; a per\-sentence gate cannot represent it\.

### 5\.3Pair\-level conditioning makes the difficulty signal active

Moving to the pair scorer of Eq\. \([5](https://arxiv.org/html/2606.03244#S3.E5)\) \(2000/10002000/1000pairs,600600steps, seeds\{0,1,2\}\\\{0,1,2\\\}\) changes the picture \(Table[1](https://arxiv.org/html/2606.03244#S5.T1)\)\. In the baseline\-preserving regime \(γ=0\.25\\gamma=0\.25\), an*ungated*residual collapses to a rank\-invariant shift \(\+0\.0000\+0\.0000Spearman\), whereas gating the same residual by predicted hardness yields\+0\.0103\+0\.0103Spearman on PAWS with all seeds positive while largely preserving retrieval \(a small Recall@1 change,0\.6760\.676vs\.0\.6820\.682\)\. On the saturated STS\-B the gated residual self\-suppresses to the baseline, while the ungated residual slightly hurts\. At a near\-unconstrained budget \(γ=2\.0\\gamma=2\.0\) the raw residual already captures the Spearman gain \(\+0\.0105\+0\.0105\) but degrades retrieval \(Recall@10\.6390\.639\), and the hardness signal becomes redundant for Spearman while harming retrieval further\. The difficulty signal is thus useful specifically in the small, baseline\-preserving regime—which is the method’s premise\.

Table 1:Pair\-level conditioning, baseline\-preserving regime, mean over three seeds\. The hardness gate is load\-bearing: it turns an inert residual into a consistent \(all\-seeds\-positive\) Spearman gain, with only a small Recall@1 change \(0\.6760\.676vs\.0\.6820\.682\)\.Figure[1](https://arxiv.org/html/2606.03244#S5.F1)summarizes the signal\-isolation delta on PAWS across the design stages: the difficulty signal contributes nothing per\-sentence \(−0\.0007\-0\.0007\) and nothing at the near\-unconstrained budget \(\+0\.0003\+0\.0003\), and becomes active only for the small, baseline\-preserving pair residual \(\+0\.0103\+0\.0103\)\.

![Refer to caption](https://arxiv.org/html/2606.03244v1/x1.png)Figure 1:Spearman gain attributable to the difficulty signal on PAWS \(treatment minus matched control: complexity\-conditioned vs\. constant scale for the per\-sentence stage; hardness\-gated vs\. ungated residual for the pair\-level stages\)\. The signal is active only for the small, baseline\-preserving pair residual\.
### 5\.4A held\-out cross\-encoder margin confirms across tasks

We replace the surface proxy with the cross\-encoder margin of Eq\. \([7](https://arxiv.org/html/2606.03244#S3.E7)\) and confirm at multi\-task scale \(PAWS, MRPC, QQP, STS\-B; up to4000/15004000/1500pairs;800800steps; five seeds;γ=0\.25\\gamma=0\.25\)\. Figure[2](https://arxiv.org/html/2606.03244#S5.F2)shows that the cross\-encoder margin is positively aligned with the true baseline error on*all four*tasks \(\+0\.14\+0\.14,\+0\.50\+0\.50,\+0\.42\+0\.42,\+0\.83\+0\.83\), whereas the surface proxy is aligned only on PAWS \(which is built around lexical overlap,\+0\.98\+0\.98\) and*anti*\-aligned on QQP \(−0\.43\-0\.43\) and STS\-B \(−0\.63\-0\.63\): the surface proxy is a PAWS\-specific artifact\.

![Refer to caption](https://arxiv.org/html/2606.03244v1/x2.png)Figure 2:Pearson correlation of each candidate difficulty signal with the true per\-pair baseline error\|cos−y\|\|\\cos\-y\|on the evaluation split\. The cross\-encoder margin is robustly aligned across all four tasks; the surface proxy is aligned only on PAWS and anti\-aligned on QQP and STS\-B\.Table[2](https://arxiv.org/html/2606.03244#S5.T2)and Figure[3](https://arxiv.org/html/2606.03244#S5.F3)report Spearman by arm\.111The absolute STS\-B frozen baseline here \(0\.87310\.8731\) differs from the0\.9110\.911of Section[5\.1](https://arxiv.org/html/2606.03244#S5.SS1)only because the two studies score different capped evaluation subsets of the same frozen cosine—the first15001500vs\. the first10001000STS\-B validation pairs\. All claims are therefore made through within\-study paired comparisons rather than cross\-study absolute values\.The cross\-encoder\-gated residual never falls below the frozen\-baseline Spearman on any task: it self\-suppresses to the baseline on PAWS and MRPC and improves QQP \(\+0\.0372\+0\.0372\) and STS\-B \(\+0\.0220\+0\.0220, all five seeds positive\)\. It also does not degrade retrieval: its Recall@1 equals the baseline on PAWS and MRPC and is marginally higher on QQP \(\+0\.0012\+0\.0012\) and STS\-B \(\+0\.0030\+0\.0030\)\.

We caution against reading Table[2](https://arxiv.org/html/2606.03244#S5.T2)by average Spearman alone\. The surface\-gated arm attains the highest average, but this is driven almost entirely by PAWS, whose adversarial construction makes the lexical\-overlap proxy naturally aligned with where the baseline errs \(Fig\.[2](https://arxiv.org/html/2606.03244#S5.F2)\); off PAWS the surface proxy is unreliable or anti\-aligned, the surface gate degrades MRPC retrieval \(from0\.9610\.961to0\.9110\.911\), and it cannot help on the anti\-aligned STS\-B and QQP\. The appropriate comparison under our diagnostic is thus not the average alone but cross\-task alignment, sign stability, and baseline\-preserving behavior\. By those criteria the cross\-encoder gate is preferred: it is positively aligned with baseline error on all four tasks, never falls below the frozen\-baseline Spearman, and does not degrade retrieval\. We do not claim it is uniformly best per task—on PAWS the surface proxy is better—only that it is the more robust and general choice under the paper’s diagnostic criterion\.

Table 2:Multi\-task Spearman by arm \(mean over five seeds,γ=0\.25\\gamma=0\.25\)\. The cross\-encoder gate never falls below the frozen\-baseline Spearman and does not degrade retrieval; the surface gate attains the highest average but is driven by PAWS and is unreliable off it\. “Pair resid\.” is the ungated residual\.![Refer to caption](https://arxiv.org/html/2606.03244v1/x3.png)Figure 3:Multi\-task Spearman by arm \(mean over five seeds,γ=0\.25\\gamma=0\.25; error bars are one standard deviation across seeds\)\. The cross\-encoder\-gated residual never falls below the frozen\-baseline Spearman and improves the graded/large tasks \(QQP, STS\-B\); the surface gate helps only on its PAWS home turf\.

## 6Discussion

#### Three conditions for difficulty conditioning to help\.

Across the five studies the same picture emerges\. Difficulty conditioning of a frozen sentence embedding becomes a real, non\-harmful, and*generalizing*effect only when three conditions hold together: \(i\) the conditioning is*pair\-level*, not per\-sentence, because hard\-pair difficulty is not recoverable from a single sentence’s frozen embedding; \(ii\) the correction is a*small, baseline\-preserving residual*, because at large budgets a generic pair residual already captures the correlation gain and the difficulty signal becomes redundant while harming retrieval; and \(iii\) the difficulty target is an*independent cross\-encoder margin*, not a surface proxy or the circular self\-error, because only the cross\-encoder margin is aligned with baseline error across tasks\. Removing any one condition removes the effect\.

#### Why the cross\-encoder margin generalizes\.

A surface proxy encodes a particular hypothesis about difficulty \(lexical\-overlap mismatch\)\. On PAWS, which is constructed around that exact phenomenon, the proxy is near\-perfectly aligned; elsewhere it is uninformative or anti\-aligned, and gating by an anti\-aligned signal can only hurt or self\-suppress\. The cross\-encoder margin instead measures the gap between a strong independent judge and the cheap bi\-encoder, which is a task\-agnostic estimate of “where the embedding is wrong\.” This is why it is positively aligned on all four tasks and why the gated residual it drives never falls below the frozen\-baseline Spearman\.

#### A useful safety property\.

The predicted\-difficulty gate confers an automatic “do no harm” behavior: where the difficulty signal has no aligned headroom \(saturated STS\-B for the surface proxy; PAWS/MRPC for the cross\-encoder margin\), the gate drives the residual toward zero and the model reverts to the frozen\-baseline score\. Combined with the smallγ\\gammaand the anchor loss, this means that in our multi\-task confirmation the cross\-encoder\-gated adapter never reduced the frozen\-baseline Spearman and did not degrade retrieval—an attractive property for a drop\-in post\-encoder module, although we verify it only at the bounded scale of this study\.

#### Relationship to re\-ranking and distillation\.

Our positive model is a minimal learned re\-ranker over frozen bi\-encoder embeddings, and it is worth stating precisely how it differs from two neighbours\. Standard cross\-encoder re\-ranking uses the cross\-encoder*itself*as the inference\-time scorer\[[9](https://arxiv.org/html/2606.03244#bib.bib9)\]; standard distillation trains a student bi\-encoder or scorer to*imitate*the cross\-encoder’s output\[[7](https://arxiv.org/html/2606.03244#bib.bib7)\]\. We do neither: the cross\-encoder is used only to*define a difficulty target*—the gap between the cross\-encoder similarity and the frozen bi\-encoder cosine \(Eq\. \([7](https://arxiv.org/html/2606.03244#S3.E7)\)\)—while the inference score remains anchored to the frozen cosine plus a bounded residual \(Eq\. \([5](https://arxiv.org/html/2606.03244#S3.E5)\)\)\. The cross\-encoder is therefore neither the scorer nor a target to be imitated; it is an independent estimator of*where*the frozen embedding needs correction\. The method is genuinely related to re\-ranking and distillation—it is a pair\-level scorer, not a single\-vector embedding—but the experimental question is different: whether an independently supplied difficulty signal can localize the corrections a frozen embedding needs\. We make no claim that it matches the accuracy of full cross\-encoder re\-ranking, which remains an upper reference\.

#### The headroom diagnostic\.

The pre\-training diagnostic predicted, from frozen embeddings alone, that STS\-B was saturated \(high baseline Spearman, small error spread\) and that adaptation would not help there, and that PAWS had large but structurally specific headroom\. These predictions matched the outcomes\. We suggest reporting such a diagnostic before training any post\-encoder adapter, as it cheaply separates “no headroom” from “headroom along an unrecoverable axis\.” Concretely, the practical takeaway is: before training a difficulty\-aware adapter, first check whether the proposed difficulty signal is correlated with frozen\-baseline error; otherwise the adapter is likely to learn a constant or self\-suppressing correction, exactly as we observe for the surface proxy off PAWS\.

## 7Limitations

The effects we report are modest in magnitude \(∼0\.02\\sim 0\.02–0\.040\.04Spearman on the graded and large tasks\) and are demonstrated at a bounded scale: a single open\-weight bi\-encoder, a single cross\-encoder, four English tasks, thousands rather than full\-corpus pairs, and a few hundred optimization steps\. We have not tested larger bi\-encoders, multilingual data, or full benchmark\-scale evaluation, and we make no state\-of\-the\-art claim\.

Two observations caution against over\-reading the positive result\. First, several cross\-encoder\-gated cells only revert to the frozen baseline rather than improving; the gain is concentrated on the graded/large tasks\. Second, on QQP even the anti\-aligned surface gate improved Spearman, which indicates that part of the measured gain comes from generic pair\-residual capacity rather than from correct difficulty ranking\. Our signal\-isolation controls bound but do not entirely eliminate this confound\.

The positive configuration changes the interface: the model is a pair scorer \(re\-ranker\), not a single\-vector embedding, so it does not directly produce vectors for approximate nearest\-neighbor indices\. The difficulty target also depends on the choice and calibration of the cross\-encoder; a poorly matched judge would supply a weaker signal\. Finally, the frozen\-baseline design isolates the adapter but does not speak to whether difficulty conditioning would help under full end\-to\-end fine\-tuning, which is a different and larger question\.

## 8Conclusion

We asked whether conditioning a frozen sentence embedding on input difficulty improves it, and under what conditions\. Through five controlled, multi\-seed studies with strict signal\-isolation controls, we showed that the naïve per\-sentence form does not help—even when its target is made well aligned with baseline error—because difficulty is a property of the pair that a per\-sentence gate cannot represent\. The mechanism becomes effective only when pair\-level conditioning, a small baseline\-preserving residual, and a held\-out cross\-encoder difficulty target are combined: the resulting gated residual produces consistent, sign\-stable gains on graded and large tasks, never falls below the frozen\-baseline Spearman in the multi\-task confirmation, and does not degrade retrieval\. We also provide a cheap pre\-training diagnostic that predicts the available headroom from frozen embeddings alone\. The gains are modest, and because the useful form is pair\-level the deployment object is a lightweight re\-ranker over cached frozen embeddings rather than a replacement single\-vector embedding, so we make no state\-of\-the\-art claim; the contribution is a precise, reproducible account of when difficulty\-aware adaptation of frozen embeddings is worthwhile\. Natural next steps are to scale the confirmation to larger encoders and full benchmarks, to replace the cross\-encoder margin with a difficulty signal estimable on unlabeled corpora, and to test whether the same three conditions govern difficulty conditioning under end\-to\-end fine\-tuning\.

## References

- \[1\]Beijing Academy of Artificial Intelligence\.BAAI/bge\-base\-en\-v1\.5 model card\.[https://huggingface\.co/BAAI/bge\-base\-en\-v1\.5](https://huggingface.co/BAAI/bge-base-en-v1.5)\.Accessed 2026\-05\-28\.
- \[2\]Samuel R\. Bowman, Gabor Angeli, Christopher Potts, and Christopher D\. Manning\.A large annotated corpus for learning natural language inference\.InProceedings of the 2015 Conference on Empirical Methods in Natural Language Processing \(EMNLP\), 2015\.
- \[3\]Daniel Cer, Mona Diab, Eneko Agirre, Iñigo Lopez\-Gazpio, and Lucia Specia\.SemEval\-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation\.InProceedings of the 11th International Workshop on Semantic Evaluation \(SemEval\-2017\), 2017\.
- \[4\]William B\. Dolan and Chris Brockett\.Automatically constructing a corpus of sentential paraphrases\.InProceedings of the Third International Workshop on Paraphrasing \(IWP\), 2005\.
- \[5\]Tianyu Gao, Xingcheng Yao, and Danqi Chen\.SimCSE: Simple contrastive learning of sentence embeddings\.InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing \(EMNLP\), 2021\.
- \[6\]Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen\-tau Yih\.Dense passage retrieval for open\-domain question answering\.InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing \(EMNLP\), 2020\.
- \[7\]Sheng\-Chieh Lin, Jheng\-Hong Yang, and Jimmy Lin\.Distilling dense representations for ranking using tightly\-coupled teachers\.arXiv preprint arXiv:2010\.11386, 2020\.
- \[8\]Niklas Muennighoff, Nouamane Tazi, Loic Magne, and Nils Reimers\.MTEB: Massive text embedding benchmark\.arXiv preprint arXiv:2210\.07316, 2022\.
- \[9\]Rodrigo Nogueira and Kyunghyun Cho\.Passage re\-ranking with BERT\.arXiv preprint arXiv:1901\.04085, 2019\.
- \[10\]Quora\.Quora question pairs\.[https://quoradata\.quora\.com/First\-Quora\-Dataset\-Release\-Question\-Pairs](https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs)\.Accessed 2026\-05\-29\.
- \[11\]Qwen Team\.Qwen3 embedding models\.[https://huggingface\.co/Qwen/Qwen3\-Embedding\-0\.6B](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B)\.Open\-weight embedding model \(Apache\-2\.0\)\. Accessed 2026\-05\-29\.
- \[12\]Nils Reimers and Iryna Gurevych\.Sentence\-BERT: Sentence embeddings using Siamese BERT\-networks\.InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing \(EMNLP\-IJCNLP\), 2019\.
- \[13\]Nils Reimers and Iryna Gurevych\.cross\-encoder/stsb\-distilroberta\-base\.[https://huggingface\.co/cross\-encoder/stsb\-distilroberta\-base](https://huggingface.co/cross-encoder/stsb-distilroberta-base)\.Sentence\-Transformers cross\-encoder\. Accessed 2026\-05\-29\.
- \[14\]Sentence\-Transformers\.sentence\-transformers/stsb dataset\.[https://huggingface\.co/datasets/sentence\-transformers/stsb](https://huggingface.co/datasets/sentence-transformers/stsb)\.Accessed 2026\-05\-29\.
- \[15\]Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R\. Bowman\.GLUE: A multi\-task benchmark and analysis platform for natural language understanding\.InProceedings of the 2018 EMNLP Workshop BlackboxNLP, 2018\.
- \[16\]Lee Xiong, Chenyan Xiong, Ye Li, Kwok\-Fung Tang, Jialin Liu, Paul N\. Bennett, Junaid Ahmed, and Arnold Overwijk\.Approximate nearest neighbor negative contrastive learning for dense text retrieval\.InInternational Conference on Learning Representations \(ICLR\), 2021\.
- \[17\]Yuan Zhang, Jason Baldridge, and Luheng He\.PAWS: Paraphrase adversaries from word scrambling\.InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics \(NAACL\-HLT\), 2019\.

Similar Articles

Principles of Concept Representation in Sentence Encoders

arXiv cs.CL

This paper investigates principles of concept representation in sentence encoders through the lens of compositional semantics, identifying four key principles: fine-tuning recalibrates latent geometry, semantic signal concentrates in the final layer, hard negatives improve discrimination but not ranking, and supervision effectiveness depends on composition type.

Self-Improving In-Context Learning

arXiv cs.CL

This paper proposes a method to improve in-context learning by optimizing the continuous embeddings of a fixed few-shot prompt at test time, using a self-supervised confidence proxy derived from the model's log-probabilities without requiring fine-tuning or token generation.