Where Does Authorship Signal Emerge in Encoder-Based Language Models?

arXiv cs.CL Papers

Summary

This paper uses mechanistic interpretability to explain why authorship attribution models fine-tuned with the same encoder, data, and loss can differ four-fold in performance depending on the scoring mechanism. It finds that the scorer determines where the encoder consolidates authorship signal, with mean pooling forcing early consolidation and late interaction allowing late consolidation.

arXiv:2605.19908v1 Announce Type: new Abstract: Authorship attribution models fine-tuned with the same pretrained encoder, data, and loss can differ four-fold in performance depending only on their scoring mechanism. We use mechanistic interpretability tools to explain this gap. Stylistic features such as word length, punctuation density, and function-word frequency are equally available at every layer in every model, including in an off-the-shelf control encoder, hence the gap not coming from representation quality. Instead, causal intervention shows that the scorer determines where the encoder consolidates authorship signal. Mean pooling forces consolidation by early to mid layers, while late interaction defers it to later layers. We further derive this difference from the gradient structure of each scorer, and training dynamics reveal distinct learning trajectories that follow from that difference.
Original Article
View Cached Full Text

Cached at: 05/20/26, 08:27 AM

# Where Does Authorship Signal Emerge in Encoder-Based Language Models?
Source: [https://arxiv.org/html/2605.19908](https://arxiv.org/html/2605.19908)
Francis Kulumba Inria Paris Sorbonne Université francis\.kulumba@inria\.fr &Guillaume Vimont IRIF

###### Abstract

Authorship attribution models fine\-tuned with the same pretrained encoder, data, and loss can differ four\-fold in performance depending only on their scoring mechanism\. We use mechanistic interpretability tools to explain this gap\. Stylistic features such as word length, punctuation density, and function\-word frequency are equally available at every layer in every model, including in an off\-the\-shelf control encoder, hence the gap not coming from representation quality\. Instead, causal intervention shows that the scorer determines where the encoder consolidates authorship signal\. Mean pooling forces consolidation by early to mid layers, while late interaction defers it to later layers\. We further derive this difference from the gradient structure of each scorer, and training dynamics reveal distinct learning trajectories that follow from that difference\.

Where Does Authorship Signal Emerge in Encoder\-Based Language Models?

Francis KulumbaInria ParisSorbonne Universitéfrancis\.kulumba@inria\.frGuillaume VimontIRIF

Laurent RomaryInria ParisFlorian CafieroLRE, EPITAEcole nationale des chartes – PSL

## 1Introduction

![Refer to caption](https://arxiv.org/html/2605.19908v1/x1.png)Figure 1:Conceptual overview\.Left: The pretrained language model encodes stylistic features at every layer, regardless of fine\-tuning\.Center: Two scoring mechanisms read out these features differently\. Mean pooling averages all tokens into a single vector\. Late interaction \(LI\)\(Khattab and Zaharia,[2020](https://arxiv.org/html/2605.19908#bib.bib5)\)compares tokens directly\.Right: Causal intervention reveals that the scoring mechanism determines where the encoder consolidates authorship signal\. Mean pooling forces early consolidation whileMaxSim\\mathrm\{MaxSim\}allows for late consolidation\.Every author leaves traces in their writing\. Sentence length, punctuation habits, function\-word preferences, and word\-length distributions all carry information about who wrote a text, even when two authors write about the same topic\(Mosteller and Wallace,[1963](https://arxiv.org/html/2605.19908#bib.bib34); Burrows,[2002](https://arxiv.org/html/2605.19908#bib.bib21); Kešeljet al\.,[2003](https://arxiv.org/html/2605.19908#bib.bib35)\)\. Authorship attribution \(AA\) is the task of deciding, given two passages, whether they were written by the same person or group\. A useful task for forensic linguistics\(Dauberet al\.,[2019](https://arxiv.org/html/2605.19908#bib.bib29)\)or historical document analysis\(Cafiero and Camps,[2019](https://arxiv.org/html/2605.19908#bib.bib30)\)among other applications\.

Modern AA systems follow a contrastive learning paradigm: a pretrained text encoder produces a representation for each passage\(Vaswaniet al\.,[2017](https://arxiv.org/html/2605.19908#bib.bib31); Devlinet al\.,[2019](https://arxiv.org/html/2605.19908#bib.bib32)\), and a scoring function compares the representations to produce a similarity score\(Wegmannet al\.,[2022](https://arxiv.org/html/2605.19908#bib.bib1); Aiet al\.,[2022](https://arxiv.org/html/2605.19908#bib.bib3); Huertas\-Tatoet al\.,[2024](https://arxiv.org/html/2605.19908#bib.bib4); Kantharubanet al\.,[2026](https://arxiv.org/html/2605.19908#bib.bib2)\)\. The encoder is fine\-tuned so that same\-author passages score high and different\-author passages score low\. This setup works well, but recent work has revealed a striking puzzle about the scoring function\.Kulumbaet al\.\([2025](https://arxiv.org/html/2605.19908#bib.bib33)\)trained multiple models on a scholarly corpus in which topic is decorrelated from authorship, and found that the choice of scoring mechanism alone explains much of the observed four\-fold performance gap\. All the models share the same pretrained backbone, the same training data, and the same contrastive loss\. The only difference is the pooling/scoring mechanism: one family of models averages all token representations into a single vector before scoring \(mean pooling\), while another compares token representations directly via late interaction \(LI\)\(Khattab and Zaharia,[2020](https://arxiv.org/html/2605.19908#bib.bib5)\)\.

Why does such a large gap emerge from what is, in principle, only a difference in the final comparison step? There are at least two plausible explanations\. The first is that different scoring mechanisms cause the encoder to learn different internal representations during fine\-tuning: mean pooling forces the encoder to discard fine\-grained stylistic information that LI preserves\. The second is that the encoder learns similar representations regardless of the scorer, and the gap arises purely from how those representations are read out at inference time\. This paper uses the interpretability toolkit\(Alain and Bengio,[2017](https://arxiv.org/html/2605.19908#bib.bib9); Viget al\.,[2020](https://arxiv.org/html/2605.19908#bib.bib7); Belinkov,[2022](https://arxiv.org/html/2605.19908#bib.bib10); Goldowsky\-Dillet al\.,[2023](https://arxiv.org/html/2605.19908#bib.bib6); Zhang and Nanda,[2023](https://arxiv.org/html/2605.19908#bib.bib8)\)on the fine\-tuned encoders fromKulumbaet al\.\([2025](https://arxiv.org/html/2605.19908#bib.bib33)\)to distinguish between these two explanations\. This allows us to test a dissociation between feature*availability*and feature*use*\(Figure[1](https://arxiv.org/html/2605.19908#S1.F1)\):

- •Availability is invariant, the same stylistic features \(word length, capitalization, punctuation density, etc\.\) are linearly readable from the hidden states of all models at all layers, including a control encoder picked off the shelf\. The pretrained backbone already encodes these features\. Contrastive fine\-tuning does not create them\.
- •Use depends on the scoring mechanism, as it determines where in the encoder authorship signal becomes causally necessary\. Mean pooling consolidates authorship signal by mid layers, while LI defers consolidation to late ones\. This gap can be explained by the gradient structure of the scoring functions\.

Our results show that the choice of scoring function determines the effective depth of the encoder, the information the model can exploit, and the trajectory it follows during training\. Understanding this mechanism clarifies why LI\-based systems consistently outperform pooled representations in AA, despite relying on the same pretrained backbone\.

## 2Background

This section defines the building blocks of the contrastive AA pipeline and the analysis tools we use to study it\.

### 2\.1Contrastive authorship attribution

In the contrastive formulation, training data consists of triplets\(a,p,n\)\(a,p,n\): an anchor passageaa, a same\-author positivepp, and a different\-author negativenn\. The encoderfθf\_\{\\theta\}maps each passage to a sequence of token\-level representations\. A scoring functionssthen compares the anchor’s representation to the positive’s and to the negative’s, producing scalar similarity scores\. Training minimizes the InfoNCE loss\(van den Oordet al\.,[2019](https://arxiv.org/html/2605.19908#bib.bib12)\):

ℒ=−log⁡exp⁡\(s​\(a,p\)/τ\)exp⁡\(s​\(a,p\)/τ\)\+∑n′∈𝒩exp⁡\(s​\(a,n′\)/τ\)\\mathcal\{L\}=\-\\log\\frac\{\\exp\\bigl\(s\(a,p\)/\\tau\\bigr\)\}\{\\exp\\bigl\(s\(a,p\)/\\tau\\bigr\)\+\\displaystyle\\sum\_\{n^\{\\prime\}\\in\\mathcal\{N\}\}\\exp\\bigl\(s\(a,n^\{\\prime\}\)/\\tau\\bigr\)\}\(1\)whereτ\\tauis a temperature parameter and𝒩\\mathcal\{N\}is the set of in\-batch negatives: every non\-positive passage in the batch serves as a negative\. This loss pushes the anchor closer to the positive and farther from all negatives in the scoring space\.

### 2\.2Scoring mechanisms

The encoder produces a sequence of token representations𝐇a=\[𝐡1a,…,𝐡ma\]∈ℝm×d\\mathbf\{H\}^\{a\}=\[\\mathbf\{h\}\_\{1\}^\{a\},\\ldots,\\mathbf\{h\}\_\{m\}^\{a\}\]\\in\\mathbb\{R\}^\{m\\times d\}for a passage ofmmtokens with hidden dimensiondd\. The scoring function determines how this matrix is turned into a scalar similarity\. We study three families\.

#### Mean pooling with cosine similarity\.

The passage representation is the mean of its token embeddings and the score is the cosine similarity between mean vectors\. Mean pooling is the standard AA baseline\(Rivera\-Sotoet al\.,[2021](https://arxiv.org/html/2605.19908#bib.bib13); Wegmannet al\.,[2022](https://arxiv.org/html/2605.19908#bib.bib1); Kantharubanet al\.,[2026](https://arxiv.org/html/2605.19908#bib.bib2)\)\. It compresses the entire token sequence into a singledd\-dimensional vector before scoring\.

#### Late interaction \(MaxSim\\mathrm\{MaxSim\}\)\.

The passage is represented by its full sequence of token embeddings, and the score is the sum over anchor tokens of the maximum cosine similarity to any candidate token\(Khattab and Zaharia,[2020](https://arxiv.org/html/2605.19908#bib.bib5)\):

sLI​\(a,p\)=∑i=1mamaxj∈\[mp\]⁡cos⁡\(𝐡ia,𝐡jp\)s\_\{\\text\{LI\}\}\(a,p\)=\\sum\_\{i=1\}^\{m\_\{a\}\}\\max\_\{j\\in\[m\_\{p\}\]\}\\cos\(\\mathbf\{h\}\_\{i\}^\{a\},\\mathbf\{h\}\_\{j\}^\{p\}\)\(2\)Unlike mean pooling, LI preserves per\-token structure through the scoring function: the encoder does not need to compress all the information\.

#### Patch\-level late interaction \(PLI\)\.

A middle ground\. The token sequence is partitioned into contiguous patches of sizenn\. Each patch is mean\-pooled, andMaxSim\\mathrm\{MaxSim\}is applied at the patch level:

sPLI​\(a,p\)=∑i=1Pamaxj∈\[Pp\]⁡cos⁡\(𝐩ia,𝐩jp\)s\_\{\\text\{PLI\}\}\(a,p\)=\\sum\_\{i=1\}^\{P\_\{a\}\}\\max\_\{j\\in\[P\_\{p\}\]\}\\cos\(\\mathbf\{p\}\_\{i\}^\{a\},\\mathbf\{p\}\_\{j\}^\{p\}\)\(3\)where𝐩i=1n​∑t∈patchi𝐡t\\mathbf\{p\}\_\{i\}=\\frac\{1\}\{n\}\\sum\_\{t\\in\\text\{patch\}\_\{i\}\}\\mathbf\{h\}\_\{t\}is the mean of the tokens within patchii\. We usen=2n\{=\}2\(bigram patches\) in this study\.

### 2\.3Alignment and uniformity

We use the alignment–uniformity framework ofWang and Isola \([2020](https://arxiv.org/html/2605.19908#bib.bib14)\), where alignmentα\\alphameasures closeness of same\-author pairs and uniformityuumeasures how evenly representations spread on the hypersphere \(lower is better for both\)\.

### 2\.4Residual stream patching

Residual stream patching\(Viget al\.,[2020](https://arxiv.org/html/2605.19908#bib.bib7); Menget al\.,[2022](https://arxiv.org/html/2605.19908#bib.bib15)\)is a causal intervention that measures the contribution of each encoder layer to the model’s output\. If we corrupt the input of the encoder and then restore one layer’s activations to their clean values, how much of the model’s correct behavior is recovered?

Concretely, given a triplet\(a,p,n\)\(a,p,n\), we define three forward passes\. A*clean pass*encodes the positiveppnormally, producing hidden states𝐡clean\(ℓ\)\\mathbf\{h\}^\{\(\\ell\)\}\_\{\\text\{clean\}\}at each layerℓ∈\{0,1,…,L\}\\ell\\in\\\{0,1,\\ldots,L\\\}\. A*corrupt pass*encodes the negativennnormally, producing𝐡corrupt\(ℓ\)\\mathbf\{h\}^\{\(\\ell\)\}\_\{\\text\{corrupt\}\}\. A*patched pass*at layerℓ\\ellencodes the negative, but at layerℓ\\ellreplaces the negative’s hidden states with those from the positive\. The patched hidden state then propagates through the remaining encoder layers to produce a patched scorespatched\(ℓ\)s\_\{\\text\{patched\}\}^\{\(\\ell\)\}\.

The clean score issclean=s​\(a,p\)s\_\{\\text\{clean\}\}=s\(a,p\)and the corrupt score isscorrupt=s​\(a,n\)s\_\{\\text\{corrupt\}\}=s\(a,n\)\. If patching at layerℓ\\ellrecovers the clean score, it means layerℓ\\ellcarries the information needed for correct authorship scoring\. If patching makes no difference, the information was not yet consolidated at that layer\.

### 2\.5Recovery metrics

We quantify recovery with two metrics\.

#### Percentage recovery

is a standard metric introduced byMenget al\.\([2022](https://arxiv.org/html/2605.19908#bib.bib15)\):

Recovery\(ℓ\)\(%\)=spatched\(ℓ\)−scorruptsclean−scorrupt×100\\text\{Recovery\}^\{\(\\ell\)\}\(\\%\)=\\frac\{s\_\{\\text\{patched\}\}^\{\(\\ell\)\}\-s\_\{\\text\{corrupt\}\}\}\{s\_\{\\text\{clean\}\}\-s\_\{\\text\{corrupt\}\}\}\\times 100\(4\)A value of 0% means no recovery while 100% means full recovery\. Values can go outside\[0,100\]\[0,100\]in some particular cases\. The problem with this metric is that the denominatorsclean−scorrupts\_\{\\text\{clean\}\}\-s\_\{\\text\{corrupt\}\}can be very small, especially for scoring functions like PLI whose scores are more compressed\. When the denominator is near zero, even tiny score changes produce enormous percentage values\.

#### Rank recovery

avoids this problem by asking a binary question: after patching at layerℓ\\ell, does the model still rank the positive above the negative?

rrank\(ℓ\)=1\|𝒯\+\|​∑t∈𝒯\+𝟏​\[spatched\(ℓ\)​\(at,pt\)\>spatched\(ℓ\)​\(at,nt\)\]r\_\{\\text\{rank\}\}^\{\(\\ell\)\}=\\frac\{1\}\{\|\\mathcal\{T\}\_\{\+\}\|\}\\sum\_\{t\\in\\mathcal\{T\}\_\{\+\}\}\\mathbf\{1\}\\\!\\bigl\[s\_\{\\text\{patched\}\}^\{\(\\ell\)\}\(a\_\{t\},p\_\{t\}\)\>s\_\{\\text\{patched\}\}^\{\(\\ell\)\}\(a\_\{t\},n\_\{t\}\)\\bigr\]\(5\)where𝒯\+\\mathcal\{T\}\_\{\+\}is the set of triplets the clean model ranks correctly\. This gives a value in\[0,1\]\[0,1\]with 0\.5 being chance\. We use rank recovery for all main\-text figures and report percentage recovery in the appendix\.

### 2\.6LISA probes

To separate feature availability from feature use, we train linear probes\(Alain and Bengio,[2017](https://arxiv.org/html/2605.19908#bib.bib9); Belinkov,[2022](https://arxiv.org/html/2605.19908#bib.bib10)\)at each encoder layer\. The probes are regression models mapping the mean\-pooled hidden state at layerℓ\\ellto scalar stylistic features\. We report the coefficient of determinationR2R^\{2\}on a held\-out set\. The feature targets are inspired by the LISA framework fromKantharubanet al\.\([2026](https://arxiv.org/html/2605.19908#bib.bib2)\)and include nine categories: word length, capitalization rate, type–token ratio, punctuation density, function\-word frequency, sentence length, hedging markers, citation density, and discourse connectives\. A highR2R^\{2\}at layerℓ\\ellmeans the feature is linearly separable from the representation\. This is a necessary but not sufficient condition for the model to actually use that feature for scoring

## 3Gradient Structure and the Consolidation Bottleneck

This section develops a theory of what we expect to find, before any experiment is run\. The theory starts from the gradient of the scoring function and derives a prediction about where in the encoder authorship signal should be consolidated\.

### 3\.1How the gradient distributes across tokens

The end\-to\-end gradient of the InfoNCE loss with respect to a single token representation𝐡ja\\mathbf\{h\}\_\{j\}^\{a\}factors into two parts:

∂ℒ∂𝐡ja=∂ℒ∂s⏟InfoNCE term⋅∂s∂𝐡ja⏟Scorer term\\frac\{\\partial\\mathcal\{L\}\}\{\\partial\\mathbf\{h\}\_\{j\}^\{a\}\}=\\underbrace\{\\frac\{\\partial\\mathcal\{L\}\}\{\\partial s\}\}\_\{\\text\{InfoNCE term\}\}\\cdot\\underbrace\{\\frac\{\\partial s\}\{\\partial\\mathbf\{h\}\_\{j\}^\{a\}\}\}\_\{\\text\{Scorer term\}\}\(6\)The InfoNCE term concentrates gradient on hard negatives\. This term is identical across scoring mechanisms: it depends on the values, not on how the scores were computed\. The scorer term determines how that gradient distributes across individual tokens, and this is where the three mechanisms diverge\.

#### Mean pooling: dense, uniform gradient\.

Under mean pooling, the score depends on each token only through the mean\. The partial derivative is:

∂smean∂𝐡ja=1m⋅∂cos⁡\(𝐡¯a,𝐡¯p\)∂𝐡¯a\\frac\{\\partial s\_\{\\text\{mean\}\}\}\{\\partial\\mathbf\{h\}\_\{j\}^\{a\}\}=\\frac\{1\}\{m\}\\cdot\\frac\{\\partial\\cos\(\\bar\{\\mathbf\{h\}\}^\{a\},\\bar\{\\mathbf\{h\}\}^\{p\}\)\}\{\\partial\\bar\{\\mathbf\{h\}\}^\{a\}\}\(7\)The1/m1/mfactor means every token receives the same gradient magnitude\. The gradient is dense and uniform \(no token is preferentially updated\)\. The model has no mechanism to selectively strengthen discriminative tokens: a function word, a punctuation mark, and a content word all receive the same gradient signal\.

#### MaxSim\\mathrm\{MaxSim\}: sparse, selective gradient\.

Under late interaction \(Equation[2](https://arxiv.org/html/2605.19908#S2.E2)\), the gradient with respect to anchor tokenjjis:

∂sLI∂𝐡ja=∑i=1mp𝟏​\[j=argmaxj′cos⁡\(𝐡j′a,𝐡ip\)\]⋅∂cos∂𝐡ja\\frac\{\\partial s\_\{\\text\{LI\}\}\}\{\\partial\\mathbf\{h\}\_\{j\}^\{a\}\}=\\sum\_\{i=1\}^\{m\_\{p\}\}\\mathbf\{1\}\\\!\\bigl\[j=\\operatorname\*\{argmax\}\_\{j^\{\\prime\}\}\\cos\(\\mathbf\{h\}\_\{j^\{\\prime\}\}^\{a\},\\mathbf\{h\}\_\{i\}^\{p\}\)\\bigr\]\\cdot\\frac\{\\partial\\cos\}\{\\partial\\mathbf\{h\}\_\{j\}^\{a\}\}\(8\)Only the tokens selected viaargmax\\operatorname\*\{argmax\}receive a gradient\. Most tokens are not updated at all\. The encoder learns which tokens carry discriminative signal because only those tokens participate in the backward pass\.

#### PLI: intermediate density\.

Under PLI with patch sizepp\(Equation[3](https://arxiv.org/html/2605.19908#S2.E3)\), the gradient combines both regimes:

∂sPLI∂𝐡ja=1p⋅𝟏​\[patch​\(j\)∈argmax\]⋅∂cos∂𝐡ja\\frac\{\\partial s\_\{\\text\{PLI\}\}\}\{\\partial\\mathbf\{h\}\_\{j\}^\{a\}\}=\\frac\{1\}\{p\}\\cdot\\mathbf\{1\}\\\!\\bigl\[\\text\{patch\}\(j\)\\in\\operatorname\{argmax\}\\bigr\]\\cdot\\frac\{\\partial\\cos\}\{\\partial\\mathbf\{h\}\_\{j\}^\{a\}\}\(9\)Sparse between patches \(only selected patches get gradient\), dense within patches \(each of thepptokens in a selected patch gets1/p1/p\)\.

### 3\.2The consolidation bottleneck

Mean pooling’s dense gradient creates what we call a consolidation bottleneck\. The scoring function only accesses the mean of all tokens\. For the encoder to produce a score that distinguishes same\-author from different\-author passages, it must arrange the hidden states so that their mean already points in a direction that encodes authorship\. The encoder must coordinate information across the entire sequence, compressing authorship\-relevant features into a form that survives averaging\. This compression must happen at some intermediate layer, which we call the*consolidation layer*\.

MaxSim\\mathrm\{MaxSim\}has no such bottleneck\. The scoring function accesses individual token representations directly, so the encoder can keep refining per\-token features through the upper layers without needing to consolidate them into a single direction\. The upper layers of a transformer encode more abstract, context\-dependent features\(Tenneyet al\.,[2019](https://arxiv.org/html/2605.19908#bib.bib16)\), so the ability to defer consolidation givesMaxSim\\mathrm\{MaxSim\}access to richer representations\.

If our analysis is correct, mean pooling should show a recovery inflection at an earlier layer thanMaxSim\\mathrm\{MaxSim\}when we perform causal patching\. Patching below the consolidation layer should destroys the signal \(the representation has not yet been compressed\)\. Patching above it should preserve the signal \(consolidation is complete\)\.MaxSim\\mathrm\{MaxSim\}should show a later inflection because there is no pressure to consolidate early\.

### 3\.3Why mean pooling loses information

We can observe mean pooling through an information theory lens and explain why it has less capacity to encode authorship\. Mean pooling maps them×dm\\times dtoken matrix𝐇\\mathbf\{H\}to add\-dimensional vector𝐡¯\\bar\{\\mathbf\{h\}\}\. By the data processing inequality, any function of the mean has at most as much mutual information with the author identityYYas a function of the full token matrix:

I​\(Y;𝐡¯\)≤I​\(Y;𝐇\)I\(Y;\\bar\{\\mathbf\{h\}\}\)\\leq I\(Y;\\mathbf\{H\}\)\(10\)The information loss is strictly positive whenever𝐡¯\\bar\{\\mathbf\{h\}\}is not a sufficient statistic forYY\. For instance, two passages with identical function\-word frequencies but different function\-word orderings are indistinguishable under mean pooling \(which is permutation\-invariant\) but distinguishable underMaxSim\\mathrm\{MaxSim\}\(which preserves positional structure\)\. The information loss is therefore not only theoretical\.

Table 1:Alignmentα\\alphaand uniformityuuper model, fromKulumbaet al\.\([2025](https://arxiv.org/html/2605.19908#bib.bib33)\)\. Lower is better for both\.This capacity gap is reflected in the alignment–uniformity tradeoff \(Table[1](https://arxiv.org/html/2605.19908#S3.T1)\)\. Mean pooling achieves the best uniformity because averaging naturally spreads representations\. But it achieves the weakest alignment because it destroys the fine\-grained signal needed to cluster same\-author passages tightly\. LI achieves the tightest alignment because token\-level comparison preserves discriminative detail, but the weakest uniformity because the sparse gradient does not prevent representation collapse as aggressively\.

## 4Experimental Setup

We design a controlled analysis that isolates the scoring mechanism: every model shares one backbone, one corpus, and one loss, differing only in how they turn token representations into a scalar similarity\.

### 4\.1Models

Every model shares a ModernBERT\-base backbone\(Warneret al\.,[2025](https://arxiv.org/html/2605.19908#bib.bib17)\)with 23 transformer layers, 149M parameters, and a hidden size of 768\. Unless stated otherwise, we use the base\-4 split of HALvest\-Contrastive\(Kulumbaet al\.,[2025](https://arxiv.org/html/2605.19908#bib.bib33)\), a scholarly corpus in which the anchor and positive are drawn from different papers by the same author\-set, and the negative is mined from within the same disciplinary field\. This design ensures that topical similarity does not confound authorship signal: the model cannot rely on vocabulary overlap to distinguish positives from negatives\.

Layerwiseuses layerwise attention pooling followed by mean pooling and cosine scoring\. We use layerwise attention in addition to mean pooling to match the state of the art\(Kantharubanet al\.,[2026](https://arxiv.org/html/2605.19908#bib.bib2)\)\. In prior work, layerwise attention adds only a marginal performance gain over raw mean pooling, indicating that the learned layer weights do not overcome the single\-vector bottleneck analyzed in §[3\.2](https://arxiv.org/html/2605.19908#S3.SS2)\. The gradient with respect to each token still passes through the mean, so the1/m1/muniform\-gradient analysis applies up to a layer\-dependent reweighting factor\.LIuses token\-levelMaxSim\\mathrm\{MaxSim\}with punctuation and padding masked\.PLIn=2n\{=\}2uses bigram patch\-levelMaxSim\\mathrm\{MaxSim\}\.E5 zero\-shot\(Wanget al\.,[2024](https://arxiv.org/html/2605.19908#bib.bib26)\)is included as a control model picked off the shelf\. E5 was trained for retrieval, and to a greater extent semantic matching, yielding decorrelated similarity scores from models trained for AA\(Kulumbaet al\.,[2025](https://arxiv.org/html/2605.19908#bib.bib33); Kantharubanet al\.,[2026](https://arxiv.org/html/2605.19908#bib.bib2)\)\.

Table 2:HALvest\-Contrastive base\-4 retrieval performance fromKulumbaet al\.\([2025](https://arxiv.org/html/2605.19908#bib.bib33)\)\. E5 was tested using mean pooling\.Table[2](https://arxiv.org/html/2605.19908#S4.T2)summarizes retrieval performance\. The four\-fold Recall@20 gap between mean pooling and LI is the empirical observation we aim to study\.

### 4\.2Probe set construction

![Refer to caption](https://arxiv.org/html/2605.19908v1/x2.png)Figure 2:Token length distributions for positive \(blue\) and negative \(orange\) passages across the three tiers\. All passages cluster around the 130\-token target\.We use a small, controlled set of 148 triplets, not on the full retrieval benchmark to conduct our analysis\. Using a curated probe set rather than the full test set allows us to control for confounds \(passage length, domain overlap\)\. Triplets are drawn from HALvest\-Contrastive base\-4 validation, from the ten most frequent author\-sets that have at least four distinct documents\. Passages target a fixed token length of 130 tokens \(Figure[2](https://arxiv.org/html/2605.19908#S4.F2)\), the positive and negative within each triplet are constrained to differ by at most five tokens after tokenization\. Triplets are stratified into three tiers that vary the relationship between the anchor and the negative:

- •Tier A\(n=50n\{=\}50\): the anchor and positive share the same author\-set\. The negative is written by a completely disjoint author\-set from the same scholarly domain\. This is the baseline: the model must rely on stylistic signal to distinguish the positive from a topically similar negative written by entirely different authors\.
- •Tier B\(n=50n\{=\}50\): the anchor and positive share the same author\-set\. The negative is written by a partially overlapping author\-set that shares at least one author with the anchor’s team but is not identical to it\. The shared author contributes stylistic signal to both passages, creating a confound\. This tier tests whether the model can distinguish full author\-set matches from partial ones\.
- •Tier C\(n=48n\{=\}48\): the anchor and positive share the same author\-set but come from different scholarly domains \(anchor in domainD1D\_\{1\}, positive in domainD2D\_\{2\}\)\. The negative is written by a disjoint author\-set from the anchor’s domainD1D\_\{1\}\. This tests cross\-domain authorship recognition: can the model identify the same authors when the vocabulary and conventions shift between disciplines?

Table 3:Failure rates and effective sample sizes \(n\+n\_\{\+\}= correctly\-ranked triplets used for patching\) per tier\. Tier B is the hardest due to the shared\-author confound\. Interaction models are more robust than mean pooling across all tiers\.Residual patching is only applied to triplets that are correctly ranked \(those where the clean model scores the positive above the negative\)\. The effective sample sizes therefore vary by tier and model \(Table[3](https://arxiv.org/html/2605.19908#S4.T3)\)\.

### 4\.3Analyses

We apply four analyses to all three fine\-tuned models\.

1. 1\.LISA probestrain linear classifiers on a separate 10,000\-passage corpus evaluated on a 2,000\-passage held\-out set, measuring feature availability at each of the 23 layers\.
2. 2\.Residual stream patchingmeasures the causal contribution of each layer via rank recovery \(Equation[5](https://arxiv.org/html/2605.19908#S2.E5)\) across the 148 probe\-set triplets\.
3. 3\.Score sensitivitycomputes the average absolute score change\|spatched\(ℓ\)−scorrupt\|¯\\overline\{\|s\_\{\\text\{patched\}\}^\{\(\\ell\)\}\-s\_\{\\text\{corrupt\}\}\|\}per layer, a raw measure of how much the scoring function’s output responds to restoring a single layer\.
4. 4\.Training dynamicsapply patching to eight checkpoints per model \(steps 0, 500, 1500, 3000, 5000, 10000, 20000, and final\) to track how the depth profile develops during training\. It isolates what contrastive fine\-tuning adds\.

## 5Results

Probing, causal patching, score sensitivity, and training dynamics point to the same conclusion: the performance gap does not arise from what the encoder learns, but from where and how the scorer reads it out\.

### 5\.1Feature availability is invariant across models

![Refer to caption](https://arxiv.org/html/2605.19908v1/x3.png)\(a\)Layerwise \(mean pooling\)
![Refer to caption](https://arxiv.org/html/2605.19908v1/x4.png)\(b\)Late Interaction
![Refer to caption](https://arxiv.org/html/2605.19908v1/x5.png)\(c\)PLIn=2n\{=\}2

Figure 3:LISA probeR2R^\{2\}heatmaps at the final checkpoint\.Rows are stylistic feature categories\. Columns are encoder layers\. The three fine\-tuned models produce nearly identical heatmaps\. Word length is the most readable feature \(R2≈0\.57R^\{2\}\\approx 0\.57\), followed by capitalization rate, type–token ratio, and punctuation density\.We begin with the question of availability\. If the four\-fold performance gap between mean pooling and LI arises because LI causes the encoder to learn better stylistic representations, then the LISA probes should show higherR2R^\{2\}for LI than for mean pooling, at least at some layers\. It is, however, not the case\. Figure[3](https://arxiv.org/html/2605.19908#S5.F3)shows the probe heatmaps for all three fine\-tuned models\. The heatmaps are visually indistinguishable\. The top features, word length, capitalization, type–token ratio, punctuation density, and function\-word frequency, achieve the sameR2R^\{2\}at the same layers across all models\. The E5 control produces a visually indistinguishable pattern\. Stylistic readability is a property of the pretrained backbone\.

This rules out the first hypothesis from the introduction\. The encoder does not learn different stylistic representations under different scorers\. The pretrained ModernBERT backbone already encodes these features and contrastive fine\-tuning does not create them, regardless of the scoring function\. The four\-fold performance gap is therefore more plausibly explained by differences in how these features are used than by differences in what the encoder learned\.

### 5\.2Causal patching reveals a scoring\-dependent depth profile

![Refer to caption](https://arxiv.org/html/2605.19908v1/x6.png)\(a\)Tier A \(same domain\)
![Refer to caption](https://arxiv.org/html/2605.19908v1/x7.png)\(b\)Tier B \(shared\-author confound\)
![Refer to caption](https://arxiv.org/html/2605.19908v1/x8.png)\(c\)Tier C \(cross\-domain\)

Figure 4:Rank recovery across the three models\.Each panel shows one tier\. Purple: layerwise \(mean pooling\), orange: LI, green: PLIn=2n\{=\}2\. Dashed line: chance \(0\.5\)\. Mean pooling crosses chance at layer 9, while both interaction models cross at layers 14–16\. The six\-layer gap is consistent across all three tiers\.#### Layerwise \(mean pooling\)

follows an S\-shape\. The curve crosses random guess at approximately layer 9 and reaches near\-perfect recovery by layer 13\. This pattern is consistent across all three tiers\. On Tier C, all models show slightly above\-chance performance at the very first layers \(0–\-2\)\. This is consistent with early layers encoding shallow syntactic statistics\(Jawaharet al\.,[2019](https://arxiv.org/html/2605.19908#bib.bib11)\)that carry distributional authorship signal even when domain\-specific vocabulary shifts\. In Tiers A and B, topical overlap between anchor and negative may mask this early signal\.

#### Late interaction

shows a qualitatively similar S\-curve but with a later inflection\. Rank recovery stays below random guess until approximately layer 15, then steeply rises to≥0\.90\\geq 0\.90by layer 20\. The below\-chance dip at layers 3–12 is deeper than for layerwise \(recovery≈0\.3\\approx 0\.3–0\.40\.4\): corrupting these layers actively misleads the token\-levelMaxSim\\mathrm\{MaxSim\}scoring\.

#### PLIn=2n\{=\}2

tracks LI closely\. The inflection falls at layers 14–16, effectively indistinguishable from LI given the sample size\.

#### We define the consolidation point

as the earliest layer at which rank recovery exceeds 0\.75\. By this criterion, mean pooling consolidates at layer 10, while LI and PLI consolidate at layers 16 and 15 respectively\. This matches the prediction from §[3\.2](https://arxiv.org/html/2605.19908#S3.SS2): dense, uniform gradients force early consolidation while sparse, selective gradients allows for late consolidation\. PLIn=2n\{=\}2does not interpolate between the two, it falls squarely in the interaction regime, consistent with the patchargmax\\operatorname\{argmax\}’s selection dominating the intra\-patch averaging \(Equation[9](https://arxiv.org/html/2605.19908#S3.E9)\)\.

### 5\.3Score sensitivity confirms two regimes

![Refer to caption](https://arxiv.org/html/2605.19908v1/x9.png)\(a\)Tier A
![Refer to caption](https://arxiv.org/html/2605.19908v1/x10.png)\(b\)Tier B
![Refer to caption](https://arxiv.org/html/2605.19908v1/x11.png)\(c\)Tier C

Figure 5:Score sensitivity per layer\.Mean\|spatched\(ℓ\)−scorrupt\|\|s\_\{\\text\{patched\}\}^\{\(\\ell\)\}\-s\_\{\\text\{corrupt\}\}\|when restoring clean activations at layerℓ\\ell\. LI \(orange\) is most sensitive, PLI \(green\) is intermediate, layerwise \(purple\) is an order of magnitude lower\.Score sensitivity provides a complementary view: rather than asking whether patching recovers the correct ranking, it asks how much the score changes in absolute terms \(Figure[5](https://arxiv.org/html/2605.19908#S5.F5)\)\. The ordering is consistent across all tiers: LI is most sensitive, PLI is intermediate and layerwise is an order of magnitude lower\.Mean pooling compresses representations so heavily that restoring a single layer barely moves the mean\.MaxSim\\mathrm\{MaxSim\}reads individual tokens, so a layer\-level perturbation can change which tokens are selected by theargmax\\operatorname\{argmax\}, producing a large score shift\. PLI sits 10–\-20% below LI, consistent with intra\-patch averaging partially smoothing perturbations before the patch\-levelargmax\\operatorname\{argmax\}\.

### 5\.4Training dynamics reveal three learning trajectories

![Refer to caption](https://arxiv.org/html/2605.19908v1/x12.png)\(a\)Layerwise \(mean pooling\): top\-down monotonic\. Upper layers learn first, the inflection migrates downward during training\.
![Refer to caption](https://arxiv.org/html/2605.19908v1/x13.png)\(b\)Late Interaction: transient early\-layer spike at step 1500, then signal migration to upper layers\. The model initially exploits shallow lexical matches, then suppresses them\.
![Refer to caption](https://arxiv.org/html/2605.19908v1/x14.png)\(c\)PLIn=2n\{=\}2: gradual emergence, no transient spike\. The final checkpoint shows a distinctive mid\-layer hump \(layers 10–15\) absent in both other models\.

Figure 6:Training dynamics\.Mean percentage recovery across Tier A triplets at eight checkpoints\. Each subplot is one checkpoint\. x\-axis: layer index; y\-axis: mean recovery\. Percentage recovery is used here because rank recovery is binary and too coarse to track gradual signal emergence at early checkpoints\. The y\-axis extremes reflect the known instability of percentage recovery \(§[2\.5](https://arxiv.org/html/2605.19908#S2.SS5)\)\.The patching analysis so far shows the final\-checkpoint depth profile\. To understand how that profile develops, we apply the same analysis to intermediate checkpoints \(Figure[6](https://arxiv.org/html/2605.19908#S5.F6)\)\.

#### Mean pooling \(Figure[6\(a\)](https://arxiv.org/html/2605.19908#S5.F6.sf1)\)

learns top\-down\. At step 500, recovery is concentrated at the uppermost layers\. As training progresses, the inflection migrates downward: layer 15 by step 3000, layer 13 by step 10,000, layer 9 at the final checkpoint\. The model progressively recruits deeper layers to consolidate earlier, consistent with the consolidation bottleneck\. The dense gradient initially refines the layers closest to the scoring function, then gradually shapes earlier layers\.

#### Late interaction \(Figure[6\(b\)](https://arxiv.org/html/2605.19908#S5.F6.sf2)\)

shows a distinctive behavior\. At step 1500, recovery spikes at layers 5–10\. This suggests the model initially exploits shallow lexical matches:MaxSim\\mathrm\{MaxSim\}can propagate gradient through exact token matches at negligible cost, providing a cheap authorship signal from lower layers\. As hard negatives increase in difficulty during training, this shortcut becomes insufficient, and the model shifts to deeper, more contextualized representations\. By step 5000, this transient behavior is suppressed and recovery concentrates at layers 19\+\. The model learns to defer to deeper, more abstract representations, abandoning the shallow shortcut\.

#### PLIn=2n\{=\}2\(Figure[6\(c\)](https://arxiv.org/html/2605.19908#S5.F6.sf3)\)\.

Bigram\-patchMaxSim\\mathrm\{MaxSim\}shows a third pattern with no early spike: the intra\-patch averaging smooths out the shallow matches that LI exploits\. Recovery emerges gradually at the upper layers\. The final checkpoint shows a mid\-layer hump \(layers 10–15\) unique to PLI, possibly reflecting the two\-level structure of its gradient \(Equation[9](https://arxiv.org/html/2605.19908#S3.E9)\)\. Mid\-layer patch representations carry authorship signal that neither the token\-level first moment \(mean pooling\) nor the individual tokens \(MaxSim\\mathrm\{MaxSim\}\) would use\.

## 6Related Work

#### Authorship attribution\.

Neural AA has evolved from classification\(Burrows,[2002](https://arxiv.org/html/2605.19908#bib.bib21); Schleret al\.,[2006](https://arxiv.org/html/2605.19908#bib.bib22)\)to contrastive learning\(Wegmannet al\.,[2022](https://arxiv.org/html/2605.19908#bib.bib1); Kantharubanet al\.,[2026](https://arxiv.org/html/2605.19908#bib.bib2); Huertas\-Tatoet al\.,[2024](https://arxiv.org/html/2605.19908#bib.bib4)\), with increasing focus on topic confounding\(Wegmann and Nguyen,[2021](https://arxiv.org/html/2605.19908#bib.bib23); Rivera\-Sotoet al\.,[2021](https://arxiv.org/html/2605.19908#bib.bib13)\)\. Our work is not the first attempt of the AA community at interpretability\(Alshomaryet al\.,[2025b](https://arxiv.org/html/2605.19908#bib.bib24),[a](https://arxiv.org/html/2605.19908#bib.bib25)\), but is, to the best of our knowledge the first one to use mechanistic interpretability tools and gradient analysis to derive performance and training behavior from encoder models\.

#### Probing versus causal analysis\.

Linear probes\(Belinkov,[2022](https://arxiv.org/html/2605.19908#bib.bib10)\)are widely used to study what information neural representations encode, but the link between probe accuracy and actual model behavior is contested\(Hewitt and Liang,[2019](https://arxiv.org/html/2605.19908#bib.bib18); Ravichanderet al\.,[2021](https://arxiv.org/html/2605.19908#bib.bib19)\)\. Activation patching\(Viget al\.,[2020](https://arxiv.org/html/2605.19908#bib.bib7); Menget al\.,[2022](https://arxiv.org/html/2605.19908#bib.bib15); Wanget al\.,[2023](https://arxiv.org/html/2605.19908#bib.bib20)\)provides a causal alternative: it asks whether information is necessary, not merely decodable\. Our*availability*against*use*dissociation contributes to this debate by showing that all probed features are equally available across models with very different task performance\.

## 7Discussion

The availability–use dissociation reframes AA as an information readout problem\. In this setup, the pretrained encoder already encodes the stylistic features we probe\. What differs is whether the scoring function can access them at the right depth and with enough capacity\.

#### Availability against use\.

Probing accuracy is a poor proxy for task performance when models differ in their scoring mechanism\. All four models, three fine\-tuned and one off\-the\-shelf control, produce nearly identical probe heatmaps \(Figure[3](https://arxiv.org/html/2605.19908#S5.F3)\) while differing dramatically in retrieval performance \(Table[2](https://arxiv.org/html/2605.19908#S4.T2)\)\. The main question is not, therefore, which model encodes more stylistic information, but which scoring mechanism can effectively read it out\. Path patching at the attention\-head level\(Goldowsky\-Dillet al\.,[2023](https://arxiv.org/html/2605.19908#bib.bib6)\)could further localize how stylistic signal flows through the encoder, though this addresses a finer\-grained question than the one considered here\.

#### Why interaction beats pooling\.

The gradient analysis \(§[3\.1](https://arxiv.org/html/2605.19908#S3.SS1)\) and the information\-theoretic argument \(§[3\.3](https://arxiv.org/html/2605.19908#S3.SS3)\) converge: mean pooling discards higher\-order structure by compressing to a single vector, whileMaxSim\\mathrm\{MaxSim\}preserves token\-level granularity\. The causal depth profiles confirm this: consolidation at layer 9 versus layers 15–16\. The probe results \(Figure[3](https://arxiv.org/html/2605.19908#S5.F3)\), patching curves \(Figure[4](https://arxiv.org/html/2605.19908#S5.F4)\), score sensitivity analysis \(Figure[5](https://arxiv.org/html/2605.19908#S5.F5)\), and training dynamics \(Figure[6](https://arxiv.org/html/2605.19908#S5.F6)\),all converge on the same explanation\.

#### PLI in the interaction regime\.

PLIn=2n=2falls in the same causal regime as LI, with nearly identical recovery inflections\. This suggests that the patch\-levelargmax\\operatorname\{argmax\}dominates the effect of local averaging inside each patch\. The alignment and uniformity results \(Table[1](https://arxiv.org/html/2605.19908#S3.T1)\) are also consistent with this interpretation, since PLI remains much closer to LI than to mean pooling in embedding\-space geometry\. Whether larger patches shift consolidation earlier remains open, though the theory predicts that they should gradually approach the pooling regime\.

Overall, the results suggest that the main bottleneck in contrastive authorship attribution is not whether stylistic information exists in the encoder, but whether the scoring mechanism can preserve and exploit it\. The pretrained backbone already contains strong stylistic structure before fine\-tuning\. The scoring mechanism determines where that structure becomes causally necessary for the task\.

## Limitations

#### Backbone choice\.

We fix the backbone to ModernBERT to control for architecture, since our goal is to isolate how the scoring mechanism shapes signal consolidation\. The specific inflection layers we observe, such as layer 9 for pooling and layers 15 to 16 for interaction, may shift in other architectures\. The qualitative gap between early consolidation under mean pooling and later consolidation under interaction should however, transfer\. Testing a second backbone, such as RoBERTa\(Liuet al\.,[2019](https://arxiv.org/html/2605.19908#bib.bib27)\), would be useful for architectural generality, but it is orthogonal to the main question of this paper\.

#### Patch\-level interaction\.

We study onlyn=2n=2for PLI\. This keeps the analysis focused on the contrast between pooling and interaction while still giving us a middle regime to compare against LI and mean pooling\. The theory suggests that larger patches should move the inflection earlier, closer to the pooling regime\. Exploringn=3,4,5n=3,4,5would be a natural extension, but it is not necessary for the main result reported here\.

#### Probe set size\.

The 148 triplets are enough to resolve the six\-layer gap, but they are too small for fine\-grained LI versus PLI comparisons\. Bootstrap confidence intervals may not separate a one to two layer difference cleanly\. The high failure rate on Tier B also leaves only 28 to 33 correctly ranked triplets, which makes those curves noisier than Tiers A and C\. The main qualitative result is stable across all three tiers, but finer distinctions between LI and PLI remain below our statistical resolution\.

## Acknowledgments

The authors are grateful to Djamé Seddah who indirectly inspired this work\. We also thank Wissam Antoun, Rian Touchent and Théo Lasnier for the productive discussions\. This work was partially realized on computing HPC and storage resources provided by IDRIS thanks to the grant GCDA1016807 on the DALIA supercomputer\.

## References

- B\. Ai, Y\. Wang, Y\. Tan, and S\. Tan \(2022\)Whodunit? learning to contrast for authorship attribution\.InProceedings of the 2nd Conference of the Asia\-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing, Volume 1: Long Papers,Y\. He, H\. Ji, S\. Li, Y\. Liu, and C\. Chang \(Eds\.\),Online only,pp\. 1142–1157\.External Links:[Link](https://aclanthology.org/2022.aacl-main.84/),[Document](https://dx.doi.org/10.18653/v1/2022.aacl-main.84)Cited by:[§1](https://arxiv.org/html/2605.19908#S1.p2.1)\.
- G\. Alain and Y\. Bengio \(2017\)Understanding intermediate layers using linear classifier probes\.In5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24\-26, 2017, Workshop Track Proceedings,External Links:[Link](https://openreview.net/forum?id=HJ4-rAVtl)Cited by:[§1](https://arxiv.org/html/2605.19908#S1.p3.1),[§2\.6](https://arxiv.org/html/2605.19908#S2.SS6.p1.4)\.
- M\. Alshomary, N\. Ri, M\. Apidianaki, A\. Patel, S\. Muresan, and K\. McKeown \(2025a\)Latent space interpretation for stylistic analysis and explainable authorship attribution\.InProceedings of the 31st International Conference on Computational Linguistics,O\. Rambow, L\. Wanner, M\. Apidianaki, H\. Al\-Khalifa, B\. D\. Eugenio, and S\. Schockaert \(Eds\.\),Abu Dhabi, UAE,pp\. 1124–1135\.External Links:[Link](https://aclanthology.org/2025.coling-main.75/)Cited by:[§6](https://arxiv.org/html/2605.19908#S6.SS0.SSS0.Px1.p1.1)\.
- M\. Alshomary, N\. R\. Varimalla, V\. Anand, S\. Muresan, and K\. McKeown \(2025b\)Layered insights: generalizable analysis of human authorial style by leveraging all transformer layers\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 10279–10292\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.521/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.521),ISBN 979\-8\-89176\-332\-6Cited by:[§6](https://arxiv.org/html/2605.19908#S6.SS0.SSS0.Px1.p1.1)\.
- Y\. Belinkov \(2022\)Probing Classifiers: Promises, Shortcomings, and Advances\.Computational Linguistics48\(1\),pp\. 207–219\.External Links:[Link](https://aclanthology.org/2022.cl-1.7/),[Document](https://dx.doi.org/10.1162/coli%5Fa%5F00422)Cited by:[§1](https://arxiv.org/html/2605.19908#S1.p3.1),[§2\.6](https://arxiv.org/html/2605.19908#S2.SS6.p1.4),[§6](https://arxiv.org/html/2605.19908#S6.SS0.SSS0.Px2.p1.1)\.
- J\. Burrows \(2002\)Delta: a measure of stylistic difference and a guide to likely authorship\.Literary and Linguistic Computing17\(3\),pp\. 267–287\.External Links:ISSN 0268\-1145,[Link](https://doi.org/10.1093/llc/17.3.267),[Document](https://dx.doi.org/10.1093/llc/17.3.267)Cited by:[§1](https://arxiv.org/html/2605.19908#S1.p1.1),[§6](https://arxiv.org/html/2605.19908#S6.SS0.SSS0.Px1.p1.1)\.
- F\. Cafiero and J\. Camps \(2019\)Why molière most likely did write his plays\.Science Advances5\(11\),pp\. eaax5489\.External Links:[Document](https://dx.doi.org/10.1126/sciadv.aax5489)Cited by:[§1](https://arxiv.org/html/2605.19908#S1.p1.1)\.
- E\. Dauber, A\. Caliskan, R\. Harang, G\. Shearer, M\. Weisman, F\. Nelson, and R\. Greenstadt \(2019\)Git blame who? stylistic authorship attribution of small, incomplete source code fragments\.Proceedings on Privacy Enhancing Technologies2019\(3\),pp\. 389–408\.External Links:ISSN 2299\-0984,[Link](https://petsymposium.org/popets/2019/popets-2019-0053.php),[Document](https://dx.doi.org/10.2478/popets-2019-0053)Cited by:[§1](https://arxiv.org/html/2605.19908#S1.p1.1)\.
- J\. Devlin, M\. Chang, K\. Lee, and K\. Toutanova \(2019\)BERT: pre\-training of deep bidirectional transformers for language understanding\.InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1: Long and Short Papers,J\. Burstein, C\. Doran, and T\. Solorio \(Eds\.\),Minneapolis, Minnesota,pp\. 4171–4186\.External Links:[Link](https://aclanthology.org/N19-1423),[Document](https://dx.doi.org/10.18653/v1/N19-1423)Cited by:[§1](https://arxiv.org/html/2605.19908#S1.p2.1)\.
- N\. Goldowsky\-Dill, C\. MacLeod, L\. Sato, and A\. Arora \(2023\)Localizing Model Behavior with Path Patching\.arXiv\.Note:arXiv:2304\.05969 \[cs\]External Links:[Link](http://arxiv.org/abs/2304.05969),[Document](https://dx.doi.org/10.48550/arXiv.2304.05969)Cited by:[§1](https://arxiv.org/html/2605.19908#S1.p3.1),[§7](https://arxiv.org/html/2605.19908#S7.SS0.SSS0.Px1.p1.1)\.
- J\. Hewitt and P\. Liang \(2019\)Designing and interpreting probes with control tasks\.InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing \(EMNLP\-IJCNLP\),K\. Inui, J\. Jiang, V\. Ng, and X\. Wan \(Eds\.\),Hong Kong, China,pp\. 2733–2743\.External Links:[Link](https://aclanthology.org/D19-1275/),[Document](https://dx.doi.org/10.18653/v1/D19-1275)Cited by:[§6](https://arxiv.org/html/2605.19908#S6.SS0.SSS0.Px2.p1.1)\.
- J\. Huertas\-Tato, A\. Girón\-Jiménez, A\. Martín, and D\. Camacho \(2024\)Isolating authorship from content with semantic embeddings and contrastive learning\.arXiv\.Note:arXiv:2411\.18472 \[cs\]External Links:[Link](http://arxiv.org/abs/2411.18472),[Document](https://dx.doi.org/10.48550/arXiv.2411.18472)Cited by:[§1](https://arxiv.org/html/2605.19908#S1.p2.1),[§6](https://arxiv.org/html/2605.19908#S6.SS0.SSS0.Px1.p1.1)\.
- G\. Jawahar, B\. Sagot, and D\. Seddah \(2019\)What does BERT learn about the structure of language?\.InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics,A\. Korhonen, D\. Traum, and L\. Màrquez \(Eds\.\),Florence, Italy,pp\. 3651–3657\.External Links:[Link](https://aclanthology.org/P19-1356/),[Document](https://dx.doi.org/10.18653/v1/P19-1356)Cited by:[§5\.2](https://arxiv.org/html/2605.19908#S5.SS2.SSS0.Px1.p1.1)\.
- A\. Kantharuban, A\. Srivastava, F\. Faisal, O\. Ahia, A\. Anastasopoulos, D\. Chiang, Y\. Tsvetkov, and G\. Neubig \(2026\)IDIOLEX: unified and continuous representations for idiolectal and stylistic variation\.arXiv\.Note:arXiv:2604\.04704 \[cs\]External Links:[Link](http://arxiv.org/abs/2604.04704),[Document](https://dx.doi.org/10.48550/arXiv.2604.04704)Cited by:[§1](https://arxiv.org/html/2605.19908#S1.p2.1),[§2\.2](https://arxiv.org/html/2605.19908#S2.SS2.SSS0.Px1.p1.1),[§2\.6](https://arxiv.org/html/2605.19908#S2.SS6.p1.4),[§4\.1](https://arxiv.org/html/2605.19908#S4.SS1.p2.4),[§6](https://arxiv.org/html/2605.19908#S6.SS0.SSS0.Px1.p1.1)\.
- V\. Kešelj, F\. Peng, N\. Cercone, and C\. Thomas \(2003\)N\-gram\-based author profiles for authorship attribution\.InProceedings of the Conference of the Pacific Association for Computational Linguistics,pp\. 255–264\.Cited by:[§1](https://arxiv.org/html/2605.19908#S1.p1.1)\.
- O\. Khattab and M\. Zaharia \(2020\)ColBERT: efficient and effective passage search via contextualized late interaction over BERT\.InProceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval,SIGIR ’20,New York, NY, USA,pp\. 39–48\.External Links:ISBN 978\-1\-4503\-8016\-4,[Link](https://dl.acm.org/doi/10.1145/3397271.3401075),[Document](https://dx.doi.org/10.1145/3397271.3401075)Cited by:[Figure 1](https://arxiv.org/html/2605.19908#S1.F1),[§1](https://arxiv.org/html/2605.19908#S1.p2.1),[§2\.2](https://arxiv.org/html/2605.19908#S2.SS2.SSS0.Px2.p1.1)\.
- F\. Kulumba, W\. Antoun, G\. Vimont, L\. Romary, and F\. Cafiero \(2025\)HALvest\-contrastive: retrieval\-like authorship attribution with patch\-level late interaction\.External Links:2407\.20595,[Link](https://arxiv.org/abs/2407.20595)Cited by:[§1](https://arxiv.org/html/2605.19908#S1.p2.1),[§1](https://arxiv.org/html/2605.19908#S1.p3.1),[Table 1](https://arxiv.org/html/2605.19908#S3.T1),[§4\.1](https://arxiv.org/html/2605.19908#S4.SS1.p1.1),[§4\.1](https://arxiv.org/html/2605.19908#S4.SS1.p2.4),[Table 2](https://arxiv.org/html/2605.19908#S4.T2)\.
- Y\. Liu, M\. Ott, N\. Goyal, J\. Du, M\. Joshi, D\. Chen, O\. Levy, M\. Lewis, L\. Zettlemoyer, and V\. Stoyanov \(2019\)RoBERTa: a robustly optimized BERT pretraining approach\.External Links:1907\.11692,[Link](https://arxiv.org/abs/1907.11692)Cited by:[Backbone choice\.](https://arxiv.org/html/2605.19908#Sx1.SS0.SSS0.Px1.p1.1)\.
- K\. Meng, D\. Bau, A\. Andonian, and Y\. Belinkov \(2022\)Locating and editing factual associations in GPT\.InProceedings of the 36th International Conference on Neural Information Processing Systems,NIPS ’22,Red Hook, NY, USA,pp\. 17359–17372\.External Links:ISBN 978\-1\-7138\-7108\-8Cited by:[§2\.4](https://arxiv.org/html/2605.19908#S2.SS4.p1.1),[§2\.5](https://arxiv.org/html/2605.19908#S2.SS5.SSS0.Px1.p1.3),[§6](https://arxiv.org/html/2605.19908#S6.SS0.SSS0.Px2.p1.1)\.
- F\. Mosteller and D\. L\. Wallace \(1963\)Inference in an authorship problem: a comparative study of discrimination methods applied to the authorship of the disputed federalist papers\.Journal of the American Statistical Association58\(302\),pp\. 275–309\.Cited by:[§1](https://arxiv.org/html/2605.19908#S1.p1.1)\.
- A\. Ravichander, Y\. Belinkov, and E\. Hovy \(2021\)Probing the probing paradigm: does probing accuracy entail task relevance?\.InProceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume,P\. Merlo, J\. Tiedemann, and R\. Tsarfaty \(Eds\.\),Online,pp\. 3363–3377\.External Links:[Link](https://aclanthology.org/2021.eacl-main.295/),[Document](https://dx.doi.org/10.18653/v1/2021.eacl-main.295)Cited by:[§6](https://arxiv.org/html/2605.19908#S6.SS0.SSS0.Px2.p1.1)\.
- R\. A\. Rivera\-Soto, O\. E\. Miano, J\. Ordonez, B\. Y\. Chen, A\. Khan, M\. Bishop, and N\. Andrews \(2021\)Learning universal authorship representations\.InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,M\. Moens, X\. Huang, L\. Specia, and S\. W\. Yih \(Eds\.\),Online and Punta Cana, Dominican Republic,pp\. 913–919\.External Links:[Link](https://aclanthology.org/2021.emnlp-main.70/),[Document](https://dx.doi.org/10.18653/v1/2021.emnlp-main.70)Cited by:[§2\.2](https://arxiv.org/html/2605.19908#S2.SS2.SSS0.Px1.p1.1),[§6](https://arxiv.org/html/2605.19908#S6.SS0.SSS0.Px1.p1.1)\.
- J\. Schler, M\. Koppel, S\. Argamon, and J\. W\. Pennebaker \(2006\)Effects of age and gender on blogging\.InAAAI Spring Symposium: Computational Approaches to Analyzing Weblogs,Vol\.6,pp\. 199–205\.Cited by:[§6](https://arxiv.org/html/2605.19908#S6.SS0.SSS0.Px1.p1.1)\.
- I\. Tenney, D\. Das, and E\. Pavlick \(2019\)BERT rediscovers the classical NLP pipeline\.InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics,A\. Korhonen, D\. Traum, and L\. Màrquez \(Eds\.\),Florence, Italy,pp\. 4593–4601\.External Links:[Link](https://aclanthology.org/P19-1452/),[Document](https://dx.doi.org/10.18653/v1/P19-1452)Cited by:[§3\.2](https://arxiv.org/html/2605.19908#S3.SS2.p2.2)\.
- A\. van den Oord, Y\. Li, and O\. Vinyals \(2019\)Representation learning with contrastive predictive coding\.External Links:1807\.03748,[Link](https://arxiv.org/abs/1807.03748)Cited by:[§2\.1](https://arxiv.org/html/2605.19908#S2.SS1.p1.6)\.
- A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, Ł\. Kaiser, and I\. Polosukhin \(2017\)Attention is all you need\.InProceedings of the 31st International Conference on Neural Information Processing Systems,NIPS’17,Red Hook, NY, USA,pp\. 6000–6010\.External Links:ISBN 9781510860964Cited by:[§1](https://arxiv.org/html/2605.19908#S1.p2.1)\.
- J\. Vig, S\. Gehrmann, Y\. Belinkov, S\. Qian, D\. Nevo, Y\. Singer, and S\. Shieber \(2020\)Investigating gender bias in language models using causal mediation analysis\.InAdvances in Neural Information Processing Systems,H\. Larochelle, M\. Ranzato, R\. Hadsell, M\.F\. Balcan, and H\. Lin \(Eds\.\),Vol\.33,pp\. 12388–12401\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2020/file/92650b2e92217715fe312e6fa7b90d82-Paper.pdf)Cited by:[§1](https://arxiv.org/html/2605.19908#S1.p3.1),[§2\.4](https://arxiv.org/html/2605.19908#S2.SS4.p1.1),[§6](https://arxiv.org/html/2605.19908#S6.SS0.SSS0.Px2.p1.1)\.
- K\. R\. Wang, A\. Variengien, A\. Conmy, B\. Shlegeris, and J\. Steinhardt \(2023\)Interpretability in the wild: a circuit for indirect object identification in GPT\-2 small\.InThe Eleventh International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=NpsVSN6o4ul)Cited by:[§6](https://arxiv.org/html/2605.19908#S6.SS0.SSS0.Px2.p1.1)\.
- L\. Wang, N\. Yang, X\. Huang, B\. Jiao, L\. Yang, D\. Jiang, R\. Majumder, and F\. Wei \(2024\)Text embeddings by weakly\-supervised contrastive pre\-training\.External Links:2212\.03533,[Link](https://arxiv.org/abs/2212.03533)Cited by:[§4\.1](https://arxiv.org/html/2605.19908#S4.SS1.p2.4)\.
- T\. Wang and P\. Isola \(2020\)Understanding contrastive representation learning through alignment and uniformity on the hypersphere\.InProceedings of the 37th International Conference on Machine Learning,ICML ’20\.Cited by:[§2\.3](https://arxiv.org/html/2605.19908#S2.SS3.p1.2)\.
- B\. Warner, A\. Chaffin, B\. Clavié, O\. Weller, O\. Hallström, S\. Taghadouini, A\. Gallagher, R\. Biswas, F\. Ladhak, T\. Aarsen, G\. T\. Adams, J\. Howard, and I\. Poli \(2025\)Smarter, better, faster, longer: a modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, Volume 1: Long Papers,W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 2526–2547\.External Links:[Link](https://aclanthology.org/2025.acl-long.127/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.127),ISBN 979\-8\-89176\-251\-0Cited by:[§4\.1](https://arxiv.org/html/2605.19908#S4.SS1.p1.1)\.
- A\. Wegmann and D\. Nguyen \(2021\)Does it capture Stel? a modular, similarity\-based linguistic style evaluation framework\.InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,M\. Moens, X\. Huang, L\. Specia, and S\. W\. Yih \(Eds\.\),Online and Punta Cana, Dominican Republic,pp\. 7109–7130\.External Links:[Link](https://aclanthology.org/2021.emnlp-main.569/),[Document](https://dx.doi.org/10.18653/v1/2021.emnlp-main.569)Cited by:[§6](https://arxiv.org/html/2605.19908#S6.SS0.SSS0.Px1.p1.1)\.
- A\. Wegmann, M\. Schraagen, and D\. Nguyen \(2022\)Same author or just same topic? towards content\-independent style representations\.InProceedings of the 7th Workshop on Representation Learning for NLP,S\. Gella, H\. He, B\. P\. Majumder, B\. Can, E\. Giunchiglia, S\. Cahyawijaya, S\. Min, M\. Mozes, X\. L\. Li, I\. Augenstein, A\. Rogers, K\. Cho, E\. Grefenstette, L\. Rimell, and C\. Dyer \(Eds\.\),Dublin, Ireland,pp\. 249–268\.External Links:[Link](https://aclanthology.org/2022.repl4nlp-1.26/),[Document](https://dx.doi.org/10.18653/v1/2022.repl4nlp-1.26)Cited by:[§1](https://arxiv.org/html/2605.19908#S1.p2.1),[§2\.2](https://arxiv.org/html/2605.19908#S2.SS2.SSS0.Px1.p1.1),[§6](https://arxiv.org/html/2605.19908#S6.SS0.SSS0.Px1.p1.1)\.
- F\. Zhang and N\. Nanda \(2023\)Towards Best Practices of Activation Patching in Language Models: Metrics and Methods\.InThe Twelfth International Conference on Learning Representations,\(en\)\.External Links:[Link](https://openreview.net/forum?id=Hf17y6u9BC)Cited by:[§1](https://arxiv.org/html/2605.19908#S1.p3.1)\.

## Appendix ATop LISA features across models

Table 4:Top\-5 LISA features by peakR2R^\{2\}across layers\. All four models surface the same feature family with highly similar probe performance\.Table[4](https://arxiv.org/html/2605.19908#A1.T4)reports the top\-5 LISA features by peakR2R^\{2\}for each model\. The rankings are nearly identical: mean word length dominates in all four models \(R2≈0\.576R^\{2\}\\approx 0\.576–0\.5800\.580\), followed by function\-word frequencies and punctuation density\. The control E5 encoder, which has never been trained on authorship data, achieves the sameR2R^\{2\}values as the three fine\-tuned models, confirming that these stylistic features are linearly readable from the pretrained backbone and are not created via fine\-tuning\.

Similar Articles

LLM Attribution Analysis Across Different Fine-Tuning Strategies and Model Scales for Automated Code Compliance

arXiv cs.CL

This paper analyzes how different fine-tuning strategies (FFT, LoRA, quantized LoRA) and model scales affect LLM interpretive behavior for automated code compliance tasks using perturbation-based attribution analysis. The findings show FFT produces more focused attribution patterns than parameter-efficient methods, and larger models develop specific interpretive strategies with diminishing performance returns beyond 7B parameters.

The Attribution Contract: Feature Attribution for Generative Language Models

arXiv cs.LG

This paper introduces the Attribution Contract, a specification for feature-attribution claims in generative language models, addressing ambiguities in what constitutes a feature and how attribution methods should be evaluated. It uses autoregressive and diffusion models as case studies to show when attribution is informative or misleading.

Probabilistic Attribution For Large Language Models

arXiv cs.CL

This paper proposes a model-agnostic probabilistic token attribution measure for LLMs using Bayes' rule to invert next-token log probabilities, capturing the model's internal representation of token sequences and improving interpretability through entropy analysis.