Style or Content? Evaluating Style Classifiers with Controlled Content Overlap

arXiv cs.CL Papers

Summary

This paper introduces a controlled content overlap setup using parallel Bible translations to evaluate how much style classifiers rely on content cues rather than actual style features. Results show that low-overlap models degrade when content cues are removed, while high-overlap models transfer more robustly.

arXiv:2606.07103v1 Announce Type: new Abstract: Style classifiers can use content cues that correlate with style labels in naturally collected data, yet we lack a systematic way to measure this reliance. We study this problem with a controlled content overlap setup built on parallel Bible translations. Specifically, we define the overlap parameter $\alpha$ as the normalized residual of mutual information between content identity and style label, so that it measures how much content is shared across style classes: from no shared content ($\alpha=0$) to fully shared content ($\alpha=1$). Cross-overlap evaluation of RoBERTa-based classifiers shows that low-overlap models degrade when content cues are removed, while high-overlap models transfer more robustly. A cross-style content retrieval probe further shows that content becomes less recoverable as $\alpha$ increases, with training dynamics showing this removal occurs gradually. Together, these results suggest that controlled overlap provides a simple diagnostic for separating style learning from content shortcuts.
Original Article
View Cached Full Text

Cached at: 06/08/26, 09:22 AM

# Style or Content? Evaluating Style Classifiers with Controlled Content Overlap
Source: [https://arxiv.org/html/2606.07103](https://arxiv.org/html/2606.07103)
###### Abstract

Style classifiers can use content cues that correlate with style labels in naturally collected data, yet we lack a systematic way to measure this reliance\. We study this problem with a controlled content overlap setup built on parallel Bible translations\. Specifically, we define the overlap parameterα\\alphaas the normalized residual of mutual information between content identity and style label, so that it measures how much content is shared across style classes: from no shared content \(α=0\\alpha=0\) to fully shared content \(α=1\\alpha=1\)\. Cross\-overlap evaluation of RoBERTa\-based classifiers shows that low\-overlap models degrade when content cues are removed, while high\-overlap models transfer more robustly\. A cross\-style content retrieval probe further shows that content becomes less recoverable asα\\alphaincreases, with training dynamics showing this removal occurs gradually\. Together, these results suggest that controlled overlap provides a simple diagnostic for separating style learning from content shortcuts\.

Style or Content? Evaluating Style Classifiers with Controlled Content Overlap

Zhuo Liu Haozheng Du Xiangxiang Xu Hangfeng HeUniversity of Rochesterzhuo\.liu@rochester\.edu

## 1Introduction

Style classification aims to identify how a text is written rather than what it is about\. It supports a wide range of natural language processing \(NLP\) applications\(Stamatatos,[2009](https://arxiv.org/html/2606.07103#bib.bib1); Jinet al\.,[2022](https://arxiv.org/html/2606.07103#bib.bib9)\), including authorship analysis\(Mikros and Argiri,[2007](https://arxiv.org/html/2606.07103#bib.bib14)\), genre classification, stylistic rewriting\(Briakouet al\.,[2021](https://arxiv.org/html/2606.07103#bib.bib16)\), and controllable text generation\. Recent classifiers based on pretrained language models achieve strong progress on standard benchmarks\(Carlsonet al\.,[2018](https://arxiv.org/html/2606.07103#bib.bib4)\), yet high performance alone does not mean that they have learned transferable stylistic patterns\(Geirhoset al\.,[2020](https://arxiv.org/html/2606.07103#bib.bib2); Altakroriet al\.,[2021](https://arxiv.org/html/2606.07103#bib.bib3)\)\.

A key challenge is that style and content are often entangled in naturally collected data\. Texts with different style labels may also differ in topics, entities, events, or domains, which allows a classifier to exploit content cues that are predictive of the label\(Altakroriet al\.,[2021](https://arxiv.org/html/2606.07103#bib.bib3)\)\. Standard held\-out evaluation may fail to diagnose this behavior: when the same content–style association is preserved across training and test splits, a shortcut\-based model\(Geirhoset al\.,[2020](https://arxiv.org/html/2606.07103#bib.bib2); Niven and Kao,[2019](https://arxiv.org/html/2606.07103#bib.bib5)\)can perform well while failing to learn style features that generalize across content\.

Prior work has shown that NLP models can rely on unintended correlations in the data, including shortcut learning in general\(Geirhoset al\.,[2020](https://arxiv.org/html/2606.07103#bib.bib2); McCoyet al\.,[2019](https://arxiv.org/html/2606.07103#bib.bib6); Zhouet al\.,[2024](https://arxiv.org/html/2606.07103#bib.bib13)\)and topic\-based shortcuts in style\-related tasks\(Altakroriet al\.,[2021](https://arxiv.org/html/2606.07103#bib.bib3)\)\. Probing studies have further examined what information is captured by learned representations\(Niven and Kao,[2019](https://arxiv.org/html/2606.07103#bib.bib5); Conneauet al\.,[2018](https://arxiv.org/html/2606.07103#bib.bib10)\)\. Existing evaluations can reveal shortcut behavior, but they usually do not control the strength of the content shortcut\. As a result, it remains unclear when a model moves from content cues to content\-invariant style representation learning\.

To address this gap, we study style classification under controlled content overlap\. The evaluation is built from parallel texts, where the same content can be expressed in different styles, while the amount of content shared across style labels is systematically varied\. We use an overlap parameterα=1−I​\(C;S\)/H​\(S\)\\alpha=1\-I\(C;S\)/H\(S\)to control this variation, whereCCis content identity andSSis the style label\. This parameter measures how much content is shared across style classes, and is independent of the number of style classes\. Atα=0\\alpha=0, each style is associated with distinct content, so content alone can predict the style label\. Atα=1\\alpha=1, all styles share the same content, so content no longer provides label information\. English Bible translations\(Carlsonet al\.,[2018](https://arxiv.org/html/2606.07103#bib.bib4); Christodouloupoulos and Steedman,[2015](https://arxiv.org/html/2606.07103#bib.bib15)\)provide an aligned testbed for this setup, with shared content expressed across multiple stylistic variants\.

Our contributions are threefold\. First, we introduce an information\-based measure of controlled content overlapα\\alphathat quantifies the strength of content shortcuts independently of the number of style classes\. Second, we introduce a cross\-overlap evaluation and show that matched accuracy can hide content shortcuts: low\-overlap models perform well under matched conditions but degrade sharply when content cues are no longer predictive, whereas high\-overlap training leads to more stable transfer\. Third, we introduce a cross\-style content retrieval probe showing that content identity becomes less recoverable asα\\alphaincreases and that this change happens gradually during training\.Code\.The code is available at[https://github\.com/joeliuz6/content\_overlap\_eval](https://github.com/joeliuz6/content_overlap_eval)\.

## 2Evaluation Setup

### 2\.1Task and Data Structure

We consider a parallel corpus where the same content \(indexed by a*chunk ID*cc\) appears inkkdifferent style versions\. Each data point is a triple\(x,s,c\)\(x,s,c\): the textxx, the style labels∈\{1,…,k\}s\\in\\\{1,\\dots,k\\\}, and the chunk IDcc\. A style classifier takesxxas input and predictsss\. The content shortcut arises when different styles see different chunks at training time\.

### 2\.2Controlled Overlap Sampling

We define content overlap using an information\-based parameter:

α=1−I​\(C;S\)H​\(S\),\\alpha=1\-\\frac\{I\(C;S\)\}\{H\(S\)\},\(1\)whereCCdenotes chunk identity andSSdenotes the style label\. This parameter measures how much style information is not explained by content identity\. Whenα=0\\alpha=0, content identity fully determines the style label; whenα=1\\alpha=1, content identity gives no information about the style label\.

We use a simple sampling to construct datasets with a target value ofα\\alpha\. Givenkkstyle versions and a pool ofCCchunks available in all versions, we set the per\-version chunk number ton=⌊C/k⌋n=\\lfloor C/k\\rfloor\. Each version’s chunks come from two pools:

- •Shared pool\(np=⌊n​α⌋n\_\{p\}=\\lfloor n\\alpha\\rfloorchunks\): chunks that are included in all versions\.
- •Exclusive pool\(nr=n−npn\_\{r\}=n\-n\_\{p\}chunks\): chunks that are assigned to only one version\.

This sampling strategy realizes the target information\-based overlap\. Intuitively, shared chunks do not predict the style label, while exclusive chunks fully predict it\. Under this sampling, the conditional entropy isH​\(S∣C\)=α​log2⁡kH\(S\\mid C\)=\\alpha\\log\_\{2\}k, and thereforeI​\(C;S\)=\(1−α\)​log2⁡kI\(C;S\)=\(1\-\\alpha\)\\log\_\{2\}k\. Substituting this into Eq\.[1](https://arxiv.org/html/2606.07103#S2.E1)recovers the target value ofα\\alpha, independent of the number of style classes\. A full derivation is provided in Appendix[A](https://arxiv.org/html/2606.07103#A1)\.

Table 1:Cross\-evaluationtest accuracy\(%\) fork=2k=2\. Rows:αtrain\\alpha\_\{\\text\{train\}\}; columns:αeval\\alpha\_\{\\text\{eval\}\}\. Bold: matched conditions\. Results are reported as mean±\\pmstd over three random seeds\. Results fork=3,4,5k=3,4,5are shown as Table[2](https://arxiv.org/html/2606.07103#A1.T2)in Appendix\.![Refer to caption](https://arxiv.org/html/2606.07103v1/x1.png)Figure 1:Higher\-Overlap AdvantageΔHOA\\Delta\_\{\\mathrm\{HOA\}\}±\\pmstd \(Eq\.[2](https://arxiv.org/html/2606.07103#S2.E2)\) for all pairs of training and evaluation overlaps, fork∈\{2,3,4,5\}k\\in\\\{2,3,4,5\\\}\. Each cell shows the accuracy gain of the higher\-overlap model relative to the lower\-overlap model\.
### 2\.3Cross\-Overlap Evaluation

We use cross\-overlap evaluation to measure how well learned style features transfer when the content–style association changes\. We train a RoBERTa\-large\(Liuet al\.,[2019](https://arxiv.org/html/2606.07103#bib.bib7)\)classifier with an MLP classification head for each training overlap levelαtrain\\alpha\_\{\\text\{train\}\}, fine\-tuning all model parameters, and then evaluated on datasets with every overlap levelαeval\\alpha\_\{\\text\{eval\}\}\.

Raw cross\-overlap accuracy is not directly comparable across different values ofαeval\\alpha\_\{\\text\{eval\}\}, since evaluation overlap levels can differ in difficulty\. We therefore compare models under same overlap shifts\. Forαi<αj\\alpha\_\{i\}<\\alpha\_\{j\}, we compare a model trained atαj\\alpha\_\{j\}and tested atαi\\alpha\_\{i\}against a model trained atαi\\alpha\_\{i\}and tested atαj\\alpha\_\{j\}\. This isolates the effect of training overlap, which we measure with the*higher\-overlap advantage*:

ΔHOA​\(αi,αj\)=Acc​\(αtrain=αj,αeval=αi\)−Acc​\(αtrain=αi,αeval=αj\),\\Delta\_\{\\mathrm\{HOA\}\}\(\\alpha\_\{i\},\\alpha\_\{j\}\)=\\text\{Acc\}\(\\alpha\_\{\\text\{train\}\}\{=\}\\alpha\_\{j\},\\;\\alpha\_\{\\text\{eval\}\}\{=\}\\alpha\_\{i\}\)\\\\ \-\\text\{Acc\}\(\\alpha\_\{\\text\{train\}\}\{=\}\\alpha\_\{i\},\\;\\alpha\_\{\\text\{eval\}\}\{=\}\\alpha\_\{j\}\),\(2\)forαi<αj\\alpha\_\{i\}<\\alpha\_\{j\}\. A positiveΔHOA​\(αi,αj\)\\Delta\_\{\\mathrm\{HOA\}\}\(\\alpha\_\{i\},\\alpha\_\{j\}\)indicates that the model trained with higher overlap transfers better under the same shift\. For example,ΔHOA​\(0\.0,1\.0\)\\Delta\_\{\\mathrm\{HOA\}\}\(0\.0,1\.0\)compares theαtrain=1\.0\\alpha\_\{\\text\{train\}\}\{=\}1\.0model evaluated atαeval=0\.0\\alpha\_\{\\text\{eval\}\}\{=\}0\.0against theαtrain=0\.0\\alpha\_\{\\text\{train\}\}\{=\}0\.0model evaluated atαeval=1\.0\\alpha\_\{\\text\{eval\}\}\{=\}1\.0; a positive value means the high\-overlap model generalizes downward better than the low\-overlap model generalizes upward\.

### 2\.4Cross\-Style Content Retrieval Probe

To measure how much content information is retained in the learned representation, we introduce a*cross\-style content retrieval probe*\. Our style classifier uses RoBERTa\-large with an MLP classification head\. The\[CLS\]representation from RoBERTa is passed into the MLP head, and we use the intermediate representation𝐡\\mathbf\{h\}from this MLP head for probing\. For each trained style classifier, we freeze the model and extract the intermediate representation𝐡\\mathbf\{h\}\. We then train lightweight linear projectors that map representations from two styles into a shared embedding space\. The projectors are trained with a CLIP\-style contrastive loss\(Radfordet al\.,[2021](https://arxiv.org/html/2606.07103#bib.bib8)\), where aligned pairs from the same chunk but different versions are pulled together, and non\-aligned pairs are pushed apart\. Thus, the probe measures whether content is linearly recoverable after a linear projection\.

We evaluate content recoverability using top\-1 bidirectional retrieval accuracy with cosine similarity as in the paper\(Radfordet al\.,[2021](https://arxiv.org/html/2606.07103#bib.bib8)\)\. Given a chunk representation from one style, the probe retrieves its nearest neighbor from another style; the prediction is correct if the retrieved text has the same chunk ID\. High retrieval accuracy indicates that chunk identity remains recoverable from the representation\. In addition to probing fully trained models, we apply the same probe at different epochs during classifier training, allowing us to track how content recoverability changes\. Full details of the probe are provided in Appendix[B](https://arxiv.org/html/2606.07103#A2)\.

## 3Experimental Setup

### 3\.1Data

We use seven English Bible translations as our parallel corpus\. Each translation is segmented into aligned chunks of consecutive verses\. For classification, we use a subset of five translations, yielding style class settingsk∈\{2,3,4,5\}k\\in\\\{2,3,4,5\\\}\. The other two translations are reserved exclusively for the unseen cross\-style content retrieval probe\. Full dataset statistics and processing details are in Appendix[C](https://arxiv.org/html/2606.07103#A3)\.

### 3\.2Implementation Details

We finetune all RoBERTa\-large\(Liuet al\.,[2019](https://arxiv.org/html/2606.07103#bib.bib7)\)parameters over six overlap levels, fromα=0\.0\\alpha=0\.0toα=1\.0\\alpha=1\.0in increments of0\.20\.2\. Main results are averaged over three random seeds\. Full details are in Appendix[D](https://arxiv.org/html/2606.07103#A4)\.

## 4Results

We present three complementary analyses: cross\-overlap evaluation tests whether accuracy transfers when content–style associations change, the retrieval probe measures whether content remains recoverable, and training dynamics show when this content information is removed\.

### 4\.1High\-Overlap Training Improves Cross\-Overlap Style Transfer

Table[1](https://arxiv.org/html/2606.07103#S2.T1)shows the cross\-overlap evaluation matrix fork=2k=2\. The diagonal entries are high for all training overlaps, but matched accuracy alone can be misleading\. For example, the model trained atαtrain=0\.0\\alpha\_\{\\text\{train\}\}=0\.0reaches 94\.9% under matched evaluation, but drops to 55\.3% when evaluated atαeval=1\.0\\alpha\_\{\\text\{eval\}\}=1\.0\. This shows that low\-overlap training encourages content shortcuts: when content cues are removed at evaluation time, performance collapses\. In contrast, the model trained atαtrain=1\.0\\alpha\_\{\\text\{train\}\}=1\.0remains stable across allαeval\\alpha\_\{\\text\{eval\}\}values \(85–88%\)\. Since content is fully shared during training, content identity no longer predicts the style label, forcing the model to rely more on stylistic features\. This pattern holds across different numbers of classes as shown in Table[2](https://arxiv.org/html/2606.07103#A1.T2)in Appendix\.

![Refer to caption](https://arxiv.org/html/2606.07103v1/x2.png)\(a\)Seen versions
![Refer to caption](https://arxiv.org/html/2606.07103v1/x3.png)\(b\)Unseen versions

Figure 2:Cross\-style content retrieval probe accuracy vs\. training overlapα\\alphafor differentkk\. Shaded regions indicate standard deviation across three random seeds\.Figure[1](https://arxiv.org/html/2606.07103#S2.F1)visualizes the high\-overlap advantage in Eq\.[2](https://arxiv.org/html/2606.07103#S2.E2)for eachkk\.

Nearly all entries are positive, showing that higher\-overlap training consistently produces representations that transfer better\. This advantage is statistically reliable: a one\-sided pairedtt\-test over all off\-diagonal pairs rejects the null hypothesis of no directional advantage \(p<0\.001p<0\.001for allkk\), with large effect sizes \(Cohen’sd\>1\.1d\>1\.1\) and over 88% of pairs favoring the higher\-overlap model \(Appendix[E](https://arxiv.org/html/2606.07103#A5)\)\.

### 4\.2Content Decreases with Overlap

The content retrieval probe measures how much content information remains in the learned representations\. Figure[2](https://arxiv.org/html/2606.07103#S4.F2)reports top\-1 retrieval accuracy across classifiers trained with different overlap levelsα\\alpha\. For the retrieval evaluation, we use the fully shared setting \(α=1\.0\\alpha=1\.0\)\. Thus, the training overlap of the style classifier varies fromα=0\.0\\alpha=0\.0toα=1\.0\\alpha=1\.0, while the retrieval probe is evaluated on shared\-content pairs\.

Retrieval accuracy decreases monotonically asα\\alphaincreases for allkk\. Atα=0\\alpha=0, retrieval accuracy is high \(above 0\.7\), showing that low\-overlap models strongly encode content\. Atα=1\\alpha=1, it drops sharply \(below 0\.1\), indicating that most content information has been removed\. The same trend holds on unseen versions, although seen versions retain slightly higher retrieval accuracy\. Style classification accuracy remains high even atα=1\\alpha=1\(Table[2](https://arxiv.org/html/2606.07103#A1.T2)in Appendix\), showing that the model shifts from encoding content to encoding style\. These results support the information view: asα\\alphaincreases, content becomes less predictive of style, and the learned representation suppresses content features\.

![Refer to caption](https://arxiv.org/html/2606.07103v1/x4.png)\(a\)Seen versions
![Refer to caption](https://arxiv.org/html/2606.07103v1/x5.png)\(b\)Unseen versions

Figure 3:Cross\-style content retrieval probe accuracy during training for differentαtrain\\alpha\_\{\\text\{train\}\}withk=2k=2\.
### 4\.3Training Dynamics of Content Removal

Figure[3](https://arxiv.org/html/2606.07103#S4.F3)shows retrieval accuracy over training for differentα\\alphavalues withk=2k=2\. At the beginning, all models have similarly high matching accuracy, reflecting the pretrained RoBERTa initialization\. As training proceeds, the trajectories separate clearly\. High\-overlap models remove content much faster: by epoch 20, theα=1\.0\\alpha=1\.0model drops below 0\.2, while theα=0\\alpha=0model remains above 0\.8\. This indicates that low\-overlap continues to preserve content because it is useful for predicting the style\.

Across allα\\alpha, we observe the same pattern: largerα\\alphaleads to faster and deeper content removal\. Thus, content removal is gradual rather than sudden, and its speed is controlled by the overlap structure of the training data\. Since style classification accuracy remains high, this process reflects a shift from content\-based shortcuts toward style\-based representations rather than representational collapse\.

## 5Conclusion

Our findings point to a practical diagnostic setup: when content and style co\-vary in training data, standard held\-out accuracy is unreliable as evidence of style learning\. The overlap parameterα\\alphaoffers a straightforward way to verify this\. We hope this encourages the community to treat content bias as a measurable, adjustable variable rather than an unquantified source of noise in style\-related benchmarks\.

## Limitations

#### Limited model coverage\.

We evaluate RoBERTa\-large with an MLP classification head\. Although this setting is sufficient for testing the controlled\-overlap setting, future work should examine newer encoder models, decoder\-only language models, and instruction\-tuned models\.

#### Limited dataset scope\.

We use English Bible translations because they provide aligned content across stylistic variants\. However, this domain does not cover all forms of style variation, such as authorship, register, genre, or social media style\. Testing additional aligned or semi\-aligned corpora would strengthen the generality of our findings\.

#### No semantic\-distance modeling between chunks\.

Our setup treats chunk identity as discrete: chunks are either identical or different\. It therefore does not capture semantic relatedness between different chunks\. The mutual\-information analysis should be interpreted as measuring exact content overlap, not broader semantic overlap\.

#### Limited overlap schedule\.

We use six overlap levels fromα=0\.0\\alpha=0\.0toα=1\.0\\alpha=1\.0in steps of 0\.2\. A finer grid could reveal more detailed transition patterns between content\-based shortcuts and more content\-invariant style representations\.

## References

- M\. H\. Altakrori, J\. C\. K\. Cheung, and B\. Fung \(2021\)The topic confusion task: a novel scenario for authorship attribution\.arXiv preprint arXiv:2104\.08530\.Cited by:[§1](https://arxiv.org/html/2606.07103#S1.p1.1),[§1](https://arxiv.org/html/2606.07103#S1.p2.1),[§1](https://arxiv.org/html/2606.07103#S1.p3.1)\.
- E\. Briakou, S\. Agrawal, J\. Tetreault, and M\. Carpuat \(2021\)Evaluating the evaluation metrics for style transfer: a case study in multilingual formality transfer\.InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,pp\. 1321–1336\.Cited by:[§1](https://arxiv.org/html/2606.07103#S1.p1.1)\.
- K\. Carlson, A\. Riddell, and D\. Rockmore \(2018\)Evaluating prose style transfer with the bible\.Royal Society open science5\(10\),pp\. 171920\.Cited by:[§1](https://arxiv.org/html/2606.07103#S1.p1.1),[§1](https://arxiv.org/html/2606.07103#S1.p4.5)\.
- C\. Christodouloupoulos and M\. Steedman \(2015\)A massively parallel corpus: the bible in 100 languages\.Language resources and evaluation49\(2\),pp\. 375–395\.Cited by:[§1](https://arxiv.org/html/2606.07103#S1.p4.5)\.
- A\. Conneau, G\. Kruszewski, G\. Lample, L\. Barrault, and M\. Baroni \(2018\)What you can cram into a single $&\!\#\* vector: probing sentence embeddings for linguistic properties\.InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),I\. Gurevych and Y\. Miyao \(Eds\.\),Melbourne, Australia,pp\. 2126–2136\.External Links:[Link](https://aclanthology.org/P18-1198/),[Document](https://dx.doi.org/10.18653/v1/P18-1198)Cited by:[§1](https://arxiv.org/html/2606.07103#S1.p3.1)\.
- R\. Geirhos, J\. Jacobsen, C\. Michaelis, R\. Zemel, W\. Brendel, M\. Bethge, and F\. A\. Wichmann \(2020\)Shortcut learning in deep neural networks\.Nature Machine Intelligence2\(11\),pp\. 665–673\.Cited by:[§1](https://arxiv.org/html/2606.07103#S1.p1.1),[§1](https://arxiv.org/html/2606.07103#S1.p2.1),[§1](https://arxiv.org/html/2606.07103#S1.p3.1)\.
- D\. Jin, Z\. Jin, Z\. Hu, O\. Vechtomova, and R\. Mihalcea \(2022\)Deep learning for text style transfer: a survey\.Computational Linguistics48\(1\),pp\. 155–205\.Cited by:[§1](https://arxiv.org/html/2606.07103#S1.p1.1)\.
- Y\. Liu, M\. Ott, N\. Goyal, J\. Du, M\. Joshi, D\. Chen, O\. Levy, M\. Lewis, L\. Zettlemoyer, and V\. Stoyanov \(2019\)Roberta: a robustly optimized bert pretraining approach\.arXiv preprint arXiv:1907\.11692\.Cited by:[Appendix D](https://arxiv.org/html/2606.07103#A4.SS0.SSS0.Px1.p1.6),[§2\.3](https://arxiv.org/html/2606.07103#S2.SS3.p1.2),[§3\.2](https://arxiv.org/html/2606.07103#S3.SS2.p1.3)\.
- R\. T\. McCoy, E\. Pavlick, and T\. Linzen \(2019\)Right for the wrong reasons: diagnosing syntactic heuristics in natural language inference\.InProceedings of the 57th annual meeting of the association for computational linguistics,pp\. 3428–3448\.Cited by:[§1](https://arxiv.org/html/2606.07103#S1.p3.1)\.
- G\. K\. Mikros and E\. K\. Argiri \(2007\)Investigating topic influence in authorship attribution\.\.InPAN,Cited by:[§1](https://arxiv.org/html/2606.07103#S1.p1.1)\.
- T\. Niven and H\. Kao \(2019\)Probing neural network comprehension of natural language arguments\.InProceedings of the 57th annual meeting of the association for computational linguistics,pp\. 4658–4664\.Cited by:[§1](https://arxiv.org/html/2606.07103#S1.p2.1),[§1](https://arxiv.org/html/2606.07103#S1.p3.1)\.
- A\. Radford, J\. W\. Kim, C\. Hallacy, A\. Ramesh, G\. Goh, S\. Agarwal, G\. Sastry, A\. Askell, P\. Mishkin, J\. Clark,et al\.\(2021\)Learning transferable visual models from natural language supervision\.InInternational conference on machine learning,pp\. 8748–8763\.Cited by:[§2\.4](https://arxiv.org/html/2606.07103#S2.SS4.p1.2),[§2\.4](https://arxiv.org/html/2606.07103#S2.SS4.p2.1)\.
- O\. Roy and M\. Vetterli \(2007\)The effective rank: a measure of effective dimensionality\.In2007 15th European signal processing conference,pp\. 606–610\.Cited by:[Appendix F](https://arxiv.org/html/2606.07103#A6.p2.5)\.
- E\. Stamatatos \(2009\)A survey of modern authorship attribution methods\.Journal of the American Society for information Science and Technology60\(3\),pp\. 538–556\.Cited by:[§1](https://arxiv.org/html/2606.07103#S1.p1.1)\.
- Y\. Zhou, P\. Xu, X\. Liu, B\. An, W\. Ai, and F\. Huang \(2024\)Explore spurious correlations at the concept level in language models for text classification\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 478–492\.Cited by:[§1](https://arxiv.org/html/2606.07103#S1.p3.1)\.

## Appendix ADerivation of the Information\-Based Overlap

We provide the complete derivation for the deterministic sampling scheme \(p=αp=\\alpha,r=1−αr=1\-\\alpha\)\.

#### Chunk structure\.

Under this scheme, every chunk falls into exactly one of two categories: \(1\) Shared chunks: there arenp=n​αn\_\{p\}=n\\alphasuch chunks, each appearing in allkkversions \(multiplicitymc=km\_\{c\}=k\)\. \(2\) Exclusive chunks: there arek⋅nr=k​n​\(1−α\)k\\cdot n\_\{r\}=kn\(1\-\\alpha\)such chunks, each appearing in exactly one version \(multiplicitymc=1m\_\{c\}=1\)\.

#### Verification\.

Total rows:n​α⋅k\+k​n​\(1−α\)⋅1=k​n=Nn\\alpha\\cdot k\+kn\(1\-\\alpha\)\\cdot 1=kn=N\.

#### Style entropy\.

Each version has exactlynnrows out ofN=k​nN=kntotal, soP​\(s\)=1/kP\(s\)=1/kandH​\(S\)=log2⁡kH\(S\)=\\log\_\{2\}k\.

#### Conditional entropy\.

For a shared chunk \(mc=km\_\{c\}=k\):P​\(s∣c\)=1/kP\(s\\mid c\)=1/k, soH​\(S∣C=c\)=log2⁡kH\(S\\mid C=c\)=\\log\_\{2\}k\. Each such chunk has weightP​\(c\)=k/\(k​n\)=1/nP\(c\)=k/\(kn\)=1/n\. For an exclusive chunk \(mc=1m\_\{c\}=1\):H​\(S∣C=c\)=0H\(S\\mid C=c\)=0\. Each such chunk has weightP​\(c\)=1/\(k​n\)P\(c\)=1/\(kn\)\. Summing:

H​\(S∣C\)\\displaystyle H\(S\\mid C\)=∑sharedkk​n​log2⁡k\+∑excl\.1k​n⋅0\\displaystyle=\\sum\_\{\\text\{shared\}\}\\frac\{k\}\{kn\}\\log\_\{2\}k\+\\sum\_\{\\text\{excl\.\}\}\\frac\{1\}\{kn\}\\cdot 0=n​α⋅1n⋅log2⁡k=α​log2⁡k\.\\displaystyle=n\\alpha\\cdot\\frac\{1\}\{n\}\\cdot\\log\_\{2\}k=\\alpha\\log\_\{2\}k\.\(3\)

#### Mutual information\.

I​\(C;S\)=H​\(S\)−H​\(S∣C\)=\(1−α\)​log2⁡k\.I\(C;S\)=H\(S\)\-H\(S\\mid C\)=\(1\-\\alpha\)\\log\_\{2\}k\.

#### Normalized residual\.

I​\(C;S\)H​\(S\)=\(1−α\)​log2⁡klog2⁡k=1−α,\\frac\{I\(C;S\)\}\{H\(S\)\}=\\frac\{\(1\-\\alpha\)\\log\_\{2\}k\}\{\\log\_\{2\}k\}=1\-\\alpha,hence1−I​\(C;S\)/H​\(S\)=α1\-I\(C;S\)/H\(S\)=\\alpha\.

Table 2:Cross\-evaluationtest accuracy\(%\) fork∈\{2,3,4,5\}k\\in\\\{2,3,4,5\\\}\. Results are reported as mean±\\pmstd over 3 random seeds\.

## Appendix BCross\-Style Content Retrieval Probe details

The cross\-style content retrieval probe measures how much content information is retained in the style classifier’s representation𝐡\\mathbf\{h\}\. For all retrieval\-probe evaluations, we construct the probe dataset under the fully shared setting \(α=1\.0\\alpha=1\.0\), so that aligned chunks are available across the two versions used for retrieval\. The overlap value varied in the main analysis refers to the training overlap of the style classifier\. We use two seen versions and two unseen version translations for any value ofkk\.

Given a trained and frozen style classifier, we extract𝐡∈ℝ256\\mathbf\{h\}\\in\\mathbb\{R\}^\{256\}for every chunk in both seen and unseen verison translations\. Two linear projectorsWa,Wb∈ℝ64×256W^\{a\},W^\{b\}\\in\\mathbb\{R\}^\{64\\times 256\}\(one per translation\) map𝐡\\mathbf\{h\}into a shared 64\-dimensional space, trained with a symmetric contrastive loss:

ℒ=−12​N∑i=1N\[\\displaystyle\\mathcal\{L\}=\-\\frac\{1\}\{2N\}\\sum\_\{i=1\}^\{N\}\\bigg\[log⁡es​\(𝐳ia,𝐳ib\)/τ∑jes​\(𝐳ia,𝐳jb\)/τ\\displaystyle\\log\\frac\{e^\{s\(\\mathbf\{z\}\_\{i\}^\{a\},\\mathbf\{z\}\_\{i\}^\{b\}\)/\\tau\}\}\{\\sum\_\{j\}e^\{s\(\\mathbf\{z\}\_\{i\}^\{a\},\\mathbf\{z\}\_\{j\}^\{b\}\)/\\tau\}\}\(4\)\+loges​\(𝐳ib,𝐳ia\)/τ∑jes​\(𝐳ib,𝐳ja\)/τ\],\\displaystyle\+\\log\\frac\{e^\{s\(\\mathbf\{z\}\_\{i\}^\{b\},\\mathbf\{z\}\_\{i\}^\{a\}\)/\\tau\}\}\{\\sum\_\{j\}e^\{s\(\\mathbf\{z\}\_\{i\}^\{b\},\\mathbf\{z\}\_\{j\}^\{a\}\)/\\tau\}\}\\bigg\],where𝐳ia=Wa​𝐡ia\\mathbf\{z\}\_\{i\}^\{a\}=W^\{a\}\\mathbf\{h\}\_\{i\}^\{a\}and𝐳ib=Wb​𝐡ib\\mathbf\{z\}\_\{i\}^\{b\}=W^\{b\}\\mathbf\{h\}\_\{i\}^\{b\}are projected embeddings of an aligned pair \(same chunk in two different translations\),s​\(⋅,⋅\)s\(\\cdot,\\cdot\)is cosine similarity, andτ=0\.07\\tau=0\.07is the temperature\. We report top\-1 bidirectional matching accuracy: high accuracy means the representation retains content information; low accuracy means content has been stripped away\. For the training dynamics experiments \(§[4\.3](https://arxiv.org/html/2606.07103#S4.SS3)\), the same probe is trained from scratch at regular epoch intervals using the classifier checkpoint at that point\.

## Appendix CDataset

TranslationAbbr\.RoleAmerican Standard VersionASVClassification \+ Seen ProbeDarby Bible TranslationDBYClassification \+ Seen ProbeWorld English BibleWEBClassificationKing James VersionKJVClassificationWebster’s Bible TranslationWBTClassificationEnglish Revised VersionERVUnseen ProbeWorld Messianic BibleWMBUnseen ProbeTable 3:Bible translations versions used in this study\. The top five are used for style classification \(k∈\{2,3,4,5\}k\\in\\\{2,3,4,5\\\}in listed order\); the bottom two are reserved for the unseen cross\-style content retrieval probe\.We selected the seven English translations from the English Bible versions available through eBible\.org, which provides multiple downloadable English translations including ASV, DBY, ERV, KJV, WEB, WBT, and WMB\. We first applied an alignment\-coverage filter: a candidate translation was retained only if it covered enough of the shared verse/chapter inventory to support strict parallel chunking across versions\. This step removed translations with substantial missing books, chapters, or verse ranges, since such gaps would make chunk\-level alignment unreliable and would introduce systematic missingness unrelated to style\.

We then filtered candidates using two complementary diagnostic criteria\. First, we excluded versions that were too easily distinguishable from the others because of obvious surface artifacts, such as highly archaic or simplified language, distinctive orthography, paraphrastic expansions, naming conventions, or formatting/editorial conventions\. Such versions could allow a classifier to rely on superficial cues rather than on the subtler relationship between content and style\. Second, we removed translations that were extremely similar to another candidate version, since near\-duplicate translations would make the classification setting artificially dependent on minute textual differences and would reduce the diversity of stylistic variation in the corpus\. This near\-duplication is common in Bible translation corpora because many English versions are revisions, editions, or light modernizations of earlier translations rather than fully independent translations\. As a result, two versions may share extensive verse\-level wording even when they are distributed as separate translations\. We therefore aimed to remove both outliers and near\-duplicates, retaining translations that were similar enough to support controlled aligned comparison but different enough to provide meaningful stylistic variation\.

Table[3](https://arxiv.org/html/2606.07103#A3.T3)shows the seven English Bible translations used in this study\. For eachk∈\{2,3,4,5\}k\\in\\\{2,3,4,5\\\}, we use a fixed nested subset of five classification translations \(ASV, DBY, WEB, KJV, WBT\):k=2k=2uses the first two,k=3k=3the first three, and so on\. The remaining two translations \(ERV, WMB\) are reserved exclusively for the unseen cross\-style content retrieval probe\. For the seen version probe, we use ASV and DBY\. Each translation is segmented into chunks ofL=5L=5consecutive verses, producing aligned chunk IDs across all seven versions\. Chunks where all translations produce identical text are removed, since they carry no stylistic signal\.

Data is split by chunk ID into train chunks, validation, and test sets with an 80/10/10 ratio, so that no content chunk appears in more than one split\. For each split, overlap\-controlled subsets are constructed independently using the same targetα∈\{0\.0,0\.2,0\.4,0\.6,0\.8,1\.0\}\\alpha\\in\\\{0\.0,0\.2,0\.4,0\.6,0\.8,1\.0\\\}\. For a givenα\\alphaandkkversions, each version receivesn=⌊Csplit/k⌋n=\\lfloor C\_\{\\text\{split\}\}/k\\rfloorchunks, of whichnp=⌊n​α⌋n\_\{p\}=\\lfloor n\\alpha\\rfloorare shared across all versions and the remainingnr=n−npn\_\{r\}=n\-n\_\{p\}are exclusive to that version\. Exclusive chunks are assigned sequentially; shared chunks are drawn randomly from the remaining pool\.

## Appendix DTraining Details

#### Style classifier\.

The classifier is based on RoBERTa\-Large\(Liuet al\.,[2019](https://arxiv.org/html/2606.07103#bib.bib7)\)with an MLP classification head: \(1\) a dense layer𝐡=tanh⁡\(W1⋅𝐡\[CLS\]\+b1\)\\mathbf\{h\}=\\tanh\(W\_\{1\}\\cdot\\mathbf\{h\}\_\{\\texttt\{\[CLS\]\}\}\+b\_\{1\}\), where𝐡∈ℝ256\\mathbf\{h\}\\in\\mathbb\{R\}^\{256\}; \(2\) a linear classifiery^=W2⋅𝐡\+b2\\hat\{y\}=W\_\{2\}\\cdot\\mathbf\{h\}\+b\_\{2\}, wherey^∈ℝk\\hat\{y\}\\in\\mathbb\{R\}^\{k\}\. The model is trained end\-to\-end with cross entropy loss using AdamW with learning rate2×10−52\\times 10^\{\-5\}, weight decay0\.010\.01, linear warmup for 10% of total steps, batch size 32 with gradient accumulation over 4 steps \(effective batch size 128\), and 50 epochs\. The maximum input length is 512 tokens\.

#### Content retrieval probe\.

The projectors are linear layers mapping from the 256\-dimensional representation𝐡\\mathbf\{h\}to a 64\-dimensional embedding space \(one projector per version\)\. They are trained with AdamW \(learning rate2×10−32\\times 10^\{\-3\}, weight decay10−310^\{\-3\},β=\(0\.9,0\.98\)\\beta=\(0\.9,0\.98\)\), batch size 256, 100 epochs, cosine annealing, and a temperature ofτ=0\.07\\tau=0\.07for the contrastive loss\. The best checkpoint is selected by validation matching accuracy\. For the training dynamics experiments \(§[4\.3](https://arxiv.org/html/2606.07103#S4.SS3)\), the same probe is trained from scratch at regular epoch intervals using the classifier checkpoint at that point\.

All classification and probe results are averaged over three random seeds\.

## Appendix EStatistical Significance of the High\-Overlap Advantage

To test whether the directional pattern in the cross\-evaluation matrices is statistically significant, we pool all off\-diagonal pairs\(αi,αj\)\(\\alpha\_\{i\},\\alpha\_\{j\}\)withαi<αj\\alpha\_\{i\}<\\alpha\_\{j\}and compute the high\-overlap advantageΔ​\(αi,αj\)\\Delta\(\\alpha\_\{i\},\\alpha\_\{j\}\)\(Eq\.[2](https://arxiv.org/html/2606.07103#S2.E2)\) for each pair under each random seed\. This yields\(62\)×3=45\\binom\{6\}\{2\}\\times 3=45paired differences perkk\. We test the null hypothesisH0:𝔼​\[Δ\]≤0H\_\{0\}\{:\}\\;\\mathbb\{E\}\[\\Delta\]\\leq 0\(i\.e\., higher overlap confers no directional advantage\) using a one\-sidedtt\-test and a Wilcoxon signed\-rank test\.

Table[4](https://arxiv.org/html/2606.07103#A5.T4)reports the results on the test set\. Both tests rejectH0H\_\{0\}atp<0\.001p<0\.001for everykk\. The effect is large and consistent: Cohen’sddexceeds 1\.1 in all conditions, and over 88% of individual pair differences are positive, confirming that the benefit of higher\-overlap training is systematic rather than driven by a few extreme pairs\.

Table 4:Statistical significance of the high\-overlap advantageΔ\\Deltaon the test set\. For eachkk, 45 paired differences are pooled across 15 off\-diagonal pairs and 3 seeds\. Both thett\-test and Wilcoxon test yieldp<0\.001p<0\.001in all conditions\.
## Appendix FAdditional Representation Geometry Analysis

![Refer to caption](https://arxiv.org/html/2606.07103v1/x6.png)Figure 4:Spectrum analysis of the 256\-dimensional pre\-logit representations fork=2k=2\. Left: variance explained by PC1 and the remaining components across different training overlaps\. Right: variance explained by the top 10 principal components\. Across all overlap levels, the spectrum is dominated by PC1, suggesting that the classifier representation is highly anisotropic\.Table 5:Effective rank and participation ratio of the mean\-centered representation covariance matrix on the test set fork=2k=2\. Both metrics are close to 1 across all overlap levels, indicating that most variance is concentrated in a single dominant direction\.We include an additional analysis of the global geometry of the learned representation\. For each model trained withk=2k=2, we extract the 256\-dimensional pre\-logit representation𝐡\\mathbf\{h\}on the test set\. Before computing the covariance matrix, we mean\-center the representations\. Letλ1≥λ2≥⋯≥λd\\lambda\_\{1\}\\geq\\lambda\_\{2\}\\geq\\cdots\\geq\\lambda\_\{d\}be the eigenvalues of this covariance matrix, withd=256d=256\.

We summarize the spectrum using two standard measures\. The*effective rank*\(Roy and Vetterli,[2007](https://arxiv.org/html/2606.07103#bib.bib12)\)is defined asexp⁡\(H\)\\exp\(H\), whereH=−∑iλ^i​log⁡λ^iH=\-\\sum\_\{i\}\\hat\{\\lambda\}\_\{i\}\\log\\hat\{\\lambda\}\_\{i\}andλ^i=λi/∑jλj\\hat\{\\lambda\}\_\{i\}=\\lambda\_\{i\}/\\sum\_\{j\}\\lambda\_\{j\}\. The*participation ratio*is defined as\(∑iλi\)2/∑iλi2\(\\sum\_\{i\}\\lambda\_\{i\}\)^\{2\}/\\sum\_\{i\}\\lambda\_\{i\}^\{2\}\. Both quantities are close to 1 when variance is concentrated in a single direction, and approachddwhen variance is spread uniformly across dimensions\.

Figure[4](https://arxiv.org/html/2606.07103#A6.F4)shows that the spectrum is dominated by the first principal component across all overlap levels\. PC1 explains over 94% of the variance, while the remaining components together account for less than 6%\. Table[5](https://arxiv.org/html/2606.07103#A6.T5)shows the same pattern numerically: both effective rank and participation ratio remain close to 1 for all values ofα\\alpha\.

These results should be interpreted as a coarse geometry check rather than a direct measure of style or content information\. The strong dominance of PC1 likely reflects the anisotropy of the representation and the low\-dimensional structure induced by the binary classification head\. Importantly, this global spectrum changes only slightly withα\\alpha, while the cross\-style content retrieval probe shows large changes in content recoverability\. Thus, content information can become less recoverable even when coarse spectral statistics of the representation remain similar\.

## Appendix GArtifact Use and Licenses

We use publicly available English Bible translations from eBible\.org and the RoBERTa\-large model\. Before release, we will include the license information for each translation and ensure that our redistribution of processed metadata follows the corresponding license terms\. RoBERTa\-large is released under the MIT License\. Our use of these artifacts is consistent with their intended purposes: the Bible corpus was created as a multilingual parallel resource for NLP research, and RoBERTa\-large is a general\-purpose pretrained model designed for fine\-tuning on downstream tasks\. The dataset consists of publicly available religious texts and does not contain personally identifiable information or offensive content; no additional anonymization was required\.

Similar Articles

Interpreting Style Representations via Style-Eliciting Prompts

arXiv cs.CL

This paper proposes a framework to interpret style representations by using style-eliciting prompts—natural language instructions that steer LLMs to generate text with specific stylistic attributes. The method outperforms baseline LLM prompting techniques in both describing and imitating writing styles.

Attacks on Machine-Text Detectors Retain Stylistic Fingerprints

Hugging Face Daily Papers

This paper investigates evasion attacks on machine-text detectors, finding that while current attacks degrade detector performance, stylistic fingerprints persist. A novel paraphrasing approach that mimics human styles can evade even style-based detectors, but multi-document analysis recovers detectability.