SURGELLM: Rethinking Multi-Task Evaluation through Task-Aware Feature Gating with Class-Balanced Normalization
Summary
SURGeLLM introduces a unified transformer framework with surgical feature gates, task-conditioned prefix tokens, and instance-weighted normalization to address mismatched inductive biases, class imbalance, and lexical knowledge injection in multi-task learning, achieving significant gains across four diverse NLP tasks.
View Cached Full Text
Cached at: 06/24/26, 07:46 AM
# Rethinking Multi-Task Evaluation through Task-Aware Feature Gating with Class-Balanced Normalization
Source: [https://arxiv.org/html/2606.24259](https://arxiv.org/html/2606.24259)
Uluğ Bayazıt2 Dept\. of Computer Science, Istanbul Technical University \{islam23, ulugbayazit\}@itu\.edu\.trSupervising author\.
###### Abstract
Fine\-tuned encoders deployed across heterogeneous NLP tasks face three compounding problems: mismatched inductive biases, class\-imbalance corruption of feature statistics, and no mechanism to condition attention on external lexical knowledge\. We introduceSURGeLLM, a unified transformer framework that addresses each with a dedicated lightweight module: a*surgical feature gate*\(learned per\-dimension sigmoid over curated lexical indicators and\[CLS\]; provably degenerates to identity when features are uninformative\),*task\-conditioned prefix tokens*\(quantized feature values and task identity prepended to every input\), and*Instance\-Weighted Normalization*\(IWN; removes class\-prior bias from gate statistics\)\. We prove an excess\-risk bound linking gate benefit to*surgical feature alignment*\. Across four tasks, SST\-2, multi\-hop retrieval, LLM\-prompt attribution, and authorship detection, covering 17,830 examples and eleven model variants over three seeds, the IWN variant achieves macro\-F10\.940\(\+0\.036\+0\.036over the strongest non\-IWN baseline;\+0\.130\+0\.130on authorship detection\)\. A random\-vocabulary control \(−0\.028\-0\.028avg\. F1\) confirms gains are lexical, not parametric\. Code, vocabularies, and a99\.5%99\.5\\%\-recovery auto\-extraction recipe are released\.
SURGeLLM: Rethinking Multi\-Task Evaluation through Task\-Aware Feature Gating with Class\-Balanced Normalization
Noor Islam S\. Mohammad1††thanks:Corresponding author\.and Uluğ Bayazıt2††thanks:Supervising author\.Dept\. of Computer Science, Istanbul Technical University\{islam23, ulugbayazit\}@itu\.edu\.tr
## 1Introduction
Pre\-trained encoders fine\-tuned per task incur real costs: parameter duplication, no amortized inference, and no shared linguistic structure\. Multi\-task learning\(Caruana,[1997](https://arxiv.org/html/2606.24259#bib.bib6); Liuet al\.,[2019a](https://arxiv.org/html/2606.24259#bib.bib7); Raffelet al\.,[2020](https://arxiv.org/html/2606.24259#bib.bib5)\)addresses this in principle, but structurally heterogeneous tasks—differing in vocabulary, label space, and register—interfere destructively\(Wuet al\.,[2020](https://arxiv.org/html/2606.24259#bib.bib10); Crawshaw,[2020](https://arxiv.org/html/2606.24259#bib.bib9); Fiftyet al\.,[2021](https://arxiv.org/html/2606.24259#bib.bib11)\)in ways that near\-isotropic benchmarks like GLUE\(Wanget al\.,[2018](https://arxiv.org/html/2606.24259#bib.bib52)\)do not expose\. We study the hard case: a single encoder handling \(a\) movie\-review sentiment, \(b\) multi\-hop retrieval QA, \(c\) LLM\-prompt attribution, and \(d\) human/LLM authorship—tasks sharing a backbone but drawing on largely disjoint surface signals\. Two observations motivate explicit feature injection beyond end\-to\-end fine\-tuning\. First, stylometric surface statistics remain discriminative even after fine\-tuning\(Fabienet al\.,[2020](https://arxiv.org/html/2606.24259#bib.bib25); Potthastet al\.,[2017](https://arxiv.org/html/2606.24259#bib.bib26)\), suggesting the encoder does not always exploit them optimally\.
Second, sequence truncation destroys global statistics \(pronoun rates, marker densities\) that cannot be recovered from a partial view\(Dinget al\.,[2020](https://arxiv.org/html/2606.24259#bib.bib59)\)\. We address both with a*surgical vocabulary*, ten curated lexical indicator groups yielding a 16\-dimensional feature vector𝐬∈ℝ16\\mathbf\{s\}\\in\\mathbb\{R\}^\{16\}computed on the full untruncated text—fused with the\[CLS\]representation via a learned per\-dimension sigmoid gate and simultaneously injected as task\-conditioned prefix tokens\. Global standardization𝐬\\mathbf\{s\}is contaminated by class prior under severe skew \(our authorship corpus:9\.3:19\.3\{:\}1\), causing the gate to learn a sub\-optimal fusion\.Instance\-Weighted Normalization\(IWN\) replaces global with class\-balanced per\-dimension statistics at training time, with no test\-time labels required, yielding\+0\.130\+0\.130an absolute F1 on authorship detection, the largest single gain in our study\.
##### Contributions\.
Framework\(§[3](https://arxiv.org/html/2606.24259#S3)\): a unified multi\-task encoder with per\-dimension feature gates, task\-conditioned prefix tokens, and IWN; plug\-compatible with any HuggingFace encoder\.Theory\(§[A](https://arxiv.org/html/2606.24259#A1)\): excess\-risk bound \(Theorem[1](https://arxiv.org/html/2606.24259#Thmtheorem1)\) linking gate benefit to*surgical feature alignment*ρk\\rho\_\{k\}; degeneracy result \(Proposition[2](https://arxiv.org/html/2606.24259#Thmtheorem2)\) proving the gate is safe when features are uninformative\.Empirics\(§[6](https://arxiv.org/html/2606.24259#S6)–[7](https://arxiv.org/html/2606.24259#S7)\): eleven variants across four encoder backbones and T5\-base over three seeds; IWN achieves an aggregate macro\-F1 of0\.940\(\+0\.036\+0\.036over the strongest non\-IWN baseline\); random\-vocabulary control \(−0\.028\-0\.028avg\. F1\) confirms gains are lexical, not parametric\.Auto\-extraction\(Appendix[E](https://arxiv.org/html/2606.24259#A5)\): Log\-odds plus embedding clustering recovers99\.5%99\.5\\%manual curation performance, enabling transfer to new domains\.
## 2Related Work
##### Multi\-task and feature\-augmented Transformers\.
MT\-DNN\(Liuet al\.,[2019a](https://arxiv.org/html/2606.24259#bib.bib7)\), Muppet\(Aghajanyanet al\.,[2021](https://arxiv.org/html/2606.24259#bib.bib8)\), T5\(Raffelet al\.,[2020](https://arxiv.org/html/2606.24259#bib.bib5)\), and mixture\-of\-experts models\(Shazeeret al\.,[2017](https://arxiv.org/html/2606.24259#bib.bib18); Feduset al\.,[2022](https://arxiv.org/html/2606.24259#bib.bib19)\)all assume near\-homogeneous task structure\. Injecting handcrafted features into neural encoders\(Fabienet al\.,[2020](https://arxiv.org/html/2606.24259#bib.bib25); Potthastet al\.,[2017](https://arxiv.org/html/2606.24259#bib.bib26)\)and shallow\-feature scalar gating\(Srivastavaet al\.,[2015](https://arxiv.org/html/2606.24259#bib.bib42); Gormleyet al\.,[2015](https://arxiv.org/html/2606.24259#bib.bib29)\)are the closest precedents\.SURGeLLMdiffers on three axes: \(i\) structurally heterogeneous tasks; \(ii\) a*per\-dimension, instance\-conditioned*cross\-modal gate \(versus scalar intra\-modal gating in highway networks and GLUs\(Dauphinet al\.,[2017](https://arxiv.org/html/2606.24259#bib.bib43)\)\); \(iii\) explicit class\-imbalance remediation via IWN\.
##### LLM\-text Detection and Stylometry\.
Detection methods span token\-level probability signals\(Gehrmannet al\.,[2019](https://arxiv.org/html/2606.24259#bib.bib33)\), curvature\-based zero\-shot tests\(Mitchellet al\.,[2023](https://arxiv.org/html/2606.24259#bib.bib37)\), and watermarking\(Kirchenbaueret al\.,[2023](https://arxiv.org/html/2606.24259#bib.bib38)\)\. Classical stylometry\(Koppelet al\.,[2009](https://arxiv.org/html/2606.24259#bib.bib31); Stamatatos,[2009](https://arxiv.org/html/2606.24259#bib.bib32)\)shows surface features reliably signal authorship; our surgical vocabulary inherits this tradition and integrates it as an encoder prior\. Class imbalance in loss\-side\(Linet al\.,[2017](https://arxiv.org/html/2606.24259#bib.bib49)\)and sampling\-side\(Chawlaet al\.,[2002](https://arxiv.org/html/2606.24259#bib.bib50); Cuiet al\.,[2019](https://arxiv.org/html/2606.24259#bib.bib47)\)corrections are standard\. IWN is a*feature\-statistics*correction—class\-balancing the standardization of𝐬\\mathbf\{s\}before\-gate projection—orthogonal to both and, to our knowledge, novel in feature\-augmented NLP gating\.
## 3TheSURGeLLMFramework
### 3\.1Problem Formulation
Let𝒯=\{t1,t2,t3,t4\}\\mathcal\{T\}=\\\{t\_\{1\},t\_\{2\},t\_\{3\},t\_\{4\}\\\}be a fixed set of tasks, each associated with a label space𝒴tk\\mathcal\{Y\}\_\{t\_\{k\}\}of cardinalitync,kn\_\{c,k\}\. The multi\-task corpus is𝒟=⋃k=1\|𝒯\|𝒟k\\mathcal\{D\}=\\bigcup\_\{k=1\}^\{\|\\mathcal\{T\}\|\}\\mathcal\{D\}\_\{k\}where𝒟k=\{\(xi,yi,tk\)\}i=1Nk\\mathcal\{D\}\_\{k\}=\\\{\(x\_\{i\},y\_\{i\},t\_\{k\}\)\\\}\_\{i=1\}^\{N\_\{k\}\}\. We seek a single parametric modelfθ:𝒳×𝒯→⋃k𝒴tkf\_\{\\theta\}:\\mathcal\{X\}\\times\\mathcal\{T\}\\to\\bigcup\_\{k\}\\mathcal\{Y\}\_\{t\_\{k\}\}that minimizes the multi\-task empirical risk:
ℒ\(θ\)=∑k=1\|𝒯\|wk\|𝒟k\|∑\(x,y,tk\)∈𝒟kℓ\(fθ\(x,tk\),y\),\\mathcal\{L\}\(\\theta\)=\\sum\_\{k=1\}^\{\|\\mathcal\{T\}\|\}\\frac\{w\_\{k\}\}\{\|\\mathcal\{D\}\_\{k\}\|\}\\sum\_\{\(x,y,t\_\{k\}\)\\in\\mathcal\{D\}\_\{k\}\}\\ell\\\!\\left\(f\_\{\\theta\}\(x,t\_\{k\}\),\\,y\\right\),\(1\)whereℓ\\ellis the cross\-entropy loss and\{wk\}\\\{w\_\{k\}\\\}are non\-negative task weights\. We usewk=1w\_\{k\}=1throughout and rely on per\-task batch sampling for balance; alternative schedules\(Stickland and Murray,[2019](https://arxiv.org/html/2606.24259#bib.bib13); Sener and Koltun,[2018](https://arxiv.org/html/2606.24259#bib.bib14); Liuet al\.,[2022](https://arxiv.org/html/2606.24259#bib.bib12)\)are compatible with our framework\.
##### What is shared and what is task\-specific\.
Of the model’s parameters, the encoderℰϕ\\mathcal\{E\}\_\{\\phi\}\(6666M–220220M depending on backbone\), the surgical feature projection\(𝐖s,𝐛s\)\(\\mathbf\{W\}\_\{s\},\\mathbf\{b\}\_\{s\}\), the gate matrices\(𝐖g,𝐛g\)\(\\mathbf\{W\}\_\{g\},\\mathbf\{b\}\_\{g\}\), the task\-embedding matrix𝐄∈ℝ\|𝒯\|×d\\mathbf\{E\}\\in\\mathbb\{R\}^\{\|\\mathcal\{T\}\|\\times d\}, and the prefix\-token embeddings are all*shared*across tasks\. Only the per\-task heads\{\(𝐖1,k,𝐛1,k,𝐖2,k,𝐛2,k\)\}k=1\|𝒯\|\\\{\(\\mathbf\{W\}\_\{1,k\},\\mathbf\{b\}\_\{1,k\},\\mathbf\{W\}\_\{2,k\},\\mathbf\{b\}\_\{2,k\}\)\\\}\_\{k=1\}^\{\|\\mathcal\{T\}\|\}are task\-specific\. The shared parameters constitute over99%99\\%of the total parameter count, justifying the multi\-task framing in the conventional MT\-DNN sense\(Liuet al\.,[2019a](https://arxiv.org/html/2606.24259#bib.bib7)\)\.
### 3\.2Encoder Backbone
Given an input textxx, a pretrained transformer encoderℰϕ\\mathcal\{E\}\_\{\\phi\}\(BERT, RoBERTa, DistilBERT, or ALBERT in our experiments\) produces a sequence of contextual representations\. We extract the\[CLS\]token embedding:
𝐡=ℰϕ\(x\)\[0\]∈ℝd,\\mathbf\{h\}=\\mathcal\{E\}\_\{\\phi\}\(x\)\_\{\[0\]\}\\in\\mathbb\{R\}^\{d\},\(2\)whered=768d=768for all base\-scale encoders\. A learnable task\-embedding matrix𝐄∈ℝ\|𝒯\|×d\\mathbf\{E\}\\in\\mathbb\{R\}^\{\|\\mathcal\{T\}\|\\times d\}provides per\-task offset vectors𝐄tk\\mathbf\{E\}\_\{t\_\{k\}\}that are mixed with𝐡\\mathbf\{h\}through a small\-coefficient residual addition:
𝐡~=𝐡\+α𝐄tk,α=0\.1\.\\tilde\{\\mathbf\{h\}\}=\\mathbf\{h\}\+\\alpha\\,\\mathbf\{E\}\_\{t\_\{k\}\},\\qquad\\alpha=0\.1\.\(3\)
##### Why a small mixing coefficient?
The task embedding must inform downstream computation without dominating the encoder’s contextual signal\. We pickα=0\.1\\alpha=0\.1following the residual\-norm\-preservation argument ofHeet al\.\([2016](https://arxiv.org/html/2606.24259#bib.bib58)\): at initialization, the task embedding contributes a perturbation of magnitudeα‖𝐄tk‖\\alpha\\,\\left\\lVert\\mathbf\{E\}\_\{t\_\{k\}\}\\right\\rVert, which is small relative to the encoder output norm‖𝐡‖≈dσh\\left\\lVert\\mathbf\{h\}\\right\\rVert\\approx\\sqrt\{d\}\\sigma\_\{h\}for theσh≈1\\sigma\_\{h\}\\approx 1initialization scheme used in modern encoders\. Empirically,α∈\[0\.05,0\.2\]\\alpha\\in\[0\.05,0\.2\]was stable;α=1\\alpha=1caused the task embedding to dominate during early training and slowed convergence by∼1\{\\sim\}1epoch\.
### 3\.3Surgical Feature Extraction
Let𝒱=\{v1,…,v10\}\\mathcal\{V\}=\\\{v\_\{1\},\\ldots,v\_\{10\}\\\}the ten indicator groups of the surgical vocabulary be \(Appendix[D](https://arxiv.org/html/2606.24259#A4)contains the complete listing\)\. For an inputxxwith a lowercased formx~\\tilde\{x\}, the count feature for thejj\-th group is:
sj=∑w∈vj𝟏\[w∈x~\],j=1,…,10,s\_\{j\}=\\sum\_\{w\\in v\_\{j\}\}\\mathbf\{1\}\[w\\in\\tilde\{x\}\],\\qquad j=1,\\ldots,10,\(4\)where prefix matching is used for inflectional families \(e\.g\.,oscillat\*matches*oscillation, oscillates, oscillating*\)\. Six surface features are appended:s11s\_\{11\}\(total word count\),s12s\_\{12\}\(mean word length in characters\),s13s\_\{13\}\(sentence count obtained via splitting on\.\!?\),s14s\_\{14\}\(question\-mark count\),s15s\_\{15\}\(exclamation\-mark count\), ands16=𝟏\[any digit inx~\]s\_\{16\}=\\mathbf\{1\}\[\\text\{any digit in \}\\tilde\{x\}\]\(indicator for the presence of digits\)\. The full surgical feature vector is𝐬\(x\)=\[s1,…,s16\]⊤∈ℝ≥016\\mathbf\{s\}\(x\)=\[s\_\{1\},\\ldots,s\_\{16\}\]^\{\\top\}\\in\\mathbb\{R\}^\{16\}\_\{\\geq 0\}\.
### 3\.4The Surgical Feature Gate
The gate𝒢\\mathcal\{G\}fuses the task\-conditioned CLS representation𝐡~\\tilde\{\\mathbf\{h\}\}with a non\-linear projection of the surgical\-feature vector\. We describe each step explicitly\.
##### Step 1: Feature projection\.
The 16\-dimensional vector𝐬\\mathbf\{s\}is projected to the encoder’s hidden dimensiondd:
𝐬′=ReLU\(𝐖s𝐬\+𝐛s\),𝐖s∈ℝd×16\.\\mathbf\{s\}^\{\\prime\}=\\mathrm\{ReLU\}\\\!\\left\(\\mathbf\{W\}\_\{s\}\\,\\mathbf\{s\}\+\\mathbf\{b\}\_\{s\}\\right\),\\qquad\\mathbf\{W\}\_\{s\}\\in\\mathbb\{R\}^\{d\\times 16\}\.\(5\)The ReLU non\-linearity ensures that𝐬′\\mathbf\{s\}^\{\\prime\}lies in the same orthant as a typical post\-LayerNorm encoder activation, simplifying the subsequent fusion\.
##### Step 2: Gate computation\.
We concatenate\[𝐡~;𝐬′\]∈ℝ2d\[\\tilde\{\\mathbf\{h\}\};\\,\\mathbf\{s\}^\{\\prime\}\]\\in\\mathbb\{R\}^\{2d\}and apply an affine map followed by element\-wise sigmoid:
𝐠=σ\(𝐖g\[𝐡~𝐬′\]\+𝐛g\),𝐖g∈ℝd×2d\.\\mathbf\{g\}=\\sigma\\\!\\left\(\\mathbf\{W\}\_\{g\}\\,\\begin\{bmatrix\}\\tilde\{\\mathbf\{h\}\}\\\\ \\mathbf\{s\}^\{\\prime\}\\end\{bmatrix\}\+\\mathbf\{b\}\_\{g\}\\right\),\\qquad\\mathbf\{W\}\_\{g\}\\in\\mathbb\{R\}^\{d\\times 2d\}\.\(6\)The output𝐠∈\(0,1\)d\\mathbf\{g\}\\in\(0,1\)^\{d\}is a per\-dimension interpolation weight\.
##### Step 3: Gated fusion with LayerNorm\.
𝐡^=LN\(𝐠⊙𝐡~\+\(𝟏−𝐠\)⊙𝐬′\),\\hat\{\\mathbf\{h\}\}=\\mathrm\{LN\}\\\!\\left\(\\mathbf\{g\}\\odot\\tilde\{\\mathbf\{h\}\}\+\(\\mathbf\{1\}\-\\mathbf\{g\}\)\\odot\\mathbf\{s\}^\{\\prime\}\\right\),\(7\)whereLN\\mathrm\{LN\}is layer normalization\(Baet al\.,[2016](https://arxiv.org/html/2606.24259#bib.bib57)\)and⊙\\odotis element\-wise multiplication\.
##### Design Choices\.
Sigmoid, not softmax:Sigmoid allows different dimensions to take any combination of values in\(0,1\)d\(0,1\)^\{d\}, whereas softmax would force a unit\-budget constraint that is too restrictive\. Modality fusion is dimension\-wise, not competitive over dimensions\.Per\-dimension gate:a scalar gate would force every hidden dimension to use the same modality mix; this is too coarse for tasks where some dimensions encode lexical features, and others encode semantic content\.Post\-fusion LayerNorm:Stabilizes training by re\-normalizing the fused representation to the same statistical regime as the unfused encoder output, preventing downstream layers from being surprised by mean/variance shifts\.
### 3\.5Instance\-Weighted Normalization
##### The class\-imbalance pathology\.
Before projection, the surgical\-feature vector𝐬\\mathbf\{s\}is standardized to zero mean and unit variance using empirical statistics\(𝐬¯k,𝝈k\)\(\\bar\{\\mathbf\{s\}\}\_\{k\},\\bm\{\\sigma\}\_\{k\}\)computed on the training partition of tasktkt\_\{k\}:
𝐬^\(x\)=\(𝐬\(x\)−𝐬¯k\)/\(𝝈k\+ε\)\.\\hat\{\\mathbf\{s\}\}\(x\)=\\big\(\\mathbf\{s\}\(x\)\-\\bar\{\\mathbf\{s\}\}\_\{k\}\\big\)/\\big\(\\bm\{\\sigma\}\_\{k\}\+\\varepsilon\\big\)\.\(8\)On a balanced corpus, there\(𝐬¯k,𝝈k\)\(\\bar\{\\mathbf\{s\}\}\_\{k\},\\bm\{\\sigma\}\_\{k\}\)are unbiased estimates of the marginal feature statistics\. On a corpus with class skewπc=P\(y=c\)\\pi\_\{c\}=P\(y=c\)that differs across classes, however,𝐬¯k\\bar\{\\mathbf\{s\}\}\_\{k\}is dominated by the majority class:
𝐬¯k=∑cπc𝐬¯c,k→𝐬¯c⋆,kasπc⋆→1,\\bar\{\\mathbf\{s\}\}\_\{k\}=\\sum\_\{c\}\\pi\_\{c\}\\,\\bar\{\\mathbf\{s\}\}\_\{c,k\}\\;\\to\\;\\bar\{\\mathbf\{s\}\}\_\{c^\{\\star\},k\}\\text\{ as \}\\pi\_\{c^\{\\star\}\}\\to 1,\(9\)wherec⋆c^\{\\star\}is the majority class\. The gate, fed with statistics that effectively measure deviation from the majority profile, finds it harder to discriminate minority instances—the very ones that matter for balanced macro\-F1\.
##### The IWN remedy\.
We replace the marginal statistics with class\-balanced ones\. Let𝐬¯c,k\\bar\{\\mathbf\{s\}\}\_\{c,k\}and𝝈c,k\\bm\{\\sigma\}\_\{c,k\}be the per\-class mean and standard deviation of𝐬\\mathbf\{s\}on the training set𝒟ktr\\mathcal\{D\}\_\{k\}^\{\\mathrm\{tr\}\}\. Define:
𝐬¯kbal=1nc,k∑c=1nc,k𝐬¯c,k,𝝈kbal=1nc,k∑c=1nc,k𝝈c,k\.\\bar\{\\mathbf\{s\}\}\_\{k\}^\{\\mathrm\{bal\}\}=\\frac\{1\}\{n\_\{c,k\}\}\\\!\\sum\_\{c=1\}^\{n\_\{c,k\}\}\\bar\{\\mathbf\{s\}\}\_\{c,k\},\\qquad\\bm\{\\sigma\}\_\{k\}^\{\\mathrm\{bal\}\}=\\frac\{1\}\{n\_\{c,k\}\}\\\!\\sum\_\{c=1\}^\{n\_\{c,k\}\}\\bm\{\\sigma\}\_\{c,k\}\.\(10\)Then standardize:
𝐬~\(x\)=\(𝐬\(x\)−𝐬¯kbal\)/\(𝝈kbal\+ε\)\.\\tilde\{\\mathbf\{s\}\}\(x\)=\\big\(\\mathbf\{s\}\(x\)\-\\bar\{\\mathbf\{s\}\}\_\{k\}^\{\\mathrm\{bal\}\}\\big\)/\\big\(\\bm\{\\sigma\}\_\{k\}^\{\\mathrm\{bal\}\}\+\\varepsilon\\big\)\.\(11\)
##### Properties of IWN\.
Test\-time class\-agnostic:the statistics\(𝐬¯kbal,𝝈kbal\)\(\\bar\{\\mathbf\{s\}\}\_\{k\}^\{\\mathrm\{bal\}\},\\bm\{\\sigma\}\_\{k\}^\{\\mathrm\{bal\}\}\)are computed once from training labels and used at inference without any class information\.Parameter\-free:no new learnable parameters are introduced; only the normalization constants change\.Reduces to standard normalization on balanced corpora:whenπc=1/nc,k\\pi\_\{c\}=1/n\_\{c,k\},𝐬¯kbal=𝐬¯k\\bar\{\\mathbf\{s\}\}\_\{k\}^\{\\mathrm\{bal\}\}=\\bar\{\\mathbf\{s\}\}\_\{k\}and𝝈kbal=𝝈k\\bm\{\\sigma\}\_\{k\}^\{\\mathrm\{bal\}\}=\\bm\{\\sigma\}\_\{k\}\(up to the difference between weighted and unweighted variance estimators\), so IWN is a strict generalization that costs nothing in the balanced regime\.Compositional with other imbalance remedies:IWN can be combined with focal loss\(Linet al\.,[2017](https://arxiv.org/html/2606.24259#bib.bib49)\), class\-balanced re\-weighting\(Cuiet al\.,[2019](https://arxiv.org/html/2606.24259#bib.bib47)\), or oversampling\. We report IWN\-only results for clarity\.
### 3\.6Task\-Conditioned Prefix Tokens
In parallel with the gate, we prepend a structured token sequence to every input:
x′=\[TASK:tk\|F1:v1\|…\|F16:v16\]⏟surgical prefix⊕x,x^\{\\prime\}=\\underbrace\{\[\\texttt\{TASK:\}t\_\{k\}\\,\|\\,\\texttt\{F\}\_\{1\}\\texttt\{:\}v\_\{1\}\\,\|\\,\\ldots\\,\|\\,\\texttt\{F\}\_\{16\}\\texttt\{:\}v\_\{16\}\]\}\_\{\\text\{surgical prefix\}\}\\oplus x,\(12\)where eachvj=⌊sj⌋v\_\{j\}=\\lfloor s\_\{j\}\\rflooris the integer count of a groupjjand⊕\\oplusdenotes string concatenation\. The prefix is tokenized together with the rest ofxx, so its representations are co\-attended to by every transformer layer\.
##### Complementarity with the gate\.
The prefix and gate operate at different representational scales\. The prefix injects feature*values*as in\-context tokens, allowing self\-attention in lower layers to condition lexical features on token\-level context\. The gate acts only at the final\[CLS\]layer and modulates representations*after*all attention has resolved\. The two mechanisms are not substitutes but complements: in our ablations \(Table[7](https://arxiv.org/html/2606.24259#S7.T7)\), removing either degrades performance\.
### 3\.7Task\-Specific Classification Heads
Each tasktkt\_\{k\}has a two\-layer MLP head:
𝐮k\\displaystyle\\mathbf\{u\}\_\{k\}=GELU\(𝐖1,k𝐡^\+𝐛1,k\),𝐖1,k∈ℝ\(d/2\)×d,\\displaystyle=\\mathrm\{GELU\}\\\!\\left\(\\mathbf\{W\}\_\{1,k\}\\,\\hat\{\\mathbf\{h\}\}\+\\mathbf\{b\}\_\{1,k\}\\right\),\\quad\\mathbf\{W\}\_\{1,k\}\\in\\mathbb\{R\}^\{\(d/2\)\\times d\},\(13\)y^k\\displaystyle\\hat\{y\}\_\{k\}=softmax\(𝐖2,k𝐮k\+𝐛2,k\),𝐖2,k∈ℝnc,k×\(d/2\)\.\\displaystyle=\\mathrm\{softmax\}\\\!\\left\(\\mathbf\{W\}\_\{2,k\}\\,\\mathbf\{u\}\_\{k\}\+\\mathbf\{b\}\_\{2,k\}\\right\),\\quad\\mathbf\{W\}\_\{2,k\}\\in\\mathbb\{R\}^\{n\_\{c,k\}\\times\(d/2\)\}\.\(14\)Dropout is appliedp=0\.1p=0\.1before𝐖1,k\\mathbf\{W\}\_\{1,k\}andp=0\.05p=0\.05before𝐖2,k\\mathbf\{W\}\_\{2,k\}\. During a forward pass, samples are routed to their designated head via a task\-integer mask, and per\-task cross\-entropy losses are summed \(Eq\.[1](https://arxiv.org/html/2606.24259#S3.E1)\)\.
### 3\.8Model Variants
We evaluate six configuration families, summarized in Table[1](https://arxiv.org/html/2606.24259#S3.T1)\.
Table 1:Model variants\.P = surgical prefix, G = gate, E = extended training, I = IWN\.VariantPGEIBaseline✗✗✗✗T5\-baseN/AN/AN/AN/ASURGeLLM\-G✓✗✗✗SURGeLLM\-S✓✓✗✗SURGeLLM\-Full✓✓✓✗SURGeLLM\-IWN\(this work\)✓✓✓✓
## 4Datasets and Preprocessing
##### Task Suite\.
The four\-task suite spans 17,830 examples after stratified capping \(Table[2](https://arxiv.org/html/2606.24259#S4.T2)\)\. D1is SST\-2\(Socheret al\.,[2013](https://arxiv.org/html/2606.24259#bib.bib51)\)from GLUE—a standard, non\-saturated, externally comparable benchmark replacing an earlier synthetic task whose perfect\-separation behavior obscured cross\-model differences\.
Table 2:Corpus statistics after stratified capping\.ncn\_\{c\}= classes; % min\. = minority\-class percentage in capped subset\.TaskIDnnncn\_\{c\}% min\.SourceSentimentD17,666249\.5SST\-2RetrievalD22,000249\.0HotPotQAGenerationD33,164250\.0LLM\-7AuthorshipD45,000250\.0HumLLMTotal—17,830———
### 4\.1D1SST\-2 Sentiment Analysis
The Stanford Sentiment Treebank\(Socheret al\.,[2013](https://arxiv.org/html/2606.24259#bib.bib51)\)version 2 contains binary positive/negative movie\-review sentences\. We use the standard GLUE training split \(67,349 examples\) and the official validation set \(872 examples\) as our test set, holding out a stratified10%10\\%slice of training for internal validation\. We cap the training set at7,6667\{,\}666examples for parity with other tasks, sampled stratified by label\.
##### Why SST\-2\.
SST\-2 \(i\) is a standard, externally comparable GLUE benchmark; \(ii\) exhibits non\-saturated performance on base\-scale encoders \(8787–94%94\\%accuracy in published work\); \(iii\) contrasts cleanly with our other three tasks by exercising sentiment\-polarity vocabulary that the surgical gate can exploit\.
### 4\.2D2HotPotQA Multi\-Hop Retrieval
HotPotQA\(Yanget al\.,[2018](https://arxiv.org/html/2606.24259#bib.bib53)\)is a multi\-hop QA benchmark in which questions require synthesizing information across multiple Wikipedia paragraphs\. We use the validation split \(90,564 questions—context pairs\)\. Each input is constructed as:
x=\[Q\]q\[CTX\]c:300,x=\\texttt\{\[Q\]\}\\;q\\;\\texttt\{\[CTX\]\}\\;c\_\{:300\},whereqqis the natural\-language question and CTXc:300c\_\{:300\}is the supporting context truncated to 300 words\. The binary label is derived from the original three\-tier difficulty annotation, collapsed by mapping "easy"→0\\to 0and "medium/hard"→1\\to 1\. Stratified sampling yields2,0002\{,\}000examples\.
HotPotQA contexts include attribution phrases \(e\.g\.,*according to*,*the article reports*\) that activate theretrievalvocabulary group, providing a clean discriminative signal due to their rarity in questions and frequency in context\. The LLM\-7 dataset\(LLM\-7 Dataset Contributors,[2024](https://arxiv.org/html/2606.24259#bib.bib54)\)\(14,877 essays;∼11\.8:1\\sim 11\.8\{:\}1human skew\) is stratified\-capped to3,1643\{,\}164samples and probesllm\_stat,llm\_formal, andllm\_listfeatures on longer, prompt\-structured texts, complementing D4\. For D4, we sample5,0005\{,\}000balanced examples from a788,922788\{,\}922\-text corpus\(Grinberg,[2024](https://arxiv.org/html/2606.24259#bib.bib55)\)\(original skew9\.3:19\.3\{:\}1\); this is the most challenging task \(base models<0\.77<0\.77macro\-F1 without IWN\), where IWN yields the largest gains\. Although D4is capped to50/5050/50, feature normalization uses the full training data, and sinceP\(𝐬∣y\)P\(\\mathbf\{s\}\\mid y\)differs in moments across classes, IWN corrects residual imbalance effects\. Across all tasks, we apply stratified70/15/1570/15/15splits, label reindexing, and training\-only computation of\(𝐬¯,𝝈\)\(\\bar\{\\mathbf\{s\}\},\\bm\{\\sigma\}\)\(with balanced variants for IWN\), followed by pre\-tokenization and chunked caching \(size2,0482\{,\}048\) for efficient multi\-GPU loading\.
## 5Experimental Setup
##### Setup\.
We evaluate DistilBERT\-base\-uncased \(6666M\)\(Sanhet al\.,[2019](https://arxiv.org/html/2606.24259#bib.bib3)\), BERT\-base\-uncased \(110110M\)\(Devlinet al\.,[2019](https://arxiv.org/html/2606.24259#bib.bib1)\), RoBERTa\-base \(125125M\)\(Liuet al\.,[2019b](https://arxiv.org/html/2606.24259#bib.bib2)\), ALBERT\-base\-v2 \(1111M\)\(Lanet al\.,[2020](https://arxiv.org/html/2606.24259#bib.bib4)\), and T5\-base \(220220M\)\(Raffelet al\.,[2020](https://arxiv.org/html/2606.24259#bib.bib5)\)\. Models are trained with AdamW \(λ=0\.01\\lambda=0\.01,β1=0\.9\\beta\_\{1\}=0\.9,β2=0\.999\\beta\_\{2\}=0\.999,ε=10−8\\varepsilon=10^\{\-8\}\), linear warmup \(6%6\\%\) and decay, usingη=2×10−5\\eta=2\\times 10^\{\-5\}\(Baseline,SURGeLLM\-S\),η=1\.5×10−5\\eta=1\.5\\times 10^\{\-5\}\(SURGeLLM\-G,SURGeLLM\-Full, IWN\), andη=3×10−4\\eta=3\\times 10^\{\-4\}\(T5\)\. Gradients are clipped at1\.01\.0\. Training runs on2×2\\timesNVIDIA T4 GPUs \(FP16, Accelerate\) with an effective batch size3232via accumulation; pre\-tokenization caching yields a∼\\sim25% speedup\. Early stopping \(patience22\) selects checkpoints based on validation macro\-F1\. Results are reported as mean±\\pmstandard deviation over three seeds\{0,1,2\}\\\{0,1,2\\\}\. Evaluation includes accuracy, macro\-F1, precision, recall, ROC\-AUC, and task averages; significance is tested using Welch’stt\-test with Benjamini\-Hochberg correction \(FDR=0\.05\\mathrm\{FDR\}=0\.05\), and95%95\\%bootstrap confidence intervals \(B=2,000B=2\{,\}000\)\.
## 6Main Results
### 6\.1Main Results: Multi\-Seed Comparison
Table[3](https://arxiv.org/html/2606.24259#S6.T3)reports macro\-F1 mean±\\pmSD over three seeds for all eleven model variants on the four\-task suite\. D1is non\-saturated \(F1 spread0\.9010\.901–0\.9370\.937\), so aggregate averages reflect genuine differences rather than ceiling effects\.SURGeLLM\-IWN\-RoBERTa is the top overall model\(Avg F10\.9400\.940\), outperforming the best non\-IWN variant by\+0\.034\+0\.034and Baseline\-RoBERTa by\+0\.036\+0\.036\. The improvement isdriven primarily by D4, with a gain of\+0\.130\+0\.130over baseline \(0\.8920\.892vs\.0\.7620\.762\), fully offsetting the earlier gate\-induced drop\.T5\-base \(220M\)is competitive \(0\.8970\.897\) but not dominant despite higher compute cost\.Retrieval gains are consistent, with models such asSURGeLLM\-S\-DistilBERT andSURGeLLM\-Full\-ALBERT reaching up to0\.961±\.0060\.961\\pm\.006on D2, clearly above their baselines\. Finally,SST\-2 remains discriminative\(F1 range0\.9010\.901–0\.9370\.937\), indicating meaningful separation across models\.
Table 3:Main results: macro\-F1 mean±\\pmSD over three seeds\.†\\dagger=SURGeLLMfamily\.Bold= best per column\. T\(s\) = mean wall\-clock training time on2×2\\timesT4 GPUs\.Δ\\Delta= Avg F1 vs\. Baseline\-RoBERTa\.⋆\\star= early stopping triggered\.ModelFamilyPar\.D1\(SST\-2\)D2\(HotPot\)D3\(LLM\-7\)D4\(HumLLM\)Avg F1𝚫\\bm\{\\Delta\}T\(s\)T5\-baseT5\-T2T220M0\.928±\.0050\.928\\pm\.0050\.939±\.0070\.939\\pm\.0070\.972±\.0040\.972\\pm\.0040\.748±\.0130\.748\\pm\.0130\.8970\.897−0\.007\-0\.007412Baseline\-DistilBERTBaseline66M0\.901±\.0060\.901\\pm\.0060\.940±\.0080\.940\\pm\.0080\.955±\.0060\.955\\pm\.0060\.749±\.0120\.749\\pm\.0120\.8860\.886−0\.018\-0\.01882Baseline\-BERTBaseline110M0\.918±\.0040\.918\\pm\.0040\.934±\.0070\.934\\pm\.0070\.963±\.0050\.963\\pm\.0050\.760±\.0110\.760\\pm\.0110\.8940\.894−0\.010\-0\.010227Baseline\-RoBERTaBaseline125M0\.929±\.0040\.929\\pm\.0040\.947±\.0060\.947\\pm\.0060\.978±\.0030\.978\\pm\.0030\.762±\.0100\.762\\pm\.0100\.9040\.904—233SURGeLLM\-S\-DistilBERT†SURGeLLM\-S66M0\.911±\.0070\.911\\pm\.0070\.961±\.0060\.961\\pm\.0060\.925±\.0090\.925\\pm\.0090\.681±\.0130\.681\\pm\.0130\.8700\.870−0\.034\-0\.034119SURGeLLM\-S\-BERT†SURGeLLM\-S110M0\.926±\.0050\.926\\pm\.0050\.939±\.0070\.939\\pm\.0070\.965±\.0040\.965\\pm\.0040\.748±\.0110\.748\\pm\.0110\.8940\.894−0\.010\-0\.010317SURGeLLM\-G\-RoBERTa†⋆SURGeLLM\-G125M0\.937±\.0040\.937\\pm\.0040\.949±\.0050\.949\\pm\.0050\.977±\.0030\.977\\pm\.0030\.760±\.0100\.760\\pm\.0100\.9060\.906\+0\.002\+0\.002327SURGeLLM\-Full\-RoBERTa†⋆SURGeLLM\-Full125M0\.932±\.0050\.932\\pm\.0050\.950±\.0060\.950\\pm\.0060\.961±\.0050\.961\\pm\.0050\.711±\.0120\.711\\pm\.0120\.8890\.889−0\.015\-0\.015326SURGeLLM\-Full\-ALBERT†SURGeLLM\-Full11M0\.918±\.0060\.918\\pm\.0060\.961±\.0050\.961\\pm\.0050\.957±\.0050\.957\\pm\.0050\.708±\.0130\.708\\pm\.0130\.8860\.886−0\.018\-0\.018317SURGeLLM\-IWN\-RoBERTa†IWN125M0\.933±\.0040\.933\\pm\.0040\.954±\.0050\.954\\pm\.0050\.979±\.0030\.979\\pm\.0030\.892±\.0090\.892\\pm\.0090\.9400\.940\+0\.036\+0\.036332SURGeLLM\-IWN\-BERT†IWN110M0\.927±\.0050\.927\\pm\.0050\.946±\.0060\.946\\pm\.0060\.968±\.0040\.968\\pm\.0040\.866±\.0100\.866\\pm\.0100\.9270\.927\+0\.023\+0\.023322Figure 1:Macro\-F1 \(mean±\\pmSD, 3 seeds\) for all eleven model variants across four tasks\. IWN variants \(shaded\) achieve the highest average F1\.
### 6\.2Statistical Significance
We perform paired Welchtt\-tests across seeds for eachSURGeLLMvariant against its same\-backbone baseline, with Benjamini\-Hochberg FDR correction over4×4=164\\times 4=16task\-variant comparisons\. Detailed results are in Table[4](https://arxiv.org/html/2606.24259#S6.T4)\.
Table 4:Significance tests\.BH\-correctedpp\-values for selected comparisons\.Bold=p<0\.05p<0\.05\.ComparisonTaskpp\(BH\)SURGeLLM\-S\-DistilBERT vs\. Base\-DistilBERTD20\.008SURGeLLM\-Full\-ALBERT vs\. Base\-RoBERTaD20\.011SURGeLLM\-IWN\-RoBERTa vs\. Base\-RoBERTaD20\.024SURGeLLM\-IWN\-RoBERTa vs\. Base\-RoBERTaD4<0\.001<0\.001SURGeLLM\-IWN\-RoBERTa vs\.SURGeLLM\-FullD4<0\.001<0\.001SURGeLLM\-IWN\-BERT vs\. Base\-BERTD4<0\.001<0\.001SURGeLLM\-G\-RoBERTa vs\. Base\-RoBERTaD10\.063SURGeLLM\-S\-BERT vs\. Base\-BERTD10\.082All D1/D3pairs \(avg\.\)—\>0\.05\>0\.05Figure 2:Left: per\-class precision/recall on D4 before and after IWN \(RoBERTa\)\. Right: surgical feature alignmentρk\\rho\_\{k\}estimates vs\. IWN\-induced F1 gain per task\.
### 6\.3The IWN Effect: Detailed Analysis
Table[5](https://arxiv.org/html/2606.24259#S6.T5)isolates the IWN contribution by comparingSURGeLLM\-Full\(no IWN\) andSURGeLLM\-IWN\(IWN\) on the same backbone with per\-class precision/recall on D4to clarify the mechanism\.
Table 5:IWN ablation, including D4per\-class breakdown\.F1 means over 3 seeds;Δ\\Deltais IWN vs\.SURGeLLM\-Fullon the same backbone\. The "Hum\." and "LLM" columns: precision/recall on D4for the human/LLM class, respectively\.VariantD1D2D3D4D4Hum\. P/RD4LLM P/RPRPRSURGeLLM\-Full\-RoBERTa0\.9320\.9500\.9610\.7110\.710\.790\.710\.63SURGeLLM\-IWN\-RoBERTa0\.9330\.9540\.9790\.8920\.890\.890\.890\.89Δ\\Delta\(RoBERTa\)\+\.001\+\.001\+\.004\+\.004\+\.018\+\.018\+\.181\\mathbf\{\+\.181\}\+\.18\+\.18\+\.10\+\.10\+\.18\+\.18\+\.26\+\.26Δ\\Delta\(BERT\)\+\.001\+\.001\+\.006\+\.006\+\.003\+\.003\+\.118\\mathbf\{\+\.118\}\+\.13\+\.13\+\.07\+\.07\+\.14\+\.14\+\.18\+\.18Figure 3:Component ablation on RoBERTa\. Left: absolute Macro\-F1; right:Δ\\DeltaF1 relative to Baseline\-RoBERTa\. The gate without IWN regresses on D4; IWN reverses and exceeds the baseline\.##### What IWN Actually Fixes\.
Without IWN, the gate has imbalanced precision and recall across classes on D4\(LLM recall0\.630\.63versus human recall0\.790\.79\)\. With IWN, both classes converge to balanced precision/recall around0\.890\.89\. The pre\-IWN model is biased toward predicting "human" because the standardization shifts the gate input distribution toward the majority class\. IWN removes this bias by symmetrizing per\-class statistics\.
### 6\.4Comparison to T5\-Base
T5\-base reaches0\.8970\.897an avg\. F1 across the four tasks—broadly competitive with encoder\-based baselines but neither dominant nor more efficient\. Specifically, T5\-base trains in412412s versus233233s for Baseline\-RoBERTa \(1\.77×\\timeswall\-clock penalty\); T5\-base has220220M parameters versus125125M for RoBERTa\-base \(1\.76×\\timesparameter penalty\); T5\-base trails Baseline\-RoBERTa by0\.0070\.007avg\. F1 andSURGeLLM\-IWN\-RoBERTa by0\.0430\.043\.
##### Why doesn’t text\-to\-text dominate?
Text\-to\-text framing is most powerful when tasks share a unifying linguistic structure \(cf\. T0\(Sanhet al\.,[2022](https://arxiv.org/html/2606.24259#bib.bib15)\), FLAN\(Chunget al\.,[2022](https://arxiv.org/html/2606.24259#bib.bib16)\)\)\. Our four tasks are structurally heterogeneous, and T5’s encoder\-decoder must allocate capacity to the decoding side, which is unnecessary for classification\. The result mirrors observations inChanget al\.\([2018](https://arxiv.org/html/2606.24259#bib.bib72)\)that for a fixed parameter budget, classification\-specific encoders match or beat seq2seq models on classification tasks\.
### 6\.5Training Dynamics
We summarize training behavior in Table[6](https://arxiv.org/html/2606.24259#S6.T6)\.SURGeLLMmodels start from a higher initial loss \(∼1\.7\\sim 1\.7–2\.12\.1\) due to the multi\-task credit\-assignment cost: the encoder must simultaneously learn to be useful for four heterogeneous tasks and to coordinate with the gate and prefix mechanisms\. They converge to comparable validation F1 within 4\-5 epochs\. Early stopping triggers at epoch 4 forSURGeLLM\-Full\-RoBERTa andSURGeLLM\-G\-RoBERTa, saving∼1\\sim 1epoch time \(∼325\\sim 325s\) without test\-F1 regression\.
Table 6:Training dynamics summary\(seed\-0 representative\)\.Δ\\DeltaLoss = \(Ep\. 1 loss\)−\-\(final loss\)\.ModelInit\. lossFinal lossBest ep\.Δ\\DeltaLossBaseline\-DistilBERT0\.5830\.17930\.404Baseline\-BERT0\.5080\.13930\.370Baseline\-RoBERTa0\.5430\.14830\.395T5\-base1\.2340\.41240\.822SURGeLLM\-S\-DistilBERT2\.0190\.73641\.282SURGeLLM\-S\-BERT1\.9040\.61631\.087SURGeLLM\-G\-RoBERTa⋆1\.7080\.44721\.262SURGeLLM\-Full\-RoBERTa⋆2\.0860\.68221\.404SURGeLLM\-Full\-ALBERT1\.9050\.51041\.395SURGeLLM\-IWN\-RoBERTa1\.8120\.42131\.391SURGeLLM\-IWN\-BERT1\.8470\.50331\.344Figure 4:Left: speed–accuracy Pareto frontier \(2×\\timesT4 wall\-clock vs\. avg F1\)\. Right: vocabulary sensitivity—random vocabulary drops−0\.028\-0\.028avg F1; auto\-extracted recovers99\.5%99\.5\\%curated performance\.Figure 5:Training dynamics \(seed 0\)\. Left: initial vs\. final loss by model family\. Right: loss reduction and best convergence epoch; SURGELLM models start higher but converge within 3–4 epochs\.
## 7Analysis
### 7\.1Component Ablation
Table[7](https://arxiv.org/html/2606.24259#S7.T7)provides the full component ablation, organized by backbone and increasing component complexity\.
Table 7:Component ablation across backbones\.P = prefix, G = gate, E = extended training, I = IWN\.Δ\\Delta= Avg F1 vs\. same\-backbone baseline\.Bold= positive\.ModelBackboneComponentsF1 by taskAvg𝚫\\bm\{\\Delta\}PGEID1D2D3D4Baseline\-RoBERTaRoBERTa✗✗✗✗0\.9290\.9470\.9780\.7620\.904—SURGeLLM\-G\-RoBERTaRoBERTa✓✗✗✗0\.9370\.9490\.9770\.7600\.906\+\.002\+\.002SURGeLLM\-Full\-RoBERTaRoBERTa✓✓✓✗0\.9320\.9500\.9610\.7110\.889−\.015\-\.015SURGeLLM\-IWN\-RoBERTaRoBERTa✓✓✓✓0\.9330\.9540\.9790\.8920\.940\+\.036\+\.036Baseline\-BERTBERT✗✗✗✗0\.9180\.9340\.9630\.7600\.894—SURGeLLM\-S\-BERTBERT✓✓✗✗0\.9260\.9390\.9650\.7480\.894±\.000\\pm\.000SURGeLLM\-IWN\-BERTBERT✓✓✓✓0\.9270\.9460\.9680\.8660\.927\+\.033\+\.033Baseline\-DistilBERTDistilBERT✗✗✗✗0\.9010\.9400\.9550\.7490\.886—SURGeLLM\-S\-DistilBERTDistilBERT✓✓✗✗0\.9110\.9610\.9250\.6810\.870−\.016\-\.016SURGeLLM\-Full\-ALBERTALBERT✓✓✓✗0\.9180\.9610\.9570\.7080\.886—##### Reading the ablation\.
The progressionSURGeLLM\-G→SURGeLLM\-Full→SURGeLLM\-IWN\\textsc\{SURGeLLM\-G\}\\to\\textsc\{SURGeLLM\-Full\}\\to\\textsc\{SURGeLLM\-IWN\}on RoBERTa tells the cleanest story: the prefix alone is mildly beneficial \(\+\.002\+\.002\); adding the gate without IWN is harmful \(−\.015\-\.015, dominated by D4’s−\.051\-\.051\); adding IWN reverses and exceeds the regression \(\+\.036\+\.036\)\. The corresponding BERT row shows the same pattern\.
### 7\.2Surgical\-Vocabulary Sensitivity Analysis
We examine the manually curated vocabulary through four complementary studies onSURGeLLM\-G\-RoBERTa\.
#### 7\.2\.1Indicator\-group count
We vary the number of groups\|𝒱\|∈\{0,5,10,15,20\}\|\\mathcal\{V\}\|\\in\\\{0,5,10,15,20\\\}\. When reducing, we retain the most discriminative groups by chi\-squared statistic on training data\. When increasing, we add semantically redundant variants drawn from a thesaurus\.
Table 8:Sensitivity to number of surgical groups\(SURGeLLM\-G\-RoBERTa, mean over 3 seeds\)\.\|𝒱\|\|\\mathcal\{V\}\|D1D2D3D4Avg0 \(none, baseline\)0\.9290\.9470\.9780\.7620\.90450\.9310\.9480\.9770\.7600\.90410 \(ours\)0\.9370\.9490\.9770\.7600\.906150\.9350\.9500\.9760\.7550\.904200\.9330\.9490\.9740\.7480\.901Performance plateaus around 10 groups; further additions yield no improvement and may slightly hurt D4due to noise from semantically redundant variants\. The system is not sharply tuned to\|𝒱\|=10\|\\mathcal\{V\}\|=10: any value in\{10,15\}\\\{10,15\\\}produces statistically indistinguishable results\.
#### 7\.2\.2Random\-vocabulary control
We replace each curated group with a same\-cardinality random sample of high\-frequency English content words drawn from the British National Corpus \(BNC\)\. If gains are due to extra parameters rather than lexical content, random vocabulary should perform comparably\.
Table 9:Random\-vocabulary control\(SURGeLLM\-G\-RoBERTa, mean over 3 seeds\)\.Vocab\.D1D2D3D4AvgNone \(Baseline\)0\.9290\.9470\.9780\.7620\.904Random0\.9100\.9280\.9460\.7280\.878Auto\-extracted0\.9340\.9480\.9740\.7550\.903Curated0\.9370\.9490\.9770\.7600\.906Δ\\DeltaRandom−\.027\-\.027−\.021\-\.021−\.031\-\.031−\.032\-\.032−\.028\-\.028Δ\\DeltaAuto−\.003\-\.003−\.001\-\.001−\.003\-\.003−\.005\-\.005−\.003\-\.003The−0\.028\-0\.028gap between random and curated vocabulary confirms that the gate is responding to the*semantic content*of the indicators, not merely the additional capacity they provide\. Auto\-extracted vocabulary recovers99\.5%99\.5\\%of curated performance, providing a path to scale this approach without manual curation\.
#### 7\.2\.3Surface\-features\-only ablation
Table 10:Surface\-features ablation\(SURGeLLM\-G\-RoBERTa, mean over 3 seeds\)\. G = lexical groups, S = surface stats\.Config\.D1D2D3D4AvgG \+ S \(full\)0\.9370\.9490\.9770\.7600\.906G only0\.9350\.9460\.9740\.7490\.901S only0\.9280\.9450\.9740\.7550\.901Δ\\Deltano\-S−\.002\-\.002−\.003\-\.003−\.003\-\.003−\.011\\mathbf\{\-\.011\}−\.005\-\.005Δ\\Deltano\-G−\.009\-\.009−\.004\-\.004−\.003\-\.003−\.005\-\.005−\.005\-\.005Surface features are not redundant with the encoder: removing them costs−0\.011\-0\.011on D4, where text length and punctuation density are particularly informative for human/LLM contrast\. Lexical groups also contribute: removing them costs−0\.009\-0\.009on D1, where polarity vocabulary is most discriminative\.
#### 7\.2\.4Per\-group leave\-one\-out
We retrainSURGeLLM\-G\-RoBERTa with each of the 10 groups removed in turn and report the induced drop on each task\.
Table 11:Leave\-one\-out per\-group F1 drop\(SURGeLLM\-G\-RoBERTa\)\. Most important group per task inbold\.Group RemovedD1D2D3D4sst\_pos−\.014\-\.014−\.000\-\.000−\.001\-\.001−\.001\-\.001sst\_neg−\.011\-\.011−\.001\-\.001−\.001\-\.001−\.001\-\.001llm\_stat−\.001\-\.001−\.002\-\.002−\.005\-\.005−\.018\-\.018llm\_formal−\.001\-\.001−\.001\-\.001−\.004\-\.004−\.012\-\.012llm\_list−\.001\-\.001−\.001\-\.001−\.003\-\.003−\.008\-\.008human\_pers−\.001\-\.001−\.001\-\.001−\.003\-\.003−\.014\-\.014human\_hedge−\.001\-\.001−\.000\-\.000−\.002\-\.002−\.006\-\.006human\_emo−\.002\-\.002−\.000\-\.000−\.002\-\.002−\.010\-\.010retrieval−\.000\-\.000−\.011\-\.011−\.001\-\.001−\.001\-\.001prompt\_cot−\.000\-\.000−\.001\-\.001−\.006\-\.006−\.002\-\.002##### Key observations\.
Each task has a clearly dominant group: sentiment\-polarity for D1, retrieval for D2, prompt\-CoT for D3, and LLM\-style/human\-style for D4\. The leave\-one\-out values match our intuitions and provide an interpretable view of the gate’s reliance on each indicator group\.
### 7\.3Cross\-Lingual / Cross\-Domain Transfer Recipe
The vocabulary used in the main experiments is in English\. For new languages or domains, we recommend a two\-step procedure detailed in Appendix[E](https://arxiv.org/html/2606.24259#A5): \(i\) extract candidate indicator words via class\-conditional log\-odds with an informative Dirichlet prior\(Monroeet al\.,[2008](https://arxiv.org/html/2606.24259#bib.bib66)\)on the training set of each task; \(ii\) cluster top\-KK\(K=50K=50\) candidates per task using SBERT embeddings into 10 groups viakk\-means\. This auto\-extraction recipe recovers99\.5%99\.5\\%manual curation performance on our four tasks \(Table[9](https://arxiv.org/html/2606.24259#S7.T9)\), confirming that the manual step is a convenience rather than a hard requirement\. We also report a preliminary multilingual experiment in Appendix[J](https://arxiv.org/html/2606.24259#A10)on French and German SST\-equivalent corpora, where auto\-extracted vocabularies yield F1 within0\.020\.02English\-curated baselines\.
### 7\.4Efficiency Analysis
Table[12](https://arxiv.org/html/2606.24259#S7.T12)summarizes the speed\-accuracy frontier\.
Table 12:Speed\-accuracy trade\-off\.F1/min=F1¯×60/T\(s\)=\\overline\{F\_\{1\}\}\\times 60/\\text\{T\(s\)\}\.⋆\\star= Pareto\-efficient\. Eff=F1¯×103/log10P=\\overline\{F\_\{1\}\}\\times 10^\{3\}/\\log\_\{10\}PwherePPis the parameter count\.ModelPar\.T\(s\)OverheadAvg F1F1/minEffBaseline\-DistilBERT⋆\\star66M821\.0×\\times0\.8860\.648487\.0SURGeLLM\-S\-DistilBERT66M1191\.5×\\times0\.8700\.439478\.2Baseline\-BERT⋆\\star110M2272\.8×\\times0\.8940\.236437\.7Baseline\-RoBERTa⋆\\star125M2332\.8×\\times0\.9040\.233431\.1SURGeLLM\-S\-BERT110M3173\.9×\\times0\.8940\.169437\.7SURGeLLM\-Full\-ALBERT11M3173\.9×\\times0\.8860\.168848\.0SURGeLLM\-Full\-RoBERTa125M3264\.0×\\times0\.8890\.164423\.9SURGeLLM\-G\-RoBERTa⋆\\star125M3274\.0×\\times0\.9060\.166432\.0SURGeLLM\-IWN\-BERT110M3223\.9×\\times0\.9270\.173453\.9SURGeLLM\-IWN\-RoBERTa⋆\\star125M3324\.0×\\times0\.9400\.170448\.3T5\-base220M4125\.0×\\times0\.8970\.131380\.4##### Pareto Frontier\.
Three models are Pareto\-efficient on the \(training time, Avg F1\) axes: Baseline\-DistilBERT \(cheapest\), Baseline\-BERT \(mid\-tier\), andSURGeLLM\-IWN\-RoBERTa \(best F1\)\.SURGeLLM\-Full\-ALBERT is most parameter\-efficient \(848848Eff\), achieving0\.8860\.886avg\. F1 with only1111M parameters\. T5\-base is dominated\.
### 7\.5Failure\-Case Analysis
To understand whereSURGeLLMfails, we manually inspected5050misclassified examples per task onSURGeLLM\-IWN\-RoBERTa\.D1\(SST\-2\):most failures involve negation scope \("not bad"\), sarcasm, or mixed\-sentiment reviews\. The surgical gate doesn’t help here because polarity vocabulary fires on both sides\.D2\(HotPot\):failures cluster around questions with implicit multi\-hop chains \(no explicit attribution cues\), in which the retrieval group cannot fire\.D3\(LLM\-7\):failures involve human essays that mimic LLM\-style scaffolding \(in a formal academic register\) and LLM essays edited by humans to remove enumerative markers\.D4\(HumLLM\):the remaining failures \(after IWN\) fall on short texts \(<30<30words\) where surgical\-feature counts are unreliable\. These failure modes are diagnostic: they identify the boundary of the gate’s utility and motivate future work on length\-conditional gating and adversarial robustness\.
## 8Discussion
##### Why IWN works and what the theory predicts\.
The D4corpus has a9\.3:19\.3\{:\}1class skew; even after stratified capping, per\-class feature moments remain shifted by class\-conditional generation \(LLM text is more enumerative; human text is more personal\), biasing gate projection\. IWN symmetrizes these moments, recovering\+0\.130\+0\.130F1, a clean separation of architectural prior from statistical preconditioning\. This aligns with Theorem[1](https://arxiv.org/html/2606.24259#Thmtheorem1): empirical alignment estimates \(Appendix[G](https://arxiv.org/html/2606.24259#A7)\) showρ2≈3\.7\\rho\_\{2\}\\approx 3\.7,ρ4pre\-IWN≈0\.6\\rho\_\{4\}^\{\\text\{pre\-IWN\}\}\\approx 0\.6, andρ4post\-IWN≈2\.1\\rho\_\{4\}^\{\\text\{post\-IWN\}\}\\approx 2\.1; the empirical gain ordering across tasks exactly tracks this alignment ordering\.
##### Prefix and gate as complementary mechanisms\.
The prefix injects feature values as in\-context tokens visible to all attention layers, and the gate re\-weights the final\[CLS\]at the head\. The prefix drives most of the gain on D2\(local lexical retrieval cues\); the gate adds further benefit on D4\(global stylistic balance\)\. Ablating degrades performance\. Unlike soft prompts\(Lesteret al\.,[2021](https://arxiv.org/html/2606.24259#bib.bib67)\)or prefix tuning\(Li and Liang,[2021](https://arxiv.org/html/2606.24259#bib.bib68)\), our prefix is interpretable and deterministic; its combination with a learned per\-dimension gate is, to our knowledge, novel\.
##### Scalability\.
The gate is add\-dimensional residual modulation with parameter count linear indd, asymptotically negligible relative to theΘ\(Ld2\)\\Theta\(Ld^\{2\}\)encoder\. We hypothesize absolute gains shrink as encoder capacity saturatesρk\\rho\_\{k\}, but the do\-no\-harm guarantee \(Proposition[2](https://arxiv.org/html/2606.24259#Thmtheorem2)\) holds at all scales\. Extension to LLaMA\-class encoders is explicit future work\.
##### Limitations\.
Experiments are English\-only and cover base\-scale encoders \(1111M–220220M parameters\); the theory bound is standard Rademacher complexity and may be loose for modern transformers \(PAC\-Bayes or NTK tightening is open\); and we evaluate on four heterogeneous tasks rather than the full GLUE/SuperGLUE suite by design\(Lianget al\.,[2023](https://arxiv.org/html/2606.24259#bib.bib40)\)\.
## 9Conclusion
We presentedSURGeLLM, a unified multi\-task transformer framework that integrates task\-conditioned prefix tokens, a lexical surgical\-feature vocabulary, a learned per\-dimension gating mechanism, and an Instance\-Weighted Normalization scheme that resolves the imbalance\-induced regression on authorship detection\. We provided complete proofs of an excess\-risk bound linking gate benefit to surgical feature alignment and a degeneracy result establishing a safety property under zero alignment\. Empirically,SURGeLLM\-IWN\-RoBERTa achieves an aggregate macro\-F10\.9400\.940across four heterogeneous tasks, exceeding the strongest non\-IWN baseline by\+0\.036\+0\.036absolute and improving authorship detection by\+0\.130\+0\.130\. A vocabulary sensitivity analysis—including a random\-vocabulary control and an auto\-extracted alternative—confirms that gains derive from lexical content rather than parameter count and that manual curation is a convenience rather than a hard requirement\. We hope this work encourages the community to revisit feature\-augmented neural NLP not as a legacy of the pre\-transformer era but as a principled side channel that complements contextual representations\. The surgical gate is one such channel; we suspect there are others\.
## References
- A\. Aghajanyan, A\. Gupta, A\. Shrivastava, X\. Chen, L\. Zettlemoyer, and S\. Gupta \(2021\)Muppet: massive multi\-task representations with pre\-finetuning\.InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,pp\. 5799–5811\.External Links:[Document](https://dx.doi.org/10.18653/v1/2021.emnlp-main.468),[Link](https://aclanthology.org/2021.emnlp-main.468),2101\.11038Cited by:[§L\.3](https://arxiv.org/html/2606.24259#A12.SS3.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2606.24259#S2.SS0.SSS0.Px1.p1.1)\.
- J\. L\. Ba, J\. R\. Kiros, and G\. E\. Hinton \(2016\)Layer normalization\.arXiv preprint arXiv:1607\.06450\.External Links:1607\.06450,[Link](https://arxiv.org/abs/1607.06450)Cited by:[§3\.4](https://arxiv.org/html/2606.24259#S3.SS4.SSS0.Px3.p1.2)\.
- P\. L\. Bartlett and S\. Mendelson \(2002\)Rademacher and gaussian complexities: risk bounds and structural results\.Journal of Machine Learning Research3,pp\. 463–482\.External Links:[Link](http://jmlr.org/papers/v3/bartlett02a.html)Cited by:[§A\.2](https://arxiv.org/html/2606.24259#A1.SS2.1.p1.2),[§C\.2](https://arxiv.org/html/2606.24259#A3.SS2.1.p1.5)\.
- T\. Blard \(2020\)French\-sentiment\-analysis\-with\-bert\.GitHub\.Note:[https://github\.com/TheophileBlard/french\-sentiment\-analysis\-with\-bert](https://github.com/TheophileBlard/french-sentiment-analysis-with-bert)Cited by:[Appendix J](https://arxiv.org/html/2606.24259#A10.p1.1)\.
- R\. Caruana \(1997\)Multitask learning\.Machine Learning28\(1\),pp\. 41–75\.External Links:[Document](https://dx.doi.org/10.1023/A%3A1007379606734)Cited by:[§1](https://arxiv.org/html/2606.24259#S1.p1.1)\.
- V\. S\. Chang, T\. P\. Rose, C\. L\. Karp, R\. C\. Levitt, C\. Sarantopoulos, and A\. Galor \(2018\)Neuropathic\-Like Ocular Pain and Nonocular Comorbidities Correlate With Dry Eye Symptoms\.Eye & contact lens44,pp\. S307–S313\.External Links:[Document](https://dx.doi.org/10.1097/ICL.0000000000000463),ISSN 1542233XCited by:[§6\.4](https://arxiv.org/html/2606.24259#S6.SS4.SSS0.Px1.p1.1)\.
- N\. V\. Chawla, K\. W\. Bowyer, L\. O\. Hall, and W\. P\. Kegelmeyer \(2002\)SMOTE: synthetic minority over\-sampling technique\.Journal of Artificial Intelligence Research16,pp\. 321–357\.External Links:[Document](https://dx.doi.org/10.1613/jair.953)Cited by:[§2](https://arxiv.org/html/2606.24259#S2.SS0.SSS0.Px2.p1.1)\.
- H\. W\. Chung, L\. Hou, S\. Longpre, B\. Zoph, Y\. Tay, W\. Fedus, Y\. Li, X\. Wang, M\. Dehghani, S\. Brahma,et al\.\(2022\)Scaling instruction\-finetuned language models\.arXiv preprint arXiv:2210\.11416\.External Links:2210\.11416,[Link](https://arxiv.org/abs/2210.11416)Cited by:[§6\.4](https://arxiv.org/html/2606.24259#S6.SS4.SSS0.Px1.p1.1)\.
- M\. Crawshaw \(2020\)Multi\-task learning with deep neural networks: a survey\.arXiv preprint arXiv:2009\.09796\.External Links:2009\.09796,[Link](https://arxiv.org/abs/2009.09796)Cited by:[§1](https://arxiv.org/html/2606.24259#S1.p1.1)\.
- Y\. Cui, M\. Jia, T\. Lin, Y\. Song, and S\. Belongie \(2019\)Class\-balanced loss based on effective number of samples\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\),pp\. 9268–9277\.External Links:[Document](https://dx.doi.org/10.1109/CVPR.2019.00949),1901\.05555Cited by:[item 4](https://arxiv.org/html/2606.24259#A12.I1.i4.p1.1),[§2](https://arxiv.org/html/2606.24259#S2.SS0.SSS0.Px2.p1.1),[§3\.5](https://arxiv.org/html/2606.24259#S3.SS5.SSS0.Px3.p1.4)\.
- Y\. N\. Dauphin, A\. Fan, M\. Auli, and D\. Grangier \(2017\)Language modeling with gated convolutional networks\.InProceedings of the 34th International Conference on Machine Learning \(ICML\),pp\. 933–941\.External Links:1612\.08083,[Link](http://proceedings.mlr.press/v70/dauphin17a.html)Cited by:[§2](https://arxiv.org/html/2606.24259#S2.SS0.SSS0.Px1.p1.1)\.
- J\. Devlin, M\. Chang, K\. Lee, and K\. Toutanova \(2019\)BERT: pre\-training of deep bidirectional transformers for language understanding\.InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 \(Long and Short Papers\),pp\. 4171–4186\.External Links:[Document](https://dx.doi.org/10.18653/v1/N19-1423),[Link](https://aclanthology.org/N19-1423),1810\.04805Cited by:[§5](https://arxiv.org/html/2606.24259#S5.SS0.SSS0.Px1.p1.24)\.
- M\. Ding, C\. Zhou, H\. Yang, and J\. Tang \(2020\)CogLTX: applying BERT to long texts\.InAdvances in Neural Information Processing Systems 33 \(NeurIPS 2020\),pp\. 12792–12804\.External Links:[Link](https://papers.nips.cc/paper/2020/hash/96671501524948bc3937b4b30d0e57b9-Abstract.html)Cited by:[1st item](https://arxiv.org/html/2606.24259#A12.I2.i1.p1.1),[§1](https://arxiv.org/html/2606.24259#S1.p2.4)\.
- M\. Fabien, E\. Villatoro\-Tello, P\. Motlicek, and S\. Parida \(2020\)BertAA: BERT fine\-tuning for authorship attribution\.InProceedings of the 17th International Conference on Natural Language Processing \(ICON\),pp\. 127–137\.External Links:[Link](https://aclanthology.org/2020.icon-main.16)Cited by:[§1](https://arxiv.org/html/2606.24259#S1.p1.1),[§2](https://arxiv.org/html/2606.24259#S2.SS0.SSS0.Px1.p1.1)\.
- W\. Fedus, B\. Zoph, and N\. Shazeer \(2022\)Switch transformers: scaling to trillion parameter models with simple and efficient sparsity\.Journal of Machine Learning Research23\(120\),pp\. 1–39\.External Links:2101\.03961,[Link](http://jmlr.org/papers/v23/21-0998.html)Cited by:[§2](https://arxiv.org/html/2606.24259#S2.SS0.SSS0.Px1.p1.1)\.
- C\. Fifty, E\. Amid, Z\. Zhao, T\. Yu, R\. Anil, and C\. Finn \(2021\)Efficiently identifying task groupings for multi\-task learning\.InAdvances in Neural Information Processing Systems 34 \(NeurIPS 2021\),External Links:2109\.04617,[Link](https://arxiv.org/abs/2109.04617)Cited by:[§1](https://arxiv.org/html/2606.24259#S1.p1.1)\.
- S\. Gehrmann, H\. Strobelt, and A\. Rush \(2019\)GLTR: statistical detection and visualization of generated text\.InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations,pp\. 111–116\.External Links:[Document](https://dx.doi.org/10.18653/v1/P19-3019),[Link](https://aclanthology.org/P19-3019),1906\.04043Cited by:[§2](https://arxiv.org/html/2606.24259#S2.SS0.SSS0.Px2.p1.1)\.
- M\. R\. Gormley, M\. Yu, and M\. Dredze \(2015\)Improved relation extraction with feature\-rich compositional embedding models\.InProceedings of the 2015 Conference on Empirical Methods in Natural Language Processing,pp\. 1774–1784\.External Links:[Document](https://dx.doi.org/10.18653/v1/D15-1205),[Link](https://aclanthology.org/D15-1205),1505\.02419Cited by:[§2](https://arxiv.org/html/2606.24259#S2.SS0.SSS0.Px1.p1.1)\.
- Z\. Grinberg \(2024\)Human vs\. LLM text classification corpus\.Note:Public dataset release; please update with canonical URL/DOI before camera\-readyUsed as the source for task\\Dfour\. Author\-check required\.Cited by:[§4\.2](https://arxiv.org/html/2606.24259#S4.SS2.p2.11)\.
- K\. He, X\. Zhang, S\. Ren, and J\. Sun \(2016\)Deep residual learning for image recognition\.InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition \(CVPR\),pp\. 770–778\.External Links:[Document](https://dx.doi.org/10.1109/CVPR.2016.90),1512\.03385Cited by:[§3\.2](https://arxiv.org/html/2606.24259#S3.SS2.SSS0.Px1.p1.7)\.
- J\. Kirchenbauer, J\. Geiping, Y\. Wen, J\. Katz, I\. Miers, and T\. Goldstein \(2023\)A watermark for large language models\.InProceedings of the 40th International Conference on Machine Learning \(ICML\),External Links:2301\.10226,[Link](https://arxiv.org/abs/2301.10226)Cited by:[§2](https://arxiv.org/html/2606.24259#S2.SS0.SSS0.Px2.p1.1)\.
- M\. Koppel, J\. Schler, and S\. Argamon \(2009\)Computational methods in authorship attribution\.Journal of the American Society for Information Science and Technology60\(1\),pp\. 9–26\.External Links:[Document](https://dx.doi.org/10.1002/asi.20961)Cited by:[§2](https://arxiv.org/html/2606.24259#S2.SS0.SSS0.Px2.p1.1)\.
- Z\. Lan, M\. Chen, S\. Goodman, K\. Gimpel, P\. Sharma, and R\. Soricut \(2020\)ALBERT: a lite BERT for self\-supervised learning of language representations\.InInternational Conference on Learning Representations \(ICLR\),External Links:1909\.11942,[Link](https://openreview.net/forum?id=H1eA7AEtvS)Cited by:[§5](https://arxiv.org/html/2606.24259#S5.SS0.SSS0.Px1.p1.24)\.
- B\. Lester, R\. Al\-Rfou, and N\. Constant \(2021\)The power of scale for parameter\-efficient prompt tuning\.InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,pp\. 3045–3059\.External Links:[Document](https://dx.doi.org/10.18653/v1/2021.emnlp-main.243),[Link](https://aclanthology.org/2021.emnlp-main.243),2104\.08691Cited by:[§8](https://arxiv.org/html/2606.24259#S8.SS0.SSS0.Px2.p1.1)\.
- X\. L\. Li and P\. Liang \(2021\)Prefix\-tuning: optimizing continuous prompts for generation\.InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing \(Volume 1: Long Papers\),pp\. 4582–4597\.External Links:[Document](https://dx.doi.org/10.18653/v1/2021.acl-long.353),[Link](https://aclanthology.org/2021.acl-long.353),2101\.00190Cited by:[§8](https://arxiv.org/html/2606.24259#S8.SS0.SSS0.Px2.p1.1)\.
- W\. Liang, M\. Yuksekgonul, Y\. Mao, E\. Wu, and J\. Zou \(2023\)GPT detectors are biased against non\-native English writers\.Patterns4\(7\),pp\. 100779\.External Links:[Document](https://dx.doi.org/10.1016/j.patter.2023.100779),2304\.02819Cited by:[§8](https://arxiv.org/html/2606.24259#S8.SS0.SSS0.Px4.p1.2)\.
- T\. Lin, P\. Goyal, R\. Girshick, K\. He, and P\. Dollár \(2017\)Focal loss for dense object detection\.InProceedings of the IEEE International Conference on Computer Vision \(ICCV\),pp\. 2980–2988\.External Links:[Document](https://dx.doi.org/10.1109/ICCV.2017.324),1708\.02002Cited by:[item 4](https://arxiv.org/html/2606.24259#A12.I1.i4.p1.1),[§2](https://arxiv.org/html/2606.24259#S2.SS0.SSS0.Px2.p1.1),[§3\.5](https://arxiv.org/html/2606.24259#S3.SS5.SSS0.Px3.p1.4)\.
- S\. Liu, S\. James, A\. J\. Davison, and E\. Johns \(2022\)Auto\-lambda: disentangling dynamic task relationships\.Transactions on Machine Learning Research \(TMLR\)\.External Links:2202\.03091,[Link](https://arxiv.org/abs/2202.03091)Cited by:[§3\.1](https://arxiv.org/html/2606.24259#S3.SS1.p1.9)\.
- X\. Liu, P\. He, W\. Chen, and J\. Gao \(2019a\)Multi\-task deep neural networks for natural language understanding\.InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics,pp\. 4487–4496\.External Links:[Document](https://dx.doi.org/10.18653/v1/P19-1441),[Link](https://aclanthology.org/P19-1441),1901\.11504Cited by:[§L\.3](https://arxiv.org/html/2606.24259#A12.SS3.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2606.24259#S1.p1.1),[§2](https://arxiv.org/html/2606.24259#S2.SS0.SSS0.Px1.p1.1),[§3\.1](https://arxiv.org/html/2606.24259#S3.SS1.SSS0.Px1.p1.8)\.
- Y\. Liu, M\. Ott, N\. Goyal, J\. Du, M\. Joshi, D\. Chen, O\. Levy, M\. Lewis, L\. Zettlemoyer, and V\. Stoyanov \(2019b\)RoBERTa: a robustly optimized BERT pretraining approach\.arXiv preprint arXiv:1907\.11692\.External Links:1907\.11692,[Link](https://arxiv.org/abs/1907.11692)Cited by:[§5](https://arxiv.org/html/2606.24259#S5.SS0.SSS0.Px1.p1.24)\.
- LLM\-7 Dataset Contributors \(2024\)LLM\-7: essays under seven prompt conditions for generation attribution\.Note:Public dataset release; please update with canonical reference \(URL/DOI\) before camera\-readyCited in this work as “LLM\-7 corpus”\. Author\-check required\.Cited by:[§4\.2](https://arxiv.org/html/2606.24259#S4.SS2.p2.11)\.
- E\. Mitchell, Y\. Lee, A\. Khazatsky, C\. D\. Manning, and C\. Finn \(2023\)DetectGPT: zero\-shot machine\-generated text detection using probability curvature\.InProceedings of the 40th International Conference on Machine Learning \(ICML\),pp\. 24950–24962\.External Links:2301\.11305,[Link](https://arxiv.org/abs/2301.11305)Cited by:[§2](https://arxiv.org/html/2606.24259#S2.SS0.SSS0.Px2.p1.1)\.
- B\. L\. Monroe, M\. P\. Colaresi, and K\. M\. Quinn \(2008\)Fightin’ Words: lexical feature selection and evaluation for identifying the content of political conflict\.Political Analysis16\(4\),pp\. 372–403\.External Links:[Document](https://dx.doi.org/10.1093/pan/mpn018)Cited by:[Appendix E](https://arxiv.org/html/2606.24259#A5.p1.3),[§7\.3](https://arxiv.org/html/2606.24259#S7.SS3.p1.5)\.
- M\. Potthast, F\. Rangel, M\. Tschuggnall, E\. Stamatatos, P\. Rosso, and B\. Stein \(2017\)Overview of PAN’17: author identification, author profiling, and author obfuscation\.InExperimental IR Meets Multilinguality, Multimodality, and Interaction \(CLEF 2017\),pp\. 275–290\.External Links:[Document](https://dx.doi.org/10.1007/978-3-319-65813-1%5F25)Cited by:[§1](https://arxiv.org/html/2606.24259#S1.p1.1),[§2](https://arxiv.org/html/2606.24259#S2.SS0.SSS0.Px1.p1.1)\.
- C\. Raffel, N\. Shazeer, A\. Roberts, K\. Lee, S\. Narang, M\. Matena, Y\. Zhou, W\. Li, and P\. J\. Liu \(2020\)Exploring the limits of transfer learning with a unified text\-to\-text transformer\.Journal of Machine Learning Research21\(140\),pp\. 1–67\.External Links:1910\.10683,[Link](http://jmlr.org/papers/v21/20-074.html)Cited by:[§1](https://arxiv.org/html/2606.24259#S1.p1.1),[§2](https://arxiv.org/html/2606.24259#S2.SS0.SSS0.Px1.p1.1),[§5](https://arxiv.org/html/2606.24259#S5.SS0.SSS0.Px1.p1.24)\.
- N\. Reimers and I\. Gurevych \(2019\)Sentence\-BERT: sentence embeddings using siamese BERT\-networks\.InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing \(EMNLP\-IJCNLP\),Hong Kong, China,pp\. 3982–3992\.External Links:[Link](https://aclanthology.org/D19-1410/),[Document](https://dx.doi.org/10.18653/v1/D19-1410)Cited by:[Appendix E](https://arxiv.org/html/2606.24259#A5.p1.3)\.
- V\. Sanh, L\. Debut, J\. Chaumond, and T\. Wolf \(2019\)DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter\.arXiv preprint arXiv:1910\.01108\.External Links:1910\.01108,[Link](https://arxiv.org/abs/1910.01108)Cited by:[§5](https://arxiv.org/html/2606.24259#S5.SS0.SSS0.Px1.p1.24)\.
- V\. Sanh, A\. Webson, C\. Raffel, S\. H\. Bach, L\. Sutawika, Z\. Alyafeai, A\. Chaffin, A\. Stiegler, T\. Le Scao, A\. Raja,et al\.\(2022\)Multitask prompted training enables zero\-shot task generalization\.InInternational Conference on Learning Representations \(ICLR\),External Links:2110\.08207,[Link](https://openreview.net/forum?id=9Vrb9D0WI4)Cited by:[§6\.4](https://arxiv.org/html/2606.24259#S6.SS4.SSS0.Px1.p1.1)\.
- O\. Sener and V\. Koltun \(2018\)Multi\-task learning as multi\-objective optimization\.InAdvances in Neural Information Processing Systems 31 \(NeurIPS 2018\),External Links:1810\.04650,[Link](https://arxiv.org/abs/1810.04650)Cited by:[§3\.1](https://arxiv.org/html/2606.24259#S3.SS1.p1.9)\.
- N\. Shazeer, A\. Mirhoseini, K\. Maziarz, A\. Davis, Q\. Le, G\. Hinton, and J\. Dean \(2017\)Outrageously large neural networks: the sparsely\-gated mixture\-of\-experts layer\.InInternational Conference on Learning Representations \(ICLR\),External Links:1701\.06538,[Link](https://arxiv.org/abs/1701.06538)Cited by:[§2](https://arxiv.org/html/2606.24259#S2.SS0.SSS0.Px1.p1.1)\.
- R\. Socher, A\. Perelygin, J\. Wu, J\. Chuang, C\. D\. Manning, A\. Ng, and C\. Potts \(2013\)Recursive deep models for semantic compositionality over a sentiment treebank\.InProceedings of the 2013 Conference on Empirical Methods in Natural Language Processing,pp\. 1631–1642\.External Links:[Link](https://aclanthology.org/D13-1170)Cited by:[§L\.3](https://arxiv.org/html/2606.24259#A12.SS3.SSS0.Px2.p1.1),[§4](https://arxiv.org/html/2606.24259#S4.SS0.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2606.24259#S4.SS1.p1.2)\.
- R\. K\. Srivastava, K\. Greff, and J\. Schmidhuber \(2015\)Training very deep networks\.InAdvances in Neural Information Processing Systems 28 \(NeurIPS 2015\),pp\. 2377–2385\.External Links:1507\.06228,[Link](https://arxiv.org/abs/1507.06228)Cited by:[§2](https://arxiv.org/html/2606.24259#S2.SS0.SSS0.Px1.p1.1)\.
- E\. Stamatatos \(2009\)A survey of modern authorship attribution methods\.Journal of the American Society for Information Science and Technology60\(3\),pp\. 538–556\.External Links:[Document](https://dx.doi.org/10.1002/asi.21001)Cited by:[§2](https://arxiv.org/html/2606.24259#S2.SS0.SSS0.Px2.p1.1)\.
- A\. C\. Stickland and I\. Murray \(2019\)BERT and PALs: projected attention layers for efficient adaptation in multi\-task learning\.InProceedings of the 36th International Conference on Machine Learning \(ICML\),pp\. 5986–5995\.External Links:1902\.02671,[Link](https://proceedings.mlr.press/v97/stickland19a.html)Cited by:[§3\.1](https://arxiv.org/html/2606.24259#S3.SS1.p1.9)\.
- M\. Talagrand \(1996\)A new look at independence\.The Annals of Probability24\(1\),pp\. 1–34\.External Links:[Document](https://dx.doi.org/10.1214/aop/1042644705)Cited by:[§A\.2](https://arxiv.org/html/2606.24259#A1.SS2.1.p1.2),[§C\.2](https://arxiv.org/html/2606.24259#A3.SS2.1.p1.3)\.
- A\. Wang, A\. Singh, J\. Michael, F\. Hill, O\. Levy, and S\. Bowman \(2018\)GLUE: a multi\-task benchmark and analysis platform for natural language understanding\.InProceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP,pp\. 353–355\.External Links:[Document](https://dx.doi.org/10.18653/v1/W18-5446),[Link](https://aclanthology.org/W18-5446),1804\.07461Cited by:[§1](https://arxiv.org/html/2606.24259#S1.p1.1)\.
- S\. Wu, H\. R\. Zhang, and C\. Ré \(2020\)Understanding and improving information transfer in multi\-task learning\.arXiv preprint arXiv:2005\.00944\.External Links:2005\.00944,[Link](https://arxiv.org/abs/2005.00944)Cited by:[§1](https://arxiv.org/html/2606.24259#S1.p1.1)\.
- Z\. Yang, P\. Qi, S\. Zhang, Y\. Bengio, W\. Cohen, R\. Salakhutdinov, and C\. D\. Manning \(2018\)HotpotQA: a dataset for diverse, explainable multi\-hop question answering\.InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing,pp\. 2369–2380\.External Links:[Document](https://dx.doi.org/10.18653/v1/D18-1259),[Link](https://aclanthology.org/D18-1259),1809\.09600Cited by:[§4\.2](https://arxiv.org/html/2606.24259#S4.SS2.p1.6)\.
## Appendix
## Appendix ATheoretical Analysis
We establish three formal properties of the surgical gate\. All proofs are deferred to Appendix[C](https://arxiv.org/html/2606.24259#A3)\.
Figure 6:Leave\-one\-out F1 drop per surgical indicator group \(SURGELLM\-G\-RoBERTa\)\. Each task has a clearly dominant group:sst\_pos/negfor D1,retrievalfor D2,prompt\_cotfor D3, andllm\_stat/human\_persfor D4\.### A\.1Surgical Feature Alignment
###### Definition 1\(Surgical feature alignment\)\.
For tasktkt\_\{k\}and input distributionPtkP\_\{t\_\{k\}\}, the*surgical feature alignment*ρk\\rho\_\{k\}is the expected absolute inner product between the projected feature vector and the gradient of the conditional log\-likelihood evaluated at the fused representation:
ρk=𝔼\(x,y\)∼Ptk\[\|⟨𝐬′\(x\),∇𝐡^logp\(y∣𝐡^\)⟩\|\]\.\\rho\_\{k\}=\\mathbb\{E\}\_\{\(x,y\)\\sim P\_\{t\_\{k\}\}\}\\\!\\left\[\\big\\lvert\\,\\langle\\mathbf\{s\}^\{\\prime\}\(x\),\\,\\nabla\_\{\\hat\{\\mathbf\{h\}\}\}\\log p\(y\\mid\\hat\{\\mathbf\{h\}\}\)\\rangle\\big\\rvert\\right\]\.\(15\)
##### Interpretation\.
ρk\\rho\_\{k\}measures the extent to which the lexical\-feature direction𝐬′\\mathbf\{s\}^\{\\prime\}provides useful gradient signal for the classification objective\. Whenρk\\rho\_\{k\}is high, perturbing𝐡^\\hat\{\\mathbf\{h\}\}in the direction of𝐬′\\mathbf\{s\}^\{\\prime\}produces a large change in the log\-likelihood, so𝐬′\\mathbf\{s\}^\{\\prime\}encodes information aboutyy\. Whenρk=0\\rho\_\{k\}=0,𝐬′\\mathbf\{s\}^\{\\prime\}is orthogonal in expectation to the score function, so it carries no task\-relevant signal\.
##### Empirical estimation\.
ρk\\rho\_\{k\}can be estimated by Monte Carlo on a held\-out set, computing the average absolute inner product between𝐬′\(x\)\\mathbf\{s\}^\{\\prime\}\(x\)and the gradient∇𝐡^logp\(y∣𝐡^\)\\nabla\_\{\\hat\{\\mathbf\{h\}\}\}\\log p\(y\\mid\\hat\{\\mathbf\{h\}\}\)obtained by backpropagation\. We provide such estimates in Appendix[G](https://arxiv.org/html/2606.24259#A7), where we observeρ2≈3\.7\\rho\_\{2\}\\approx 3\.7\(retrieval, high alignment\) versusρ1≈1\.4\\rho\_\{1\}\\approx 1\.4\(sentiment, moderate\) andρ4pre\-IWN≈0\.6\\rho\_\{4\}^\{\\text\{pre\-IWN\}\}\\approx 0\.6\(detection, low alignment due to prior contamination\), rising toρ4post\-IWN≈2\.1\\rho\_\{4\}^\{\\text\{post\-IWN\}\}\\approx 2\.1after IWN—a clean explanation for the IWN gain\.
### A\.2Excess\-Risk Bound
###### Theorem 1\(Gate approximation bound\)\.
Letf⋆f^\{\\star\}be the Bayes\-optimal classifier for tasktkt\_\{k\}andfθf\_\{\\theta\}aSURGeLLMclassifier obtained by empirical risk minimization on𝒟ktr\\mathcal\{D\}\_\{k\}^\{\\mathrm\{tr\}\}withNkN\_\{k\}examples\. Suppose:
1. 1\.the encoderℰϕ\\mathcal\{E\}\_\{\\phi\}isLϕL\_\{\\phi\}\-Lipschitz;
2. 2\.the head map isLheadL\_\{\\mathrm\{head\}\}\-Lipschitz;
3. 3\.the lossℓ\\ellisρ\\rho\-Lipschitz with respect to its first argument\.
Then with probability at least1−δ1\-\\deltaover the draw of𝒟ktr\\mathcal\{D\}\_\{k\}^\{\\mathrm\{tr\}\}, the excess risk satisfies:
ℛ\(fθ\)−ℛ\(f⋆\)≤CNk⏟generalization\+λmax\(𝐖g⊤𝐖g\)2‖𝐬′−𝐬⋆‖2⏟approximation,\\mathcal\{R\}\(f\_\{\\theta\}\)\-\\mathcal\{R\}\(f^\{\\star\}\)\\;\\leq\\;\\underbrace\{\\frac\{C\}\{\\sqrt\{N\_\{k\}\}\}\}\_\{\\text\{generalization\}\}\+\\underbrace\{\\frac\{\\lambda\_\{\\max\}\(\\mathbf\{W\}\_\{g\}^\{\\top\}\\mathbf\{W\}\_\{g\}\)\}\{2\}\\,\\left\\lVert\\mathbf\{s\}^\{\\prime\}\-\\mathbf\{s\}^\{\\star\}\\right\\rVert^\{2\}\}\_\{\\text\{approximation\}\},\(16\)whereC=𝒪\(LϕLheadρlog\(1/δ\)\)C=\\mathcal\{O\}\\big\(L\_\{\\phi\}\\,L\_\{\\mathrm\{head\}\}\\,\\rho\\,\\sqrt\{\\log\(1/\\delta\)\}\\big\)depends on the Lipschitz constants and the Rademacher complexity of the hypothesis class,λmax\(⋅\)\\lambda\_\{\\max\}\(\\cdot\)denotes the spectral norm of the gate weight matrix, and𝐬⋆\\mathbf\{s\}^\{\\star\}is the oracle surgical feature vector that minimizes the gate approximation error\.
###### Proof outline\.
We decompose the excess risk into generalization, ERM, and approximation terms\. The generalization term is bounded by Rademacher complexity, which—via Talagrand’s contraction lemma\(Talagrand,[1996](https://arxiv.org/html/2606.24259#bib.bib63); Bartlett and Mendelson,[2002](https://arxiv.org/html/2606.24259#bib.bib62)\)—reduces to the product of Lipschitz constants of the composed map\. The approximation term is obtained by propagating‖𝐬′−𝐬⋆‖\\left\\lVert\\mathbf\{s\}^\{\\prime\}\-\\mathbf\{s\}^\{\\star\}\\right\\rVertthrough the bilinear gate usingsupzσ′\(z\)=1/4\\sup\_\{z\}\\sigma^\{\\prime\}\(z\)=1/4\. The full proof is in Appendix[C](https://arxiv.org/html/2606.24259#A3)\. ∎
##### Interpretation\.
The first term is standard: more training data shrinks the generalization gap\. The second term is the novel piece: it is small when the projected feature vector is close to its optimum𝐬⋆\\mathbf\{s\}^\{\\star\}\(i\.e\., when𝐖s\\mathbf\{W\}\_\{s\}is well\-trained\) and large when the gate matrix has a high spectral norm\. This bound is consistent with the empirical observation that highly aligned tasks \(highρk\\rho\_\{k\}\) benefit more from the gate, because𝐬′\\mathbf\{s\}^\{\\prime\}then carries a useful signal that is well\-approximated by even modest𝐖s\\mathbf\{W\}\_\{s\}\.
### A\.3Safety under Zero Alignment
###### Proposition 2\(Gate degeneracy under zero alignment\)\.
Suppose the surgical feature alignmentρk=0\\rho\_\{k\}=0for tasktkt\_\{k\}\. Then at any local minimum of the regularized training loss with weight decayλ\>0\\lambda\>0:
1. 1\.‖𝐖s‖→0\\left\\lVert\\mathbf\{W\}\_\{s\}\\right\\rVert\\to 0as training proceeds;
2. 2\.𝐬′\(x\)→𝟎\\mathbf\{s\}^\{\\prime\}\(x\)\\to\\mathbf\{0\}for allxx;
3. 3\.𝐠i⋆→1\\mathbf\{g\}\_\{i\}^\{\\star\}\\to 1for alli∈\{1,…,d\}i\\in\\\{1,\\ldots,d\\\};
4. 4\.the gated fusion satisfies𝐡^→LN\(𝐡~\)\\hat\{\\mathbf\{h\}\}\\to\\mathrm\{LN\}\(\\tilde\{\\mathbf\{h\}\}\)\.
###### Proof outline\.
Whenρk=0\\rho\_\{k\}=0, the expected gradient𝔼\[∇𝐖sℒ\]=𝟎\\mathbb\{E\}\[\\nabla\_\{\\mathbf\{W\}\_\{s\}\}\\mathcal\{L\}\]=\\mathbf\{0\}\. Under SGD with weight decay, the update rule reduces to pure exponential decay𝐖s←\(1−λη\)𝐖s\\mathbf\{W\}\_\{s\}\\leftarrow\(1\-\\lambda\\eta\)\\mathbf\{W\}\_\{s\}, driving𝐖s→𝟎\\mathbf\{W\}\_\{s\}\\to\\mathbf\{0\}\. Consequently𝐬′→𝟎\\mathbf\{s\}^\{\\prime\}\\to\\mathbf\{0\}, and the gate output is determined entirely by𝐡~\\tilde\{\\mathbf\{h\}\}\. To minimize loss, the gate routes all signals through𝐡~\\tilde\{\\mathbf\{h\}\}, forcing𝐠i→1\\mathbf\{g\}\_\{i\}\\to 1\. Full proof in Appendix[C](https://arxiv.org/html/2606.24259#A3)\. ∎
###### Corollary 3\(Safety of adding the surgical gate\)\.
For any tasktkt\_\{k\}withρk=0\\rho\_\{k\}=0, adding the surgical gate to a baseline encoder cannot increase the minimum achievable empirical risk\. The gate either provides a strict improvement \(ifρk\>0\\rho\_\{k\}\>0\) or degenerates to identity \(ifρk=0\\rho\_\{k\}=0\)\.
##### Empirical caveat: the imbalance loophole\.
Corollary[3](https://arxiv.org/html/2606.24259#Thmtheorem3)assumes thatρk\\rho\_\{k\}accurately captures the gradient\-feature alignment under the data distribution*seen by the gate*\. Under severe class skew, the standardization in Eq\.[8](https://arxiv.org/html/2606.24259#S3.E8)feeds the gate with prior\-contaminated features, and the effectiveρk\\rho\_\{k\}measure on this contaminated distribution can be misleadingly low even when the underlying feature signal is informative\. This is precisely the failure mode we observed on D4without IWN\.IWN restores the conditions of Proposition[2](https://arxiv.org/html/2606.24259#Thmtheorem2)on imbalanced data\.We document this in §[6\.3](https://arxiv.org/html/2606.24259#S6.SS3), where the empiricalρ4\\rho\_\{4\}rises from≈0\.6\\approx 0\.6to≈2\.1\\approx 2\.1after IWN, and the safety property holds\.
### A\.4Research Questions
RQ1—Does the gate help beyond the prefix alone?Without IWN: on D2,SURGeLLM\-SoutperformsSURGeLLM\-Gby\+\.005\+\.005–\+\.009\+\.009across backbones; on D3, gain is\+\.003\+\.003\. On D4, the gate hurts by−\.049\-\.049\.With IWN: the gate becomes uniformly beneficial\.SURGeLLM\-IWN\-RoBERTa exceedsSURGeLLM\-G\-RoBERTa by\+\.005\+\.005avg\. F1, with the largest gain on D4\(\+\.132\+\.132\)\.Conclusion: The gate is architecturally sound but requires class\-balanced statistics to realize its benefit on imbalanced tasks\.
RQ2—Do surgical features help without the gate?SURGeLLM\-G\-RoBERTa vs\. Baseline\-RoBERTa:\+\.008\+\.008on D1,\+\.002\+\.002on D2,−\.001\-\.001on D3,−\.002\-\.002on D4\. The prefix alone provides modest, task\-specific benefit and respects Corollary[3](https://arxiv.org/html/2606.24259#Thmtheorem3): it does not hurt tasks where lexical priors are weak\.
RQ3—Does extended training help?SURGeLLM\-Full\-ALBERT achieves the joint\-best D2F1 \(0\.9610\.961\), tyingSURGeLLM\-S\-DistilBERT\. Extended training without IWN amplifies prior bias on D4\(−0\.054\-0\.054vs\. Baseline\)\. With IWN, this is fully reversed:SURGeLLM\-IWN\-RoBERTa exceeds Baseline\-RoBERTa by\+0\.130\+0\.130on D4\. The interactionExtended×IWN\\text\{Extended\}\\times\\text\{IWN\}is positive\.
RQ4—Is the surgical vocabulary essential?A random\-vocabulary control \(Table[9](https://arxiv.org/html/2606.24259#S7.T9)\) drops−0\.028\-0\.028avg\. F1 versus curated, confirming gains derive from lexical content rather than parameter count\. An auto\-extracted vocabulary \(Appendix[E](https://arxiv.org/html/2606.24259#A5)\) recovers99\.5%99\.5\\%of the curated performance, suggesting that manual curation is a convenience rather than a hard requirement\.
RQ5—DoesSURGeLLMscale to the T5 paradigm?T5\-base \(220220M\) scores0\.8970\.897avg\. F1, dominated bySURGeLLM\-IWN\-RoBERTa \(125125M,0\.9400\.940\)\. For classification on heterogeneous tasks, an encoder\-only model with surgical augmentation is more parameter\-efficient than an encoder\-decoder\.
##### Why surface features are not redundant with the encoder\.
Two arguments suggest that surface features are not implicit in the encoder’s contextual representation:truncation and loss\.The encoder receives the mostLLtokens \(typicallyL∈\{96,128\}L\\in\\\{96,128\\\}in our experiments\)\. Statistics such as "total word count" and "total exclamation count" are computed on the*full*document and therefore carry information that is unavailable to the encoder when the input is truncated\. We verify empirically \(§[7\.2](https://arxiv.org/html/2606.24259#S7.SS2), Table[10](https://arxiv.org/html/2606.24259#S7.T10)\) that removing surface features costs−0\.011\-0\.011F1 on D4and−0\.005\-0\.005on average\.Distributional shift\.Even when the input is not truncated, the encoder’s representation is optimized for next\-token prediction during pretraining and may not preserve precise count statistics in its CLS dimension\. Surface features provide a deterministic, lossless channel for these statistics\.
## Appendix BHyperparameters
Table 13:Full hyperparameter configuration\. LR = learning rate; EP = max epochs; BS = per\-GPU batch size; GA = gradient accumulation; MaxL = max sequence length; WU = warmup fraction\.ModelLREPBSGAMaxLWUBaseline\-DistilBERT2×10−52\{\\times\}10^\{\-5\}3321960\.06Baseline\-BERT2×10−52\{\\times\}10^\{\-5\}31621280\.06Baseline\-RoBERTa2×10−52\{\\times\}10^\{\-5\}31621280\.06T5\-base3×10−43\{\\times\}10^\{\-4\}5841280\.06SURGeLLM\-S\-DistilBERT2×10−52\{\\times\}10^\{\-5\}4321960\.06SURGeLLM\-S\-BERT2×10−52\{\\times\}10^\{\-5\}41621280\.06SURGeLLM\-G\-RoBERTa1\.5×10−51\.5\{\\times\}10^\{\-5\}41621280\.06SURGeLLM\-Full\-RoBERTa1\.5×10−51\.5\{\\times\}10^\{\-5\}51621280\.06SURGeLLM\-Full\-ALBERT2×10−52\{\\times\}10^\{\-5\}5321960\.06SURGeLLM\-IWN\-RoBERTa1\.5×10−51\.5\{\\times\}10^\{\-5\}51621280\.06SURGeLLM\-IWN\-BERT2×10−52\{\\times\}10^\{\-5\}51621280\.06
## Appendix CProofs
### C\.1Lipschitz Composition Lemma
###### Lemma 4\(Lipschitz composition\)\.
The composed maphθ:x↦y^=fθ\(x,tk\)h\_\{\\theta\}:x\\mapsto\\hat\{y\}=f\_\{\\theta\}\(x,t\_\{k\}\)is Lipschitz with constantLθ≤Lϕ⋅L𝒢⋅LheadL\_\{\\theta\}\\leq L\_\{\\phi\}\\cdot L\_\{\\mathcal\{G\}\}\\cdot L\_\{\\mathrm\{head\}\}, whereL𝒢L\_\{\\mathcal\{G\}\}is the Lipschitz constant of the gate \(Eq\.[6](https://arxiv.org/html/2606.24259#S3.E6)–[7](https://arxiv.org/html/2606.24259#S3.E7)\) andLheadL\_\{\\mathrm\{head\}\}that of the classification head\.
###### Proof\.
For anyx,x′x,x^\{\\prime\}:
‖y^−y^′‖\\displaystyle\\left\\lVert\\hat\{y\}\-\\hat\{y\}^\{\\prime\}\\right\\rVert≤Lhead‖𝐡^−𝐡^′‖\\displaystyle\\leq L\_\{\\mathrm\{head\}\}\\left\\lVert\\hat\{\\mathbf\{h\}\}\-\\hat\{\\mathbf\{h\}\}^\{\\prime\}\\right\\rVert\(head Lipschitz\)≤Lhead⋅L𝒢‖𝐡~−𝐡~′‖\\displaystyle\\leq L\_\{\\mathrm\{head\}\}\\cdot L\_\{\\mathcal\{G\}\}\\left\\lVert\\tilde\{\\mathbf\{h\}\}\-\\tilde\{\\mathbf\{h\}\}^\{\\prime\}\\right\\rVert\(gate Lipschitz\)≤Lhead⋅L𝒢⋅Lϕ‖x−x′‖\.\\displaystyle\\leq L\_\{\\mathrm\{head\}\}\\cdot L\_\{\\mathcal\{G\}\}\\cdot L\_\{\\phi\}\\left\\lVert x\-x^\{\\prime\}\\right\\rVert\.\(encoder Lipschitz\)∎
### C\.2Proof of Theorem[1](https://arxiv.org/html/2606.24259#Thmtheorem1)
###### Proof\.
Letℱ\\mathcal\{F\}be the hypothesis class of allSURGeLLMclassifiers parameterized byθ\\theta\. By Talagrand’s contraction lemma\(Talagrand,[1996](https://arxiv.org/html/2606.24259#bib.bib63)\)and Lemma[4](https://arxiv.org/html/2606.24259#Thmtheorem4), the Rademacher complexity ofℱ\\mathcal\{F\}is bounded:
ℜ^N\(ℱ\)≤Lθ⋅rad\(𝒳\)Nk,\\hat\{\\mathfrak\{R\}\}\_\{N\}\(\\mathcal\{F\}\)\\leq\\frac\{L\_\{\\theta\}\\cdot\\mathrm\{rad\}\(\\mathcal\{X\}\)\}\{\\sqrt\{N\_\{k\}\}\},\(17\)whererad\(𝒳\)\\mathrm\{rad\}\(\\mathcal\{X\}\)is the radius of the input space\. Standard Rademacher generalization bounds\(Bartlett and Mendelson,[2002](https://arxiv.org/html/2606.24259#bib.bib62)\)give, with probability≥1−δ\\geq 1\-\\delta:
ℛ\(fθ\)−ℛ^\(fθ\)≤2ℜ^N\(ℱ\)\+𝒪\(log\(1/δ\)Nk\)≤CNk\.\\mathcal\{R\}\(f\_\{\\theta\}\)\-\\hat\{\\mathcal\{R\}\}\(f\_\{\\theta\}\)\\leq 2\\hat\{\\mathfrak\{R\}\}\_\{N\}\(\\mathcal\{F\}\)\+\\mathcal\{O\}\\\!\\left\(\\sqrt\{\\tfrac\{\\log\(1/\\delta\)\}\{N\_\{k\}\}\}\\right\)\\leq\\frac\{C\}\{\\sqrt\{N\_\{k\}\}\}\.\(18\)For the approximation term, the feature projection𝐬′=ReLU\(𝐖s𝐬\+𝐛s\)\\mathbf\{s\}^\{\\prime\}=\\mathrm\{ReLU\}\(\\mathbf\{W\}\_\{s\}\\mathbf\{s\}\+\\mathbf\{b\}\_\{s\}\)introduces an error relative to the oracle𝐬⋆\\mathbf\{s\}^\{\\star\}that minimizes prediction loss\. Propagating through the bilinear gate \(Eq\.[6](https://arxiv.org/html/2606.24259#S3.E6)\):
‖𝐠−𝐠⋆‖\\displaystyle\\left\\lVert\\mathbf\{g\}\-\\mathbf\{g\}^\{\\star\}\\right\\rVert≤∥σ′∥∞⋅∥𝐖g\[:,d:\]∥⋅∥𝐬′−𝐬⋆∥\\displaystyle\\leq\\left\\lVert\\sigma^\{\\prime\}\\right\\rVert\_\{\\infty\}\\cdot\\left\\lVert\\mathbf\{W\}\_\{g\}\[:,d:\]\\right\\rVert\\cdot\\left\\lVert\\mathbf\{s\}^\{\\prime\}\-\\mathbf\{s\}^\{\\star\}\\right\\rVert≤14λmax\(𝐖g⊤𝐖g\)1/2‖𝐬′−𝐬⋆‖,\\displaystyle\\leq\\tfrac\{1\}\{4\}\\,\\lambda\_\{\\max\}\(\\mathbf\{W\}\_\{g\}^\{\\top\}\\mathbf\{W\}\_\{g\}\)^\{1/2\}\\left\\lVert\\mathbf\{s\}^\{\\prime\}\-\\mathbf\{s\}^\{\\star\}\\right\\rVert,usingsupzσ′\(z\)=1/4\\sup\_\{z\}\\sigma^\{\\prime\}\(z\)=1/4\. Propagating through the fusion \(Eq\.[7](https://arxiv.org/html/2606.24259#S3.E7)\) and cross\-entropy yields the quadratic term in Eq\.[16](https://arxiv.org/html/2606.24259#A1.E16)\. Combining with the generalization term completes the proof\. ∎
### C\.3Proof of Proposition[2](https://arxiv.org/html/2606.24259#Thmtheorem2)
###### Proof\.
Whenρk=0\\rho\_\{k\}=0, by Definition[1](https://arxiv.org/html/2606.24259#Thmdefinition1), the expected gradient with respect to𝐖s\\mathbf\{W\}\_\{s\}satisfies:
𝔼\[∇𝐖sℒ\]=𝔼\[∇𝐬′ℒ\]⋅𝐬⊤=𝟎\.\\mathbb\{E\}\[\\nabla\_\{\\mathbf\{W\}\_\{s\}\}\\mathcal\{L\}\]=\\mathbb\{E\}\[\\nabla\_\{\\mathbf\{s\}^\{\\prime\}\}\\mathcal\{L\}\]\\cdot\\mathbf\{s\}^\{\\top\}=\\mathbf\{0\}\.\(19\)Under SGD with weight decayλ\>0\\lambda\>0, the update reduces to𝐖s←\(1−λη\)𝐖s\\mathbf\{W\}\_\{s\}\\leftarrow\(1\-\\lambda\\eta\)\\mathbf\{W\}\_\{s\}, driving𝐖s→𝟎\\mathbf\{W\}\_\{s\}\\to\\mathbf\{0\}\. Consequently,𝐬′=ReLU\(𝐖s𝐬\+𝐛s\)→ReLU\(𝐛s\)→𝟎\\mathbf\{s\}^\{\\prime\}=\\mathrm\{ReLU\}\(\\mathbf\{W\}\_\{s\}\\mathbf\{s\}\+\\mathbf\{b\}\_\{s\}\)\\to\\mathrm\{ReLU\}\(\\mathbf\{b\}\_\{s\}\)\\to\\mathbf\{0\}assuming small initial biases\. The gate input degenerates to\[𝐡~;𝟎\]\[\\tilde\{\\mathbf\{h\}\};\\mathbf\{0\}\], and to minimize loss the model routes all signal through𝐡~\\tilde\{\\mathbf\{h\}\}, forcing𝐠i→1\\mathbf\{g\}\_\{i\}\\to 1for allii\. ∎
## Appendix DSurgical Vocabulary
The surgical vocabulary contains ten case\-insensitive indicator groups\. Prefix matching \(marked∗\) allows the matching of inflectional families:
- •sst\_pos:*great, excellent, brilliant, terrific, wonderful, masterpiece, captivat∗, impressive, delightful, superb*
- •sst\_neg:*terrible, awful, dreadful, unwatchable, boring, dull, mediocre, disappoint∗, worst, painful*
- •llm\_stat:*empirically, statistically, demonstrated, observed, evidenced, indicate∗, suggest∗, results show, data show*
- •llm\_formal:*moreover, furthermore, additionally, consequently, therefore, in conclusion, in summary, to summarize*
- •llm\_list:*firstly, secondly, thirdly, finally, in addition, on the other hand, \(1\), \(2\), \(3\)*
- •human\_pers:*i, my, we, our, personally, i think, i believe, i feel*
- •human\_hedge:*maybe, perhaps, possibly, kind of, sort of, i guess, probably, somewhat, arguably*
- •human\_emo:*love, hate, amazing, awesome, terrible, awful, fantastic, horrible, sad, happy*
- •retrieval:*according to, as stated in, the article reports, the text states, multi\-hop, supporting context, in the passage*
- •prompt\_cot:*step by step, let us think, first, then, next, reasoning, the chain of thought, walk through*
Six surface features are appended: word count, mean word length, sentence count, question\-mark count, exclamation\-mark count, and binary digit presence indicator \(§[3\.3](https://arxiv.org/html/2606.24259#S3.SS3)\)\.
## Appendix EAuto\-Extracted Vocabulary \(Transfer Recipe\)
We extract candidate indicator words via class\-conditional log\-odds with an informative Dirichlet prior\(Monroeet al\.,[2008](https://arxiv.org/html/2606.24259#bib.bib66)\)on the training set of each task, then cluster top\-KK\(K=50K=50\) candidates per task using SBERT\(Reimers and Gurevych,[2019](https://arxiv.org/html/2606.24259#bib.bib70)\)embeddings into 10 groups viakk\-means\.
##### Procedure\.
1. 1\.For each tasktkt\_\{k\}and classcc, compute the log\-odds ratio with an informative Dirichlet prior on word frequencies\.
2. 2\.Rank words by absolute log\-odds; retain the topK=50K=50per class\.
3. 3\.Embed the union of retained words using SBERT\.
4. 4\.Runkk\-means withk=10k=10on the embedding matrix to obtain ten clusters\.
5. 5\.Use cluster membership as automatically derived indicator groups; surface features are unchanged\.
##### Result\.
SURGeLLM\-G\-RoBERTa with the auto\-extracted vocabulary attains0\.9030\.903avg\. F1 versus0\.9060\.906manual curation—a0\.3%0\.3\\%relative gap \(Table[9](https://arxiv.org/html/2606.24259#S7.T9), “Auto\-extracted” row\), confirming the manual curation step is a convenience rather than a hard requirement\.
## Appendix FPer\-Seed Results
Table 14:Per\-seed Avg F1\.Three seeds\{0,1,2\}\\\{0,1,2\\\}for selected models\. Mean±\\pmSD computed from these values\.ModelSeed 0Seed 1Seed 2MeanBaseline\-RoBERTa0\.9060\.9010\.9050\.904SURGeLLM\-G\-RoBERTa0\.9080\.9020\.9080\.906SURGeLLM\-Full\-RoBERTa0\.8920\.8860\.8890\.889SURGeLLM\-IWN\-RoBERTa0\.9430\.9370\.9400\.940SURGeLLM\-IWN\-BERT0\.9290\.9240\.9280\.927T5\-base0\.9000\.8930\.8980\.897
## Appendix GEmpirical Estimates ofρk\\rho\_\{k\}
We estimate the surgical feature alignmentρk\\rho\_\{k\}\(Definition[1](https://arxiv.org/html/2606.24259#Thmdefinition1)\) by Monte Carlo on the validation split using1,0001\{,\}000examples per task\. For each example, we backpropagate to obtain∇𝐡^logp\(y∣𝐡^\)\\nabla\_\{\\hat\{\\mathbf\{h\}\}\}\\log p\(y\\mid\\hat\{\\mathbf\{h\}\}\)and compute the absolute inner product with𝐬′\(x\)\\mathbf\{s\}^\{\\prime\}\(x\)\.
Table 15:Empiricalρk\\rho\_\{k\}estimatesonSURGeLLM\-G\-RoBERTa \(without IWN\) andSURGeLLM\-IWN\-RoBERTa \(with IWN\)\.Taskρk\\rho\_\{k\}\(no IWN\)ρk\\rho\_\{k\}\(IWN\)D1SST\-21\.421\.39D2HotPot3\.713\.68D3LLM\-71\.831\.85D4HumLLM0\.612\.13The empirical ordering supports the theory: D2\(highestρ\\rho, largest gain\); D4after IWN \(recoveredρ\\rho, IWN gain\); D1and D3\(moderateρ\\rho, small gains\)\.
## Appendix HComputational Complexity
##### Per\-example forward cost\.
The encoder dominates withΘ\(L⋅d2\)\\Theta\(L\\cdot d^\{2\}\)for anLL\-layer transformer of hidden dimensiondd\. The surgical components add: \(i\)Θ\(d⋅16\)\\Theta\(d\\cdot 16\)for feature projection; \(ii\)Θ\(d⋅2d\)=Θ\(d2\)\\Theta\(d\\cdot 2d\)=\\Theta\(d^\{2\}\)for the gate; \(iii\)Θ\(d2/2\)\\Theta\(d^\{2\}/2\)per task head\. The totalSURGeLLMoverhead isΘ\(d2\)\\Theta\(d^\{2\}\), asymptotically negligible compared to the encoder’sΘ\(L⋅d2\)\\Theta\(L\\cdot d^\{2\}\)forL≫1L\\gg 1\.
##### Memory\.
The gate adds2d2\+d=2⋅7682\+768≈1\.182d^\{2\}\+d=2\\cdot 768^\{2\}\+768\\approx 1\.18M parameters; the feature projection adds16d\+d≈12\.516d\+d\\approx 12\.5K parameters; the task embedding\|𝒯\|⋅d≈3\|\\mathcal\{T\}\|\\cdot d\\approx 3K\. TotalSURGeLLMoverhead is∼1\.2\\sim 1\.2M parameters per backbone—about1%1\\%of RoBERTa\-base\.
##### Wall\-clock\.
On2×2\\timesT4 GPUs,SURGeLLM\-RoBERTa adds∼100\\sim 100s versus Baseline\-RoBERTa \(233→332233\\to 332s for the same five\-epoch budget\), a43%43\\%overhead driven primarily by extended training and prefix\-token tokenization\.
## Appendix IReproducibility Checklist
- •
- •Data:all four corpora are publicly available; we provide preprocessing scripts that reproduce our stratified splits\.
- •Random seeds:all results from seeds\{0,1,2\}\\\{0,1,2\\\}; data splits, weight initialization, dropout masks, and CUDA determinism are seeded\.
- •Software versions:PyTorch 2\.1, Hugging Face Transformers 4\.35, Accelerate 0\.24, scikit\-learn 1\.3, sentence\-transformers 2\.2\.
- •Hardware:2×2\\timesNVIDIA T4 \(16 GB\) with FP16 mixed precision via Accelerate\.
- •Hyperparameters:listed in Table[13](https://arxiv.org/html/2606.24259#A2.T13)\.
- •Statistical tests:bootstrap \(B=2,000B=2\{,\}000, seed0\); paired Welchtt\-tests with Benjamini\-Hochberg FDR=0\.05\.
- •Estimated total compute:∼38\\sim 38GPU\-hours on T4 to reproduce all main and ablation results\.
## Appendix JPreliminary Multilingual Experiment
To probe cross\-lingual transfer of the auto\-extraction recipe, we evaluateSURGeLLM\-G\-XLM\-R\-base on French \(Allocine\(Blard,[2020](https://arxiv.org/html/2606.24259#bib.bib71)\)\) and German \(GermanSentiment\) sentiment corpora using auto\-extracted vocabularies built per language\. Capping at5,0005\{,\}000training examples and evaluating on official test splits with three seeds:
Table 16:Preliminary multilingual results\.SURGeLLM\-G\-XLM\-R\-base with auto\-extracted per\-language vocabularies vs\. baseline\.ConfigurationFrenchGermanXLM\-R\-base baseline0\.9170\.872SURGeLLM\-G\-XLM\-R\-base \(auto\)0\.9260\.881Δ\\Delta\+\.009\+\.009\+\.009\+\.009The auto\-extracted French and German vocabularies yield gains within0\.010\.01F1 of the English\-curated baseline gain \(\+0\.008\+0\.008on D1\), suggesting the recipe transfers without per\-language manual curation\. A full\-scale multilingual study is left to future work\.
##### Interpretation\.
The IWN gains on D4are highly significant \(p<0\.001p<0\.001for both backbones\)\. The retrieval improvements on D2are significant for three configurations\. Differences on D1and D3are mostly within seed noise, consistent with the gate\-degeneracy result of Proposition[2](https://arxiv.org/html/2606.24259#Thmtheorem2): when surgical alignment is moderate, the gate degenerates harmlessly to a near\-identity map, and observed differences are dominated by SGD noise\.
## Appendix KTraining Algorithm
Algorithm[1](https://arxiv.org/html/2606.24259#alg1)presents the fullSURGeLLMtraining procedure with multi\-GPU execution, pre\-tokenization caching, optional IWN normalization, and early stopping\.
Algorithm 1SURGeLLMMulti\-GPU Training \(with optional IWN\)1:Corpus
𝒟\\mathcal\{D\}; model config
cfg\\mathrm\{cfg\}; accelerator
𝒜\\mathcal\{A\}; flag
IWN∈\{0,1\}\\mathrm\{IWN\}\\in\\\{0,1\\\}
2:Trained model
fθf\_\{\\theta\}
3:Split:for each task
tkt\_\{k\}, stratify
𝒟k\\mathcal\{D\}\_\{k\}into
𝒟ktr,𝒟kv,𝒟kte\\mathcal\{D\}^\{\\mathrm\{tr\}\}\_\{k\},\\mathcal\{D\}^\{\\mathrm\{v\}\}\_\{k\},\\mathcal\{D\}^\{\\mathrm\{te\}\}\_\{k\}\(70/15/15%\)
4:if
IWN\\mathrm\{IWN\}then
5:Compute per\-class
\(𝐬¯c,k,𝝈c,k\)\(\\bar\{\\mathbf\{s\}\}\_\{c,k\},\\bm\{\\sigma\}\_\{c,k\}\)on
𝒟ktr\\mathcal\{D\}^\{\\mathrm\{tr\}\}\_\{k\}
6:Form class\-balanced
\(𝐬¯kbal,𝝈kbal\)\(\\bar\{\\mathbf\{s\}\}\_\{k\}^\{\\mathrm\{bal\}\},\\bm\{\\sigma\}\_\{k\}^\{\\mathrm\{bal\}\}\)via Eq\.[10](https://arxiv.org/html/2606.24259#S3.E10)
7:else
8:Compute marginal
\(𝐬¯k,𝝈k\)\(\\bar\{\\mathbf\{s\}\}\_\{k\},\\bm\{\\sigma\}\_\{k\}\)on
𝒟ktr\\mathcal\{D\}^\{\\mathrm\{tr\}\}\_\{k\}
9:endif
10:Pre\-tokenize:cache training/val texts as tensors \(chunk
C=2,048C\{=\}2\{,\}048\)
11:Construct
fθf\_\{\\theta\}\(§[3\.4](https://arxiv.org/html/2606.24259#S3.SS4)–[3\.7](https://arxiv.org/html/2606.24259#S3.SS7)\); optimizer AdamW; scheduler
γ\\gamma
12:
fθ,Adam,γ,DLtr,DLv←𝒜\.prepare\(…\)f\_\{\\theta\},\\mathrm\{Adam\},\\gamma,\\mathrm\{DL\}^\{\\mathrm\{tr\}\},\\mathrm\{DL\}^\{\\mathrm\{v\}\}\\leftarrow\\mathcal\{A\}\.\\texttt\{prepare\}\(\\ldots\)⊳\\trianglerightDDP \+ FP16
13:
F1⋆←−∞F\_\{1\}^\{\\star\}\\leftarrow\-\\infty;
p←0p\\leftarrow 0;
θ⋆←θ\\theta^\{\\star\}\\leftarrow\\theta
14:for
e=1,…,Emaxe=1,\\ldots,E\_\{\\max\}do
15:
fθ\.train\(\)f\_\{\\theta\}\.\\texttt\{train\}\(\)
16:foreach mini\-batch
B=\{\(xi,yi,ti\)\}B=\\\{\(x\_\{i\},y\_\{i\},t\_\{i\}\)\\\}do
17:Compute
𝐬\(xi\)\\mathbf\{s\}\(x\_\{i\}\)\(Eq\.[4](https://arxiv.org/html/2606.24259#S3.E4)\); standardize via Eq\.[8](https://arxiv.org/html/2606.24259#S3.E8)or Eq\.[11](https://arxiv.org/html/2606.24259#S3.E11)
18:Build prefix
xi′x^\{\\prime\}\_\{i\}\(Eq\.[12](https://arxiv.org/html/2606.24259#S3.E12)\)
19:
y^i,ℓi←fθ\(xi′,ti,𝐬\(xi\),yi\)\\hat\{y\}\_\{i\},\\ell\_\{i\}\\leftarrow f\_\{\\theta\}\(x^\{\\prime\}\_\{i\},t\_\{i\},\\mathbf\{s\}\(x\_\{i\}\),y\_\{i\}\)⊳\\trianglerightEq\.[1](https://arxiv.org/html/2606.24259#S3.E1)
20:
𝒜\.backward\(ℓi/τ\)\\mathcal\{A\}\.\\texttt\{backward\}\(\\ell\_\{i\}/\\tau\)⊳\\trianglerightτ\\tau= grad\. accum\. steps
21:ifstep
≡0\(modτ\)\\equiv 0\\pmod\{\\tau\}then
22:
𝒜\.clip\_grad\_norm\(1\.0\)\\mathcal\{A\}\.\\texttt\{clip\\\_grad\\\_norm\}\(1\.0\)
23:
Adam\.step\(\)\\mathrm\{Adam\}\.\\texttt\{step\}\(\);
γ\.step\(\)\\gamma\.\\texttt\{step\}\(\);
Adam\.zero\_grad\(\)\\mathrm\{Adam\}\.\\texttt\{zero\\\_grad\}\(\)
24:endif
25:endfor
26:
F1e←QuickVal\(fθ,DLv,𝒜\)F\_\{1\}^\{e\}\\leftarrow\\texttt\{QuickVal\}\(f\_\{\\theta\},\\mathrm\{DL\}^\{\\mathrm\{v\}\},\\mathcal\{A\}\)
27:if
F1e\>F1⋆F\_\{1\}^\{e\}\>F\_\{1\}^\{\\star\}then
28:
F1⋆←F1eF\_\{1\}^\{\\star\}\\leftarrow F\_\{1\}^\{e\};
θ⋆←𝒜\.unwrap\(fθ\)\.θ\\theta^\{\\star\}\\leftarrow\\mathcal\{A\}\.\\texttt\{unwrap\}\(f\_\{\\theta\}\)\.\\theta;
p←0p\\leftarrow 0
29:else
30:
p←p\+1p\\leftarrow p\+1
31:if
p≥Pp\\geq Pthenbreak
32:endif⊳\\trianglerightpatienceP=2P\{=\}2
33:endif
34:endfor
35:
fθ←θ⋆f\_\{\\theta\}\\leftarrow\\theta^\{\\star\}; evaluate on
𝒟kte\\mathcal\{D\}^\{\\mathrm\{te\}\}\_\{k\}
36:return
fθf\_\{\\theta\}
## Appendix LMeta Review and Paper Updates
This appendix documents the three principal changes made in response to the meta\-review and the four reviewer reports \(Qs1u, 4Pvq, idHo, EVkC\) for the KnowFM 2026 Workshop and ARR\. For each concern, we state \(i\) the exact reviewer criticism, \(ii\) what was changed in the paper, and \(iii\) where to find the updated material\.
### Crosswalk Table
Table[17](https://arxiv.org/html/2606.24259#A12.T17)provides a compact mapping from the reviewer’s comment on the manuscript change\.
Table 17:Reviewer\-to\-revision crosswalk\.R = revision implemented in this camera\-ready version\. ✓ = fully addressed;∼\\sim= partially addressed with future work note\.ReviewerConcern \(verbatim summary\)Change in paperStatusQs1u, EVkC, MetaClass imbalance on D4corrupts gate statistics; IWN deferred to future workIWN fully implemented \(§[3\.5](https://arxiv.org/html/2606.24259#S3.SS5), §[6\.3](https://arxiv.org/html/2606.24259#S6.SS3), Appendix[L\.1](https://arxiv.org/html/2606.24259#A12.SS1)\)✓Qs1u, idHoNo sensitivity analysis of the surgical vocabulary; unclear why exactly 10 groups; surface features may be redundantFour\-part sensitivity analysis added \(§[7\.2](https://arxiv.org/html/2606.24259#S7.SS2), Appendix[L\.2](https://arxiv.org/html/2606.24259#A12.SS2)\)✓4Pvq, idHo, MetaD1 \(physics oscillation\) saturates at F1=1\.000=1\.000; inflates reported averages; should be replaced with a GLUE taskD1 replaced with SST\-2; all aggregates recomputed over \{SST\-2, D2, D3, D4\} \(Appendix[L\.3](https://arxiv.org/html/2606.24259#A12.SS3)\)✓idHoNo comparison to T5 / text\-to\-text unified modelsT5\-base added as 11th model variant; see Table[3](https://arxiv.org/html/2606.24259#S6.T3)and §[6\.4](https://arxiv.org/html/2606.24259#S6.SS4)✓Qs1u, 4PvqSingle\-seed results weaken confidence in small F1 differencesAll results re\-run over three seeds\{0,1,2\}\\\{0,1,2\\\}; mean±\\pmSD reported throughout; per\-seed breakdown in Appendix[F](https://arxiv.org/html/2606.24259#A6)✓Qs1uAbstract overclaims “state\-of\-the\-art performance”Abstract revised to “competitive parameter\-efficient multi\-task performance” with exact CI overlap stated✓idHoNo multilingual or cross\-domain evaluationPreliminary French/German experiment added \(Appendix[J](https://arxiv.org/html/2606.24259#A10)\); full\-scale study left to future work∼\\sim
### L\.1R1 — Class Imbalance on D4: Instance\-Weighted Normalization
##### Reviewer concern\.
Reviewers Qs1u and EVkC, and the meta\-reviewer, identified the9\.3:19\.3\{:\}1raw class skew in the authorship corpus as the root cause of SURGELLM’s underperformance on D4\. In the original submission, Table 8 showed the gate degrading D4byΔ=−0\.046\\Delta=\-0\.046on average across backbone pairs \(worst case:SURGeLLM\-Full\-RoBERTa vs\. Baseline\-RoBERTa,Δ=−0\.052\\Delta=\-0\.052\)\. The proposed fix—class\-conditional or instance\-weighted normalization—was deferred to future work despite being the most practically relevant task in the suite\.
##### What changed\.
We implementInstance\-Weighted Normalization \(IWN\), a parameter\-free correction applied to the surgical\-feature standardization step \(Eq\.[8](https://arxiv.org/html/2606.24259#S3.E8)in the main paper\)\. Instead of computing global per\-dimension statistics over the entire training partition of tasktkt\_\{k\}:
𝐬¯k=1Nk∑i=1Nk𝐬\(xi\),𝝈k=1Nk∑i=1Nk\(𝐬\(xi\)−𝐬¯k\)2,\\bar\{\\mathbf\{s\}\}\_\{k\}=\\frac\{1\}\{N\_\{k\}\}\\sum\_\{i=1\}^\{N\_\{k\}\}\\mathbf\{s\}\(x\_\{i\}\),\\qquad\\bm\{\\sigma\}\_\{k\}=\\sqrt\{\\frac\{1\}\{N\_\{k\}\}\\sum\_\{i=1\}^\{N\_\{k\}\}\\bigl\(\\mathbf\{s\}\(x\_\{i\}\)\-\\bar\{\\mathbf\{s\}\}\_\{k\}\\bigr\)^\{2\}\},\(Eq\.[8](https://arxiv.org/html/2606.24259#S3.E8), original\)we replace these with class\-balanced statistics:
𝐬¯kbal=1nc,k∑c=1nc,k𝐬¯c,k,𝝈kbal=1nc,k∑c=1nc,k𝝈c,k,\\bar\{\\mathbf\{s\}\}\_\{k\}^\{\\mathrm\{bal\}\}=\\frac\{1\}\{n\_\{c,k\}\}\\sum\_\{c=1\}^\{n\_\{c,k\}\}\\bar\{\\mathbf\{s\}\}\_\{c,k\},\\qquad\\bm\{\\sigma\}\_\{k\}^\{\\mathrm\{bal\}\}=\\frac\{1\}\{n\_\{c,k\}\}\\sum\_\{c=1\}^\{n\_\{c,k\}\}\\bm\{\\sigma\}\_\{c,k\},\(Eq\.[10](https://arxiv.org/html/2606.24259#S3.E10)\)where𝐬¯c,k\\bar\{\\mathbf\{s\}\}\_\{c,k\}and𝝈c,k\\bm\{\\sigma\}\_\{c,k\}are the per\-class mean and standard deviation of𝐬\\mathbf\{s\}on the training set, andnc,kn\_\{c,k\}is the number of classes in tasktkt\_\{k\}\. At inference, these statistics are used directly without any class label \(test\-time class\-agnostic\)\.
##### Key properties of IWN\.
1. 1\.Parameter\-free: no new learnable parameters; only the normalization constants change\.
2. 2\.Test\-time agnostic:\(𝐬¯kbal,𝝈kbal\)\(\\bar\{\\mathbf\{s\}\}\_\{k\}^\{\\mathrm\{bal\}\},\\bm\{\\sigma\}\_\{k\}^\{\\mathrm\{bal\}\}\)are computed once from training labels and applied at inference without requiring class information\.
3. 3\.Reduces to standard normalization on balanced corpora: whenπc=1/nc,k\\pi\_\{c\}=1/n\_\{c,k\}, the two estimators coincide \(up to the difference between weighted and unweighted variance\), so IWN is a strict generalization at zero cost in the balanced regime\.
4. 4\.Compositional: IWN can be combined with focal loss\(Linet al\.,[2017](https://arxiv.org/html/2606.24259#bib.bib49)\)or class\-balanced re\-weighting\(Cuiet al\.,[2019](https://arxiv.org/html/2606.24259#bib.bib47)\)without conflict\.
##### Empirical outcome\.
SURGeLLM\-IWN\-RoBERTa achieves D4macro\-F1=0\.892=0\.892versus Baseline\-RoBERTa0\.7620\.762\(Δ=\+0\.130\\Delta=\+0\.130,p<0\.001p<0\.001, BH\-corrected Welchtt\-test; Table[4](https://arxiv.org/html/2606.24259#S6.T4)\), fully reversing the original gate\-induced regression and exceeding the baseline by the largest single margin in our study\. Per\-class breakdown in Table[5](https://arxiv.org/html/2606.24259#S6.T5)shows that IWN symmetrizes human and LLM precision/recall around0\.890\.89\(from the unbalanced0\.630\.63LLM recall vs\.0\.790\.79human recall without IWN\)\.
##### Connection to theory\.
Empirical estimates of surgical feature alignmentρk\\rho\_\{k\}\(Appendix[G](https://arxiv.org/html/2606.24259#A7), Table[15](https://arxiv.org/html/2606.24259#A7.T15)\) showρ4pre\-IWN≈0\.61\\rho\_\{4\}^\{\\text\{pre\-IWN\}\}\\approx 0\.61rising toρ4post\-IWN≈2\.13\\rho\_\{4\}^\{\\text\{post\-IWN\}\}\\approx 2\.13after IWN\. This rise in alignment directly reduces the approximation term in Theorem[1](https://arxiv.org/html/2606.24259#Thmtheorem1)\(Eq\.[16](https://arxiv.org/html/2606.24259#A1.E16)\), explaining why IWN converts a harmful gate into a beneficial one: the gate was architecturally sound but was being fed prior\-contaminated features\.
### L\.2R2 — Surgical Vocabulary Sensitivity Analysis
##### Reviewer concern\.
Reviewer Qs1u raised the absence of any analysis of sensitivity to the manually curated 10\-group surgical vocabulary\. Reviewer idHo asked specifically: \(a\) why exactly 10 indicator groups were selected; \(b\) whether an ablation over group count exists; and \(c\) why surface features \(word count, mean word length, question\-mark count\) are provided explicitly when they might be implicit in the raw text\.
##### What changed\.
We added a four\-part sensitivity study in §[7\.2](https://arxiv.org/html/2606.24259#S7.SS2)of the main paper, usingSURGeLLM\-G\-RoBERTa across three seeds as the reference configuration\.
#### R2a — Group\-Count Sweep
We vary\|𝒱\|∈\{0,5,10,15,20\}\|\\mathcal\{V\}\|\\in\\\{0,5,10,15,20\\\}\. When reducing, we retain the most discriminative groups by chi\-squared statistic on training data; when increasing, we add semantically redundant thesaurus\-derived variants\. Table[8](https://arxiv.org/html/2606.24259#S7.T8)in the main paper shows that performance plateaus at\|𝒱\|=10\|\\mathcal\{V\}\|=10: any value in\{10,15\}\\\{10,15\\\}produces statistically indistinguishable results \(paired Welchp\>0\.05p\>0\.05, three seeds\)\. Larger vocabularies \(\|𝒱\|=20\|\\mathcal\{V\}\|=20\) incur a small D4drop \(−0\.012\-0\.012\) from noise introduced by redundant variants\. The system is thereforenotsharply tuned to the exact group count, but 10 groups achieve the best precision\-to\-effort trade\-off\.
#### R2b — Random\-Vocabulary Control
To determine whether gains are lexical or merely parametric, we replace each curated group with a same\-cardinality random sample of high\-frequency English content words from the British National Corpus \(BNC\)\. Table[9](https://arxiv.org/html/2606.24259#S7.T9)shows a−0\.028\-0\.028average F1 drop versus curated vocabulary \(p=0\.003p=0\.003, three seeds\), confirming that the gate responds tosemantic content, not extra parameters\. An auto\-extracted vocabulary \(log\-odds ranking \+kk\-means on SBERT embeddings; Appendix[E](https://arxiv.org/html/2606.24259#A5)\) recovers99\.5%99\.5\\%of curated performance \(Δ=−0\.003\\Delta=\-0\.003avg\. F1\), providing a path to new domains without manual curation\.
#### R2c — Per\-Group Leave\-One\-Out
We retrainSURGeLLM\-G\-RoBERTa with each of the 10 groups removed in turn\. Table[11](https://arxiv.org/html/2606.24259#S7.T11)shows that each task has a clearly dominant group:sst\_pos/negfor D1\(−0\.014\-0\.014\),retrievalfor D2\(−0\.011\-0\.011\),prompt\_cotfor D3\(−0\.006\-0\.006\), andllm\_statfor D4\(−0\.018\-0\.018\)\. Cross\-task leakage is minimal: removing a task\-specific group rarely affects other tasks by more than0\.0020\.002\.
#### R2d — Surface\-Features\-Only Ablation
To address reviewer idHo’s concern that surface statistics may be implicit in the encoder, Table[10](https://arxiv.org/html/2606.24259#S7.T10)shows that removing them costs−0\.011\-0\.011F1 on D4and−0\.005\-0\.005on average\. Two arguments confirm they are not redundant with the encoder:
- •Truncation loss\.The encoder receives at mostL∈\{96,128\}L\\in\\\{96,128\\\}tokens; global statistics \(total word count, exclamation\-mark count\) are computed on thefulluntruncated document and carry information the encoder cannot recover from a partial view\(Dinget al\.,[2020](https://arxiv.org/html/2606.24259#bib.bib59)\)\.
- •Distributional shift\.Even without truncation, the\[CLS\]representation is optimized for masked\-token prediction and may not preserve count statistics; the surgical channel provides a deterministic, lossless path for these\.
### L\.3R3 — Replacement of D1 with SST\-2
##### Reviewer concern\.
Reviewers 4Pvq and idHo, and the meta\-reviewer, noted that D1 \(synthetic physics oscillation classification\) attains F1=1\.000=1\.000foreverymodel variant in both the single\-seed and multi\-seed settings\. This saturated task contributes zero discriminative signal to any model comparison while inflating reported average scores\. The meta\-reviewer recommended replacing D1 with a standard GLUE benchmark task to improve comparability with MT\-DNN\(Liuet al\.,[2019a](https://arxiv.org/html/2606.24259#bib.bib7)\)and Muppet\(Aghajanyanet al\.,[2021](https://arxiv.org/html/2606.24259#bib.bib8)\)\.
##### What changed\.
D1 is removed from the main evaluation suite\. In its place we incorporateSST\-2\(Socheret al\.,[2013](https://arxiv.org/html/2606.24259#bib.bib51)\)\(binary movie\-review sentiment; 7,666 capped training examples; standard GLUE test split of 872 examples\), referred to as D1throughout the revised paper\.
##### Rationale for SST\-2 specifically\.
1. 1\.Non\-saturated:published base\-encoder accuracy on SST\-2 spans8787–94%94\\%; in our multi\-seed evaluation, F1 ranges0\.9010\.901–0\.9370\.937across model variants \(Table[3](https://arxiv.org/html/2606.24259#S6.T3)\), providing genuine discriminative signal\.
2. 2\.Standard benchmark:SST\-2 is part of GLUE, enabling direct comparison with MT\-DNN, Muppet, and related multi\-task work\.
3. 3\.Surgical vocabulary coverage:thesst\_posandsst\_negindicator groups \(Appendix[D](https://arxiv.org/html/2606.24259#A4)\) fire reliably on sentiment\-polarity vocabulary, making SST\-2 the task most sensitive to the gate’s lexical prior—the complementary role D1 failed to provide\.
##### Impact on aggregate metrics\.
Removing the uniformly saturated D1 task narrows bootstrap CI widths from≈0\.17\\approx 0\.17\(original paper, §8\.4\) to≈0\.12\\approx 0\.12in the revised four\-task suite, sharpening statistical comparisons\. All aggregate F1 values in Tables[3](https://arxiv.org/html/2606.24259#S6.T3)–[7](https://arxiv.org/html/2606.24259#S7.T7)are recomputed over\{\\\{SST\-2, D2, D3, D4\}\\\}\. The revised leaderboard \(Table[3](https://arxiv.org/html/2606.24259#S6.T3)\) showsSURGeLLM\-IWN\-RoBERTa at0\.9400\.940avg\. F1 versus Baseline\-RoBERTa at0\.9040\.904\(Δ=\+0\.036\\Delta=\+0\.036,p<0\.001p<0\.001\)—a substantially clearer separation than the originalΔ=0\.001\\Delta=0\.001within\-CI gap\.
### L\.4Additional Changes: Multi\-Seed Evaluation and Abstract Revision
##### Three\-seed evaluation \(Reviewers 4Pvq, Qs1u\)\.
The original submission used a single random seed, which reviewers correctly identified as insufficient for interpreting small F1 differences\. All experiments are re\-run with seeds\{0,1,2\}\\\{0,1,2\\\}; results are reported as mean±\\pmSD throughout\. Per\-seed breakdowns for selected models are in Appendix[F](https://arxiv.org/html/2606.24259#A6)\(Table[14](https://arxiv.org/html/2606.24259#A6.T14)\)\. Key comparisons remain significant: IWN gains on D4hold across all three seeds \(p<0\.001p<0\.001\); retrieval gains on D2are significant for three configurations \(Table[4](https://arxiv.org/html/2606.24259#S6.T4)\)\.
##### T5\-base comparison \(Reviewer idHo\)\.
Reviewer idHo asked for a comparison against unified text\-to\-text models \(T5, FLAN\-style\)\. We add T5\-base \(220220M parameters\) as an 11th model variant\. T5\-base achieves0\.8970\.897avg\. F1—competitive with encoder baselines but dominated bySURGeLLM\-IWN\-RoBERTa \(0\.9400\.940\) at lower parameter count \(125125M\) and1\.24×1\.24\\timesfaster training \(§[6\.4](https://arxiv.org/html/2606.24259#S6.SS4)\)\.
##### Abstract revision \(Reviewer Qs1u\)\.
The phrase “state\-of\-the\-art multi\-task performance” is replaced with “competitive parameter\-efficient multi\-task performance,” and the headline comparison now explicitly states the bootstrap CI overlap:SURGeLLM\-IWN\-RoBERTa0\.940±\.0030\.940\\pm\.003\(95% CI\[0\.934,0\.946\]\[0\.934,0\.946\]\) versus Baseline\-RoBERTa0\.904±\.0030\.904\\pm\.003\.
##### Multilingual preliminary \(Reviewer idHo\)\.
A preliminary experiment on French and German sentiment corpora using auto\-extracted per\-language vocabularies is reported in Appendix[J](https://arxiv.org/html/2606.24259#A10)\(Table[16](https://arxiv.org/html/2606.24259#A10.T16)\)\.SURGeLLM\-G\-XLM\-R\-base with auto\-extracted vocabulary gains\+0\.009\+0\.009F1 in both languages, within0\.0010\.001of the English\-curated gain on D1, suggesting the recipe transfers without per\-language manual curation\. A full\-scale multilingual study is left to future work\.Similar Articles
@HuggingPapers: Stable-GFlowNet: Toward Diverse and Robust LLM Red-Teaming via Contrastive Trajectory Balance Naver AI eliminates unsta…
Naver AI introduces Stable-GFlowNet, a method to improve LLM red-teaming by eliminating unstable partition function estimation in Generative Flow Networks through contrastive trajectory balance.
Decomposing the Basic Abilities of Large Language Models: Mitigating Cross-Task Interference in Multi-Task Instruct-Tuning
This paper proposes Badit, a method that decomposes large language model parameters into orthogonal high-singular-value LoRA experts to mitigate cross-task interference during multi-task instruction tuning.
PEML: Parameter-efficient Multi-Task Learning with Optimized Continuous Prompts
PEML proposes a parameter-efficient multi-task learning method that co-optimizes continuous prompts and model weights via low-rank adaptation. It achieves up to 6.67% average accuracy improvement on multiple benchmarks.
Llama Surgery: Continuous Sparsification of Pre-Trained Language Models via Differentiable Ultrametric Topology Injection
Llama Surgery injects learned block-sparse attention topologies into pre-trained Llama 3.1 8B without retraining from scratch, using a Dynamic Topology Router with Gumbel-Softmax routing, temperature annealing, and a Straight-Through Estimator to avoid gradient collapse, achieving stable convergence and coherent output.
A Multi-Domain Red Teaming Framework for Safety, Robustness, and Fairness Evaluation of Medical Large Language Models
This paper presents a multi-domain red teaming framework for evaluating safety, robustness, and fairness of medical LLMs across 690 clinically grounded scenarios. Results show that high aggregate accuracy can mask critical failures, and hybrid evaluation with clinician oversight is necessary for credible safety assessment.