READER: Robust Evidence-based Authorship Decoding via Extracted Representations

arXiv cs.AI Papers

Summary

Introduces READER, a lightweight framework for dynamic black-box LLM provenance that uses a frozen proxy LLM to extract authorship evidence from responses and performs Bayesian evidence accumulation across multiple queries, achieving high accuracy on the Agent500 dataset.

arXiv:2606.10794v1 Announce Type: new Abstract: As agentic applications increasingly route user tasks through official and third-party LLM APIs, provenance becomes an operational question: which model generated a given black-box response? We study Dynamic Black-Box LLM Provenance: identifying the source LLM from generations elicited by query-varying, non-predefined prompts rather than a fixed input set or benchmark suite. This setting is difficult because prompt semantics dominate the text, while model-specific authorship traces are weak and inconsistent at the surface level. We introduce READER (Robust Evidence-based Authorship Decoding via Extracted Representations), a lightweight provenance framework that treats a frozen proxy LLM as a reader of hidden authorship evidence. READER maps black-box outputs into proxy activation space, temporally filters token states within each response, and performs Bayesian Evidence Accumulation by summing single-response log-posterior evidence across independently sampled prompts. This avoids fragile mean-pooling of prompt-specific representations while preserving the query-wise evidence needed for calibrated confidence. On Agent500, a 50-target dataset built from agent-style prompts, READER reaches $31.0$-$42.4\%$ top-1 accuracy from a single response and $70.0$-$84.0\%$ from 50 responses, substantially outperforming sentence-encoder fingerprints. Scaling across nine proxy readers further shows that stronger LLMs expose more linearly decodable authorship structure, suggesting that authorship perception is already present in frozen LLM representations and can be converted into reliable multi-query attribution.
Original Article
View Cached Full Text

Cached at: 06/10/26, 06:17 AM

# READER: Robust Evidence-based Authorship Decoding via Extracted Representations
Source: [https://arxiv.org/html/2606.10794](https://arxiv.org/html/2606.10794)
Jiaxu Liu1,4Sunnan Mu211footnotemark:1Dong Huang1Liuyin Wang3Jing Shao4Jie Zhang4 1National University of Singapore2Xidian University3Tsinghua University 4Shanghai Artificial Intelligence Laboratory \{jiaxu\.liu,dong\.huang\}@u\.nus\.edu snmu@stu\.xidian\.edu\.cnliuyinwangthu@gmail\.com \{shaojing,zhangjie1\}@pjlab\.org\.cn

###### Abstract

As agentic applications increasingly route user tasks through official and third\-party LLM APIs, provenance becomes an operational question: which model generated a given black\-box response? We studyDynamic Black\-Box LLM Provenance: identifying the source LLM from generations elicited by query\-varying, non\-predefined prompts rather than a fixed input set or benchmark suite\. This setting is difficult because prompt semantics dominate the text, while model\-specific authorship traces are weak and inconsistent at the surface level\. We introduceREADER\(RobustEvidence\-basedAuthorshipDecoding viaExtractedRepresentations\), a lightweight provenance framework that treats a frozen proxy LLM as a reader of hidden authorship evidence\. READER maps black\-box outputs into proxy activation space, temporally filters token states within each response, and performs Bayesian Evidence Accumulation by summing single\-response log\-posterior evidence across independently sampled prompts\. This avoids fragile mean\-pooling of prompt\-specific representations while preserving the query\-wise evidence needed for calibrated confidence\. On Agent500, a 50\-target dataset built from agent\-style prompts, READER reaches31\.031\.0–42\.4%42\.4\\%top\-1 accuracy from a single response and70\.070\.0–84\.0%84\.0\\%from 50 responses, substantially outperforming sentence\-encoder fingerprints\. Scaling across nine proxy readers further shows that stronger LLMs expose more linearly decodable authorship structure, suggesting that authorship perception is already present in frozen LLM representations and can be converted into reliable multi\-query attribution\.

## 1Introduction

Large language models have shifted from standalone chatbots to infrastructure behind agents, workflow automation, and third\-party API services\[[30](https://arxiv.org/html/2606.10794#bib.bib9)\]\. In this setting, model identity becomes an operational property: systems may need to verify whether a response came from a licensed model, an unauthorized wrapper, a silently substituted backend, or a model family with known safety and compliance risks\. Prior work has framed model ownership as a deploy\-time protection problem\[[7](https://arxiv.org/html/2606.10794#bib.bib12),[9](https://arxiv.org/html/2606.10794#bib.bib13)\], while system cards and recent variability reports show that deployed API behavior can carry safety, copyright, and stability concerns\[[15](https://arxiv.org/html/2606.10794#bib.bib10),[24](https://arxiv.org/html/2606.10794#bib.bib8)\]\. We ask a practical provenance question: given only generated text and query access, can we identify which LLM produced a response?

Existing black\-box provenance methods usually compare outputs under a controlled input distribution, such as predefined prompts, common prompt sets, or fixed benchmark suites\[[14](https://arxiv.org/html/2606.10794#bib.bib19),[32](https://arxiv.org/html/2606.10794#bib.bib23),[17](https://arxiv.org/html/2606.10794#bib.bib21),[27](https://arxiv.org/html/2606.10794#bib.bib6)\]\. This is useful for controlled comparison, but live API auditing observes user\- or task\-specific prompts whose semantics vary across queries\. To the best of our knowledge, we are the first to formulate and studyDynamic Black\-Box LLM Provenance: identifying the source LLM from generations elicited by query\-varying, non\-predefined prompts\. We instantiate this setting with Agent500, a 50\-target agent\-style prompt corpus\. The challenge is that prompt semantics dominate surface text, leaving source\-model evidence as a weak and inconsistent signal\.

Our approach is to use a frozen proxy LLM as a provenance reader\. Rather than matching generated text in sentence\-embedding space, the proxy maps a black\-box response into activation space, where subtle generation habits may become more linearly accessible\. This uses activation evidence pragmatically, consistent with mechanistic\-interpretability work while avoiding causal claims from decodability alone\[[34](https://arxiv.org/html/2606.10794#bib.bib7)\]\. It requires no access to the target model’s weights, logits, gradients, or decoding internals\.

We proposeREADER:RobustEvidence\-basedAuthorshipDecoding viaExtractedRepresentations\. READER reads each response with a frozen proxy LLM, averages response\-token hidden states into a single\-response representation, and maps it to a posterior over candidate source models with a linear probe\. For multi\-query attribution, READER performsBayesian Evidence Accumulation, summing calibrated log\-posterior evidence across independently sampled responses\.

This design separates single\-response evidence from multi\-query reliability\. On Agent500, READER achieves31\.031\.0–42\.4%42\.4\\%top\-1 accuracy from a single response \(K=1K\{=\}1\), far above the 2% chance level and sentence\-encoder baselines\. With 50 independently sampled responses \(K=50K\{=\}50\), it reaches70\.070\.0–84\.0%84\.0\\%accuracy across the four main proxy readers\. Confusion matrices and t\-SNE visualisations further reveal family\-level structure in frozen proxy representations, with stronger proxy readers exposing cleaner authorship geometry\.

Our contributions are:

- •Dynamic Black\-Box LLM Provenance\.We formulate provenance from query\-varying black\-box generations and instantiate it with Agent500, a 50\-target agent\-style dataset\.
- •READER, a proxy\-LLM authorship reader\.We show that frozen proxy activations contain linearly decodable source\-model evidence that outperforms sentence\-encoder fingerprints from one response\.
- •Bayesian Evidence Accumulation\.We aggregate calibrated log\-posterior evidence across independent prompts, avoiding brittle geometric pooling of prompt\-specific hidden states\.
- •Ecosystem\-scale evidence\.Across Agent500 and nine proxy readers, stronger proxy LLMs expose more useful authorship structure and yield substantially better attribution\.

## 2Related Work

##### Model Provenance: from white\-box to black\-box\.

Early LLM provenance and ownership\-verification methods often make models identifiable by design\. Decoding\-time watermarks add detectable statistical structure to generated text\[[7](https://arxiv.org/html/2606.10794#bib.bib12),[9](https://arxiv.org/html/2606.10794#bib.bib13)\], while training\-time or instruction\-time fingerprints use triggerable behaviors as verification keys\[[18](https://arxiv.org/html/2606.10794#bib.bib14),[28](https://arxiv.org/html/2606.10794#bib.bib15)\]\. More recent fingerprinting variants improve scalability, semantic conditioning, robustness to model merging, or black\-box identity verification through targeted adversarial probes\[[3](https://arxiv.org/html/2606.10794#bib.bib27),[4](https://arxiv.org/html/2606.10794#bib.bib22),[13](https://arxiv.org/html/2606.10794#bib.bib26),[29](https://arxiv.org/html/2606.10794#bib.bib28)\]\. These methods are effective when the model owner can instrument the system before release or design verification\-specific probes, but retrospective API auditing often lacks this option\. A second line therefore searches for intrinsic evidence of model identity\. White\-box methods use parameters, gradients, or internal representations, including human\-readable fingerprints\[[33](https://arxiv.org/html/2606.10794#bib.bib30)\], gradient\-based fingerprints\[[26](https://arxiv.org/html/2606.10794#bib.bib17)\], and representation\-similarity methods such as REEF, which compares suspect and victim activations with centered\-kernel alignment\[[8](https://arxiv.org/html/2606.10794#bib.bib34),[35](https://arxiv.org/html/2606.10794#bib.bib18)\]\. Related black\-box fingerprints exploit generated\-text style\[[12](https://arxiv.org/html/2606.10794#bib.bib29)\]or output\-space behavior in API\-protected settings\[[31](https://arxiv.org/html/2606.10794#bib.bib16)\]\. More recent black\-box provenance methods avoid target internals by comparing behavior under controlled input sets: Model Provenance Testing compares next\-token similarity against unrelated controls\[[14](https://arxiv.org/html/2606.10794#bib.bib19)\], Model Provenance Set returns statistically valid candidate sets\[[19](https://arxiv.org/html/2606.10794#bib.bib20)\], LLMmap uses crafted probing queries and external text features to infer model identity\[[17](https://arxiv.org/html/2606.10794#bib.bib21)\], and PhyloLM and LLM DNA infer model relationships from outputs on common prompt sets\[[32](https://arxiv.org/html/2606.10794#bib.bib23),[27](https://arxiv.org/html/2606.10794#bib.bib6)\]\. Together, these black\-box methods establish that model identity can be inferred from generated outputs under carefully controlled query protocols\. Their evidence, however, is often tied to the semantic distribution induced by the probes, making dynamic inputs a harder setting\. READER follows the black\-box direction but changes the evidence source: it maps generated text into a frozen proxy model’s activation space and accumulates calibrated log\-posterior evidence\. Figure[1](https://arxiv.org/html/2606.10794#S2.F1)summarizes the resulting progression from white\-box access to static and dynamic black\-box auditing\.

![Refer to caption](https://arxiv.org/html/2606.10794v1/x1.png)Figure 1:Provenance settings from white\-box to dynamic black\-box auditing\.White\-box methods compare model internals directly, static black\-box methods query shared or controlled prompt sets, and dynamic black\-box auditing must attribute sources from generated responses under query\-varying prompts without target internals\.
##### Mechanistic Interpretability and Proxy Activation Evidence\.

Mechanistic interpretability provides the representation\-level lens behind READER\. Early probing work established that labeled properties can be decoded from frozen neural states, while also warning that probe accuracy must be interpreted with controls and does not by itself prove causal use\[[5](https://arxiv.org/html/2606.10794#bib.bib35),[20](https://arxiv.org/html/2606.10794#bib.bib36)\]\. Work on superposition and dictionary learning then showed why many latent features may share high\-dimensional activation space and how more interpretable feature directions can be recovered\[[2](https://arxiv.org/html/2606.10794#bib.bib2),[1](https://arxiv.org/html/2606.10794#bib.bib37)\]\. Building on this view, activation steering and representation engineering demonstrated that high\-level behaviors can often be exposed or modulated through activation directions\[[36](https://arxiv.org/html/2606.10794#bib.bib3),[21](https://arxiv.org/html/2606.10794#bib.bib38)\]\. Recent work further localizes concrete attributes such as linguistic style and emotion inference inside LLM activations\[[10](https://arxiv.org/html/2606.10794#bib.bib39),[22](https://arxiv.org/html/2606.10794#bib.bib40)\]\. The linear representation hypothesis and recent analyses of its origins provide a more formal account of when such linear structure should emerge in LLM representations\[[16](https://arxiv.org/html/2606.10794#bib.bib32),[6](https://arxiv.org/html/2606.10794#bib.bib33)\], while a recent survey emphasizes that decodability remains correlational unless supported by interventions\[[34](https://arxiv.org/html/2606.10794#bib.bib7)\]\. READER uses this correlational setting as a provenance signal: it tests whether frozen proxy activations contain weak but repeatable source\-model evidence and whether Bayesian Evidence Accumulation can turn that evidence into reliable attribution\.

## 3Methodology: The READER Framework

READER \(RobustEvidence\-basedAuthorshipDecoding viaExtractedRepresentations\) treats a frozen proxy LLM as a reader of model\-specific generation traces\. Rather than forcing a global geometric disentanglement of semantics and authorship, READER uses two lightweight operations: temporal filtering within each response and Bayesian Evidence Accumulation across independently prompted responses\. Figure[2](https://arxiv.org/html/2606.10794#S3.F2)gives the end\-to\-end pipeline\.

![Refer to caption](https://arxiv.org/html/2606.10794v1/x2.png)Figure 2:Overview of the READER pipeline\.A frozen proxy LLM reads black\-box target responses, READER temporally aggregates selected hidden states within each response, and Bayesian Evidence Accumulation combines per\-response posterior evidence across multiple prompts for final source\-model attribution\.### 3\.1Authorship Signal in Proxy Representations

Let𝐡t\(c,p\)∈ℝd\\mathbf\{h\}\_\{t\}^\{\(c,p\)\}\\in\\mathbb\{R\}^\{d\}denote a proxy hidden state when reading text generated by target modelccunder promptpp\. Following the linear representation view\[[2](https://arxiv.org/html/2606.10794#bib.bib2)\], we model this state as

𝐡t\(c,p\)=𝐒\(p\)\+Δ​𝐬t\(p\)\+𝐚\(c\)\+ϵt,\\mathbf\{h\}\_\{t\}^\{\(c,p\)\}=\\mathbf\{S\}^\{\(p\)\}\+\\Delta\\mathbf\{s\}\_\{t\}^\{\(p\)\}\+\\mathbf\{a\}^\{\(c\)\}\+\\bm\{\\epsilon\}\_\{t\},where𝐒\(p\)\\mathbf\{S\}^\{\(p\)\}is the prompt\-level semantic component,Δ​𝐬t\(p\)\\Delta\\mathbf\{s\}\_\{t\}^\{\(p\)\}is local contextual drift,𝐚\(c\)\\mathbf\{a\}^\{\(c\)\}is the target\-model authorship signature, andϵt\\bm\{\\epsilon\}\_\{t\}is high\-frequency decoding noise\. Dynamic provenance is difficult because semantic variation usually dominates the weaker authorship component\. READER makes authorship evidence more accessible by filtering token\-level noise and accumulating weak per\-response evidence, without explicitly estimating𝐚\(c\)\\mathbf\{a\}^\{\(c\)\}\.

### 3\.2Stage 1: Temporal Low\-Pass Filtering

Within one generated response, target\-model habits can appear at multiple positions, while token states remain correlated through the autoregressive prefix\. We sampleMMpositions from the response and use their arithmetic mean as a sequence\-level representation:

𝐮\(c,p\)=1M​∑m=1M𝐡tm\(c,p\)\.\\mathbf\{u\}^\{\(c,p\)\}=\\frac\{1\}\{M\}\\sum\_\{m=1\}^\{M\}\\mathbf\{h\}\_\{t\_\{m\}\}^\{\(c,p\)\}\.This averaging is a windowed temporal low\-pass filter\. It reduces high\-frequency decoding noise and local drift, yielding a more stable representation that is still prompt\-specific but more suitable for single\-response attribution\.

### 3\.3Stage 2: Bayesian Evidence Accumulation

A filtered vector𝐮\(c,p\)\\mathbf\{u\}^\{\(c,p\)\}can still be dominated by prompt semantics, making vector averaging across query\-varying prompts fragile\. READER aggregates in decision space\. ForKKindependently prompted responses from the same unknown target, let𝒰=\{𝐮1,…,𝐮K\}\\mathcal\{U\}=\\\{\\mathbf\{u\}\_\{1\},\\ldots,\\mathbf\{u\}\_\{K\}\\\}\. Under conditional independence given the source model and a uniform class prior, MAP inference accumulates per\-response log likelihoods∑klog⁡p​\(𝐮k∣c\)\\sum\_\{k\}\\log p\(\\mathbf\{u\}\_\{k\}\\mid c\)\.

We avoid explicit density estimation by training a discriminative probeqθ​\(c∣𝐮\)q\_\{\\theta\}\(c\\mid\\mathbf\{u\}\)\. With a uniform prior, Bayes’ rule makeslog⁡qθ​\(c∣𝐮k\)\\log q\_\{\\theta\}\(c\\mid\\mathbf\{u\}\_\{k\}\)a class\-dependent surrogate forlog⁡p​\(𝐮k∣c\)\\log p\(\\mathbf\{u\}\_\{k\}\\mid c\)up to terms independent ofcc\. The resulting discriminative product\-of\-experts decision rule is

y^=arg⁡maxc∈𝒞⁡Sc,Sc=1K​∑k=1Klog⁡qθ​\(c∣𝐮k\)\.\\hat\{y\}=\\arg\\max\_\{c\\in\\mathcal\{C\}\}S\_\{c\},\\qquad S\_\{c\}=\\frac\{1\}\{K\}\\sum\_\{k=1\}^\{K\}\\log q\_\{\\theta\}\(c\\mid\\mathbf\{u\}\_\{k\}\)\.\(1\)The factor1/K1/Kkeeps score scales comparable across query budgets without affecting the MAP class\. Ambiguous prompts contribute low\-margin evidence, while prompts that expose stronger authorship traces contribute sharper log\-posterior evidence\. Appendix[B](https://arxiv.org/html/2606.10794#A2)gives the full derivation\.

For calibrated confidence, we apply a scalar evidence scaleα\>0\\alpha\>0to the accumulated scores:

P~​\(c∣𝒰\)=softmax​\(α​𝐒\)c,\\widetilde\{P\}\(c\\mid\\mathcal\{U\}\)=\\mathrm\{softmax\}\(\\alpha\\mathbf\{S\}\)\_\{c\},\(2\)whereα\\alphais fitted on a validation split by minimizing NLL\. Sinceα\\alphais positive, calibration changes confidence but not the MAP prediction in Eq\.[1](https://arxiv.org/html/2606.10794#S3.E1)\.

### 3\.4Linear Probe Implementation

We instantiateqθq\_\{\\theta\}as multinomial logistic regression over frozen proxy representations, training only the probe weights and biases\. The probe is optimized on single\-response examples with cross\-entropy and anL2L\_\{2\}penalty on the weight matrix:

ℒ=−𝔼\(𝐮,y\)∼𝒟t​r​a​i​n​\[log⁡qθ​\(y∣𝐮\)\]\+λLR2​‖𝐖‖F2,\\mathcal\{L\}=\-\\mathbb\{E\}\_\{\(\\mathbf\{u\},y\)\\sim\\mathcal\{D\}\_\{train\}\}\\left\[\\log q\_\{\\theta\}\(y\\mid\\mathbf\{u\}\)\\right\]\+\\frac\{\\lambda\_\{\\mathrm\{LR\}\}\}\{2\}\\\|\\mathbf\{W\}\\\|\_\{F\}^\{2\},\(3\)whereλLR\\lambda\_\{\\mathrm\{LR\}\}is controlled by the inverse regularisation parameterCLRC\_\{\\mathrm\{LR\}\}in the logistic\-regression implementation\. Appendix[C\.2](https://arxiv.org/html/2606.10794#A3.SS2)gives the value used in all reported experiments\. At inference time, READER extracts one filtered representation per target response, computesqθ\(⋅∣𝐮k\)q\_\{\\theta\}\(\\cdot\\mid\\mathbf\{u\}\_\{k\}\)independently, and accumulates log evidence using Eq\.[1](https://arxiv.org/html/2606.10794#S3.E1)\. Pseudocode is provided in Appendix[A](https://arxiv.org/html/2606.10794#A1)\.

## 4Experiment

### 4\.1Experimental Setup

##### Target ecosystem\.

We evaluate onAgent500, a 50\-way dynamic provenance benchmark dataset built from an in\-house corpus of 500 agent\-style prompts \(Appendix[C\.4](https://arxiv.org/html/2606.10794#A3.SS4)\)\. The target set spans the Llama, Qwen, Mistral, Phi, Gemma, DeepSeek and related open\-model families, with parameter scales from 3B to 122B and both dense and MoE architectures\. For every targetc∈𝒞c\\in\\mathcal\{C\}, we collect one response per prompt, yielding25,00025\{,\}000trajectories\. The target models are treated strictly as black boxes: downstream methods observe only the generated text\.

##### Proxy models and main\-text scope\.

The main text reports four representative proxy readers:Llama\-3\.1\-8B,Qwen3\-8B,Qwen3\.5\-9B, andQwen3\-32B\. For the scaling analysis in §[4\.7](https://arxiv.org/html/2606.10794#S4.SS7), we additionally evaluate five larger Qwen\-3\.5/3\.6 dense and MoE proxies up to 122B\. For each proxyϕ\\phi, we extract hidden states from the best layer selected by the validation procedure in §[4\.6](https://arxiv.org/html/2606.10794#S4.SS6)\. Intra\-sequence filtering uses hidden states stored atMmax=16M\_\{\\max\}=16uniformly spaced positions over the first128128generated tokens;M=1M=1uses the last stored prefix position, whileM\>1M\>1usesMMuniformly spaced stored positions\.

##### Baselines\.

We include a uniform random 50\-way baseline and three LLM\-DNA\-style black\-box sentence\-encoder fingerprints\[[27](https://arxiv.org/html/2606.10794#bib.bib6)\]:all\-mpnet\-base\-v2\(110M\),bge\-large\-en\-v1\.5\(335M\), andQwen3\-Embedding\-8B\(8B\)\. To isolate the representation rather than the classifier, the sentence\-encoder baselines use the same downstream linear probe and Bayesian Evidence Accumulation as READER\.

##### Evaluation protocol\.

All supervised metrics are computed with prompt\-level 5\-fold cross\-validation on Agent500, so train and test folds contain disjoint prompts for every target\. We evaluate query budgetsK∈\{1,5,10,20,50\}K\\in\\\{1,5,10,20,50\\\}and emphasize two operating points:K=1K=1for single\-response attribution andK=50K=50for the practical multi\-query setting\. Top\-1 accuracy is averaged over random query sessions per target\. Pair\-AUC and mAP@10 are diagnostic metrics on the same grouped fingerprint space: mAP@10 uses cosine retrieval, while Pair\-AUC averages one\-vs\-one linear separability over target pairs\. Neither metric is used by READER’s 50\-way attribution head\. Unless otherwise noted, READER usesM=4M=4and Bayesian Evidence Accumulation\. Appendix[D\.5](https://arxiv.org/html/2606.10794#A4.SS5)–[D\.6](https://arxiv.org/html/2606.10794#A4.SS6)reports aggregation ablations, including geometric mean\-pooling with logistic regression and Gaussian discriminant scoring\.

### 4\.2Main Results: Dynamic Provenance Accuracy

![Refer to caption](https://arxiv.org/html/2606.10794v1/x3.png)Figure 3:Dynamic provenance versus sentence\-encoder baselines\.Solid lines show READER on the four main\-text proxies \(Bayesian Evidence Accumulation,M=4M\{=\}4\)\. Dashed lines show three LLM\-DNA\-style sentence encoders under the same downstream pipeline\. READER provides substantially higher top\-1 accuracy, while Pair\-AUC and mAP@10 diagnose separability and retrieval quality in the same grouped fingerprint space\. The full nine\-proxy version is reported in Appendix[C\.5](https://arxiv.org/html/2606.10794#A3.SS5)\.Figure[3](https://arxiv.org/html/2606.10794#S4.F3)and Table[1](https://arxiv.org/html/2606.10794#S4.T1)give the main comparison\. A uniform random classifier obtains2%2\\%top\-1 accuracy on this 50\-way task\. With a single response \(K=1K=1\), READER reaches31\.031\.0–42\.4%42\.4\\%accuracy, whereas the strongest sentence\-encoder baseline is11\.9%11\.9\\%\. This single\-response result is important because it shows that proxy hidden states contain linearly decodable authorship evidence before any multi\-query averaging\. AtK=50K=50, READER reaches0\.7000\.700–0\.8400\.840across the four proxy readers, while the sentence\-encoder baselines remain at or below0\.1880\.188, showing that Bayesian Evidence Accumulation can turn this weak but repeatable evidence into reliable multi\-query attribution\. Residual errors remain concentrated among related checkpoints, but hidden states from a capable frozen proxy expose authorship evidence that generic sentence embeddings largely discard\.

Table 1:Single\-query and multi\-query provenance on the 50\-target ecosystem\.Each entry is mean±\\pmstd where available\. Accuracy is computed with READER’s 50\-way Bayesian Evidence Accumulation\. Pair\-AUC and mAP@10 are representation diagnostics on the same grouped fingerprint space: mAP@10 uses cosine retrieval, while Pair\-AUC averages one\-vs\-one linear separability over target pairs\. They are not used for READER inference\. Random is the 50\-way chance baseline\.Bold= best per column,underlined= second best\.
### 4\.3Hyperparameter Analysis:M×KM\\times K

![Refer to caption](https://arxiv.org/html/2606.10794v1/x4.png)Figure 4:Joint sweep ofMM\(temporal filter width\) andKK\(Bayesian budget\)on Llama\-3\.1\-8B, Qwen3\-8B, Qwen3\.5\-9B and Qwen3\-32B\. TheM=4M\{=\}4setting captures most of the benefit from intra\-sequence filtering; larger values provide limited additional accuracy while increasing feature extraction cost\.Figure[4](https://arxiv.org/html/2606.10794#S4.F4)jointly varies the intra\-sequence sample countMMand the cross\-prompt budgetKK\. Two practical findings inform our default\. First, temporal filtering saturates early:M=4M=4is competitive withM=8/16M=8/16across the plotted budgets, which suggests that a small number of evenly spaced response states is sufficient for this benchmark\. Second, most of the multi\-query gain is realized byK=50K=50; larger budgets give smaller and less consistent returns\. We therefore use\(M,K\)=\(4,50\)\(M,K\)=\(4,50\)as the main operating point\.

### 4\.4Per\-Target Prediction Behaviour

![Refer to caption](https://arxiv.org/html/2606.10794v1/x5.png)Figure 5:50×5050\\times 50confusion matrices atK=50K=50,M=4M=4\(BEA\)\. One panel per main\-text proxy\. Rows are grouped by family\. The main off\-diagonal mass stays inside a few related families, especially Qwen3, Qwen2\.5 and DeepSeek\. The Llama block is comparatively weak under the Llama\-3\.1\-8B proxy but becomes more diagonal under the three Qwen proxies\. The full nine\-proxy panel and theK=10K\{=\}10counterpart are deferred to Appendix[C\.5](https://arxiv.org/html/2606.10794#A3.SS5)\.Figure[5](https://arxiv.org/html/2606.10794#S4.F5)visualises where the remaining mistakes occur\. The Qwen3, Qwen2\.5, and DeepSeek blocks contain the lightest diagonals and the clearest within\-block off\-diagonal structure, indicating that checkpoints inside these families are the hardest to separate\. This is expected: many of these targets share nearby base weights and differ mainly by scale or post\-training recipe\. The proxy choice also matters\. With Llama\-3\.1\-8B as reader, even the Llama family block is only weakly resolved\. Replacing the reader with Qwen3\-8B, Qwen3\.5\-9B, or Qwen3\-32B makes the Llama diagonal visibly darker and reduces within\-family leakage, matching the quantitative improvement in Table[1](https://arxiv.org/html/2606.10794#S4.T1)\.

### 4\.5Authorship Geometry: t\-SNE Visualisation

![Refer to caption](https://arxiv.org/html/2606.10794v1/x6.png)Figure 6:t\-SNE projection of randomly groupedK=10K=10proxy fingerprints\(one panel per proxy,M=4M=4\)\. Each point is a mean\-pooled proxy\-hidden\-state fingerprint before the supervised provenance head; colours denote model families\. Even without using classifier predictions in the visualization, the representation exhibits family\-level organization\.To visualise the geometry before the supervised head, we randomly groupK=10K=10responses per target, average their proxy hidden\-state fingerprints, and run t\-SNE on these aggregated vectors \(Fig\.[6](https://arxiv.org/html/2606.10794#S4.F6)\)\. This representation\-level visualization probes whether frozen proxy activations already contain family structure\. It suggests that frozen proxy LLM representations encode source\-model\-related structure before supervised attribution, and that stronger proxy readers expose clearer family\-level organization\. The Llama\-3\.1\-8B reader gives the weakest geometry: Qwen2\.5 and Llama points overlap substantially, and part of the Qwen3 mass is close to DeepSeek\. The three Qwen readers show clearer family organization, with Qwen3\-8B separating several coarse groups and Qwen3\.5\-9B producing the cleanest family\-level layout among the four panels\. This mirrors the main accuracy trend at small budgets, where Qwen3\.5\-9B is the strongest reader atK=1K=1and remains among the best atK=10K=10\.

### 4\.6Best\-Layer Localisation

![Refer to caption](https://arxiv.org/html/2606.10794v1/x7.png)Figure 7:Layer\-wise probe accuracy heatmap across the full nine\-proxy roster, plotted along relative depth \(0 = embedding, 1 = final layer\)\. The sweep is run atM=1,K=1M\{=\}1,K\{=\}1: each layer is evaluated with the same simple single\-response representation, so the diagnostic reflects layer choice itself rather than its interaction with intra\-response temporal averaging\. White stars mark the selected layer\.Llama\-3\.1\-8Bpeaks at the final layer, whereas the Qwen proxies usually peak in the middle\-to\-late stack before the final layer \(e\.g\., Qwen3\-8B23/3623/36, Qwen3\.5\-9B19/3219/32, Qwen3\-32B50/6450/64\)\.We probe each residual\-stream layer with a per\-layer linear classifier and record top\-1 accuracy across all nine proxies\. To make this diagnostic cleaner, the sweep fixesM=1M\{=\}1andK=1K\{=\}1, so every layer is scored from the same simple single\-response representation\. Figure[7](https://arxiv.org/html/2606.10794#S4.F7)shows that the useful layer is architecture dependent\. Llama\-3\.1\-8B peaks at the final layer, while the Qwen proxies typically peak earlier in the middle\-to\-late stack\. This agrees qualitatively with analyses of internal policy formation in decoder LLMs\[[23](https://arxiv.org/html/2606.10794#bib.bib5)\], but here the observation is operational: layer selection matters, and the final layer is not uniformly optimal\.

##### Outlier: Qwen3\.6\-35B\-A3B\.

One Qwen variant has an argmax near the final layer \(L=39/40L\{=\}39/40, accuracy0\.3190\.319\), but this is a shallow plateau rather than a clear final\-layer optimum\. The earliest layer within1%1\\%of the best score is alreadyL=29L\{=\}29\(accuracy0\.3140\.314\), only0\.0050\.005below the selected layer and within binomial noise\. Per\-proxy accuracy curves with macro\-F1 overlays are reported in Appendix[C\.5](https://arxiv.org/html/2606.10794#A3.SS5)\.

### 4\.7Proxy Scaling: Authorship Evidence Tracks Proxy Capability

![Refer to caption](https://arxiv.org/html/2606.10794v1/x8.png)Figure 8:Single\-query accuracy tracks proxy capability\.Each point is one frozen proxy reader evaluated on Agent500 at\(M=1,K=1\)\(M\{=\}1,K\{=\}1\), before any intra\-response averaging or multi\-query evidence accumulation\. Stronger benchmark capability is tightly associated with more linearly decodable authorship evidence \(Pearsonr=0\.942r=0\.942, Spearmanρ=0\.917\\rho=0\.917\)\.Going beyond the four main\-text proxies, we extract READER fingerprints from nine proxies spanning88B–122122B parameters and correlate theirK=1K=1identification accuracy with reported MMLU\-Pro score\[[25](https://arxiv.org/html/2606.10794#bib.bib4)\]\. Figure[8](https://arxiv.org/html/2606.10794#S4.F8)shows a strong positive association: higher\-capability proxy readers yield more separable authorship evidence from the same black\-box responses \(Pearsonr=0\.942r=0\.942, Spearmanρ=0\.917\\rho=0\.917\)\. We useM=1M\{=\}1here to keep the diagnostic aligned with the layer\-localisation analysis and to avoid conflating proxy choice with intra\-response temporal averaging\. The association is also present under the main\-paper representation withM=4M\{=\}4\(Pearsonr=0\.844r=0\.844, Spearmanρ=0\.783\\rho=0\.783\), indicating that the trend is not specific to a single feature\-aggregation setting\.

### 4\.8Robustness under Realistic API Masking

![Refer to caption](https://arxiv.org/html/2606.10794v1/x9.png)Figure 9:Robustness to PII redaction \(agent\-style API masking\)\.For each ratioR∈\{10,20,30,40,50\}%R\\in\\\{10,20,30,40,50\\\}\\%, randomly selected words in every response are replaced with\[REDACTED\]prior to feature extraction\. READER degrades with heavier masking, but multi\-query aggregation keeps the proxy curves well above the sentence\-encoder baselines across the tested redaction levels\.Real\-world agent pipelines often interpose PII\-scrubbing or content\-filtering middleware that removes fragments of text before it reaches an auditor\. We simulate this by replacing randomly selected response words with\[REDACTED\], then re\-extract features and evaluate atM=4M\{=\}4across multiple aggregation budgets\. Figure[9](https://arxiv.org/html/2606.10794#S4.F9)shows a smooth but non\-catastrophic degradation as the redaction ratio increases\. AtK=50K\{=\}50, the four main proxies retain54\.854\.8–62\.4%62\.4\\%top\-1 accuracy even atR=50%R\{=\}50\\%, compared with70\.070\.0–84\.0%84\.0\\%without masking\. IncreasingKKlifts the entire curve and makes the degradation less abrupt: atR=50%R\{=\}50\\%, mean proxy accuracy rises from43\.5%43\.5\\%atK=10K\{=\}10to58\.9%58\.9\\%atK=50K\{=\}50and64\.3%64\.3\\%atK=100K\{=\}100\. The slope with respect to masking is therefore smaller after sufficient multi\-query accumulation, consistent with READER aggregating residual authorship evidence rather than relying on any single unmasked lexical cue\. Heavy redaction remains a real failure mode, but the gap to sentence\-encoder baselines persists across the tested settings\. Numerical details are reported in Appendix[7](https://arxiv.org/html/2606.10794#A3.T7)\.

## 5Limitations

##### Scope and closed\-set assumption\.

READER is evaluated as a closed\-set provenance system: the true source model is assumed to belong to the candidate ecosystem𝒞\\mathcal\{C\}\. Although different model families naturally induce distinct distributions in proxy\-model activations, this separability does not by itself solve open\-world detection\. A deployment would need rejection and calibration for unknown models, and adding a new candidate requires training a new lightweight linear\-probe head over the expanded label set\. Our benchmark covers 50 targets and nine proxies, but remains finite and does not include closed\-source API targets\. Broader claims require wider coverage across model ecosystems, languages, tasks, decoding policies, and deployment settings\.

##### Single\-source and adaptive settings\.

We assume each observed response is generated by a single source model\. This excludes multi\-source settings where an agent routes sub\-steps through different backends, stitches together outputs from multiple models, or post\-processes one model’s response with another model\. We also evaluate realistic text masking, but not fully adaptive attacks: a target provider aware of READER could paraphrase outputs, randomize decoding style, route requests through multiple backends, or optimize against a proxy\-based detector\. Extending READER to mixture provenance and adaptive provenance games is an important direction for future work\.

##### Task\-specific training\.

One could train a supervised detector specifically for this attribution task, potentially improving the final metrics\. Our goal is different: we ask whether a frozen LLM already carries useful authorship evidence in its activations, and whether a lightweight probe can expose and aggregate that evidence\. Stronger task\-specific architectures are therefore complementary to, rather than replacements for, the core finding that proxy LLMs possess usable provenance sensitivity\.

## 6Conclusion

We presented READER, an evidence\-based framework for black\-box LLM provenance\. Rather than matching static output distributions or relying on target\-side instrumentation, READER uses a frozen proxy LLM to convert generated text into hidden\-state evidence and aggregates that evidence across prompts with Bayesian Evidence Accumulation\. On a 50\-model dynamic provenance benchmark, this simple proxy\-reader design substantially outperforms sentence\-encoder fingerprint baselines and remains effective under multi\-query aggregation and realistic API masking\.

Beyond its practical attribution performance, READER suggests that capable proxy LLMs encode useful information about the source model behind a text sample\. The strong relationship between proxy capability and single\-query attribution accuracy indicates that stronger models may serve as better provenance readers because they expose cleaner authorship evidence from the same black\-box text\. We view this as a step toward non\-intrusive auditing tools for the black\-box LLM ecosystem, while leaving open\-world detection, calibration, and adaptive robustness as important next challenges\.

## References

- \[1\]T\. Bricken, A\. Templeton, J\. Batson, B\. Chen, A\. Jermyn, T\. Conerly, N\. Turner, C\. Anil, C\. Denison, A\. Askell, R\. Lasenby, Y\. Wu, S\. Kravec, N\. Schiefer, T\. Maxwell,et al\.\(2023\)Towards monosemanticity: decomposing language models with dictionary learning\.Transformer Circuits Thread\.Note:[https://transformer\-circuits\.pub/2023/monosemantic\-features/index\.html](https://transformer-circuits.pub/2023/monosemantic-features/index.html)Cited by:[§2](https://arxiv.org/html/2606.10794#S2.SS0.SSS0.Px2.p1.1)\.
- \[2\]N\. Elhage, T\. Hume, C\. Olsson, N\. Schiefer, T\. Henighan, S\. Kravec, Z\. Hatfield\-Dodds, R\. Lasenby, D\. Drain, C\. Chen,et al\.\(2022\)Toy models of superposition\.arXiv preprint arXiv:2209\.10652\.Cited by:[§2](https://arxiv.org/html/2606.10794#S2.SS0.SSS0.Px2.p1.1),[§3\.1](https://arxiv.org/html/2606.10794#S3.SS1.p1.3)\.
- \[3\]T\. Gloaguen, R\. Staab, N\. Jovanović, and M\. Vechev\(2026\)LLM fingerprinting via semantically conditioned watermarks\.InThe Fourteenth International Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2606.10794#S2.SS0.SSS0.Px1.p1.1)\.
- \[4\]M\. Gubri, D\. Ulmer, H\. Lee, S\. Yun, and S\. J\. Oh\(2024\)TRAP: targeted random adversarial prompt honeypot for black\-box identification\.InFindings of the Association for Computational Linguistics: ACL 2024,pp\. 11496–11517\.Cited by:[§2](https://arxiv.org/html/2606.10794#S2.SS0.SSS0.Px1.p1.1)\.
- \[5\]J\. Hewitt and P\. Liang\(2019\)Designing and interpreting probes with control tasks\.InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing,pp\. 2733–2743\.Cited by:[§2](https://arxiv.org/html/2606.10794#S2.SS0.SSS0.Px2.p1.1)\.
- \[6\]Y\. Jiang, G\. Rajendran, P\. K\. Ravikumar, B\. Aragam, and V\. Veitch\(2024\)On the origins of linear representations in large language models\.InProceedings of the 41st International Conference on Machine Learning,pp\. 21879–21911\.Cited by:[§2](https://arxiv.org/html/2606.10794#S2.SS0.SSS0.Px2.p1.1)\.
- \[7\]J\. Kirchenbauer, J\. Geiping, Y\. Wen, J\. Katz, I\. Miers, and T\. Goldstein\(2023\)A watermark for large language models\.InInternational Conference on Machine Learning,pp\. 17061–17084\.Cited by:[§1](https://arxiv.org/html/2606.10794#S1.p1.1),[§2](https://arxiv.org/html/2606.10794#S2.SS0.SSS0.Px1.p1.1)\.
- \[8\]S\. Kornblith, M\. Norouzi, H\. Lee, and G\. Hinton\(2019\)Similarity of neural network representations revisited\.InProceedings of the 36th International Conference on Machine Learning,pp\. 3519–3529\.Cited by:[§2](https://arxiv.org/html/2606.10794#S2.SS0.SSS0.Px1.p1.1)\.
- \[9\]R\. Kuditipudi, J\. Thickstun, T\. Hashimoto, and P\. Liang\(2024\)Robust distortion\-free watermarks for language models\.Transactions on Machine Learning Research\.Cited by:[§1](https://arxiv.org/html/2606.10794#S1.p1.1),[§2](https://arxiv.org/html/2606.10794#S2.SS0.SSS0.Px1.p1.1)\.
- \[10\]W\. Lai, V\. Hangya, and A\. Fraser\(2024\)Style\-specific neurons for steering LLMs in text style transfer\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,pp\. 13427–13443\.Cited by:[§2](https://arxiv.org/html/2606.10794#S2.SS0.SSS0.Px2.p1.1)\.
- \[11\]Z\. Lin, M\. Feng, C\. N\. dos Santos, M\. Yu, B\. Xiang, B\. Zhou, and Y\. Bengio\(2017\)A structured self\-attentive sentence embedding\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=BJC_jUqxe)Cited by:[§D\.2](https://arxiv.org/html/2606.10794#A4.SS2.SSS0.Px1.p1.2)\.
- \[12\]H\. McGovern, R\. Stureborg, Y\. Suhara, and D\. Alikaniotis\(2025\)Your large language models are leaving fingerprints\.InProceedings of the 1st Workshop on GenAI Content Detection,pp\. 85–95\.Cited by:[§2](https://arxiv.org/html/2606.10794#S2.SS0.SSS0.Px1.p1.1)\.
- \[13\]A\. Nasery, J\. Hayase, C\. Brooks, P\. Sheng, H\. Tyagi, P\. Viswanath, and S\. Oh\(2025\)Scalable fingerprinting of large language models\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,Cited by:[§2](https://arxiv.org/html/2606.10794#S2.SS0.SSS0.Px1.p1.1)\.
- \[14\]I\. Nikolic, T\. Baluta, and P\. Saxena\(2025\)Model provenance testing for large language models\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,Cited by:[§1](https://arxiv.org/html/2606.10794#S1.p2.1),[§2](https://arxiv.org/html/2606.10794#S2.SS0.SSS0.Px1.p1.1)\.
- \[15\]OpenAI\(2024\)GPT\-4o system card\.Note:[https://openai\.com/index/gpt\-4o\-system\-card/](https://openai.com/index/gpt-4o-system-card/)Cited by:[§1](https://arxiv.org/html/2606.10794#S1.p1.1)\.
- \[16\]K\. Park, Y\. J\. Choe, and V\. Veitch\(2024\)The linear representation hypothesis and the geometry of large language models\.InProceedings of the 41st International Conference on Machine Learning,pp\. 39643–39666\.Cited by:[§2](https://arxiv.org/html/2606.10794#S2.SS0.SSS0.Px2.p1.1)\.
- \[17\]D\. Pasquini, E\. M\. Kornaropoulos, and G\. Ateniese\(2025\)LLMmap: fingerprinting for large language models\.In34th USENIX Security Symposium,pp\. 299–318\.Cited by:[§1](https://arxiv.org/html/2606.10794#S1.p2.1),[§2](https://arxiv.org/html/2606.10794#S2.SS0.SSS0.Px1.p1.1)\.
- \[18\]W\. Peng, J\. Yi, F\. Wu, S\. Wu, B\. B\. Zhu, L\. Lyu, B\. Jiao, T\. Xu, G\. Sun, and X\. Xie\(2023\)Are you copying my model? protecting the copyright of large language models for eaas via backdoor watermark\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics,pp\. 7653–7668\.Cited by:[§2](https://arxiv.org/html/2606.10794#S2.SS0.SSS0.Px1.p1.1)\.
- \[19\]X\. Qiu, H\. Zeng, Z\. Hou, and H\. Wei\(2026\)Provable model provenance set for large language models\.arXiv preprint arXiv:2602\.00772\.Cited by:[§2](https://arxiv.org/html/2606.10794#S2.SS0.SSS0.Px1.p1.1)\.
- \[20\]A\. Ravichander, Y\. Belinkov, and E\. H\. Hovy\(2021\)Probing the probing paradigm: does probing accuracy entail task relevance?\.InProceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics,pp\. 3363–3377\.Cited by:[§2](https://arxiv.org/html/2606.10794#S2.SS0.SSS0.Px2.p1.1)\.
- \[21\]N\. Rimsky, N\. Gabrieli, J\. Schulz, M\. Tong, E\. Hubinger, and A\. Turner\(2024\)Steering llama 2 via contrastive activation addition\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics,pp\. 15504–15522\.Cited by:[§2](https://arxiv.org/html/2606.10794#S2.SS0.SSS0.Px2.p1.1)\.
- \[22\]A\. N\. Tak, A\. Banayeeanzade, A\. Bolourani, M\. Kian, R\. Jia, and J\. Gratch\(2025\)Mechanistic interpretability of emotion inference in large language models\.InFindings of the Association for Computational Linguistics: ACL 2025,pp\. 13090–13120\.Cited by:[§2](https://arxiv.org/html/2606.10794#S2.SS0.SSS0.Px2.p1.1)\.
- \[23\]Y\. Tan, M\. Wang, S\. He, H\. Liao, C\. Zhao, Q\. Lu, T\. Liang, J\. Zhao, and K\. Liu\(2025\)Bottom\-up policy optimization: your language model policy secretly contains internal policies\.arXiv preprint arXiv:2512\.19673\.Cited by:[§4\.6](https://arxiv.org/html/2606.10794#S4.SS6.p1.2)\.
- \[24\]P\. Tschisgale and P\. Wulff\(2026\)Evidence for daily and weekly periodic variability in gpt\-4o performance\.arXiv preprint arXiv:2602\.15889\.Cited by:[§1](https://arxiv.org/html/2606.10794#S1.p1.1)\.
- \[25\]Y\. Wang, X\. Ma, G\. Zhang, Y\. Ni, A\. Chandra, S\. Guo, W\. Ren, A\. Arulraj, X\. He, Z\. Jiang,et al\.\(2024\)Mmlu\-pro: a more robust and challenging multi\-task language understanding benchmark\.arXiv preprint arXiv:2406\.01574\.Cited by:[§4\.7](https://arxiv.org/html/2606.10794#S4.SS7.p1.9)\.
- \[26\]Z\. Wu, Y\. Zhao, and H\. Wang\(2025\)Gradient\-based model fingerprinting for LLM similarity detection and family classification\.arXiv preprint arXiv:2506\.01631\.Cited by:[§2](https://arxiv.org/html/2606.10794#S2.SS0.SSS0.Px1.p1.1)\.
- \[27\]Z\. Wu, H\. Zhao, Z\. Wang, J\. Guo, Q\. Wang, and B\. He\(2026\)LLM dna: tracing model evolution via functional representations\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/pdf?id=UIxHaAqFqQ)Cited by:[Table 5](https://arxiv.org/html/2606.10794#A3.T5.65.60.1.1.1),[§1](https://arxiv.org/html/2606.10794#S1.p2.1),[§2](https://arxiv.org/html/2606.10794#S2.SS0.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2606.10794#S4.SS1.SSS0.Px3.p1.1),[Table 1](https://arxiv.org/html/2606.10794#S4.T1.48.46.48.2.1.1)\.
- \[28\]J\. Xu, F\. Wang, M\. Ma, P\. W\. Koh, C\. Xiao, and M\. Chen\(2024\)Instructional fingerprinting of large language models\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,pp\. 3277–3306\.Cited by:[§2](https://arxiv.org/html/2606.10794#S2.SS0.SSS0.Px1.p1.1)\.
- \[29\]S\. Yamabe, F\. K\. Waseda, T\. Takahashi, and K\. Wataoka\(2025\)MergePrint: merge\-resistant fingerprints for robust black\-box ownership verification of large language models\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics,pp\. 6894–6916\.Cited by:[§2](https://arxiv.org/html/2606.10794#S2.SS0.SSS0.Px1.p1.1)\.
- \[30\]X\. Yang, L\. Li, H\. Zhou, T\. Zhu, X\. Qu, Y\. Fan, Q\. Wei, R\. Ye, L\. Kang, Y\. Qin,et al\.\(2026\)Toward efficient agents: memory, tool learning, and planning\.arXiv preprint arXiv:2601\.14192\.Cited by:[§1](https://arxiv.org/html/2606.10794#S1.p1.1)\.
- \[31\]Z\. Yang and H\. Wu\(2024\)A fingerprint for large language models\.arXiv preprint arXiv:2407\.01235\.Cited by:[§2](https://arxiv.org/html/2606.10794#S2.SS0.SSS0.Px1.p1.1)\.
- \[32\]N\. Yax, P\. Oudeyer, and S\. Palminteri\(2025\)PhyloLM: inferring the phylogeny of large language models and predicting their performances in benchmarks\.InThe Thirteenth International Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2606.10794#S1.p2.1),[§2](https://arxiv.org/html/2606.10794#S2.SS0.SSS0.Px1.p1.1)\.
- \[33\]B\. Zeng, L\. Wang, Y\. Hu, Y\. Xu, C\. Zhou, X\. Wang, Y\. Yu, and Z\. Lin\(2024\)HuRef: human\-readable fingerprint for large language models\.InAdvances in Neural Information Processing Systems,Vol\.37,pp\. 126332–126362\.Cited by:[§2](https://arxiv.org/html/2606.10794#S2.SS0.SSS0.Px1.p1.1)\.
- \[34\]H\. Zhang, Z\. Zhang, M\. Wang, Z\. Su, Y\. Wang, Q\. Wang, S\. Yuan, E\. Nie, X\. Duan, F\. Han,et al\.\(2026\)Locate, steer, and improve: a practical survey of actionable mechanistic interpretability in large language models\.arXiv preprint arXiv:2601\.14004\.Cited by:[§1](https://arxiv.org/html/2606.10794#S1.p3.1),[§2](https://arxiv.org/html/2606.10794#S2.SS0.SSS0.Px2.p1.1)\.
- \[35\]J\. Zhang, D\. Liu, C\. Qian, L\. Zhang, Y\. Liu, Y\. Qiao, and J\. Shao\(2025\)REEF: representation encoding fingerprints for large language models\.InThe Thirteenth International Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2606.10794#S2.SS0.SSS0.Px1.p1.1)\.
- \[36\]A\. Zou, L\. Phan, S\. Chen, J\. Campbell, P\. Guo, R\. Ren, A\. Pan, X\. Yin, M\. Mazeika, A\. Dombrowski,et al\.\(2023\)Representation engineering: a top\-down approach to ai transparency\.arXiv preprint arXiv:2310\.01405\.Cited by:[§2](https://arxiv.org/html/2606.10794#S2.SS0.SSS0.Px2.p1.1)\.

## Appendix AREADER Inference Algorithm

Algorithm 1READER inference with Bayesian Evidence AccumulationInput:black\-box target APIff, frozen proxy LLMϕ\\phi, prompt distribution𝒫\\mathcal\{P\}, trainedL2L\_\{2\}\-regularised probeqθq\_\{\\theta\}with parameters\(𝐖,𝐛\)\(\\mathbf\{W\},\\mathbf\{b\}\)\. Hyperparameters:query budgetKK, intra\-sequence budgetMM, proxy layerℓ\\ell, numerical floorε\\varepsilon\. Output:predicted source modely^\\hat\{y\}and optional calibrated confidenceP~\\widetilde\{P\}\.

1:

𝐒←𝟎∈ℝ\|𝒞\|\\mathbf\{S\}\\leftarrow\\mathbf\{0\}\\in\\mathbb\{R\}^\{\|\\mathcal\{C\}\|\}
2:for

k=1k=1to

KKdo

3:Sample prompt

pk∼𝒫p\_\{k\}\\sim\\mathcal\{P\}and query

xk←f​\(pk\)x\_\{k\}\\leftarrow f\(p\_\{k\}\)
4:Run the frozen proxy on the observed text and collect layer\-

ℓ\\ellresponse hidden states

\{𝐡t\(k\)\}t=1Tk\\\{\\mathbf\{h\}^\{\(k\)\}\_\{t\}\\\}\_\{t=1\}^\{T\_\{k\}\}\.

5:

ℐ←round​\(linspace​\(1,Tk,M\)\)\\mathcal\{I\}\\leftarrow\\mathrm\{round\}\(\\mathrm\{linspace\}\(1,T\_\{k\},M\)\)\{uniformly spaced response positions\}

6:

𝐮k←\|ℐ\|−1​∑t∈ℐ𝐡t\(k\)\\mathbf\{u\}\_\{k\}\\leftarrow\|\\mathcal\{I\}\|^\{\-1\}\\sum\_\{t\\in\\mathcal\{I\}\}\\mathbf\{h\}^\{\(k\)\}\_\{t\}
7:

𝐩k←softmax​\(𝐖⊤​𝐮k\+𝐛\)\\mathbf\{p\}\_\{k\}\\leftarrow\\mathrm\{softmax\}\(\\mathbf\{W\}^\{\\top\}\\mathbf\{u\}\_\{k\}\+\\mathbf\{b\}\)
8:

𝐒←𝐒\+log⁡\(𝐩k\+ε\)\\mathbf\{S\}\\leftarrow\\mathbf\{S\}\+\\log\(\\mathbf\{p\}\_\{k\}\+\\varepsilon\)
9:endfor

10:

𝐒←𝐒/K\\mathbf\{S\}\\leftarrow\\mathbf\{S\}/K
11:

y^←arg⁡maxc∈𝒞⁡Sc\\hat\{y\}\\leftarrow\\arg\\max\_\{c\\in\\mathcal\{C\}\}S\_\{c\}
12:ifcalibrated confidence is requiredthen

13:

P~​\(c∣\{xk\}k=1K\)←softmax​\(α​𝐒\)c\\widetilde\{P\}\(c\\mid\\\{x\_\{k\}\\\}\_\{k=1\}^\{K\}\)\\leftarrow\\mathrm\{softmax\}\(\\alpha\\mathbf\{S\}\)\_\{c\}\{

α\\alphais validation\-fitted\}

14:endif

15:return

y^\\hat\{y\}and, if requested,

P~\\widetilde\{P\}

## Appendix BDerivation of Bayesian Evidence Accumulation

This appendix expands the probabilistic justification for the decision rule in Sec\.[3\.3](https://arxiv.org/html/2606.10794#S3.SS3)\. For a fixed unknown target model, let𝒰=\{𝐮1,…,𝐮K\}\\mathcal\{U\}=\\\{\\mathbf\{u\}\_\{1\},\\ldots,\\mathbf\{u\}\_\{K\}\\\}denote the filtered proxy representations extracted fromKKindependently prompted responses\. The desired multi\-query attribution rule is the MAP estimator

y^=arg⁡maxc∈𝒞⁡P​\(c∣𝒰\)\.\\hat\{y\}=\\arg\\max\_\{c\\in\\mathcal\{C\}\}P\(c\\mid\\mathcal\{U\}\)\.By Bayes’ rule,

P​\(c∣𝒰\)∝P​\(c\)​p​\(𝒰∣c\)\.P\(c\\mid\\mathcal\{U\}\)\\propto P\(c\)\\,p\(\\mathcal\{U\}\\mid c\)\.Assuming that prompts are sampled independently and that the filtered representations are conditionally independent given the source model, we obtain

p​\(𝒰∣c\)=∏k=1Kp​\(𝐮k∣c\)\.p\(\\mathcal\{U\}\\mid c\)=\\prod\_\{k=1\}^\{K\}p\(\\mathbf\{u\}\_\{k\}\\mid c\)\.With a uniform prior over candidate source models, the MAP estimator becomes

y^=arg⁡maxc∈𝒞​∑k=1Klog⁡p​\(𝐮k∣c\)\.\\hat\{y\}=\\arg\\max\_\{c\\in\\mathcal\{C\}\}\\sum\_\{k=1\}^\{K\}\\log p\(\\mathbf\{u\}\_\{k\}\\mid c\)\.
Directly estimatingp​\(𝐮∣c\)p\(\\mathbf\{u\}\\mid c\)in the proxy hidden\-state space is statistically unattractive because the representation dimension is high and the number of source models is large\. READER instead trains a discriminative probeqθ​\(c∣𝐮\)q\_\{\\theta\}\(c\\mid\\mathbf\{u\}\)on single\-response examples\. Under a uniform class prior, Bayes’ rule gives

log⁡p​\(𝐮k∣c\)=log⁡qθ​\(c∣𝐮k\)\+log⁡p​\(𝐮k\)\+const,\\log p\(\\mathbf\{u\}\_\{k\}\\mid c\)=\\log q\_\{\\theta\}\(c\\mid\\mathbf\{u\}\_\{k\}\)\+\\log p\(\\mathbf\{u\}\_\{k\}\)\+\\mathrm\{const\},wherelog⁡p​\(𝐮k\)\\log p\(\\mathbf\{u\}\_\{k\}\)and the prior\-dependent constant are independent ofcc\. Therefore, replacing the class\-dependent likelihood term with the probe log posterior yields the decision rule used in the main paper:

y^=arg⁡maxc∈𝒞⁡1K​∑k=1Klog⁡qθ​\(c∣𝐮k\)\.\\hat\{y\}=\\arg\\max\_\{c\\in\\mathcal\{C\}\}\\frac\{1\}\{K\}\\sum\_\{k=1\}^\{K\}\\log q\_\{\\theta\}\(c\\mid\\mathbf\{u\}\_\{k\}\)\.This should be interpreted as a discriminative product\-of\-experts approximation: each response contributes one piece of posterior evidence, and independent prompts allow weak but repeatable single\-response evidence to accumulate\. The normalization byKKleaves the MAP prediction unchanged while keeping score magnitudes comparable across different query budgets\.

For confidence reporting, READER applies the scalar calibration step in Eq\.[2](https://arxiv.org/html/2606.10794#S3.E2)\. The fittedα\>0\\alpha\>0rescales the accumulated evidence before softmax, improving NLL/ECE calibration without changing the predicted class\.

## Appendix CExperimental Details

This appendix consolidates everything required to reproduce the experiments in the main paper: hardware/software stack, hyperparameter table, the full 50\-target ecosystem, the agent prompt corpus, and full\-ecosystem versions of the main analyses\.

##### Code and data release\.

An anonymized repository containing the READER implementation, experiment scripts, and the Agent500 prompt corpus is available at[https://anonymous\.4open\.science/r/READER/](https://anonymous.4open.science/r/READER/)\. The repository includes the code and data needed to reproduce the main experiments reported in this paper\.

### C\.1Hardware and Software Stack

- •Compute\.All experiments are executed on a single H200\-140G node\. Inference of the largest proxy \(Qwen3\.5\-122B\-A10B MoE,≈10\\approx 10B active parameters\) fits in bf16 on two 140 GB GPU\.
- •Software\.Experiments use thezeroconda environment with Python 3\.10\.8, PyTorch 2\.10\.0 with CUDA 12\.8,transformers5\.5\.4,scikit\-learn1\.7\.2, NumPy 2\.2\.6, and SciPy 1\.15\.3\. We forceHF\_HUB\_OFFLINE=1andTRANSFORMERS\_OFFLINE=1so all checkpoints are resolved from a local Hugging Face cache snapshot\.
- •Random seeds\.Target generation uses temperature0\.70\.7andtop\_p=0\.950\.95with seed4242\. All downstream classifiers,KK\-pool aggregators and t\-SNE projections are seeded by4242for reproducibility; per\-classKK\-sample subsampling is repeated5,0005\{,\}000times per target during evaluation\.

### C\.2Hyperparameter Settings

Table[2](https://arxiv.org/html/2606.10794#A3.T2)lists every hyperparameter introduced by READER\. Defaults marked with⋆\\starare used in the main\-text figures; sweeps are reported in Section[4\.3](https://arxiv.org/html/2606.10794#S4.SS3)and Section[C\.5](https://arxiv.org/html/2606.10794#A3.SS5)\.

SymbolDescriptionSetting\|𝒞\|\|\\mathcal\{C\}\|Size of target ecosystem5050NcN\_\{c\}Per\-target query budget500500NprefixN\_\{\\text\{prefix\}\}Prefix length used for hidden\-state extraction128128tokensMM\# token positions averaged per sequence \(Stage 1\)\{1,4⋆,8,16\}\\\{1,4^\{\\star\},8,16\\\}KK\# prompts per query session \(Stage 2\)\{1,5,10,20,50⋆,100\}\\\{1,5,10,20,50^\{\\star\},100\\\}ℓ\\ellProxy layer used for feature extractionbest layer \(probe\-selected\)Probe / aggregatorProbe familyMultinomial logistic regressionL2L\_\{2\}regularisation,CLR=1\.0C\_\{\\mathrm\{LR\}\}\{=\}1\.0StandardiserPer\-featureStandardScalerfit on train fold—Aggregator \(default\)Bayesian Evidence Accumulator \(logposterior\)—Aggregator \(ablations\)Mean\-pool \+ LR; Gaussian discriminant—Optimiserlbfgsmax\_iter=20002000Cross\-validationStratified K\-fold \(probe sweep\)55foldsGeneration \(target side\)DecodingSamplingtemperature0\.70\.7, top\-pp0\.950\.95Max new tokensPer\-response cap512512Chat templateNative HFapply\_chat\_templateper\-targetMask robustnessRedaction unitWhitespace\-tokenised words—Redaction token\[REDACTED\]—Redaction ratioRR% of words randomly replaced\{10,20,30,40,50\}%\\\{10,20,30,40,50\\\}\\%Table 2:All hyperparameters introduced by READER\. Stars mark the defaults used in the main paper\.HereCLRC\_\{\\mathrm\{LR\}\}denotes the inverse strength of theL2L\_\{2\}regularisation penalty in the multinomial logistic\-regression probe\.

### C\.3The 50\-Target Ecosystem

Table[3](https://arxiv.org/html/2606.10794#A3.T3)enumerates every target LLM evaluated in the main paper\. The ecosystem is intentionally heterogeneous: it covers99families \(Qwen\-2\.5/3/3\.5/3\.6/1\.5, Llama\-3/3\.1/3\.2/4, Mistral\-v0\.3/Nemo/Mixtral, Gemma\-3/4, DeepSeek\-R1\-Distill / V2, Phi\-3, Hunyuan, GPT\-OSS, GLM\-4\.5, Seed\-OSS, ERNIE\-4\.5, LFM\-2\), parameter scales from1\.71\.7B to122122B, dense and Mixture\-of\-Experts architectures, base/instruct/thinking/coder variants, and two reasoning\-distilled families\. This breadth is what makes the 50\-way attribution problem genuinely challenging: many targets share the same base weights and differ only by post\-training recipe\.

Table 3:The 50 target LLMs forming our dynamic provenance ecosystem\.
### C\.4Agent500 Prompt Corpus

The dynamic prompt distribution𝒫\\mathcal\{P\}is realised byAgent500, an in\-house corpus of500 agent\-style probesthat simulate realistic black\-box API traffic\. Each probe is a self\-contained natural\-language request expressing one of: a software\-engineering task \(debugging, refactoring, testing, deployment\), a tool\-use plan \(kubernetes, git, npm, postgres, monitoring\), a code\-search request, an architectural\-decision question, a coding\-style or PR description draft, an open\-ended troubleshooting dialogue, or a meta\-cognitive request \(“ask me clarifying questions”\)\. The corpus was authored to beout\-of\-distributionrelative to standard pretraining benchmarks, so that the resulting target responses are dominated by unpredictable task\-specific semantics—precisely the regime in which static\-input distribution\-matching baselines fail\.

For every targetcc, we sampleNc=500N\_\{c\}\{=\}500responses, one per probe, with identical chat templating and decoding parameters \(Section[C\.2](https://arxiv.org/html/2606.10794#A3.SS2)\)\. Table[4](https://arxiv.org/html/2606.10794#A3.T4)lists the first 50 probes verbatim; the remaining450450follow the same stylistic distribution\.

Table 4:First5050of the500500agent\-style probes used as the dynamic prompt distribution\. The full corpus is included in the anonymized code and data release\.
### C\.5Full\-Ecosystem Figures \(All Nine Proxies\)

Most main\-text figures render the four representative proxies \(Llama\-3\.1\-8B, Qwen3\-8B, Qwen3\.5\-9B, Qwen3\-32B\) to keep panels readable\. This section reports the same analyses computed over the full nine\-proxy palette \(the four above plus Qwen3\.5\-27B, Qwen3\.6\-27B, Qwen3\.5\-35B\-A3B, Qwen3\.6\-35B\-A3B, Qwen3\.5\-122B\-A10B\)\. The Qwen3\.5\-9B⋆and Qwen3\-32B⋆proxies are the only entries with publicly known parameter counts only; everything else uses the technical\-report disclosed counts\.

![Refer to caption](https://arxiv.org/html/2606.10794v1/x10.png)Figure 10:Full\-ecosystem cross\-KKbaseline comparison\.Same axes as Fig\.[3](https://arxiv.org/html/2606.10794#S4.F3), but with all nine proxies\. Larger Qwen\-3\.5/3\.6 dense and MoE proxies further widen the margin over the LLM\-DNA sentence\-encoder baselines\.![Refer to caption](https://arxiv.org/html/2606.10794v1/x11.png)Figure 11:Per\-proxy layer accuracy curves\(3×\\times3 grid\)\. Solid line: top\-1 accuracy; dotted: macro\-F1; gold star: best layer chosen by READER\.![Refer to caption](https://arxiv.org/html/2606.10794v1/x12.png)Figure 12:FullM×KM\\times Kaccuracy heatmap, one panel per proxy\. TheM=4M\{=\}4saturation observed in the main text holds for every proxy, including the largest 122B\-A10B MoE\.![Refer to caption](https://arxiv.org/html/2606.10794v1/x13.png)Figure 13:Full nine\-proxy confusion matrices atK=10K\{=\}10,M=4M\{=\}4\.Red blocks delineate model families; near\-block off\-diagonal mass corresponds to within\-family siblings\.![Refer to caption](https://arxiv.org/html/2606.10794v1/x14.png)Figure 14:Full nine\-proxy confusion matrices atK=50K\{=\}50,M=4M\{=\}4\.The diagonal sharpens further; family blocks are almost completely resolved on Qwen\-3\.5 and Qwen\-3\.6 proxies\.![Refer to caption](https://arxiv.org/html/2606.10794v1/x15.png)Figure 15:Full nine\-proxy t\-SNE projections atK=10K\{=\}10\.Family\-level clusters and per\-target tight clusters are visible across all proxies\.![Refer to caption](https://arxiv.org/html/2606.10794v1/x16.png)Figure 16:Sorted per\-target macro\-F1 atK=10K\{=\}10\.Each row is one proxy; bars sorted descending by F1\. Coloured by family\. Targets that consistently fall in the bottom decile are typically same\-family base/instruct/thinking variants of a single backbone \(e\.g\., Qwen3\.5\-4B vs Qwen3\.5\-4B\-Base\)\.![Refer to caption](https://arxiv.org/html/2606.10794v1/x17.png)Figure 17:Per\-class F1 heatmap\(proxies×\\timestargets, family\-grouped\) atM=4M\{=\}4,K=10K\{=\}10\. Cell colour is per\-class F1; vertical bands reveal target families that remain hard before the larger multi\-query budget is available\.![Refer to caption](https://arxiv.org/html/2606.10794v1/x18.png)Figure 18:Per\-class F1 heatmap\(proxies×\\timestargets, family\-grouped\) atM=4M\{=\}4,K=50K\{=\}50\. Cell colour is per\-class F1; vertical bands indicate within\-family attribution difficulty\.![Refer to caption](https://arxiv.org/html/2606.10794v1/x19.png)Figure 19:Aggregator ablation grid\.Comparison of three aggregators—mean\-pool \+ LR \(meanpool\_lr\), Bayesian Evidence Accumulator \(logposterior\), and class\-conditional Gaussian discriminant \(gaussian\_disc\)—across all proxies andKKvalues\. The log\-posterior aggregator is uniformly competitive with or strictly better than mean\-pool and matches the more expressive Gaussian discriminant within11–22accuracy points; we therefore adopt it as the default\.![Refer to caption](https://arxiv.org/html/2606.10794v1/x20.png)Figure 20:Reliability diagramsfor the log\-posterior aggregator, one panel per proxy\. The raw averaged log\-evidence scores are useful for MAP ranking; calibrated confidence is obtained by fitting the scalar evidence temperature described in Sec\.[3\.3](https://arxiv.org/html/2606.10794#S3.SS3)\.![Refer to caption](https://arxiv.org/html/2606.10794v1/x21.png)Figure 21:Optimal\-\(M,K\)\(M,K\)contour: best\-achievable accuracy in theM×KM\\times Kplane, averaged over the four main proxies\. The contour line marks the “saturation frontier”; READER reaches≥80%\\geq 80\\%accuracy with as little as\(M=4,K=50\)\(M\{=\}4,K\{=\}50\)on Qwen\-3\.5/3\.6 proxies\.![Refer to caption](https://arxiv.org/html/2606.10794v1/x22.png)Figure 22:Mask robustness, per ratio breakdown\.Six panels forR∈\{0,10,20,30,40,50\}%R\\in\\\{0,10,20,30,40,50\\\}\\%, each plotting accuracy versusKK\. Solid lines: proxies; dashed: sentence\-encoder baselines\.
### C\.6Detailed Numerical Tables

This section reports the raw numbers underlying every full\-ecosystem figure, so that readers reproducing READER can compare against exact values rather than reading off curves\. All numbers below use the paper’s default configuration: Stage 1 mean\-pooling overM=4M\{=\}4token positions and Stage 2 Bayesian log\-posterior accumulation overK∈\{1,50\}K\\in\\\{1,50\\\}trajectories\. Standard errors on accuracy are the binomial estimatep​\(1−p\)/N\\sqrt\{p\(1\{\-\}p\)/N\}whereNNis the number of decision\-time fingerprints \(25,00025\{,\}000atK=1K\{=\}1,500500atK=50K\{=\}50\)\. Pair\-AUC is averaged over independent binary probes for all target pairs; its standard deviation is reported by the evaluator\. All values are rounded to three decimals\.

Table 5:Full\-ecosystem per\-system detailed results\(default configuration, all nine proxies\)\. For each system we report Acc, macro\-F1, mean Pair\-AUC, and mAP@10 atK=1K\{=\}1andK=50K\{=\}50\. Acc/F1 are produced by READER’s 50\-way log\-posterior aggregator\. Pair\-AUC and mAP@10 are diagnostic metrics computed on the correspondingKK\-grouped fingerprint space; mAP@10 uses cosine retrieval and Pair\-AUC uses independent binary probes for each target pair\.Table 6:Aggregator comparison \(top\-1 accuracy\)\.We compare two cross\-query aggregation rules on the same per\-response features:mean\-pool \+ mean\-pool\(MP\) averages theKKresponse fingerprints before classification, whilemean\-pool \+ log\-posterior\(LP\) applies READER’s Bayesian Evidence Accumulation\. AtK=1K\{=\}1the two rules reduce to the same single\-response evaluation; differences appear only when multiple responses are aggregated\.Table 7:Mask\-robustness numerical detailatK=50K\{=\}50,M=4M\{=\}4under the canonical mean\-pool\-intra \+ log\-posterior pipeline\. We report redaction ratiosR∈\{0,10,20,30,40,50\}%R\\in\\\{0,10,20,30,40,50\\\}\\%\.

## Appendix DSupplementary Ablations and Diagnostics

### D\.1Input Form of the Proxy Model

READER uses a frozen proxy LLM to read the target response and extract hidden states\. A natural implementation choice is whether the proxy should read only the generated response, or the concatenation of the user prompt and the response\. The latter gives the proxy explicit access to the task condition, but it can also increase semantic dominance: prompt content may become easier to encode than the subtler model\-specific generation trace\.

We compare two proxy input forms under the same canonical evaluator used in the main paper: Stage 1 mean\-pool over intra\-response positions and Stage 2 log\-posterior accumulation acrossKKresponses\. Theresponse\-onlysetting feeds the generated response to the proxy\. Theuser\+responsesetting feeds

Prompt:​\{p\}Response:​\{x\},\\texttt\{Prompt: \}\\\{p\\\}\\quad\\texttt\{Response: \}\\\{x\\\},while still extracting response\-side hidden states\. Table[8](https://arxiv.org/html/2606.10794#A4.T8)reports top\-1 accuracy for the four main proxy readers atM=4M\{=\}4\.

Table 8:Proxy input\-form ablationunder the canonical mean\-pool\-intra \+ log\-posterior pipeline\. “Resp\.” is the response\-only default used in the main paper; “User\+Resp\.” additionally provides the user prompt to the proxy reader\. The prompt is not uniformly helpful, so we keep response\-only as the default\.The ablation shows that prompt access is not a prerequisite for attribution\. For Qwen3\-8B, adding the prompt improves accuracy, suggesting that this proxy can use prompt\-response alignment as additional evidence\. However, the effect does not generalize across readers: Llama\-3\.1\-8B drops atK=50K\{=\}50, Qwen3\-32B drops across all three budgets, and Qwen3\.5\-9B remains effectively tied atK=50K\{=\}50\. This pattern supports the conservative response\-only design in the main paper\. It avoids relying on prompt availability, reduces semantic shortcuts through the task description, and still exposes strong authorship evidence once Bayesian Evidence Accumulation aggregates multiple responses\.

### D\.2Stage 1 Aggregator: Mean\-Pool vs\. Learnable Attention\-Pool

The intra\-sequence stage of READER \(Sec\.[3\.2](https://arxiv.org/html/2606.10794#S3.SS2)\) replaces theMMsampled hidden states\{𝐡tm\(c,p\)\}m=1M\\\{\\mathbf\{h\}\_\{t\_\{m\}\}^\{\(c,p\)\}\\\}\_\{m=1\}^\{M\}with their arithmetic mean𝐮\(c,p\)=1M​∑m=1M𝐡tm\(c,p\)\\mathbf\{u\}^\{\(c,p\)\}=\\tfrac\{1\}\{M\}\\sum\_\{m=1\}^\{M\}\\mathbf\{h\}\_\{t\_\{m\}\}^\{\(c,p\)\}, and the cross\-KKstage \(Sec\.[3\.3](https://arxiv.org/html/2606.10794#S3.SS3)\) accumulates per\-prompt evidence scores overKKprobes,Sc=1K​∑k=1Klog⁡qθ​\(c∣𝐮\(c,pk\)\)S\_\{c\}=\\frac\{1\}\{K\}\\sum\_\{k=1\}^\{K\}\\log q\_\{\\theta\}\(c\\mid\\mathbf\{u\}^\{\(c,p\_\{k\}\)\}\), and predictsy^=arg⁡maxc⁡Sc\\hat\{y\}=\\arg\\max\_\{c\}S\_\{c\}\. A natural alternative for Stage 1 is to make the intra\-MMpooling weights data\-dependent — if certain response\-internal positions \(e\.g\. formatting boundaries, code\-block markers, or system\-prompt\-leaking tokens\) carried disproportionate authorship signal, a learnable attention head should sharpen the centroid without otherwise touching the Stage\-2 evidence aggregator\. This appendix tests that hypothesis empirically under the READER default cross\-KKaggregator, and reports a uniformly negative result: across the Agent500 main set \(C=50C\{=\}50,P=500P\{=\}500\) underthree independent proxies\(Qwen3\-8B,Qwen3\.5\-9B,Llama\-3\.1\-8B\), a single\-head linear attention pool fails to beat the parameter\-free mean\-pool in every non\-degenerate cell\.

##### Architecture\.

Following the minimal recipe of\[[11](https://arxiv.org/html/2606.10794#bib.bib1)\], we replace the uniform1/M1/Mweighting by a softmax over a single linear scoring head𝐰attn∈ℝd\\mathbf\{w\}\_\{\\text\{attn\}\}\\in\\mathbb\{R\}^\{d\}:

αm\(c,p\)=exp⁡\(𝐰attn⊤​𝐡tm\(c,p\)\)∑m′=1Mexp⁡\(𝐰attn⊤​𝐡tm′\(c,p\)\),𝐮~\(c,p\)=∑m=1Mαm\(c,p\)​𝐡tm\(c,p\)\.\\alpha\_\{m\}^\{\(c,p\)\}=\\frac\{\\exp\(\\mathbf\{w\}\_\{\\text\{attn\}\}^\{\\top\}\\,\\mathbf\{h\}\_\{t\_\{m\}\}^\{\(c,p\)\}\)\}\{\\sum\_\{m^\{\\prime\}=1\}^\{M\}\\exp\(\\mathbf\{w\}\_\{\\text\{attn\}\}^\{\\top\}\\,\\mathbf\{h\}\_\{t\_\{m^\{\\prime\}\}\}^\{\(c,p\)\}\)\},\\qquad\\tilde\{\\mathbf\{u\}\}^\{\(c,p\)\}=\\sum\_\{m=1\}^\{M\}\\alpha\_\{m\}^\{\(c,p\)\}\\,\\mathbf\{h\}\_\{t\_\{m\}\}^\{\(c,p\)\}\.\(4\)The pooled vector𝐮~\(c,p\)\\tilde\{\\mathbf\{u\}\}^\{\(c,p\)\}is consumed by the same linear authorship probe𝐖∈ℝd×C\\mathbf\{W\}\\in\\mathbb\{R\}^\{d\\times C\}as in Sec\.[3\.4](https://arxiv.org/html/2606.10794#S3.SS4);𝐰attn\\mathbf\{w\}\_\{\\text\{attn\}\}and𝐖\\mathbf\{W\}are jointly optimised under the cross\-entropy objective of Eq\.[3](https://arxiv.org/html/2606.10794#S3.E3)\. At inference time both pools feed the same Stage\-2 evidence accumulator \(Algorithm[1](https://arxiv.org/html/2606.10794#alg1)\): only the intra\-MMmap is changed\. Compared withK=1K\{=\}1mean\-pool \(𝐮\(c,p\)\\mathbf\{u\}^\{\(c,p\)\}fixed\), this adds onlyddtrainable scalars on top of the probe, so any gain \(or collapse\) is attributable to the learnable re\-weighting itself rather than to head capacity or to the cross\-prompt aggregator\.

##### Protocol\.

We re\-use the proxy/feature pipeline of Sec\.[4](https://arxiv.org/html/2606.10794#S4): proxyϕ=Qwen3\-8B\\phi\{=\}\\texttt\{Qwen3\-8B\}atℓ⋆=23\\ell^\{\\star\}\{=\}23on Agent500, withMmax=16M\_\{\\max\}\{=\}16uniformly\-spaced response\-internal positions and the same StandardScaler fitted on the training\-fold mean\-pool features \(applied per position to keep both pooling heads in a comparable basis\)\. We use 5\-fold prompt\-level cross\-validation, so test prompts are disjoint from those used to fit𝐖\\mathbf\{W\}\(and𝐰attn\\mathbf\{w\}\_\{\\text\{attn\}\}\)\. For attn\-pool, the joint\(𝐰attn,𝐖\)\(\\mathbf\{w\}\_\{\\text\{attn\}\},\\mathbf\{W\}\)pair is trained with AdamW \(lr10−310^\{\-3\}, weight decay10−410^\{\-4\}, batch size 256, 60 epochs, cross\-entropy loss\) on the GPU; for mean\-pool,𝐖\\mathbf\{W\}is fitted with multinomial LR\-LBFGS as in the main paper\. Test\-time cross\-prompt aggregation atKKuses the READER evidence accumulator \(Sec\.[3\.3](https://arxiv.org/html/2606.10794#S3.SS3)\): per\-prompt softmax probabilities are log\-summed within each class then arg\-maxed\. The same RNG seed pattern is reused across both pools so theKK\-grouping is identical\.

##### Cross\-KKaggregator parity in the high\-confidence regime\.

The probe heads converge to near\-zero training loss under both pools \(Tab\.[11](https://arxiv.org/html/2606.10794#A4.T11)\), so the per\-prompt posteriorsP​\(c∣𝐮\(c,p\)\)P\(c\\mid\\mathbf\{u\}^\{\(c,p\)\}\)are sharply peaked\. In this regime the log\-evidence MAP and the feature\-mean argmax agree on every cell of every proxy we tested:\|accLP−accfeat\-mean\|<10−9\\big\|\\,\\text\{acc\}\_\{\\text\{LP\}\}\-\\text\{acc\}\_\{\\text\{feat\-mean\}\}\\,\\big\|<10^\{\-9\}across the tested\(M,K\)\(M,K\)cells\. This is the standard observation that for a confident classifier, the arg\-max of∑klog⁡qθ​\(c∣𝐮\(c,pk\)\)\\sum\_\{k\}\\log q\_\{\\theta\}\(c\\mid\\mathbf\{u\}^\{\(c,p\_\{k\}\)\}\)is determined by the plurality of per\-prompt argmaxes, which in turn equals the argmax of theKK\-averaged feature plus a higher\-order correction that vanishes when the LR head is locally affine\. The attn\-pool failure reported below is therefore not an artefact of the cross\-KKaggregator: it persists identically under the legacy feature\-mean aggregator and the READER default evidence aggregator\.

##### Grids\.

M∈\{1,4,8,16\}M\\in\\\{1,4,8,16\\\},K∈\{1,5,10,20,50\}K\\in\\\{1,5,10,20,50\\\}on Agent500; each cell is reported as mean±\\pm1 standard deviation across the 5 folds\. All accuracies and F1’s below are computed under the BEA cross\-KKaccumulator\.

#### D\.2\.1Main result: Agent500 \(qwen3\-8b,ℓ⋆=23\\ell^\{\\star\}\{=\}23, 50 classes\)

Tab\.[9](https://arxiv.org/html/2606.10794#A4.T9)reports per\-cell test accuracy for both pooling heads and the attn\-minus\-mean deltaΔacc\\Delta\_\{\\text\{acc\}\}\. Macro\-F1 follows the same qualitative pattern but degrades∼\\sim10–20% faster than accuracy, indicating the failure is concentrated on per\-class collapse rather than uniform calibration drift; representative numbers are in Tab\.[10](https://arxiv.org/html/2606.10794#A4.T10)\. Fig\.[23](https://arxiv.org/html/2606.10794#A4.F23)visualises the same data as accuracy curves and per\-cell deltas\.

![Refer to caption](https://arxiv.org/html/2606.10794v1/x23.png)

![Refer to caption](https://arxiv.org/html/2606.10794v1/x24.png)

Figure 23:Agent500Qwen3\-8BL23\.Left:test accuracy of mean\-pool vs\. attn\-pool across the\(M,K\)\(M,K\)grid under BEA cross\-KK\.Right:Δacc\\Delta\_\{\\text\{acc\}\}curves; every non\-degenerate cell is negative\.Table 9:Test accuracy \(mean±\\pm1σ\\sigmaacross 5 folds\) on Agent500 qwen3\-8b L23, under BEA cross\-KKaggregation\. Mean\-pool is the default in the main paper; attn\-pool follows Eq\.[4](https://arxiv.org/html/2606.10794#A4.E4)\.Δacc=attn−mean\\Delta\_\{\\text\{acc\}\}=\\text\{attn\}\-\\text\{mean\}\. Bold cells markΔacc\>0\.005\\Delta\_\{\\text\{acc\}\}\>0\.005\.Table 10:Macro\-F1 \(mean±\\pm1σ\\sigma\) corresponding to Tab\.[9](https://arxiv.org/html/2606.10794#A4.T9)\(BEA cross\-KK\)\.ΔF1=attn−mean\\Delta\_\{\\text\{F1\}\}=\\text\{attn\}\-\\text\{mean\}\. Boundary cellsK∈\{1,50\}K\\in\\\{1,50\\\}shown to bracket the operating regime\.Three structural patterns emerge:

1. 1\.M=1M\{=\}1is structurally degenerate\.With a single sampled position, Eq\.[4](https://arxiv.org/html/2606.10794#A4.E4)reduces toα1≡1\\alpha\_\{1\}\\equiv 1, so the attention head can only re\-weight in a trivial way\. The observedΔacc∈\[−0\.015,−0\.004\]\\Delta\_\{\\text\{acc\}\}\\in\[\-0\.015,\-0\.004\]atM=1M\{=\}1is residual optimisation noise from joint AdamW vs\. closed\-form LR; the gap disappears at the highestKK\. The Stage\-2 evidence aggregator does not rescue the joint\-trained head here either, since the only learnable quantity atM=1M\{=\}1is a constant scalar absorbed by the LR head\.
2. 2\.TheM=4M\{=\}4regime collapses outright\.Across all sixKK\-values, attn\-pool loses 14–35 percentage points of test accuracy\. The training cross\-entropy stalls at≈2\.28\\approx 2\.28\(Tab\.[11](https://arxiv.org/html/2606.10794#A4.T11)\), close toln⁡\(C/5\)≈2\.30\\ln\(C\{/\}5\)\{\\approx\}2\.30— in other words, the joint head fails to converge in 60 epochs, and the softmax overM=4M\{=\}4positions never finds a sparse weighting that outperforms the uniform one\. F1 degrades∼\\sim30 pp more than accuracy at thisMM, signalling that several errors are concentrated in systematic confusions instead of near\-boundary fluctuations\.
3. 3\.LargerMMpartly mitigates but never reverses the loss\.AtM=8M\{=\}8andM=16M\{=\}16the attn\-pool head does converge \(ℒ≈1\.96\\mathcal\{L\}\\approx 1\.96and1\.091\.09respectively\), but its test accuracy is uniformlybelowthe uniform\-pool accuracy by up to 23\.8 pp\. Crucially, the fold\-to\-fold standard deviation grows by 5–7×\\times\(e\.g\.M=8M\{=\}8,K=50K\{=\}50: 0\.017 mean\-pool vs\. 0\.065 attn\-pool\)\. The attn\-pool solution is also less stable: different train/val splits land on different local minima of𝐰attn\\mathbf\{w\}\_\{\\text\{attn\}\}, and the evidence accumulator preserves this inter\-fold variance because each fold trains a fresh head\.

Table 11:Final training cross\-entropy \(averaged over 5 folds\) for the attn\-pool head on Agent500\. Reference:ln⁡\(50\)≈3\.91\\ln\(50\)\\approx 3\.91is the chance loss;ln⁡\(50/5\)≈2\.30\\ln\(50/5\)\\approx 2\.30is the loss of a head that effectively groups classes into∼5\\sim 5\-class meta\-clusters\.
#### D\.2\.2Cross\-proxy robustness:Qwen3\.5\-9BandLlama\-3\.1\-8Bon Agent500

A reviewer\-facing concern with Sec\.[D\.2\.1](https://arxiv.org/html/2606.10794#A4.SS2.SSS1)is thatQwen3\-8Bhappens to be the proxy with the most degenerate authorship/semantic geometry: its intra\-view leading principal angle isθ1≈0\.05∘\\theta\_\{1\}\\approx 0\.05^\{\\circ\}\(Tab\.[14](https://arxiv.org/html/2606.10794#A4.T14)\), so theM=16M\{=\}16positions are nearly co\-linear in the relevant subspaces and a data\-driven re\-weighting may have little to learn by construction\. Under the two other proxies of the main paper the leading angle is materially larger —θ1=53\.1∘\\theta\_\{1\}=53\.1^\{\\circ\}forQwen3\.5\-9B\(intra, L19\) andθ1=43\.6∘\\theta\_\{1\}=43\.6^\{\\circ\}forLlama\-3\.1\-8B\(intra, L31\) — so these proxies should be the onesmosthospitable to a learnable attention head if any signal exists\. We re\-ran the fullM×KM\\times Ksweep under both with the protocol of Sec\.[D\.2\.1](https://arxiv.org/html/2606.10794#A4.SS2.SSS1)unchanged \(BEA cross\-KKthroughout\)\. Tab\.[12](https://arxiv.org/html/2606.10794#A4.T12)reports the per\-cellΔacc\\Delta\_\{\\text\{acc\}\}for all three proxies, and Fig\.[24](https://arxiv.org/html/2606.10794#A4.F24)shows the corresponding accuracy andΔ\\Deltacurves for the two new proxies\.

![Refer to caption](https://arxiv.org/html/2606.10794v1/x25.png)

![Refer to caption](https://arxiv.org/html/2606.10794v1/x26.png)

![Refer to caption](https://arxiv.org/html/2606.10794v1/x27.png)

![Refer to caption](https://arxiv.org/html/2606.10794v1/x28.png)

Figure 24:Per\-proxy accuracy andΔacc\\Delta\_\{\\text\{acc\}\}on Agent500 under BEA cross\-KK\.Top:Qwen3\.5\-9BL19\.Bottom:Llama\-3\.1\-8BL31\. The qualitative pattern is identical to Fig\.[23](https://arxiv.org/html/2606.10794#A4.F23)despite the principal\-angle geometry differing by4040–53∘53^\{\\circ\}\.Table 12:Per\-cellΔacc=attn\-pool−mean\-pool\\Delta\_\{\\text\{acc\}\}=\\text\{attn\-pool\}\-\\text\{mean\-pool\}on Agent500 under BEA cross\-KK, three proxies at their best probe layer\. Bold cells markΔacc\>0\.005\\Delta\_\{\\text\{acc\}\}\>0\.005\. The qualitative pattern is identical across proxies despite the principal\-angle geometry differing by orders of magnitude:M=1M\{=\}1ties \(or marginally wins by joint\-training noise\),M∈\{4,8,16\}M\\in\\\{4,8,16\\\}uniformly loses by 4–35 pp\.Three observations make the cross\-proxy resultstrongerthan the single\-proxy claim:

1. 1\.The negative result is independent of subspace entanglementandof the cross\-KKaggregator\.Under the principal\-angle reading, the qwen3\-8b intra\-view is degenerate \(θ1≈0∘\\theta\_\{1\}\\approx 0^\{\\circ\}\) and should be theworstcase for a learnable position weighting; the two other proxies haveθ1∈\[43\.6∘,53\.1∘\]\\theta\_\{1\}\\in\[43\.6^\{\\circ\},53\.1^\{\\circ\}\]and should be thebestcases\. Empirically all three behave the same: 45/45 non\-degenerate cells across the three proxies areΔacc≤−0\.054\\Delta\_\{\\text\{acc\}\}\\leq\-0\.054\. Furthermore, swapping the cross\-KKaggregator from feature\-mean to BEA leaves every per\-cellΔacc\\Delta\_\{\\text\{acc\}\}unchanged to floating\-point precision \(the high\-confidence\-classifier parity discussed above\), so the bottleneck of attn\-pool is neither the proxy’s authorship/semantic alignment nor the cross\-prompt aggregator, but the joint\-training optimisation surface discussed in Sec\.[D\.2\.3](https://arxiv.org/html/2606.10794#A4.SS2.SSS3)\.
2. 2\.The only positive cell confirmsM=1M\{=\}1is degenerate, not useful\.The onlyΔ\>0\\Delta\>0cell in theK≤50K\\leq 50sweep \(Tab\.[12](https://arxiv.org/html/2606.10794#A4.T12)\) sits atM=1M\{=\}1, where Eq\.[4](https://arxiv.org/html/2606.10794#A4.E4)reduces to identity and the gain is bounded by\+0\.001\+0\.001\. This is the expectedσ\\sigma\-level fluctuations of joint AdamW vs\. closed\-form LR under finite folds, not evidence of a learnable signal: atM=1M\{=\}1theonlything𝐰attn\\mathbf\{w\}\_\{\\text\{attn\}\}controls is a constant scalar multiplier on the input, which is fully absorbed by the downstream linear probe\.
3. 3\.TheM=4M\{=\}4collapse is the deepest undereveryproxy\.The worstΔacc\\Delta\_\{\\text\{acc\}\}per proxy within theK≤50K\\leq 50operating regime is−0\.334\-0\.334\(qwen3\-8b\),−0\.224\-0\.224\(qwen3\.5\-9b\), and−0\.250\-0\.250\(llama\-3\.1\-8b\-base\), all atM=4M\{=\}4\. This pin\-points the optimisation pathology in Sec\.[D\.2\.3](https://arxiv.org/html/2606.10794#A4.SS2.SSS3), item 3, as proxy\-independent: a 4\-position softmax with∼\\sim3 effective degrees of freedom is precisely the regime where joint AdamW is most prone to landing on a sparse non\-uniform pattern that the LR head cannot recover from — and that the evidence accumulator cannot rescue downstream\.

##### Does increasingMMeventually let attn\-pool overtake mean\-pool?

A natural follow\-up reads Tab\.[12](https://arxiv.org/html/2606.10794#A4.T12)as evidence that attn\-pool is gradually catching up: the worst\-case gap shrinks from−0\.334\-0\.334atM=4M\{=\}4to−0\.088\-0\.088atM=16M\{=\}16onQwen3\-8B, and similar contraction is visible on the other two proxies\. Fig\.[25](https://arxiv.org/html/2606.10794#A4.F25)plotsΔacc\\Delta\_\{\\text\{acc\}\}as a function ofMMon a log axis: under all three proxies and allK∈\{1,5,10,20,50\}K\\in\\\{1,5,10,20,50\\\}, the curves are monotone inMMbutconverge from below toΔ=0\\Delta=0\. Three lines of evidence support this asymptotic interpretation:

1. 1\.Empirical trajectory\.TheM=1→16M\{=\}1\{\\to\}16contraction is exactly the shape predicted by “MMlarge⇒\\Rightarrowuniformαm≡1/M\\alpha\_\{m\}\\equiv 1/Mapproaches the optimum⇒\\Rightarrowoptimiser converges back to mean\-pool faster”\. Linear extrapolation givesΔ​\(M=32\)∼−2\\Delta\(M\{=\}32\)\\\!\\sim\\\!\-2to−5\-5pp andΔ​\(M=∞\)→0−\\Delta\(M\{=\}\\infty\)\\to 0^\{\-\}, never positive\.
2. 2\.Hidden\-state redundancy at the chosen layer\.At layersℓ⋆∈\{19,23,31\}\\ell^\{\\star\}\\in\\\{19,23,31\\\}each response\-internal token has already integrated the full prefix via self\-attention; cross\-position information is strongly overlapping \(cf\. Sec\.[D\.3\.2](https://arxiv.org/html/2606.10794#A4.SS3.SSS2), whereθ8≥85\.2∘\\theta\_\{8\}\\geq 85\.2^\{\\circ\}for every proxy/view\)\. There is no large “information\-poor token” mass in theMM\-window for an attention head to suppress, unlike in classical attention pooling on raw tokens\.
3. 3\.Fisher\-ratio profile underMM\.Tab\.[13](https://arxiv.org/html/2606.10794#A4.T13)shows the per\-position authorship signal is either flat inMM\(Qwen3\-8B\) ordecreasing\(RRdrops6×6\{\\times\}–22×22\{\\times\}betweenM=1M\{=\}1andM=16M\{=\}16on the other two proxies\), so the upper bound for any data\-driven re\-weighting is at best theM=1M\{=\}1Fisher ratio and the expected return from “find the few high\-RRpositions” is small\.

![Refer to caption](https://arxiv.org/html/2606.10794v1/x29.png)Figure 25:Δacc=attn\-pool−mean\-pool\\Delta\_\{\\text\{acc\}\}=\\text\{attn\-pool\}\-\\text\{mean\-pool\}as a function of intra\-MMon a log axis \(BEA cross\-KK\), three proxies, and fiveKKvalues\. Reference line:Δ=0\\Delta=0\. Under every proxy and everyKK,Δacc\\Delta\_\{\\text\{acc\}\}is monotone non\-decreasing inMMand converges from below to zero — attn\-pool asymptotes to mean\-pool at largeMM, never overtakes it\.We did not extend theMM\-grid further because the marginal information return is bounded by the contraction rate visible in Fig\.[25](https://arxiv.org/html/2606.10794#A4.F25), while the joint\-training cost grows linearly inMM\. The reasonable engineering conclusion is the same we already adopt in the main paper: keep mean\-pool as the production intra\-sequence aggregator, paired with the BEA cross\-KKaccumulator\.

#### D\.2\.3Why the parameter\-free mean\-pool wins

The empirical pattern of Sec\.[D\.2\.1](https://arxiv.org/html/2606.10794#A4.SS2.SSS1)aligns directly with the additive\-superposition model of Sec\.[3\.1](https://arxiv.org/html/2606.10794#S3.SS1)and the Stage\-1/Stage\-2 separation of Algorithm[1](https://arxiv.org/html/2606.10794#alg1)\.

1. 1\.Proxy hidden states are already heavily contextualised\.Atℓ⋆=23\\ell^\{\\star\}\{=\}23ofQwen3\-8B\(60% depth\), each of theMmax=16M\_\{\\max\}\{=\}16response\-internal tokens has, via self\-attention inside the proxy, integrated information from the full 128\-token suffix\. Following the principal\-angle analysis of Tab\.[14](https://arxiv.org/html/2606.10794#A4.T14), the leading semantic direction is partially shared across positions while the authorship\-bearing directions sit in the subordinate, near\-orthogonal regime\. The cross\-position variance is therefore small relative to the inter\-class authorship variance, so a uniform mean preserves more signal than any data\-driven sparse re\-weighting the optimiser can find from∼20​k\\sim 20\\,\\text\{k\}training examples\. This is the temporal\-low\-pass interpretation of Sec\.[3\.2](https://arxiv.org/html/2606.10794#S3.SS2): the optimalα\\alphais essentially uniform on the relevant subspace, so a learnable head can only deviate downwards\.
2. 2\.Joint training distorts the StandardScaler basis\.The mean\-pool baseline is trained as𝐖⋅StandardScaler​\(𝐮\(c,p\)\)\\mathbf\{W\}\\cdot\\mathrm\{StandardScaler\}\(\\mathbf\{u\}^\{\(c,p\)\}\), with closed\-form per\-feature moments\. Joint AdamW on\(𝐰attn,𝐖\)\(\\mathbf\{w\}\_\{\\text\{attn\}\},\\mathbf\{W\}\)freely shifts the implicit feature mean asα\\alphamoves, whichunscalesthe LR head’s input distribution; this is consistent with the heavy F1 collapse on a few classes \(Tab\.[10](https://arxiv.org/html/2606.10794#A4.T10)\) being driven by𝐖\\mathbf\{W\}losing calibration on classes whose attention pattern drifts off\-axis\. The Stage\-2 evidence aggregator does not undo this: per\-prompt posteriors of a miscalibrated head remain miscalibrated, and summing their logarithms reinforces rather than cancels the systematic bias\.
3. 3\.TheM=4M\{=\}4pathology is an optimisation, not a capacity, failure\.With only four positions to attend over and 50 classes, the softmax has33effective degrees of freedom per sample; combined with the small number of joint training steps \(60 epochs×\\times80 batches/epoch\) this leaves the head firmly in the local\-minimum regime where it neither collapses to uniformαm≡1/M\\alpha\_\{m\}\\equiv 1/M\(which would recover mean\-pool\) nor to a useful sparse pattern\. Tab\.[11](https://arxiv.org/html/2606.10794#A4.T11)confirms the head is still learning at the cut\-off, not converged\.
4. 4\.Mean\-pool has no such optimisation surface\.1M​∑m\\tfrac\{1\}\{M\}\\sum\_\{m\}is a fixed linear map;𝐖\\mathbf\{W\}is trained by convexL2L\_\{2\}\-regularised multinomial logistic regression on its output\. There is no joint local minimum to fall into, no fold\-to\-fold attention drift, and no calibration distortion\. This is why mean\-pool’s fold standard deviations \(Tab\.[9](https://arxiv.org/html/2606.10794#A4.T9)\) are 3–7×\\timestighter than attn\-pool’s at every non\-degenerate cell\.

##### Take\-away\.

A learnable position\-attention head is the most natural extension of the intra\-sequence aggregator in Sec\.[3\.2](https://arxiv.org/html/2606.10794#S3.SS2); we tested it under matched scaling, regularisation, and CV protocol on Agent500 with three proxies \(Qwen3\-8B,Qwen3\.5\-9B,Llama\-3\.1\-8B\), and across the full\(M,K\)\(M,K\)grid, with the READER BEA cross\-KKaccumulator \(Sec\.[3\.3](https://arxiv.org/html/2606.10794#S3.SS3)\) downstream throughout\. The result is uniformly negative — attn\-pool either ties \(degenerateM=1M\{=\}1, with one fold\-noise win among 15M=1M\{=\}1cells\) or loses across all 45 non\-degenerate cells, by up to 35 pp\. Combined with the Fisher\-ratio analysis of Sec\.[D\.3\.1](https://arxiv.org/html/2606.10794#A4.SS3.SSS1), which shows thatMM\-averaging does not amplify between/within\-class variance, and the empirical parity between feature\-mean and BEA aggregators in the high\-confidence regime documented above, this supports a low\-pass\-filter interpretation of READER’s intra\-sequence stage\. In this regime, the added𝒪​\(d\)\\mathcal\{O\}\(d\)parameters of𝐰attn\\mathbf\{w\}\_\{\\text\{attn\}\}mainly increase optimization variance relative to the parameter\-free mean\. We therefore retain the mean as the production intra\-sequence aggregator, paired with the BEA cross\-KKaccumulator described in Sec\.[3\.3](https://arxiv.org/html/2606.10794#S3.SS3)\.

### D\.3Stage 1 Diagnostic: Temporal Low\-Pass Filtering

This section empirically interrogates the role of the intra\-sequence expectation introduced in Sec\.[3\.2](https://arxiv.org/html/2606.10794#S3.SS2)\. We model𝐮\(c,p\)=1M​∑m=1M𝐡tm\(c,p\)\\mathbf\{u\}^\{\(c,p\)\}=\\tfrac\{1\}\{M\}\\sum\_\{m=1\}^\{M\}\\mathbf\{h\}\_\{t\_\{m\}\}^\{\(c,p\)\}as atemporal low\-pass filteron the proxy’s hidden states\. Two questions follow:

1. 1\.Q\-LP\.DoesMM\-averaging by itself amplify the authorship\-to\-semantic signal\-to\-noise ratio?
2. 2\.Q\-Geom\.How are the semantic and authorship subspaces geometrically arranged in the proxy at the chosen layer?

##### Setup\.

We cross three proxiesϕ∈\{Qwen3\-8B,Qwen3\.5\-9B,Llama\-3\.1\-8B\}\\phi\\in\\\{\\texttt\{Qwen3\-8B\},\\texttt\{Qwen3\.5\-9B\},\\texttt\{Llama\-3\.1\-8B\}\\\}, each at its best probe layerℓ⋆∈\{23,19,31\}\\ell^\{\\star\}\\in\\\{23,19,31\\\}respectively\. The dataset comprisesC=50C\{=\}50target LLMs queried onP=500P\{=\}500shared agent\-domain probe prompts\. We extract two feature views at the same layer\. Theintra\-meanview averagesMmax=16M\_\{\\max\}\{=\}16uniformly spaced response\-internal hidden states, whereas thelast\-tokenview uses the hidden state at the final response token\.

##### Metric: Fisher ratio\.

Define the per\-model centroid𝝁\(c\)=1P​∑p𝐮\(c,p\)\\bm\{\\mu\}^\{\(c\)\}=\\tfrac\{1\}\{P\}\\sum\_\{p\}\\mathbf\{u\}^\{\(c,p\)\}, the global centroid𝝁g\\bm\{\\mu\}\_\{g\}, and

R=1C​∑c‖𝝁\(c\)−𝝁g‖221C​P​∑c,p‖𝐮\(c,p\)−𝝁\(c\)‖22=VarbetweenVarwithin\.R\\;=\\;\\frac\{\\frac\{1\}\{C\}\\sum\_\{c\}\\\|\\bm\{\\mu\}^\{\(c\)\}\-\\bm\{\\mu\}\_\{g\}\\\|\_\{2\}^\{2\}\}\{\\frac\{1\}\{CP\}\\sum\_\{c,p\}\\\|\\mathbf\{u\}^\{\(c,p\)\}\-\\bm\{\\mu\}^\{\(c\)\}\\\|\_\{2\}^\{2\}\}\\;=\\;\\frac\{\\mathrm\{Var\}\_\{\\text\{between\}\}\}\{\\mathrm\{Var\}\_\{\\text\{within\}\}\}\.R≫1R\\gg 1would imply authorship variance dominates after filtering\.

#### D\.3\.1Fisher ratio under intra\-position averaging

We compute𝐮\(c,p\)=1M​∑m=1M𝐡tm\(c,p\)\\mathbf\{u\}^\{\(c,p\)\}=\\tfrac\{1\}\{M\}\\sum\_\{m=1\}^\{M\}\\mathbf\{h\}\_\{t\_\{m\}\}^\{\(c,p\)\}forM∈\{1,2,4,8,16\}M\\in\\\{1,2,4,8,16\\\}and reportRRin Tab\.[13](https://arxiv.org/html/2606.10794#A4.T13)\.

Table 13:Fisher ratio under intra\-sequence filtering, three proxies at their best layers\. The qwen3 case is exactly flat underMM; the other two proxies exhibit strong attenuation asMMgrows — consistent withMM\-averaging acting as a windowed low\-pass filter that smooths early\-position author cues rather than amplifying them\.ForQwen3\-8B, bothVarbetween\\mathrm\{Var\}\_\{\\text\{between\}\}andVarwithin\\mathrm\{Var\}\_\{\\text\{within\}\}contract by exactly1/M21/M^\{2\}, soRRis invariant to four significant figures: theMMsampled positions are highly co\-linear in the authorship/semantic split\. For the two other proxies, theM=1M\{=\}1position \(the earliest sampled response\-internal token\) carries a much larger authorship\-coherent component, and averaging it against later positionsactively destroysthis signal —RRdrops by6×6\{\\times\}–22×22\{\\times\}betweenM=1M\{=\}1andM=16M\{=\}16\. In all three cases the conclusion is identical:MM\-averaging is not a Fisher\-ratio amplifier\.

This empirically refutes the naive interpretation ofMM\-aggregation as a semantic Law\-of\-Large\-Numbers operation\. Instead, it confirms the alternative mechanism postulated in Sec\.[3\.2](https://arxiv.org/html/2606.10794#S3.SS2): arithmetic averaging acts as a windowed low\-pass filter on per\-token decoding noiseϵt\\bm\{\\epsilon\}\_\{t\}and local driftΔ​𝐬t\(p\)\\Delta\\mathbf\{s\}\_\{t\}^\{\(p\)\}\. The practical role of intra\-sequence filtering is therefore to produce anumerically stable, magnitude\-coherentinput𝐮\(c,p\)≈𝐒\(p\)\+𝐚\(c\)\\mathbf\{u\}^\{\(c,p\)\}\\approx\\mathbf\{S\}^\{\(p\)\}\+\\mathbf\{a\}^\{\(c\)\}for the downstream Bayesian probe in Stage 2 — an important pre\-conditioner whose payoff materializes only once per\-prompt log evidence scores are accumulated \(cf\. Sec\.[D\.6](https://arxiv.org/html/2606.10794#A4.SS6)\)\.

#### D\.3\.2Principal\-angle analysis between authorship and semantic subspaces

To geometrically locate the authorship signature inside the proxy’s representation space, we compute the principal angles between the empirical authorship subspace𝒱A\\mathcal\{V\}\_\{A\}and the empirical semantic subspace𝒱S\\mathcal\{V\}\_\{S\}\.𝒱A\\mathcal\{V\}\_\{A\}is spanned by the top\-k1k\_\{1\}left singular vectors of the de\-meaned matrix of model centroids𝐗A=\[𝝁\(c\)−𝝁g\]c=1C\\mathbf\{X\}\_\{A\}=\\big\[\\bm\{\\mu\}^\{\(c\)\}\-\\bm\{\\mu\}\_\{g\}\\big\]\_\{c=1\}^\{C\};𝒱S\\mathcal\{V\}\_\{S\}is spanned by the top\-k2k\_\{2\}left singular vectors of𝐗S=\[𝐡¯\(p\)−𝐡¯g\]p=1P\\mathbf\{X\}\_\{S\}=\\big\[\\bar\{\\mathbf\{h\}\}^\{\(p\)\}\-\\bar\{\\mathbf\{h\}\}\_\{g\}\\big\]\_\{p=1\}^\{P\}with𝐡¯\(p\)=1C​∑c𝐮\(c,p\)\\bar\{\\mathbf\{h\}\}^\{\(p\)\}=\\tfrac\{1\}\{C\}\\sum\_\{c\}\\mathbf\{u\}^\{\(c,p\)\}\. Withk1=k2=10k\_\{1\}=k\_\{2\}=10, the principal angles\{θi\}i=110\\\{\\theta\_\{i\}\\\}\_\{i=1\}^\{10\}are obtained from the singular values of𝐔A⊤​𝐔S\\mathbf\{U\}\_\{A\}^\{\\\!\\top\}\\mathbf\{U\}\_\{S\}\. Robustness to the subspace dimension is verified at\(k1,k2\)∈\{\(5,5\),\(10,10\),\(15,15\),\(20,20\)\}\(k\_\{1\},k\_\{2\}\)\\in\\\{\(5,5\),\(10,10\),\(15,15\),\(20,20\)\\\}; the mean\-of\-angles changes by≤4∘\\leq 4^\{\\circ\}across this grid in every run\.

Table 14:Principal angles \(deg\., sorted ascending\) atk1=k2=10k\_\{1\}\{=\}k\_\{2\}\{=\}10\. Subordinate directions \(i≥8i\\geq 8\) are universally near\-orthogonal; leading directions are partially entangled with magnitude that varies by proxy\.Tab\.[14](https://arxiv.org/html/2606.10794#A4.T14)establishes two universal facts: \(i\) subordinate directions \(i≥8i\\geq 8\) are near\-orthogonal atθi∈\[85\.2∘,89\.9∘\]\\theta\_\{i\}\\in\[85\.2^\{\\circ\},89\.9^\{\\circ\}\], withθ10\\theta\_\{10\}within one degree of90∘90^\{\\circ\}in 5 of 6 cases; \(ii\) the leading directionθ1\\theta\_\{1\}is partially entangled with proxy\-dependent magnitude\. Only the Qwen3 intra\-mean row exhibits the dramaticθ1≈0∘\\theta\_\{1\}\\approx 0^\{\\circ\}collapse; the other proxies showθ1∈\[38\.6∘,53\.1∘\]\\theta\_\{1\}\\in\[38\.6^\{\\circ\},53\.1^\{\\circ\}\]under both views\.

This stratification has two consequences for the design choices in Sec\.[3](https://arxiv.org/html/2606.10794#S3): \(i\) the per\-prompt feature𝐮\(c,p\)\\mathbf\{u\}^\{\(c,p\)\}should retain the full activation dimensionality, since the leading components are dominated by entangled energy; \(ii\) anL2L\_\{2\}\-regularised per\-prompt linear classifier is a suitable discriminative evidence unit,qθ​\(c∣𝐮\(c,p\)\)=softmax​\(𝐖⊤​𝐮\(c,p\)\+𝐛\)cq\_\{\\theta\}\(c\\mid\\mathbf\{u\}^\{\(c,p\)\}\)=\\mathrm\{softmax\}\(\\mathbf\{W\}^\{\\top\}\\mathbf\{u\}^\{\(c,p\)\}\+\\mathbf\{b\}\)\_\{c\}because the orthogonal authorship\-bearing directions live in a low\-energy regime that is not aligned with the cosine geometry of the raw activations\. This per\-prompt classifier is the unit of evidence that Stage 2 \(Sec\.[3\.3](https://arxiv.org/html/2606.10794#S3.SS3)\) accumulates in log\-space\.

#### D\.3\.3Joint takeaway

Intra\-sequence filtering does not, on its own, separate authorship from semantics:RRis invariant or actively suppressed underMM\(Tab\.[13](https://arxiv.org/html/2606.10794#A4.T13)\) and the leading authorship/semantic directions remain partially entangled across proxies \(Tab\.[14](https://arxiv.org/html/2606.10794#A4.T14)\)\. Both observations converge on the interpretation given in Sec\.[3\.2](https://arxiv.org/html/2606.10794#S3.SS2): the1M​∑m\\frac\{1\}\{M\}\\sum\_\{m\}operator is a numerically stabilizing low\-pass filter whose payoff materializes only when its output feeds the per\-prompt posterior from a frozen linear probe, the unit of evidence for Stage 2’s Bayesian accumulation\. Subordinate\-direction near\-orthogonality holds universally and motivates the full\-dimensional per\-prompt probe used downstream\.

### D\.4Why Intra\-Response Sampling Uses a Bounded Window

A counter\-intuitive empirical observation in our framework is the performance dynamic regarding the intra\-sequence sampling sizeMM\. From a naive statistical perspective, largerMMshould yield a better estimation of the temporal expectation\. However, Tab\.[13](https://arxiv.org/html/2606.10794#A4.T13)and Fig\.[4](https://arxiv.org/html/2606.10794#S4.F4)show that the most useful operating region is bounded: performance peaks at smallMMvalues such asM∈\{4,8\}M\\in\\\{4,8\\\}and can degrade whenMMis too large \(e\.g\.,M≥16M\\geq 16\) or too small \(M=1M=1\)\.

We attribute this phenomenon to the non\-independent and identically distributed \(non\-i\.i\.d\.\) nature of autoregressive generation\. Unlike cross\-prompt sampling \(KK\), tokens within a single sequence form a highly correlated, non\-stationary Markov process\.

- •Under\-smoothing \(M=1M=1\):A single token capture suffers heavily from high\-frequency decoding noise \(ϵt\\mathbf\{\\epsilon\}\_\{t\}\) and local contextual spikes, masking the underlying authorship signature\.
- •Optimal Filtering \(M∈\{4,8\}M\\in\\\{4,8\\\}\):A small, bounded window effectively acts as a short\-term temporal low\-pass filter, neutralizing the random noise while preserving the structural stylistic signal\.
- •Over\-smoothing and Signal Decay \(M≥16M\\geq 16\):Authorship signatures—such as system prompt inertia, format adherence, and specific introductory phrasing—are empirically front\-loaded\. As the generation extends, the constraints of the generated context increasingly dominate the deep representations, diluting the target model’s intrinsic signature\. Furthermore, taking an expectation over a broad, non\-stationary temporal window leads toover\-smoothing, flattening the distinct linear features into a generic, inseparable centroid\.

We therefore interpret the intra\-sequence operation as abounded windowed expectation: statistical marginalization over the temporal dimension must be tightly bounded to avoid diluting front\-loaded stylistic priors\.

### D\.5Stage 2 Motivation: Limits of Geometric Mean\-Pooling

This section quantifies the limitations of the naive geometric centroid𝐮¯\(c\)=1K​∑k𝐮\(c,pk\)\\bar\{\\mathbf\{u\}\}^\{\(c\)\}=\\tfrac\{1\}\{K\}\\sum\_\{k\}\\mathbf\{u\}^\{\(c,p\_\{k\}\)\}that Stage 2 \(Sec\.[3\.3](https://arxiv.org/html/2606.10794#S3.SS3)\) deliberately abandons\. The purpose is twofold: \(i\) establish that mean\-pooling does indeed compress the prompt\-induced semantic variance — i\.e\. a geometric LLN regime exists — so that Stage 2’s improvement is measured against a functioning geometric baseline; \(ii\) quantify the assumption\-violation that makes mean\-pooling fragile and thus unsuitable for the final operating point\. Two questions follow:

1. 1\.Q\-LLN\.Does averagingKKfiltered features𝐮\(c,pk\)\\mathbf\{u\}^\{\(c,p\_\{k\}\)\}contract the semantic variance and lift the Fisher ratioR=Varbetween/VarwithinR=\\mathrm\{Var\}\_\{\\text\{between\}\}/\\mathrm\{Var\}\_\{\\text\{within\}\}monotonically withKK?
2. 2\.Q\-Csem\.Is the limiting semantic centroid𝔼p​\[𝐒\(p\)\]\\mathbb\{E\}\_\{p\}\[\\mathbf\{S\}^\{\(p\)\}\]a rigid, model\-independent constant — the closed\-form premise that would let mean\-pooling recover𝐚\(c\)\\mathbf\{a\}^\{\(c\)\}exactly?

##### Setup\.

Three proxiesϕ∈\{Qwen3\-8B,Qwen3\.5\-9B,Llama\-3\.1\-8B\}\\phi\\in\\\{\\texttt\{Qwen3\-8B\},\\texttt\{Qwen3\.5\-9B\},\\texttt\{Llama\-3\.1\-8B\}\\\}at their best probe layersℓ⋆∈\{23,19,31\}\\ell^\{\\star\}\\in\\\{23,19,31\\\}\. We use Agent500 \(P=500P\{=\}500\) withC=50C\{=\}50target LLMs\. For each proxy, we evaluate two feature views at the same layer: thelast\-tokenview uses the hidden state at the final response token, whereas theintra\-meanview averagesMmax=16M\_\{\\max\}\{=\}16uniformly spaced response\-internal hidden states before any cross\-prompt pooling\.

#### D\.5\.1Fisher ratio under cross\-prompt mean\-pooling

For eachKKwe partition thePPprompts into disjoint groups of sizeKK, average𝐮\(c,p\)\\mathbf\{u\}^\{\(c,p\)\}within each group to obtain𝐯\(c,g\)\\mathbf\{v\}^\{\(c,g\)\}, and recompute the Fisher ratio over the\(C×⌊P/K⌋\)\(C\\times\\lfloor P/K\\rfloor\)grouped centroids\.

Table 15:Fisher ratioRRas a function of mean\-pool aggregation sizeKK, across three proxies and two feature views on Agent500\. Bold cells markR≥1R\\geq 1\. Mean\-pooling does compressVarwithin\\mathrm\{Var\}\_\{\\text\{within\}\}monotonically — a geometric LLN regime exists, but as a sub\-optimal baseline \(cf\. Sec\.[D\.6](https://arxiv.org/html/2606.10794#A4.SS6)\)\.Tab\.[15](https://arxiv.org/html/2606.10794#A4.T15)establishes that the geometric LLN regime is real:RRrises monotonically withKKin every one of the six runs and crosses unity for all but one view byK=50K\{=\}50\. This is the upper\-bound on what mean\-pooling alone can achieve\. Sec\.[D\.6](https://arxiv.org/html/2606.10794#A4.SS6)will show that Bayesian Evidence Accumulation extracts strictly more signal from the sameKKsamples; the gap quantifies the suboptimality of geometric fusion\.

#### D\.5\.2Cross\-model semantic\-shift ratio: rigidity is violated

The clean closed\-form𝐯\(c\)−𝐯\(c′\)=𝐚\(c\)−𝐚\(c′\)\\mathbf\{v\}^\{\(c\)\}\-\\mathbf\{v\}^\{\(c^\{\\prime\}\)\}=\\mathbf\{a\}^\{\(c\)\}\-\\mathbf\{a\}^\{\(c^\{\\prime\}\)\}for the mean\-pooled centroid would require the per\-prompt residual𝐞\(c,p\)=𝐮\(c,p\)−𝝁\(c\)\\mathbf\{e\}^\{\(c,p\)\}=\\mathbf\{u\}^\{\(c,p\)\}\-\\bm\{\\mu\}^\{\(c\)\}to be model\-invariant on the same prompt:𝐞\(c,p\)≈𝐞\(c′,p\)\\mathbf\{e\}^\{\(c,p\)\}\\approx\\mathbf\{e\}^\{\(c^\{\\prime\},p\)\}for everyc≠c′c\\neq c^\{\\prime\}\. We test this directly\. Compare:

DA=𝔼c≠c′\[‖𝝁\(c\)−𝝁\(c′\)‖2\],DS=𝔼p𝔼c≠c′\[‖𝐞\(c,p\)−𝐞\(c′,p\)‖2\],D\_\{A\}=\\mathop\{\\mathbb\{E\}\}\_\{c\\neq c^\{\\prime\}\}\\big\[\\,\\\|\\bm\{\\mu\}^\{\(c\)\}\-\\bm\{\\mu\}^\{\(c^\{\\prime\}\)\}\\\|\_\{2\}\\,\\big\],\\qquad D\_\{S\}=\\mathop\{\\mathbb\{E\}\}\_\{p\}\\mathop\{\\mathbb\{E\}\}\_\{c\\neq c^\{\\prime\}\}\\big\[\\,\\\|\\mathbf\{e\}^\{\(c,p\)\}\-\\mathbf\{e\}^\{\(c^\{\\prime\},p\)\}\\\|\_\{2\}\\,\\big\],and reportρ=DS/DA\\rho=D\_\{S\}/D\_\{A\}\. Strict rigidity demandsρ≪1\\rho\\ll 1\.

Table 16:Cross\-model semantic\-shift ratioρ=DS/DA\\rho=D\_\{S\}/D\_\{A\}\. Strict rigidity \(ρ≪1\\rho\\ll 1\) fails in every cell, but the violation is bounded and remarkably consistent: last\-tokenρ∈\[4\.7,5\.9\]\\rho\\in\[4\.7,5\.9\], intra\-meanρ∈\[1\.8,2\.1\]\\rho\\in\[1\.8,2\.1\]\. Mean\-pooling’s clean form is therefore unphysical, motivating Stage 2’s replacement\.Two observations consolidate the case against geometric mean\-pooling\. First,ρ\>1\\rho\>1in every cell: the per\-prompt residual rotates non\-trivially across target models\. The closed\-form identity does not hold, so a mean\-pooled centroid still carries model\-specific semantic distortion\. Second, the violation is sharply bounded across proxies: last\-tokenρ∈\[4\.7,5\.9\]\\rho\\in\[4\.7,5\.9\], intra\-meanρ∈\[1\.8,2\.1\]\\rho\\in\[1\.8,2\.1\]\. A geometric prediction rule that ignores this anisotropy is therefore systematically biased, especially when an atypical high\-magnitude prompt drags the mean\-pooled centroid away from the authorship signature in proportion to the prompt’s semantic magnitude\.

#### D\.5\.3Why this motivates the Stage 2 redesign

Tab\.[15](https://arxiv.org/html/2606.10794#A4.T15)confirms the geometric LLN regime exists \(positive control\)\. Tab\.[16](https://arxiv.org/html/2606.10794#A4.T16)confirms the rigidity assumption underlying its closed\-form doesnothold \(failure of the naive premise\)\. A mean\-pooled centroid is therefore \(i\) guaranteed to be systematically rotated by per\-prompt residuals and \(ii\) unable to gate out atypical high\-magnitude prompts whose magnitudes overwhelm the geometric average\. The Stage 2 redesign in Sec\.[3\.3](https://arxiv.org/html/2606.10794#S3.SS3)replaces the geometric centroid with a Bayesian Evidence Accumulation in the decision space, and Sec\.[D\.6](https://arxiv.org/html/2606.10794#A4.SS6)verifies empirically that this replacement extracts strictly more signal from the sameKKsamples\.

#### D\.5\.4Joint takeaway

Mean\-pooling does compress prompt\-induced semantic variance monotonically withKKacross all 9 proxy/view/domain configurations, but its closed\-form identity is unphysical: the cross\-model rigidity assumption is universally violated withρ\>1\\rho\>1\. This combination — a working geometric baseline that nevertheless rests on a broken premise — is exactly the situation in which a probabilistic aggregation in the calibrated decision space outperforms direct geometric averaging\. We turn to that experiment next \(Sec\.[D\.6](https://arxiv.org/html/2606.10794#A4.SS6)\)\.

### D\.6Stage 2 Validation: Bayesian Evidence Accumulation

This section directly compares the Stage 2 design choice — Bayesian Evidence Accumulation in the decision space — against the geometric mean\-pool baseline whose limitations were established in Sec\.[D\.5](https://arxiv.org/html/2606.10794#A4.SS5)\. The hypothesis under test is that, given thesameper\-prompt filtered features𝐮\(c,p\)=1M​∑m𝐡tm\(c,p\)\\mathbf\{u\}^\{\(c,p\)\}=\\tfrac\{1\}\{M\}\\sum\_\{m\}\\mathbf\{h\}\_\{t\_\{m\}\}^\{\(c,p\)\}from Stage 1, aggregating evidence in the decision space strictly out\-performs aggregating in the activation space\.

##### Setup\.

Three proxiesϕ∈\{Qwen3\-8B,Qwen3\.5\-9B,Llama\-3\.1\-8B\}\\phi\\in\\\{\\texttt\{Qwen3\-8B\},\\texttt\{Qwen3\.5\-9B\},\\texttt\{Llama\-3\.1\-8B\}\\\}at their best probe layersℓ⋆∈\{23,19,31\}\\ell^\{\\star\}\\in\\\{23,19,31\\\}on the agent\-domain probe pool \(P=500P\{=\}500,C=50C\{=\}50target LLMs\)\. For eachK∈\{1,5,10,20,50\}K\\in\\\{1,5,10,20,50\\\}we partition thePPprompts of every target into disjoint groups of sizeKK, yielding⌊P/K⌋⋅C\\lfloor P/K\\rfloor\\cdot Cfingerprintsper setting; classification is evaluated under stratified 5\-fold cross\-validation\. Three aggregators share the same Stage 1 features and the same StandardScaler\+PCA preprocessing:

meanpool\_lrGeometric Stage 2 baseline\. Form𝐮¯\(c,g\)=1K​∑k𝐮\(c,pk\)\\bar\{\\mathbf\{u\}\}^\{\(c,g\)\}=\\tfrac\{1\}\{K\}\\sum\_\{k\}\\mathbf\{u\}^\{\(c,p\_\{k\}\)\}over each fingerprint groupgg, fit the sameL2L\_\{2\}\-regularised multinomial logistic\-regression head on the pooled centroids, and predict withqθ​\(c∣𝐮¯\(c,g\)\)=softmax​\(𝐖⊤​𝐮¯\(c,g\)\+𝐛\)cq\_\{\\theta\}\(c\\mid\\bar\{\\mathbf\{u\}\}^\{\(c,g\)\}\)=\\mathrm\{softmax\}\(\\mathbf\{W\}^\{\\top\}\\bar\{\\mathbf\{u\}\}^\{\(c,g\)\}\+\\mathbf\{b\}\)\_\{c\}\.

logposteriorStage 2 of READER\. Fit the same logistic\-regression head onsingle\-promptfeatures𝐮\(c,p\)\\mathbf\{u\}^\{\(c,p\)\}, then for each fingerprint groupggpredict byarg⁡maxc⁡1K​∑klog⁡qθ​\(c∣𝐮\(c,pk\)\)\\arg\\max\_\{c\}\\tfrac\{1\}\{K\}\\sum\_\{k\}\\log q\_\{\\theta\}\(c\\mid\\mathbf\{u\}^\{\(c,p\_\{k\}\)\}\)\.

gaussian\_discReference: Ledoit–Wolf shrunk LDA on pooled centroids withKK\-scaled decision function\. Acts as a distributional sanity check\.

We report top\-1 accuracy, expected calibration error \(ECE\), and negative log\-likelihood \(NLL\) of the predicted class distribution against the true target\.

#### D\.6\.1Top\-1 accuracy: log evidence dominates at everyK≥5K\\geq 5

Table 17:Response\-view diagnostic of top\-1 provenance accuracy as a function of evidence\-aggregation sizeKK\. This controlled ablation isolates the Stage\-2 aggregation rule before the final intra\-response mean\-pool pipeline:logposterioraccumulates per\-response posterior evidence, whilemeanpool\_lrfirst averages response fingerprints geometrically\.gaussian\_discis included as a distributional reference\.Three regularities hold in this last\-token\-view diagnostic:

- •Decision\-space accumulation helps onceK\>1K\>1\.logposteriorout\-performsmeanpool\_lrin every one of the3×5=153\\times 5=15non\-trivial\(K,proxy\)\(K,\\text\{proxy\}\)cells; the inequality is strict in this diagnostic setting\.
- •Gap widens through the operating point\.The accuracy gapΔ=acclogposterior−accmeanpool\_lr\\Delta=\\mathrm\{acc\}\_\{\\texttt\{logposterior\}\}\-\\mathrm\{acc\}\_\{\\texttt\{meanpool\\\_lr\}\}grows fromΔ∈\[0\.02,0\.05\]\\Delta\\in\[0\.02,0\.05\]atK=5K\{=\}5to a clear margin atK=50K\{=\}50, the multi\-query budget used throughout the main paper\. This comparison is the one we emphasize because it captures most of the achievable gain without doubling the query cost\.
- •Diagnostic operating point\.AtK=50K\{=\}50, log evidence reaches0\.690\.69–0\.790\.79across the three proxies in this last\-token\-view ablation, exceeding the geometric baseline while using the same underlying per\-prompt features\.

TheK=1K\{=\}1identity is structural: with one prompt the two methods reduce to the same softmax over𝐮\(c,p1\)\\mathbf\{u\}^\{\(c,p\_\{1\}\)\}\(mean\-pool of one element equals that element\)\. The non\-trivial divergence appears only when there are multiple pieces of evidence to aggregate — i\.e\. exactly the regime in which the choice between geometric pooling and log\-space accumulation is tested\.

#### D\.6\.2NLL, calibrated confidence, and the soft\-gate effect

Table 18:Negative log\-likelihood \(NLL, lower is better\) atK=5K\{=\}5andK=50K\{=\}50\. Mean\-pool’s NLL spikes to 3\.9–4\.8 at smallKKbecause a single semantically extreme prompt can pull the pooled centroid into a region where the probe assigns high confidence to a wrong class\. Log\-posterior absorbs the same prompts as low\-information evidence \(uniform\-ish per\-prompt softmax adds a near\-constant in log\-space\), keeping NLL bounded at 1\.3–2\.0\.The NLL pattern in Tab\.[18](https://arxiv.org/html/2606.10794#A4.T18)provides the mechanistic signature of the soft\-gate behavior asserted in Sec\.[3\.3](https://arxiv.org/html/2606.10794#S3.SS3)\. AtK=5K\{=\}5, geometric mean\-pool produces NLL∈\[3\.95,4\.81\]\\in\[3\.95,4\.81\]across the three proxies — an order of magnitude worse than log\-posterior’s\[1\.34,1\.98\]\[1\.34,1\.98\]\. The mean\-pool head sometimes classifies confidentlywrong: a smallKKleaves enough room for a single atypical prompt to dominate the pooled centroid and induce a high\-confidence incorrect prediction\. Log\-posterior, by contrast, treats each prompt independently — an atypical prompt that produces a near\-uniform per\-prompt softmax contributes a vector close tolog⁡\(1/\|𝒞\|\)⋅𝟏\\log\(1/\|\\mathcal\{C\}\|\)\\cdot\\mathbf\{1\}, which adds a label\-independent constant to the running log\-sum and cannot bias thearg⁡max\\arg\\max\. This is precisely the “information\-theoretic soft\-gate” described in the methodology\.

The ECE columns in Tab\.[18](https://arxiv.org/html/2606.10794#A4.T18)reveal that the raw score scale is not itself a calibrated group\-level posterior\. In our implementation the accumulated evidence is stored as the average log\-posterior scoreSc=1K​∑klog⁡qθ​\(c∣𝐮k\)S\_\{c\}=\\frac\{1\}\{K\}\\sum\_\{k\}\\log q\_\{\\theta\}\(c\\mid\\mathbf\{u\}\_\{k\}\), which preserves the MAP ranking but can be under\-confident because the1/K1/Kfactor flattens the final softmax\. We therefore report calibrated confidence by fitting a scalar evidence scaleα\\alphaon the validation split and evaluatingsoftmax​\(α​𝐒\)\\mathrm\{softmax\}\(\\alpha\\mathbf\{S\}\)on the held\-out fold\. Sinceα\>0\\alpha\>0, this calibration preserves top\-1 accuracy and affects only NLL, ECE, and any downstream confidence threshold\.

Table 19:Raw confidence scale and validation\-fitted scalar evidence scale for READER atM=4M\{=\}4, using the canonical mean\-pool\-intra \+ log\-posterior cross\-KKpipeline\. The fittedα\\alphais positive, so MAP accuracy is unchanged\. The value indicates the evidence rescaling used for calibrated confidence reporting\.Tab\.[19](https://arxiv.org/html/2606.10794#A4.T19)confirms the calibration interpretation\. AtK=50K\{=\}50, the raw BEA scores are accurate but not directly calibrated as group\-level probabilities: ECE lies in\[0\.350,0\.503\]\[0\.350,0\.503\]while top\-1 accuracy reaches0\.7000\.700–0\.8400\.840\. The validation\-fittedα\\alphavalues increase from roughly0\.60\.6–1\.11\.1atK=1K\{=\}1to77–1212atK=50K\{=\}50, matching the need to rescale averaged log\-evidence before confidence reporting\. We therefore use raw BEA scores for MAP attribution and calibrated BEA probabilities only when reporting confidence metrics or applying confidence thresholds\.

#### D\.6\.3Joint takeaway

The Stage 2 design choice is empirically justified by two complementary observations\. First, in the last\-token\-view diagnostic above, log\-posterior accumulation consistently improves over geometric mean\-pooling once multiple responses are available\. Second, in the canonical pipeline used by the main paper, the same decision\-space accumulator remains competitive with geometric pooling on the strongest Qwen proxies while improving several proxies and all sentence\-encoder baselines \(Tab\.[6](https://arxiv.org/html/2606.10794#A3.T6)\)\. NLL further supports the soft\-gate interpretation: log\-posterior treats low\-information prompts as near\-constant evidence, whereas geometric pooling can move a shared centroid through feature space\. Combined with Sec\.[D\.3](https://arxiv.org/html/2606.10794#A4.SS3)’s validation of Stage 1, this supports the two\-stage READER pipeline: a temporal low\-pass filter that produces stable per\-prompt features, followed by Bayesian evidence accumulation in the decision space\.

## Appendix EStatic Relationship Evaluation on Bench\-A

We evaluate whether different black\-box signals can recover static relationships between LLMs when all models are queried on the same input set\. This appendix experiment is a controlled comparison against prior static relationship detection methods, not evidence for the dynamic input\-set setting considered in our main method\. All entries below use the same cached Bench\-A prompt set and the same generated responses\.

##### Task construction\.

We construct a balanced pairwise relationship classification task from Bench\-A\. Each positive example is a parent–derived model pair, and each negative example is sampled from two different parent families\. We use only models with complete cached generations on all prompts\. This yields 60 model pairs with a 1:1 ratio between related and unrelated pairs\. Each model has responses to 600 prompts with a maximum generation length of 128 tokens\. For evaluation, we use 20 fixed stratified 4:1 train/test splits over model pairs\. For every method we compute a scalar similarity score for each model pair, train a linear SVM on the training pairs, and report mean and standard deviation over the 20 test splits\. The classifier and the data split are identical across methods; only the pairwise similarity function changes\.

##### Per\-sample comparison in the binary pairwise setting\.

The READER framework introduced in Sec\.[3](https://arxiv.org/html/2606.10794#S3)aggregates evidence acrossKKprompts via Bayesian log\-posterior accumulation in order to identify a single target LLM out of aCC\-class ecosystem\. The relationship classification task here is structurally different: each instance is already a*pair*of models, and the SVM consumes a single scalar similarity score per pair\. There is noCC\-way posterior to accumulate\. Our method therefore reduces to its natural binary\-pairwise form, which we call*per\-sample proxy comparison*: for each of the 600 aligned promptsppwe compute the cosine similarity between the two models’ Stage\-1 filtered features𝐮\(c,p\)=1M​∑m=1M𝐡tm\(c,p\)\\mathbf\{u\}^\{\(c,p\)\}=\\tfrac\{1\}\{M\}\\sum\_\{m=1\}^\{M\}\\mathbf\{h\}\_\{t\_\{m\}\}^\{\(c,p\)\}\(M=4M\{=\}4uniformly spaced response positions, last layer of the proxy LM\), and average the 600 prompt\-level cosines into one pair score\. No multi\-sample posterior aggregation is needed because the binary decision is encoded in the per\-pair score itself; the SVM absorbs the residual scaling\.

##### Compared methods\.

We compare three families of black\-box relationship signals on the same task\.Local output statisticsuse prompt\-aligned surface agreement on the generated text: MPT \(agreement of the first non\-space generated token\) and PhyloLM \(agreement of the first four output characters\)\.LLM\-DNAembeds responses with a sentence encoder; the original variant concatenates the 600 response embeddings of a model, reduces the result with a random Gaussian projection \(RGP\) to 128 dimensions, and compares model pairs by cosine similarity over the model\-level vector; we additionally evaluate a*prompt\-aligned*variant \(LLM\-DNA\-split\) that computes per\-prompt cosines and averages over the 600 prompts\. Both variants are instantiated with MPNet, BGE, and Qwen3\-Embedding\-8B encoders\.Oursis the per\-sample proxy comparison described above, instantiated with Qwen3\-8B and Qwen3\.5\-9B as proxies\.

Table 20:Static model relationship recognition on Bench\-A\. All methods are evaluated on the same 60 balanced model pairs and the same 20 fixed train/test splits\. LLM\-DNA is reported under both its original model\-level RGP aggregation and a prompt\-aligned variant \(LLM\-DNA\-split\) that averages per\-prompt cosines\. Our method is the per\-sample proxy comparison described in the text, evaluated with two proxies\. Best per metric in bold, second\-best underlined\.
##### Main observations\.

Table[20](https://arxiv.org/html/2606.10794#A5.T20)groups the 10 methods into three horizontal bands: prefix statistics \(MPT, PhyloLM\), sentence\-embedding DNA \(LLM\-DNA and its prompt\-aligned variant LLM\-DNA\-split with three encoders\), and the per\-sample proxy method \(Ours\)\.

Prefix statistics dominate raw scoreboard\.PhyloLM and MPT achieve the top two positions on accuracy, precision, F1, and AUC\. This should not be read as a general dominance of local string statistics: the setting here is unusually favorable to them\. Both models in each pair are evaluated on exactly the same prompts and the score is computed from prompt\-aligned output prefixes, so any shared formatting, refusal pattern, or short boilerplate is counted directly as relationship evidence\. Bench\-A’s valid model pool is dominated by small parent–derived families, which further inflates this local\-agreement signal\.

Sentence\-embedding DNA does not transfer cleanly to this regime\.Among LLM\-DNA encoders, Qwen3\-Embedding\-8B reaches the best F10\.7210\.721and BGE the best AUC0\.8530\.853, both well below the prefix baselines\. Switching from the original RGP aggregation to the prompt\-aligned variant \(LLM\-DNA\-split\) yields only a modest improvement \(e\.g\. MPNet F10\.590→0\.6860\.590\\to 0\.686, BGE F10\.696→0\.6640\.696\\to 0\.664\)\. Sentence embeddings are tuned for semantic equivalence, so on prompts where two models produce semantically equivalent but stylistically different responses they collapse the very signal we want\.

Per\-sample proxy comparison is the strongest representation\-based method\.The Qwen3\-8B proxy reaches F10\.820±0\.0940\.820\\pm 0\.094and AUC0\.896±0\.0850\.896\\pm 0\.085\. Compared with the best LLM\-DNA variant on each metric \(F10\.7210\.721for Qwen3\-Emb\-8B, AUC0\.8530\.853for BGE\), this is a\+10\+10F1 and\+4\+4AUC absolute improvement\. The Qwen3\-8B proxy also achieves the best recall \(0\.9080\.908\) of any method, including the two prefix baselines\. Variance is also lower for the proxy method: F1σ≈0\.094\\sigma\\approx 0\.094vsσ≥0\.130\\sigma\\geq 0\.130for every LLM\-DNA variant, indicating that the proxy similarity is more stable across the 20 splits\. The Qwen3\.5\-9B proxy is slightly weaker on recall but trades it for higher precision \(0\.8360\.836\), suggesting a proxy\-family effect on the operating point rather than on aggregate quality\.

##### Scope of the result\.

Three restrained conclusions\. First, local prefix statistics are unusually strong on this Bench\-A subset and should be treated as important static baselines whenever the two models are queried on the same prompts\. Second, under faithful black\-box aggregation, sentence\-embedding DNA does not match its originally reported advantage in our small\-model Bench\-A regime, likely because semantic encoders dilute the fine\-grained local cues that carry parent–derived information\. Third, our per\-sample proxy comparison provides a representation\-based relationship signal that improves over LLM\-DNA by a non\-trivial margin on F1, AUC, and recall, while remaining below the prefix baselines\. This is consistent with the intended role of the method: in settings where prompt\-aligned surface statistics are unavailable — different prompt sets per model, different sampling temperatures, or arbitrary decoding configurations — the proxy comparison retains its score\-formula validity, whereas MPT and PhyloLM lose theirs\.

### E\.1Bench\-A Subset Details for Static Relationship Evaluation

This appendix gives the exact Bench\-A subset used in Section[E](https://arxiv.org/html/2606.10794#A5)\. We start from the cached Bench\-A generations and keep only models with complete outputs for all 600 prompts\. The resulting generation pool contains 61 models\. For relationship evaluation, a model can form a positive pair only if its resolved parent model is also present in the valid generation pool\. This leaves five parent families with complete parent models\. We construct all available parent–derived positive pairs from these families and sample an equal number of cross\-family negative pairs, giving 60 model pairs in total: 30 related and 30 unrelated\.

Table 21:Parent families used to construct positive pairs in the static relationship experiment\. “Family size” includes the parent model itself\.Table 22:Summary of the Bench\-A subset used for static relationship recognition\.Table 23:Complete model\-pair list used in the static relationship evaluation\. The table contains all 60 balanced pairs used by the 20 repeated train/test splits\. Related pairs are listed first, followed by unrelated pairs\.

## Appendix FBroader Impact

READER is intended as an auditing tool for black\-box model provenance\. Its positive uses include license governance, detection of unauthorized model wrapping, verification of API model identity, and post\-incident attribution when generated content must be traced back to a likely source model\. Because the method requires only generated text from the target model and a frozen proxy reader, it may help third\-party auditors evaluate deployed systems without requiring access to proprietary model weights or logits\.

The same capability can be misused\. A reliable provenance system may reveal information about model families, deployment choices, or downstream service providers that an operator did not intend to disclose\. It may also be used for competitive intelligence or for selectively evading model\-specific monitoring once an attacker understands which source model is likely to be detected\. We therefore view READER as appropriate for legitimate auditing, compliance, and research settings, and not as a tool for deanonymizing private users or inferring sensitive attributes from human\-written text\.

The method is probabilistic and closed\-set: it reports the most likely source among candidate LLMs seen during probe training\. Deployment should therefore include uncertainty reporting, calibrated confidence thresholds, and an “unknown model” handling policy rather than treating every prediction as a conclusive attribution\. Dataset governance is also important: target outputs used for training and evaluation should respect model terms of service and avoid prompts that solicit private, copyrighted, or harmful content\.

Finally, READER does not require additional target\-model queries beyond the natural interaction trace being audited\. Its main operational cost is local proxy inference, so responsible deployments should use it for legitimate audits where provenance accuracy matters\.

## Appendix GLarge Language Model Usage

We used Claude Opus 4\.6/4\.7 as coding assistance tools for implementation, debugging, and experiment\-running support\. We used GPT\-5\.5 to assist with manuscript polishing and English proofreading\. All technical claims, experimental results, tables, and final manuscript content were reviewed and approved by the authors\.

Similar Articles

Author-in-the-Loop Response Generation and Evaluation: Integrating Author Expertise and Intent in Responses to Peer Review

arXiv cs.CL

This paper introduces the Re3Align dataset, REspGen framework, and REspEval evaluation suite for author-in-the-loop response generation in peer review, integrating author expertise and intent signals. The work addresses gaps in NLP formulation of scientific rebuttal writing with comprehensive datasets, controllable generation frameworks, and multi-dimensional evaluation metrics.

Partial Evidence Bench: Benchmarking Authorization-Limited Evidence in Agentic Systems

arXiv cs.AI

This paper introduces Partial-Evidence-Bench, a deterministic benchmark for measuring 'authorization-limited evidence' failures in agentic AI systems. It evaluates how models handle tasks where access control restricts visibility, assessing their ability to recognize and report incomplete information rather than silently producing seemingly complete but incomplete answers.