AEyeDE: An Attention-Based Attribution Framework for AI-Generated Text Detection

arXiv cs.CL Papers

Summary

AEyeDE is an attention-based attribution framework that uses a proxy Transformer model to extract attention maps from text and trains a lightweight CNN to distinguish human-written from AI-generated text, outperforming text-only baselines and showing robustness across settings.

arXiv:2606.00016v1 Announce Type: new Abstract: Detecting AI-generated text is becoming increasingly challenging as modern language models approach human-level fluency and can evade detectors that rely on surface statistics or likelihood-based signals. We propose \textsc{AEyeDE}, an attribution-driven approach to human-AI authorship detection that leverages model attention as a discriminative signal. Specifically, we extract attention-based attribution matrices for both human- and AI-generated text using a \emph{proxy} Transformer model with white-box access and train a lightweight Convolutional Neural Network to learn representations from these attribution maps. Across encoder-decoder translation settings, our method consistently outperforms a text-only baseline. In decoder-only settings, it performs strongly in generator-specific detection, remains competitive on standard benchmarks, and shows robustness under cross-dataset transfer and alternative-spelling perturbations. We further show that attention maps exhibit recurring local structures whose relative frequencies differ consistently between human- and AI-generated text across datasets and proxy models. These findings suggest that attention-based attribution maps provide a complementary and interpretable signal for AI-generated text detection. We will make the code publicly available to support future research.
Original Article
View Cached Full Text

Cached at: 06/02/26, 03:34 PM

# AEyeDE: An Attention-Based Attribution Framework for AI-Generated Text Detection
Source: [https://arxiv.org/html/2606.00016](https://arxiv.org/html/2606.00016)
Aria Nourbakhsharia\.nourbakhsh@uni\.lu Department of Computer Science University of LuxembourgAdelaide Danilov11footnotemark:1adelaide\.danilov\.002@student\.uni\.lu Department of Computer Science University of LuxembourgChristoph Schommerchristoph\.schommer@uni\.lu Department of Computer Science University of LuxembourgSalima Lamsiyahsalima\.lamsiyah@uni\.lu Department of Computer Science University of Luxembourg

###### Abstract

Detecting AI\-generated text is becoming increasingly challenging as modern language models approach human\-level fluency and can evade detectors that rely on surface statistics or likelihood\-based signals\. We proposeAEyeDE, an attribution\-driven approach to human\-AI authorship detection that leverages model attention as a discriminative signal\. Specifically, we extract attention\-based attribution matrices for both human\- and AI\-generated text using a*proxy*Transformer model with white\-box access and train a lightweight Convolutional Neural Network to learn representations from these attribution maps\. Across encoder\-decoder translation settings, our method consistently outperforms a text\-only baseline\. In decoder\-only settings, it performs strongly in generator\-specific detection, remains competitive on standard benchmarks, and shows robustness under cross\-dataset transfer and alternative\-spelling perturbations\. We further show that attention maps exhibit recurring local structures whose relative frequencies differ consistently between human\- and AI\-generated text across datasets and proxy models\. These findings suggest that attention\-based attribution maps provide a complementary and interpretable signal for AI\-generated text detection\. We will make the code publicly available to support future research\.

## 1Introduction

The advent of large language models \(LLMs\) has enabled the generation of coherent, context\-aware, and human\-like text across a wide range of domains and languages\(Naveedet al\.,[2023](https://arxiv.org/html/2606.00016#bib.bib39); Changet al\.,[2024](https://arxiv.org/html/2606.00016#bib.bib40)\)\. While these advances unlock substantial benefits, they also raise critical challenges related to information integrity, authorship, and misuse, including large\-scale misinformation in journalism and academic dishonesty in educational settings, where automated content generation threatens societal trust, originality, and assessment validity\(Duganet al\.,[2023](https://arxiv.org/html/2606.00016#bib.bib26); Wuet al\.,[2025](https://arxiv.org/html/2606.00016#bib.bib42); Liuet al\.,[2024b](https://arxiv.org/html/2606.00016#bib.bib25); Aliet al\.,[2025](https://arxiv.org/html/2606.00016#bib.bib62); Huanget al\.,[2025b](https://arxiv.org/html/2606.00016#bib.bib32); Bittle and El\-Gayar,[2025](https://arxiv.org/html/2606.00016#bib.bib43)\)\.

In response, several AI\-generated text detection methods have been proposed, including surface\-statistical approaches exploiting cues such as perplexity, burstiness, and n\-gram repetition\(Gehrmannet al\.,[2019](https://arxiv.org/html/2606.00016#bib.bib31); Ippolitoet al\.,[2020](https://arxiv.org/html/2606.00016#bib.bib53)\); likelihood\-based methods that probe variation in a model’s probability landscape via perturbation or re\-sampling\(Mitchellet al\.,[2023](https://arxiv.org/html/2606.00016#bib.bib36); Baoet al\.,[2023](https://arxiv.org/html/2606.00016#bib.bib12)\); supervised classifiers that fine\-tune Transformer encoders on labeled data\(Liet al\.,[2025](https://arxiv.org/html/2606.00016#bib.bib49); Zhuet al\.,[2025](https://arxiv.org/html/2606.00016#bib.bib63)\); and watermarking techniques that embed detectable signals for source attribution\(Kirchenbaueret al\.,[2023](https://arxiv.org/html/2606.00016#bib.bib48); Liuet al\.,[2024a](https://arxiv.org/html/2606.00016#bib.bib19)\)\. However, each paradigm has intrinsic limitations: statistical and likelihood\-based detectors degrade as LLMs are optimized to mimic human distributions through techniques such as RLHF\(Christianoet al\.,[2017](https://arxiv.org/html/2606.00016#bib.bib54)\); supervised classifiers suffer under domain shift and unseen generators\(Uchenduet al\.,[2021](https://arxiv.org/html/2606.00016#bib.bib29)\); and watermarking requires model\-side cooperation and is fragile to paraphrasing, post\-editing, or partial reuse\(Liuet al\.,[2024a](https://arxiv.org/html/2606.00016#bib.bib19); Wanget al\.,[2025](https://arxiv.org/html/2606.00016#bib.bib21); Niess and Kern,[2025](https://arxiv.org/html/2606.00016#bib.bib23); Ahnet al\.,[2025](https://arxiv.org/html/2606.00016#bib.bib5)\)\. As a result, robust detection remains an open challenge, continually undermined by advances in generation quality and adversarial evasion strategies\(Wuet al\.,[2025](https://arxiv.org/html/2606.00016#bib.bib42);[2024](https://arxiv.org/html/2606.00016#bib.bib34)\)\.

These challenges motivate a shift away from detecting*what*is written toward analyzing*how*text is produced\. In particular, we hypothesize that token\-token interaction structure provides a richer detection signal than one\-dimensional token\-level statistics such as rank, likelihood, or perplexity\. Text\-only detectors, whether statistical or neural, operate on the generated sequence itself\. Attention\-based attribution maps, by contrast, capture how a proxy Transformer internally processes that sequence, revealing a two\-dimensional relational structure over token interactions\. This representation can preserve higher\-order local regularities that are not directly observable from surface\-form statistics alone\. We therefore investigate whether these attribution patterns provide a more robust and transferable signal for distinguishing human\- and AI\-generated text\. To test this hypothesis, we introduceAEyeDE, an attribution\-based detection framework that operates directly on attention attribution maps extracted from Transformer models\(Vaswaniet al\.,[2017](https://arxiv.org/html/2606.00016#bib.bib38)\)\. Given an observed textxx\(human or AI\-generated\),AEyeDEpassesxxthrough a fixed*proxy*modelGθG\_\{\\theta\}with white\-box access and derives an attention attribution matrix \(Sec\.[3](https://arxiv.org/html/2606.00016#S3)\)\. We process attribution maps using a multi\-scale convolutional encoder with attention pooling to obtain compact embeddings for authorship classification \(Figure[1](https://arxiv.org/html/2606.00016#S3.F1)\), making the detector less sensitive to purely lexical or stylistic variation\.

Beyond predictive performance, this approach also enables an interpretable analysis of the learned attribution structure\. For this purpose, we analyze what the CNN attribution encoder captures in attention maps\. Clustering8×88\\times 8patches in its last convolutional stage feature space reveals recurring local patterns \(*motifs*\) whose prevalence differs between human and AI\-generated text across datasets and proxy models\. This indicates that authorship leaves a localized, repeatable signature in proxy\-model attention maps that our detector can exploit\.

We evaluateAEyeDEacross both encoder\-decoder and decoder\-only settings, using machine translation benchmarks \(WMT14 and the UN Parallel Corpus\) and open\-ended generation datasets \(HC3,RAID, and Beemo\)\. Together, these experiments cover multiple languages, domains, and model families\.

Our main contributions are summarized as follows:

- •We introduceAEyeDE, an attribution\-based framework for AI\-generated text detection that uses attention attribution maps from a proxy Transformer as structured input to a lightweight CNN classifier\.
- •We provide a broad empirical evaluation across encoder\-decoder and decoder\-only settings, including generator\-specific detection, mixed\-generator generalization, adversarial perturbations, and cross\-dataset transfer\. The results show that attention\-based attribution maps provide a competitive and complementary detection signal, with particularly strong performance in generator\-specific settings and robustness to alternative\-spelling attacks\.
- •We analyze the learned attribution representations and identify recurring local attention patterns, or “motifs,” whose relative frequencies differ systematically between human\- and AI\-generated text\. These motifs provide an interpretable and localized signature of authorship\.

## 2Related Work

#### AI\-Generated Text Detection\.

Research on detecting machine\-generated text has accelerated alongside the rapid progress and deployment of LLMs\(Wuet al\.,[2025](https://arxiv.org/html/2606.00016#bib.bib42)\)\. Existing approaches can be categorized into \(i\)*surface\-statistical*and \(ii\)*likelihood\-based*detectors, \(iii\)*supervised neural classifiers*, \(iv\)*watermarking and source attribution*, and \(v\)*LLM\-based*meta\-detectors\. Surface\-statistical methods exploit distributional artifacts such as perplexity, burstiness, or n\-gram irregularities, often providing lightweight but increasingly fragile signals as generators improve\(Gehrmannet al\.,[2019](https://arxiv.org/html/2606.00016#bib.bib31); Ippolitoet al\.,[2020](https://arxiv.org/html/2606.00016#bib.bib53); Shenet al\.,[2023](https://arxiv.org/html/2606.00016#bib.bib72); Tassopoulouet al\.,[2021](https://arxiv.org/html/2606.00016#bib.bib73); Krishnaet al\.,[2022](https://arxiv.org/html/2606.00016#bib.bib74)\)\. Complementarily, likelihood\-based methods probe the generator’s probability landscape: DetectGPT identifies machine\-generated text by measuring curvature via perturbations\(Mitchellet al\.,[2023](https://arxiv.org/html/2606.00016#bib.bib36)\), and related work improves efficiency and robustness through faster perturbation schemes\(Baoet al\.,[2023](https://arxiv.org/html/2606.00016#bib.bib12)\)\. These lines of work capture model\-specific statistical footprints, but can degrade as LLMs are optimized to match human\-like distributions\.

#### Neural Detectors, Robustness, and Generalization\.

Supervised detectors typically fine\-tune Transformer encoders \(e\.g\., BERT\(Devlinet al\.,[2019](https://arxiv.org/html/2606.00016#bib.bib75)\)and RoBERTa\(Liuet al\.,[2019](https://arxiv.org/html/2606.00016#bib.bib90)\)\) on labeled human vs\. machine text, achieving strong in\-domain performance but often suffering under domain shift and unseen generators\(Uchenduet al\.,[2021](https://arxiv.org/html/2606.00016#bib.bib29); Wanget al\.,[2024b](https://arxiv.org/html/2606.00016#bib.bib47)\)\. Robustness has become a central focus: training on diverse decoding strategies improves resilience\(Ippolitoet al\.,[2020](https://arxiv.org/html/2606.00016#bib.bib53)\), adversarial training frameworks such as IRON harden detectors against evasion\(Liet al\.,[2025](https://arxiv.org/html/2606.00016#bib.bib49)\), and Radar explicitly targets robustness via adversarial learning\(Huet al\.,[2023](https://arxiv.org/html/2606.00016#bib.bib18)\)\. Recent methods further aim to improve out\-of\-distribution behavior and reliability guarantees, e\.g\., by shaping attention over multiple receptive ranges\(Jiaoet al\.,[2025](https://arxiv.org/html/2606.00016#bib.bib70)\)or bounding false positives with conformal prediction in zero\-shot settings\(Zhuet al\.,[2025](https://arxiv.org/html/2606.00016#bib.bib63)\)\. Interpretability for detectors is also receiving attention: feature\-level analyses using sparse autoencoders help reveal which latent patterns separate machine and human text\(Kuznetsovet al\.,[2025](https://arxiv.org/html/2606.00016#bib.bib71)\), while downstream applications increasingly require multilingual and domain\-specific robustness\(Aliet al\.,[2025](https://arxiv.org/html/2606.00016#bib.bib62)\)and fine\-grained settings such as human\-AI co\-authorship\(Suet al\.,[2025](https://arxiv.org/html/2606.00016#bib.bib64)\)\.

#### Watermarking and Source Attribution\.

Watermarking aims to embed detectable signals into generated text, enabling attribution when generation\-side cooperation is available\(Liuet al\.,[2024a](https://arxiv.org/html/2606.00016#bib.bib19)\)\. Early and widely adopted schemes include token\-list or “soft” watermarks that bias sampling\(Kirchenbaueret al\.,[2023](https://arxiv.org/html/2606.00016#bib.bib48)\), while subsequent work explored alternative embedding mechanisms and detection rules, including entropy\- or Bayesian\-inspired detectors\(Luet al\.,[2024](https://arxiv.org/html/2606.00016#bib.bib24); Huanget al\.,[2025a](https://arxiv.org/html/2606.00016#bib.bib20)\)and more adaptive watermark designs such as MorphMark\(Wanget al\.,[2025](https://arxiv.org/html/2606.00016#bib.bib21)\)\. Recent studies further examine watermark ensembles\(Niess and Kern,[2025](https://arxiv.org/html/2606.00016#bib.bib23)\), watermark\-based source attribution \(e\.g\., WASA\)\(Luet al\.,[2025](https://arxiv.org/html/2606.00016#bib.bib16)\), and approaches that reduce bias and risk\(Maoet al\.,[2025](https://arxiv.org/html/2606.00016#bib.bib17)\)\. However, watermarking remains challenged by post\-editing and paraphrasing\(Liuet al\.,[2024a](https://arxiv.org/html/2606.00016#bib.bib19)\), motivating defenses such as paraphrase inversion\(Rivera Sotoet al\.,[2025](https://arxiv.org/html/2606.00016#bib.bib69)\)and robustness through injected “fictitious knowledge” signals\(Cuiet al\.,[2025](https://arxiv.org/html/2606.00016#bib.bib22)\)\. Adversarial settings also reveal vulnerabilities: DITTO formalizes spoofing attacks against watermarked LLMs via knowledge distillation\(Ahnet al\.,[2025](https://arxiv.org/html/2606.00016#bib.bib5)\), underscoring the need for evaluation under realistic transformation and attack pipelines\.

#### LLMs as Detectors and Explainable Attribution\.

Beyond classical detectors, LLMs are increasingly used as meta\-detectors and critics of generated content, reflecting a trend toward black\-box and instruction\-following detection pipelines\(Wanget al\.,[2024b](https://arxiv.org/html/2606.00016#bib.bib47)\)\. Recent work expands from binary detection to attribution and explanation, e\.g\., XDAC provides XAI\-driven detection and attribution for Korean news comments\(Goet al\.,[2025a](https://arxiv.org/html/2606.00016#bib.bib4);[b](https://arxiv.org/html/2606.00016#bib.bib88)\), and studies of detectability highlight how author intent and role can affect detection outcomes\(Li and Wan,[2025](https://arxiv.org/html/2606.00016#bib.bib66)\)\. Together, these directions emphasize that practical detection increasingly requires robustness, reliability, and interpretable evidence—not only raw accuracy\.

#### Benchmarks and Shared Tasks\.

Progress in AI\-text detection is tightly coupled with benchmarks that stress generalization across domains, languages, and attack conditions\. Widely used datasets include HC3\(Guoet al\.,[2023](https://arxiv.org/html/2606.00016#bib.bib9)\), MGTBench\(Heet al\.,[2024](https://arxiv.org/html/2606.00016#bib.bib10)\), WritingPrompts\(Baoet al\.,[2023](https://arxiv.org/html/2606.00016#bib.bib12)\), RAID\(Duganet al\.,[2024](https://arxiv.org/html/2606.00016#bib.bib52)\), and adversarial extensions such as Adv\-HC3\(Penget al\.,[2023](https://arxiv.org/html/2606.00016#bib.bib11)\); additional resources target broader settings such as BUST\(Corneliuset al\.,[2024](https://arxiv.org/html/2606.00016#bib.bib13)\)and LLMTRACE\(Tolstykhet al\.,[2025](https://arxiv.org/html/2606.00016#bib.bib15)\)\. Beyond text\-only benchmarks, MultiSocial supports multilingual social\-media detection\(Mackoet al\.,[2025](https://arxiv.org/html/2606.00016#bib.bib61)\), Double Entendre introduces a multimodal audio\-lyrics setting\(Frohmannet al\.,[2025](https://arxiv.org/html/2606.00016#bib.bib67)\), and stress\-test benchmarks systematically perturb style to probe brittleness\(Pedrottiet al\.,[2025](https://arxiv.org/html/2606.00016#bib.bib68)\)\. Shared tasks further standardize evaluation and accelerate methodology: SemEval\-2024 Task 8 targets black\-box, multilingual, and multidomain detection\(Wanget al\.,[2024a](https://arxiv.org/html/2606.00016#bib.bib82)\), with system analyses such as TrustAI highlighting practical modeling choices\(Urlanaet al\.,[2024](https://arxiv.org/html/2606.00016#bib.bib45)\)\. Community efforts such as the GenAIDetect workshop at COLING 2025\(Alamet al\.,[2025](https://arxiv.org/html/2606.00016#bib.bib83)\)and domain\-focused shared tasks and datasets \(e\.g\., M\-DAIGT for news and academic writing\(Lamsiyahet al\.,[2025](https://arxiv.org/html/2606.00016#bib.bib86)\), and AraGenEval for Arabic settings\(Abudalfaet al\.,[2025](https://arxiv.org/html/2606.00016#bib.bib87)\)\) reflect increasing emphasis on robustness, multilinguality, and real\-world constraints\. These benchmarks and tasks collectively motivate detectors that generalize across generator families while offering transparent, verifiable evidence for their decisions\.

#### Positioning of Our Work\.

Our method differs from prior AI\-text detectors by using attention\-based attribution maps from a proxy Transformer as the main detection signal, rather than relying only on surface statistics, likelihood perturbations, or end\-to\-end text representations\. It does not require watermarking or access to the true generator, and it is less dependent on lexical or stylistic cues alone\. More specifically, our framework uses attribution maps as structured inputs to a dedicated detection model\. The experiments suggest that this signal is useful for this task: it performs especially well in generator\-specific settings, remains competitive on standard benchmarks, and shows robustness under alternative\-spelling attacks and cross\-dataset transfer\.

## 3AEyeDEFramework

![Refer to caption](https://arxiv.org/html/2606.00016v1/x1.png)Figure 1:Overview ofAEyeDE\. Given a text sample and white\-box access to a proxy generator modelGθG\_\{\\theta\}, we extract an attention\-derived attribution matrixAA\(top left\)\. For decoder\-only models, we summarizeAAby sampling fixed\-size square blocks \(e\.g\.,128×128128\\times 128\) along the main diagonal, where the strongest local token\-token interactions are concentrated \(orange boxes\)\(Xiaoet al\.,[2023](https://arxiv.org/html/2606.00016#bib.bib100);[Ivanitskiyet al\.,](https://arxiv.org/html/2606.00016#bib.bib98); Qiet al\.,[2025](https://arxiv.org/html/2606.00016#bib.bib99)\); for encoder\-decoder models, we use the full cross\-attention map\. Each block is encoded by a CNN \(bottom\), producing one embedding per block\. These embeddings are aggregated with learnable attention pooling to form a global attribution representation, which is passed to a lightweight MLP classifier to predict human versus LLM authorship \(top right\)\. In our implementation, the final low\-resolution CNN feature map is16×1616\\times 16; its spatial cells correspond to regions of the original attribution block and are later used for motif clustering\.We cast AI\-text detection as a binary classification problem\. The detector has white\-box access to a proxy language modelGθG\_\{\\theta\}, which may be either the suspected generator itself or a surrogate model\. Given an observed text samplexx, we passxxthroughGθG\_\{\\theta\}and extract an attention\-derived attribution map\. Within each experiment, the same proxy model is used to compute attributions for all inputs, regardless of whetherxxis human\-written or AI\-generated\. Our hypothesis is that the internal processing dynamics ofGθG\_\{\\theta\}induce systematic differences in these attribution maps for human versus machine\-generated text\.

For decoder\-only models, the attribution map reflects the influence of previous tokens on each current token under causal self\-attention\. For encoder\-decoder models, it reflects the influence of source tokens on each target token under cross\-attention\. In both cases, the attribution map serves as the primary input to the downstream detector\.

Let𝒱\\mathcal\{V\}denote the vocabulary, and letx=\(x1,…,xT\)∈𝒱Tx=\(x\_\{1\},\\ldots,x\_\{T\}\)\\in\\mathcal\{V\}^\{T\}denote the text whose authorship is to be predicted\. Lety∈\{0,1\}y\\in\\\{0,1\\\}be the corresponding label, wherey=1y=1indicates AI\-generated text andy=0y=0indicates human\-written text\. When the proxy model is encoder\-decoder, the target textxxis paired with its associated source sequence to compute cross\-attention\. The detector is a conditional classifier

Dϕ​\(x;θ\)=fϕ​\(x,A​\(x;θ\)\)∈\[0,1\],D\_\{\\phi\}\(x;\\theta\)=f\_\{\\phi\}\\\!\\big\(x,A\(x;\\theta\)\\big\)\\in\[0,1\],whereA​\(x;θ\)A\(x;\\theta\)is the attention\-derived attribution map extracted fromGθG\_\{\\theta\}, andfϕf\_\{\\phi\}is a learnable decision function with parametersϕ\\phi\.

Assume first thatGθG\_\{\\theta\}is anLL\-layer,HH\-head Transformer\. At layerℓ∈\{1,…,L\}\\ell\\in\\\{1,\\ldots,L\\\}and headh∈\{1,…,H\}h\\in\\\{1,\\ldots,H\\\}, letQ\(ℓ,h\)∈ℝTt×dkQ^\{\(\\ell,h\)\}\\in\\mathbb\{R\}^\{T\_\{t\}\\times d\_\{k\}\}andK\(ℓ,h\)∈ℝTs×dkK^\{\(\\ell,h\)\}\\in\\mathbb\{R\}^\{T\_\{s\}\\times d\_\{k\}\}denote the query and key matrices\. The attention matrix for that head is

A~\(ℓ,h\)​\(x;θ\)=softmax⁡\(Q\(ℓ,h\)​\(K\(ℓ,h\)\)⊤dk\+C\)∈ℝTt×Ts,\\widetilde\{A\}^\{\(\\ell,h\)\}\(x;\\theta\)=\\operatorname\{softmax\}\\\!\\left\(\\frac\{Q^\{\(\\ell,h\)\}\(K^\{\(\\ell,h\)\}\)^\{\\top\}\}\{\\sqrt\{d\_\{k\}\}\}\+C\\right\)\\in\\mathbb\{R\}^\{T\_\{t\}\\times T\_\{s\}\},whereC∈ℝTt×TsC\\in\\mathbb\{R\}^\{T\_\{t\}\\times T\_\{s\}\}is the causal mask in the decoder\-only setting\. In the encoder\-decoder setting,A~\(ℓ,h\)\\widetilde\{A\}^\{\(\\ell,h\)\}denotes cross\-attention with the same target\-by\-source orientation and no causal mask\. We then average across heads and layers to obtain a single attribution map:

A​\(x;θ\)=1L​H​∑ℓ=1L∑h=1HA~\(ℓ,h\)​\(x;θ\)∈ℝTt×Ts\.A\(x;\\theta\)=\\frac\{1\}\{LH\}\\sum\_\{\\ell=1\}^\{L\}\\sum\_\{h=1\}^\{H\}\\widetilde\{A\}^\{\(\\ell,h\)\}\(x;\\theta\)\\in\\mathbb\{R\}^\{T\_\{t\}\\times T\_\{s\}\}\.Each entry

at,s=\[A​\(x;θ\)\]t,sa\_\{t,s\}=\\big\[A\(x;\\theta\)\\big\]\_\{t,s\}
gives the average attention mass assigned from the target positionttto source/previous positionssunder the proxy model\. In the decoder\-only setting,Ts=Tt=TT\_\{s\}=T\_\{t\}=T; in the encoder\-decoder setting,TsT\_\{s\}andTtT\_\{t\}may differ\. We use theInseqlibrary to extract these attention\-derived attributions\(Sartiet al\.,[2023](https://arxiv.org/html/2606.00016#bib.bib89)\)\.

The detector architecture is shown in Fig\.[1](https://arxiv.org/html/2606.00016#S3.F1)\. For decoder\-only models, we summarizeAAby extracting square blocks of sizewaw\_\{a\}along a diagonal traversal; in all experiments, we setwa=128w\_\{a\}=128\. In encoder\-decoder experiments, the inputs are limited to at most 128 tokens, so we use a single block that covers the full cross\-attention map\.111For example, on HC3 \(see Section[4\.1](https://arxiv.org/html/2606.00016#S4.SS1)\), the128×128128\\times 128diagonal band exhibits higher entropy than the off\-diagonal region \(8\.89 vs\. 7\.78\), together with a higher mean attribution value \(0\.007 vs\. 0\.001\) and a larger standard deviation \(0\.01 vs\. 0\.002\)\. Processing four128×128128\\times 128patches is also considerably more efficient than processing a full1024×10241024\\times 1024matrix\.This design is motivated by the empirical observation that, in decoder\-only models, the most informative attribution mass is concentrated near the main diagonal\(Xiaoet al\.,[2023](https://arxiv.org/html/2606.00016#bib.bib100);[Ivanitskiyet al\.,](https://arxiv.org/html/2606.00016#bib.bib98); Qiet al\.,[2025](https://arxiv.org/html/2606.00016#bib.bib99)\)\.

Let\(pk,qk\)\(p\_\{k\},q\_\{k\}\)denote the top\-left corner of thekk\-th block\. We define

Ak=A\[pk:pk\+wa,qk:qk\+wa\]∈ℝwa×wa\.A\_\{k\}=A\[p\_\{k\}:p\_\{k\}\+w\_\{a\},\\;q\_\{k\}:q\_\{k\}\+w\_\{a\}\]\\in\\mathbb\{R\}^\{w\_\{a\}\\times w\_\{a\}\}\.\(1\)Letmt∈\{0,1\}Ttm\_\{t\}\\in\\\{0,1\\\}^\{T\_\{t\}\}andms∈\{0,1\}Tsm\_\{s\}\\in\\\{0,1\\\}^\{T\_\{s\}\}denote the target\- and source\-side padding masks, respectively, and define the validity mask

P=mt​ms⊤∈\{0,1\}Tt×Ts,P=m\_\{t\}m\_\{s\}^\{\\top\}\\in\\\{0,1\\\}^\{T\_\{t\}\\times T\_\{s\}\},\(2\)which marks non\-padding attribution entries\. A block is considered valid if it contains at least one non\-padding entry:

uk=𝕀​\(∑i,jP​\[pk\+i,qk\+j\]\>0\)∈\{0,1\}\.u\_\{k\}=\\mathbb\{I\}\\\!\\left\(\\sum\_\{i,j\}P\[p\_\{k\}\+i,\\;q\_\{k\}\+j\]\>0\\right\)\\in\\\{0,1\\\}\.\(3\)Each valid block is encoded by a CNN\-based attribution encoder

bk=Eattr​\(Ak\)∈ℝdattr\.b\_\{k\}=E\_\{\\text\{attr\}\}\(A\_\{k\}\)\\in\\mathbb\{R\}^\{d\_\{\\text\{attr\}\}\}\.\(4\)
The attribution encoderEattrE\_\{\\text\{attr\}\}treatsAkA\_\{k\}as a single\-channel image\. It applies a sequence of convolutional blocks with channel progression

1→32→64→128→128→256,1\\rightarrow 32\\rightarrow 64\\rightarrow 128\\rightarrow 128\\rightarrow 256,interleaved with2×22\\times 2max\-pooling, followed by global average pooling and a two\-layer MLP\. Writingg​\(⋅\)g\(\\cdot\)for the convolutional pipeline plus global average pooling, we obtain

rk=g​\(Ak\)∈ℝ256,r\_\{k\}=g\(A\_\{k\}\)\\in\\mathbb\{R\}^\{256\},\(5\)and then

bk=W2​ρ​\(BN​\(W1​rk\+b1\)\)\+b2,b\_\{k\}=W\_\{2\}\\,\\rho\\\!\\big\(\\mathrm\{BN\}\(W\_\{1\}r\_\{k\}\+b\_\{1\}\)\\big\)\+b\_\{2\},\(6\)whereρ​\(⋅\)\\rho\(\\cdot\)denotes ReLU\. In our implementation, the final convolutional feature map is16×1616\\times 16; its spatial cells can be mapped back to local regions of the original attribution block for motif clustering \(see Sec\.[5](https://arxiv.org/html/2606.00016#S5)\)\.

Optionally, we augment the attribution representation with a text representation\.222We evaluate this text\-augmented variant for encoder\-decoder models\. In practice, attention\-only representations are consistently competitive with the text\-augmented variant, so we emphasize the attention\-based results in the main paper\.To handle long sequences, we pad the input to a multiple of a fixed window sizewtw\_\{t\}and partition it into

N=⌈Twt⌉N=\\left\\lceil\\frac\{T\}\{w\_\{t\}\}\\right\\rceilcontiguous chunks:

x\(i\)=x\(i−1\)​wt\+1:i​wt,i=1,…,N\.x^\{\(i\)\}=x\_\{\(i\-1\)w\_\{t\}\+1:iw\_\{t\}\},\\qquad i=1,\\ldots,N\.\(7\)A text encoderEtextE\_\{\\text\{text\}\}maps each chunk to a vector

ci=Etext​\(x\(i\)\)∈ℝdtext\.c\_\{i\}=E\_\{\\text\{text\}\}\\\!\\big\(x^\{\(i\)\}\\big\)\\in\\mathbb\{R\}^\{d\_\{\\text\{text\}\}\}\.\(8\)We ignore fully padded chunks via a validity indicatorvi∈\{0,1\}v\_\{i\}\\in\\\{0,1\\\}\.

We aggregate sets of per\-block vectors using learnable attention pooling\. Given vectors\{zi\}i=1n\\\{z\_\{i\}\\\}\_\{i=1\}^\{n\}and a corresponding validity mask\{mi\}i=1n\\\{m\_\{i\}\\\}\_\{i=1\}^\{n\}, we define

si\\displaystyle s\_\{i\}=ω⊤​tanh⁡\(Wp​zi\),\\displaystyle=\\omega^\{\\top\}\\tanh\(W\_\{p\}z\_\{i\}\),\(9\)αi\\displaystyle\\alpha\_\{i\}=mi​exp⁡\(si\)∑jmj​exp⁡\(sj\),\\displaystyle=\\frac\{m\_\{i\}\\exp\(s\_\{i\}\)\}\{\\sum\_\{j\}m\_\{j\}\\exp\(s\_\{j\}\)\},\(10\)Pool​\(\{zi\},\{mi\}\)\\displaystyle\\mathrm\{Pool\}\(\\\{z\_\{i\}\\\},\\\{m\_\{i\}\\\}\)=∑iαi​zi,\\displaystyle=\\sum\_\{i\}\\alpha\_\{i\}z\_\{i\},\(11\)whereω\\omegaandWpW\_\{p\}are learnable parameters\. Applying this operator to the text and attribution branches gives

htext=Pool​\(\{ci\},\{vi\}\),hattr=Pool​\(\{bk\},\{uk\}\)\.h\_\{\\text\{text\}\}=\\mathrm\{Pool\}\(\\\{c\_\{i\}\\\},\\\{v\_\{i\}\\\}\),\\qquad h\_\{\\text\{attr\}\}=\\mathrm\{Pool\}\(\\\{b\_\{k\}\\\},\\\{u\_\{k\}\\\}\)\.\(12\)
The final representation is

h=\{\[htext;hattr\],if the text branch is used,hattr,otherwise\.h=\\begin\{cases\}\[\\,h\_\{\\text\{text\}\};\\,h\_\{\\text\{attr\}\}\\,\],&\\text\{if the text branch is used\},\\\\\[5\.69054pt\] h\_\{\\text\{attr\}\},&\\text\{otherwise\}\.\\end\{cases\}\(13\)A two\-layer MLP then produces a scalar logitℓ\\elland the corresponding probability of AI authorship:

ℓ=w⊤​ρ​\(Wh​h\+bh\)\+b,\\ell=w^\{\\top\}\\rho\(W\_\{h\}h\+b\_\{h\}\)\+b,\(14\)p​\(y=1∣x,A\)=σ​\(ℓ\)\.p\(y=1\\mid x,A\)=\\sigma\(\\ell\)\.\(15\)We train the detector using binary cross\-entropy on labeled examples\.

In this work, prompts are discarded in the decoder\-only setting\. Thus, the detector conditions only on the observed continuationx1:Tx\_\{1:T\}and the attribution map derived from it, rather than on the full prompt\-response pair\.

## 4Experimental Results

### 4\.1Datasets

In this study, we evaluate both encoder–decoder and decoder\-only language models\. For the encoder–decoder setting, we conduct experiments on three translation language pairs: French–English \(fr\-en\) and German–English \(de\-en\) using WMT14\(Bojaret al\.,[2014](https://arxiv.org/html/2606.00016#bib.bib51)\), and Arabic–English \(ar\-en\) using the UN Parallel Corpus\(Ziemskiet al\.,[2016](https://arxiv.org/html/2606.00016#bib.bib50)\)\. For each language pair, we construct a dataset of 200k source–target examples consisting of gold \(human\) reference translations and corresponding model\-generated translations produced by the Marian\-MT model\(Tiedemann and Thottingal,[2020](https://arxiv.org/html/2606.00016#bib.bib91)\)\. We selected samples with source and target pairs of at most 128 tokens\.

For the decoder\-only setting, we use theHC3dataset\(Guoet al\.,[2023](https://arxiv.org/html/2606.00016#bib.bib9)\), RAID\(Duganet al\.,[2024](https://arxiv.org/html/2606.00016#bib.bib52)\), and Beemo\(Artemovaet al\.,[2025](https://arxiv.org/html/2606.00016#bib.bib103)\)\.HC3provides paired human and ChatGPT\(OpenAI,[2023](https://arxiv.org/html/2606.00016#bib.bib92)\)responses\. We remove examples exceeding 1,024 tokens and retain approximately 24k samples per class\. RAID contains human\-written text and model\-generated text spanning multiple domains\. From RAID, we use outputs from Cohere, Llama 2 70B \(chat\), GPT\-2 XL, and Mistral\-7B, together with the corresponding human responses to the same prompts\. Because RAID is imbalanced \(with fewer human than model\-generated examples\), we downsample each model’s generated subset to 26,700 examples \(the minimum across selected models\) and use all available human examples \(12,900\)\. We use Beemo as an external test dataset for cross\-dataset out\-of\-distribution experiments\. With our preprocessing on Beemo, we obtain 2,178 human\-written samples and an equal number of machine\-generated samples, which are sourced from multiple LLMs\. In all experiments, we use a class\-balanced entropy loss to mitigate the imbalance between human\- and AI\-generated samples\.

We split each dataset into 90% training, 5% validation, and 5% test\. Throughout our experiments with decoder\-only models, we discard prompts, as in practice, it is more likely that a text sample is observed without access to the prompt that generated it\. We useLlama3\.1 8B\(Grattafioriet al\.,[2024](https://arxiv.org/html/2606.00016#bib.bib93)\),Cohere\(c4ai\-command\-r7b\-12\-2024\)\(Cohereet al\.,[2025](https://arxiv.org/html/2606.00016#bib.bib94)\),GPT\-neo\(Blacket al\.,[2021](https://arxiv.org/html/2606.00016#bib.bib95)\), andMistral\(Ministral\-3\-8B\-Instruct\-2512\)\(Jianget al\.,[2023](https://arxiv.org/html/2606.00016#bib.bib97)\)to obtain the attributions from approximate models of the same family as those that generated theRAIDandHC3datasets\.

### 4\.2Evaluation Metrics

For evaluation, we report Accuracy, Precision, Recall, F1, Area Under Curve, and True/False\-positive Rate at a fixed low false\-positive operating point, namelyTPR​@​FPR=0\.01\\mathrm\{TPR\}@\\mathrm\{FPR\}=0\.01\. HereTPR=TPTP\+FN\\mathrm\{TPR\}=\\frac\{\\mathrm\{TP\}\}\{\\mathrm\{TP\}\+\\mathrm\{FN\}\}andFPR=FPFP\+TN\\mathrm\{FPR\}=\\frac\{\\mathrm\{FP\}\}\{\\mathrm\{FP\}\+\\mathrm\{TN\}\}; thusFPR=0\.01\\mathrm\{FPR\}=0\.01corresponds to falsely labeling1%1\\%of human\-written samples as AI\-generated\. This is the critical “high\-precision” regime\. In academic or forensic settings, false positives \(accusing a human of using AI\) are unacceptable\. A detector must have a high TPR at a very low FPR to be deployable\(Ayoobiet al\.,[2025](https://arxiv.org/html/2606.00016#bib.bib14)\)\. For all threshold\-dependent metrics, thresholds were selected on the validation split and then fixed for test evaluation\.

### 4\.3Baseline Models

For the encoder\-decoder setting, our baseline is a custom three\-layer Transformer\-based classifier trained on paired source\-target text\. For the decoder\-only setting, we report results from Fast\-DetectGPT \(Curvature\)\(Baoet al\.,[2023](https://arxiv.org/html/2606.00016#bib.bib12)\), Binoculars\(Hanset al\.,[2024](https://arxiv.org/html/2606.00016#bib.bib104)\), Rank and LogRank\(Suet al\.,[2023](https://arxiv.org/html/2606.00016#bib.bib37)\), GLTR\(Gehrmannet al\.,[2019](https://arxiv.org/html/2606.00016#bib.bib31)\), and a RoBERTa\-based detector released by SuperAnnotate\.333[https://huggingface\.co/SuperAnnotate/ai\-detector](https://huggingface.co/SuperAnnotate/ai-detector)

Curvature compares the conditional log\-probability of each observed token with that of plausible alternative tokens in the same context, and aggregates these comparisons into a sample\-level statistic called conditional probability curvature\. Machine\-generated text tends to show a higher curvature value because its token choices are more often locally optimal under the model\. Binoculars computes the ratio between a model’s own perplexity and its cross\-perplexity with another model\. The intuition is that machine\-generated text is not only low\-perplexity for an LLM, but also shows unusually high agreement across related models444We use LLama\-3\.1\-8B and Llama\-3\.1\-8 B\-Instruct the pair models\.\. GLTR detects machine\-generated text by scoring each token under a reference language model using its probability, rank, and predictive entropy\. Rank is a token\-level measure that extracts the rank of each token under a language model and averages it across the sequence: lower Rank values indicate that the sample is more likely to be machine\-generated\. LogRank computes the logarithm of this rank\-based score\.

### 4\.4Results

We evaluate the proposed framework across five experimental settings, followed by an analysis of attribution\-region motifs, each of which tries to answer a research question of the use of attention attribution for this task\. First, we applyAEyeDEto encoder\-decoder attention maps and examine the effect of combining attribution maps with the text signal\. Second, we assess in\-domain detection performance, where the proxy model matches the generator of the evaluated text\. Third, we consider a unified setting in which samples from multiple LLMs are mixed during training, and generalization includes a held\-out generator\. Fourth, we evaluate robustness under adversarial attacks\. Fifth, we examine cross\-dataset generalization on an external benchmark\.

#### Do attribution maps provide useful detection signals in encoder\-decoder models?

In encoder–decoder architectures, cross\-attention models the dependencies between source and target tokens by enabling the decoder to focus selectively on the most relevant encoder states during generation\. In our task, we use the entire 128×128 attention matrix as the signal to be processed \(as we selected samples of at most 128 tokens\)\. Table[1](https://arxiv.org/html/2606.00016#S4.T1)reports results for AI\-translation detection across three Marian MT\-language pairs\. Using attribution maps alone \(CNN\), our method consistently outperforms the text\-only baseline for all metrics, with gains of \+3\.6 F1 for ar\-en \(74\.6 vs\. 71\.0\), \+6\.7 for de\-en \(81\.5 vs\. 74\.8\), and \+8\.3 for fr\-en \(85\.1 vs\. 76\.8\)\. Adding target\-text features \(CNN\+text\) yields a further, but smaller, improvement over CNN in every case \(\+2\.0, \+1\.7, and \+1\.4 F1, respectively\), suggesting that most discriminative signal is already captured by the attribution structure, while text provides complementary information\.

Table 1:Performance on Marian\-MT generated translation and attribution maps\. CNN and CNN\+text configurations are based onAEyeDE\.Table 2:Results onindividualarrangement where each human and AI\-generated attention attributions were extracted from the same model specified in the dataset to train theAEyeDEmodel\.Table 3:Results on theunifiedarrangement where the samples corresponding to the Mistral model were excluded from the training ofAEyeDE\.Table 4:Results on HC3 using GPT\-neo to extract attention attributions\.
#### How effective isAEyeDEwhen the proxy model matches the generator family?

Using the RAID dataset, we train a separate detector for each generator family using only samples generated by that family, along with the corresponding matched human texts \(e\.g\., we train Llama samples with the attributions extracted from the Llama model\)\. We refer to this configuration as theindividualsetting\. For the model\-specific baselines \(Curvature, GLTR, Rank, and LogRank\), the underlying scores are also extracted from the corresponding generator model\. It should be noted that both LLMs used for binoculars are from Llama family\.

In theindividualsetting \(Table[2](https://arxiv.org/html/2606.00016#S4.T2)\),AEyeDEachieves near\-ceiling performance across all four generator families, with F1 ranging from 96\.34 to 99\.25 and AUC from 98\.81 to 99\.86, and it is the top\-performing method on every generator\. The gains are especially large for GPT\-neo and Mistral\. For GPT\-neo, the strongest baseline reaches 83\.70 F1 \(RoBERTa\), whereasAEyeDEattains 97\.96\. For Mistral, the strongest baseline reaches 82\.71 F1 \(RoBERTa\), compared with 96\.34 forAEyeDE\. On Cohere, the strongest baselines are more competitive, with RoBERTa reaching 88\.17 F1 and Curvature 87\.81, butAEyeDEstill improves substantially to 97\.95\. The baseline behavior is more heterogeneous for Llama\. Several methods perform strongly in this case, including Binoculars \(98\.58 F1, 99\.66 AUC\), RoBERTa \(96\.98 F1\), and GLTR \(90\.70 F1\)\. Even under this more favorable setting for the baselines,AEyeDEremains best overall, achieving 99\.25 F1, 99\.86 AUC, and 99\.33 on the low\-FPR operating\-point metric\.

Furthermore, a broader pattern is that likelihood\- and rank\-based baselines, especially on Cohere, GPT\-neo, and Mistral, yield recall near 100% but precision close to the class prior \(∼67%\\sim 67\\%\)\. In contrast,AEyeDEmaintains both high precision and high recall across all generators, which suggests that the attention\-based attribution signals are highly discriminative when the proxy model matches the generator family\.

Likewise, we train and test AEyeDE on the HC3 dataset\. Table[4](https://arxiv.org/html/2606.00016#S4.T4)reports results on theHC3dataset using attribution maps extracted from GPT\-neo as the proxy model, as it is generated only by the GPT family\. Overall,HC3appears substantially less challenging thanRAID, as nearly all methods achieve very high performance\.AEyeDEremains competitive, achieving 96\.85 accuracy, 95\.68 precision, 98\.27 recall, 96\.96 F1, and 99\.51 AUC\. Although Curvature obtains the best F1 and RoBERTa achieves the highest AUC,AEyeDEis notable because it reaches near\-ceiling performance using a fundamentally different signal\. Moreover, the comparison with RoBERTa should be interpreted cautiously, since HC3 was part of that model’s training data\. We employ the trained models on the RAID dataset along with the model trained on HC3 data, to analyze informative motif patterns for human vs\. AI\-generated text \(See[5](https://arxiv.org/html/2606.00016#S5)\)\.

#### Do attribution\-based representations transfer to unseen generators?

In a different setup, we evaluate how well our proposed model can generalize to an unseen generator at test time\. We train on aunifiedmixture of all RAID generators except Mistral, but the test set includes Mistral’s generation\. To control for training set size, we subsample each included generator’s generated text to match the per\-model training budget used in theindividualsetting\. This splitting strategy assesses whether representations learned from a subset of generator families transfer to an unseen model at test time\.

In theunifiedsetting \(Table[3](https://arxiv.org/html/2606.00016#S4.T3)\), performance is lower than in the generator\-matchedindividualsetting, reflecting the increased heterogeneity of pooled training and the greater difficulty of cross\-generator transfer\. Even so,AEyeDEachieves the best accuracy, AUC, and TPR@FPR=0\.01 for every generator family\. Its AUC ranges from 78\.86 to 80\.67, and its TPR@FPR=0\.01 ranges from 56\.80 to 58\.97, consistently exceeding all baselines\. At the same time,AEyeDEadopts a markedly more conservative operating point than most competing methods\. Across all generators, it maintains very high precision \(90\.98\-96\.57\) but substantially lower recall \(61\.06\-65\.55\), which leads to F1 scores in the 74\.72\-76\.19 range\. In contrast, Binoculars, GLTR, LogRank, and Rank operate in an almost\-all\-positive regime, with recall near 100% but precision close to the class prior \(∼67%\\sim 67\\%\)\. Their F1 values therefore appear superficially strong \(∼80\.2\\sim 80\.2\-80\.7\), but this comes at the cost of many false positives\. RoBERTa is the strongest balanced baseline, with F1 between 76\.21 and 78\.21 and AUC between 77\.17 and 78\.49, yet it still trailsAEyeDEon the threshold\-independent metrics and on the low\-FPR operating point\.

These results suggest that attribution\-based representations do transfer across generator families, but they do so conservatively: when trained on mixed generators,AEyeDEbecomes more selective, flagging fewer human texts incorrectly while missing a larger fraction of AI\-generated samples at the default threshold\. Importantly, its consistent advantage in AUC and TPR@FPR=0\.01 indicates that it provides the best overall separation between human and machine text under cross\-generator transfer\.

#### How robust is AEyeDE to adversarial perturbations?

AI\-generated text detectors are known to be brittle under adversarial perturbations\(Krishnaet al\.,[2023](https://arxiv.org/html/2606.00016#bib.bib105)\)\. We evaluate robustness to the paraphrasing and alternative spelling attacks provided in the RAID dataset for both the individual and unified settings\.555Due to computational constraints, we report these adversarial experiments for GPT\-neo only\.

In the individual setting \(Table[6](https://arxiv.org/html/2606.00016#S4.T6)\),AEyeDEachieves the highest F1 \(81\.50\) and AUC \(77\.64\), with a substantial precision advantage over all baselines \(74\.78 vs\. 68\.87 for the next\-best method, RoBERTa\)666The paraphrases are generated by a T5 model, so the adversarial samples remain machine\-generated, although they are out\-of\-distribution relative to the training data\. Importantly, adversarial samples are used only at test time, meaning thatAEyeDEis not trained on adversarial examples\.\. Binoculars, GLTR, LogRank, and Rank exhibit near\-perfect recall \(≥99\.48%\\geq 99\.48\\%\) but precision close to the class prior \(~67%\), indicating a near\-all\-positive prediction behaviour\. Curvature attains the highest accuracy \(76\.13\), but its recall drops sharply \(54\.13\), yielding by far the lowest F1 \(55\.69\)\. In contrast,AEyeDEmaintains the best precision\-recall tradeoff \(74\.78 / 89\.54\), suggesting that attribution\-based structure appears more robust to paraphrastic surface perturbations in the individual setting\.

In the unified adversarial setting \(Table[6](https://arxiv.org/html/2606.00016#S4.T6)\), the distinction between methods becomes much smaller in terms of thresholded metrics: accuracy remains near the majority\-class baseline \(~67%\) and F1 values cluster tightly \(78\.92\-80\.45\)\.AEyeDEstill attains the highest F1 \(80\.45\), but only by a narrow margin, and its precision/recall profile \(67\.50 / 99\.55\) indicates that it too shifts toward the near\-all\-positive trend\. Moreover, its AUC drops to 62\.21, below RoBERTa \(72\.87\) and Curvature \(70\.38\)\. Overall, these results show that paraphrasing remains a difficult adversarial condition for all detectors\.AEyeDEretains a clear advantage in the individual setting, but that advantage disappears in the unified setting\.

We also evaluate robustness to alternative\-spelling perturbations, motivated by the fact that spelling errors are relatively uncommon in clean LLM outputs\. In the RAID dataset, this attack is constructed by randomly modifying characters in the original samples\. In the individual setting \(Table[8](https://arxiv.org/html/2606.00016#S4.T8)\),AEyeDEremains highly effective under this attack, achieving the best F1 \(98\.47\), AUC \(99\.75\), and TPR@FPR=0\.01 \(96\.26\)\. It outperforms all baselines by wide margins, with the next\-best F1 being 84\.06 for RoBERTa\. Several baselines, especially Binoculars and GLTR, operate close to an all\-positive score, with near\-perfect recall but substantially weaker precision and poor low\-FPR detection\. These results suggest that character\-level spelling perturbations affect surface\-level statistics more than the attribution patterns used byAEyeDE\.

In the unified setting \(Table[8](https://arxiv.org/html/2606.00016#S4.T8)\), performance decreases across methods, butAEyeDEstill achieves the best accuracy \(70\.34\), F1 \(81\.33\), AUC \(83\.48\), and TPR@FPR=0\.01 \(59\.10\)\. Its low\-FPR true\-positive rate exceeds that of the next\-best method, Curvature \(48\.87\), by more than 10 points\. Curvature attains the highest precision \(93\.88\), but its recall drops to 56\.47, yielding the lowest F1 among the reported methods \(70\.52\)\. Overall, these results suggest that attribution\-based detection particularly works well against this type of character\-level perturbation\. One possible explanation is that subword tokenization does not fully obscure the token\-level relationships captured by the attribution maps\.

Table 5:Results under the RAID paraphrasing adversarial in theindividualsetting with GPT\-neo\.
Table 6:Results under the RAID paraphrasing adversarial in theunifiedsetting with GPT\-neo\.

Table 7:Results under the RAID alternative\-spelling adversarial in theindividualsetting with GPT\-neo\.
Table 8:Results under the RAID alternative\-spelling adversarial in theunifiedsetting with GPT\-neo\.

#### Does AEyeDE generalize to a fully external dataset?

To complement the adversarial evaluations and assess generalization on a fully external test set, we evaluate the methods on Beemo \(Table[9](https://arxiv.org/html/2606.00016#S4.T9)\), which is unseen during training, using Mistral\-7B\-Instruct as the proxy model777In our experiments, Mistral scored best among the other generators for this task\.\.AEyeDEachieves the highest accuracy \(66\.36\), precision \(62\.16\), F1 \(71\.32\), and AUC \(74\.43\), outperforming all other methods on these metrics\. Several baselines exhibit thresholded behavior close to an all\-positive classifier, with accuracy near 50\-51%, precision near the class prior \(~50%\), and recall above 96%, indicating poor calibration at the chosen operating point\. RoBERTa performs somewhat better \(56\.03 accuracy, 68\.66 F1\) but still trailsAEyeDEby 2\.66 F1 points\. Notably, GLTR achieves a nearly identical AUC \(74\.38\) and the highest TPR@FPR=0\.01 \(15\.46, compared with 14\.55 forAEyeDE\), although its accuracy and precision remain close to chance at the selected threshold\. Overall, these results indicate that the attribution\-based signal transfers well across datasets and provides the strongest overall cross\-dataset performance, even though some baselines remain competitive on ranking\-based metrics such as AUC or low\-FPR TPR\. It is also noteworthy that these results are obtained with a relatively small training set, suggesting that the proposed framework can extract useful discriminative structure even under limited\-data conditions\.

Table 9:Cross\-dataset evaluation on Beemo using Mistral as the proxy model\.

## 5Analysis of Patch\-Level Motifs in Attribution Maps

Building on the attribution encoderEattrE\_\{\\text\{attr\}\}\(Sec\.[3](https://arxiv.org/html/2606.00016#S3)\), we investigated whether the detector exploits*localized*and*repeatable*visual motifs in attribution maps that are characteristic of human\-written versus AI\-generated text\. Recall that after the final convolutional stage, each blockAk∈ℝ128×128A\_\{k\}\\in\\mathbb\{R\}^\{128\\times 128\}is mapped to a feature map of spatial size16×1616\\times 16with256256channels\. We denote this last feature map by

Fk∈ℝ256×16×16\.F\_\{k\}\\in\\mathbb\{R\}^\{256\\times 16\\times 16\}\.Because128/16=8128/16=8, each feature map cell\(u,v\)\(u,v\)corresponds to an8×88\\times 8patch of the original blockAkA\_\{k\}\(shown in Fig\.[1](https://arxiv.org/html/2606.00016#S3.F1)\)\. LetPk,u,v∈ℝ8×8P\_\{k,u,v\}\\in\\mathbb\{R\}^\{8\\times 8\}be this patch, and define its representation, obtained after the last CNN convolutional stage, as:

zk,u,v=Fk​\[:,u,v\]∈ℝ256\.z\_\{k,u,v\}\\;=\\;F\_\{k\}\[:,u,v\]\\in\\mathbb\{R\}^\{256\}\.We denote\{zk,u,v\}\\\{z\_\{k,u,v\}\\\}as an embedding space of patches produced by the detector model\.

![Refer to caption](https://arxiv.org/html/2606.00016v1/latex/figures/motifs/hc3_std_gpt_machine.png)\(a\)HC3 \(individual\), proxy: GPT\-neo\. Top cluster byΔ​r¯\\Delta\\bar\{r\}\(machine\-skewed\)\.
![Refer to caption](https://arxiv.org/html/2606.00016v1/latex/figures/motifs/unified_cohere_human.png)\(b\)RAID \(individual\), proxy: Cohere\. Top cluster byΔ​r¯\\Delta\\bar\{r\}\(human\-skewed\)\.
![Refer to caption](https://arxiv.org/html/2606.00016v1/latex/figures/motifs/std_llama_human.png)\(c\)RAID \(individual\), proxy: Llama\. Top cluster byΔ​r¯\\Delta\\bar\{r\}\(human\-skewed\)\.
![Refer to caption](https://arxiv.org/html/2606.00016v1/latex/figures/motifs/std_mistral_human.png)\(d\)RAID \(individual\), proxy: Mistral\. Top cluster byΔ​r¯\\Delta\\bar\{r\}\(human\-skewed\)\.
![Refer to caption](https://arxiv.org/html/2606.00016v1/latex/figures/motifs/std_cohere_human.png)\(e\)RAID \(individual\), proxy: Cohere\. Top cluster byΔ​r¯\\Delta\\bar\{r\}\(human\-skewed\)\.
![Refer to caption](https://arxiv.org/html/2606.00016v1/latex/figures/motifs/unified_gpt_human.png)\(f\)RAID \(unified\), proxy: GPT\-neo\. Top cluster byΔ​r¯\\Delta\\bar\{r\}\(human\-skewed\)\.
![Refer to caption](https://arxiv.org/html/2606.00016v1/latex/figures/motifs/unified_llama_human.png)\(g\)RAID \(unified\), proxy: Llama\. Top cluster byΔ​r¯\\Delta\\bar\{r\}\(human\-skewed\)\.
![Refer to caption](https://arxiv.org/html/2606.00016v1/latex/figures/motifs/unified_mistral_human.png)\(h\)RAID \(unified\), proxy: Mistral\. Top cluster byΔ​r¯\\Delta\\bar\{r\}\(human\-skewed\)\.
![Refer to caption](https://arxiv.org/html/2606.00016v1/latex/figures/motifs/std_gpt_human.png)\(i\)RAID \(individual\), proxy: GPT\-neo\. Top cluster byΔ​r¯\\Delta\\bar\{r\}\(human\-skewed\)\.

Figure 2:Examples of the top motif cluster \(by absolute mean prevalence gapΔ​r¯\\Delta\\bar\{r\}between gold\-machine and gold\-human\) for each dataset/proxy\-model configuration\. Each panel shows representative8×88\{\\times\}8patches \(z\-normalized\) from the corresponding cluster\.#### Patch selection and clustering\.

To avoid padding\-only and non\-informative constant patches, we only keep patches that \(i\) correspond to non\-padding entries under the maskM=mx​my⊤M=m\_\{x\}m\_\{y\}^\{\\top\}, and \(ii\) pass a minimal informativeness threshold based on mean and standard deviation inside the patch\. We keep patches with\|μk,u,v\|\>τμ\|\\mu\_\{k,u,v\}\|\>\\tau\_\{\\mu\}\(0\.01\) andσk,u,v\>τσ\\sigma\_\{k,u,v\}\>\\tau\_\{\\sigma\}\(0\.001\)\. Then, we cluster their retained embeddings\{zk,u,v\}\\\{z\_\{k,u,v\}\\\}using HDBSCAN, producing assignments of each patch to the corresponding cluster

ck,u,v∈\{1,…,C\}∪\{−1\}\.c\_\{k,u,v\}\\in\\\{1,\\dots,C\\\}\\cup\\\{\-1\\\}\.\(16\)where−1\-1is unclustered noise\. Each cluster can be interpreted as a*motif family*: a set of patches that the trained encoder maps to nearby representations\.

#### Sample\-normalized motif rates\.

A direct comparison of raw motif counts is complicated by two sources: \(i\) datasets can be class\-imbalanced \(e\.g\.,RAID\), and \(ii\) the number of valid patches retained after preprocessing varies greatly across samples due to length and informativity selection\. To obtain a metric that is comparable across samples, for each samplesswe denote by𝒫​\(s\)\\mathcal\{P\}\(s\)the set of all retained patches extracted from that sample \(across all diagonal blocks\), and define the per\-sample*motif rate*for clusterccas

rs​\(c\)=1\|ℐs\|​∑\(k,u,v\)∈ℐs𝕀​\[ck,u,v=c\],r\_\{s\}\(c\)\\;=\\;\\frac\{1\}\{\|\\mathcal\{I\}\_\{s\}\|\}\\sum\_\{\(k,u,v\)\\in\\mathcal\{I\}\_\{s\}\}\\mathbb\{I\}\\\!\\left\[c\_\{k,u,v\}=c\\right\],\(17\)Hereℐs\\mathcal\{I\}\_\{s\}is the set of retained patch indices\(k,u,v\)\(k,u,v\)from sampless, and letck,u,vc\_\{k,u,v\}be the patch\-cluster assignment\. That is,rs​\(c\)r\_\{s\}\(c\)is the fraction of all retained patches in samplessthat belong to clustercc\.

For a group of samples𝒮g\\mathcal\{S\}\_\{g\}\(defined below\), we summarize prevalence by the mean rate

r¯g​\(c\)=1\|𝒮g\|​∑s∈𝒮grs​\(c\)\.\\bar\{r\}\_\{g\}\(c\)\\;=\\;\\frac\{1\}\{\|\\mathcal\{S\}\_\{g\}\|\}\\sum\\nolimits\_\{s\\in\\mathcal\{S\}\_\{g\}\}r\_\{s\}\(c\)\.\(18\)In our analysis, we reportr¯g​\(c\)\\bar\{r\}\_\{g\}\(c\)for 2 groups:𝒮gold human\\mathcal\{S\}\_\{\\text\{gold human\}\}and𝒮gold machine\\mathcal\{S\}\_\{\\text\{gold machine\}\}\. Intuitively, a value such asr¯gold machine​\(c\)=0\.01\\bar\{r\}\_\{\\text\{gold machine\}\}\(c\)=0\.01means that, on average,*1%*of all retained patches in a machine\-labeled sample belong to a clustercc; a difference of0\.150\.15corresponds to a15%15\\%shift in the average patch share\.

#### Motif discriminativeness across datasets and proxy models\.

Given gold\-labeled groups𝒮gold human\\mathcal\{S\}\_\{\\text\{gold human\}\}and𝒮gold machine\\mathcal\{S\}\_\{\\text\{gold machine\}\}, we quantify how strongly a motif cluster separates classes via the difference in mean motif rates

Δ​r​\(c\)=r¯gold machine​\(c\)−r¯gold human​\(c\)\.\\Delta r\(c\)\\;=\\;\\bar\{r\}\_\{\\text\{gold machine\}\}\(c\)\\;\-\\;\\bar\{r\}\_\{\\text\{gold human\}\}\(c\)\.\(19\)Using the aggregate statistics \(top\-3 clusters by\|Δ​r​\(c\)\|\|\\Delta r\(c\)\|per setting\), we find that the most discriminative motif often shifts by several percentage points in average patch share, and in some cases by more than1010points\. Table[10](https://arxiv.org/html/2606.00016#S5.T10)reports the top\-1 cluster per dataset/proxy/preprocessing variant, including bootstrap confidence intervals forΔ​r​\(c\)\\Delta r\(c\)and a permutation\-testpp\-value\. InHC3, the strongest motif \(cid=24\) is*machine\-enriched*, increasing from9\.7%9\.7\\%\(gold human\) to20\.7%20\.7\\%\(gold machine\),Δ​r=\+11\.0\\Delta r=\+11\.0points with a tight95%95\\%CI\[9\.3,12\.7\]\[9\.3,12\.7\]\. In contrast, forRAIDthe top motifs are consistently*human\-enriched*under all tested proxy models and both preprocessing variants \(e\.g\.,Δ​r=−10\.5\\Delta r=\-10\.5points forCohereunder the unified variant\)\. Qualitatively, the corresponding example patches \(Fig\.[2\(i\)](https://arxiv.org/html/2606.00016#S5.F2.sf9)\) show similar, repeatable local structures: for human\-leaning patch clusters, “islands” and isolated flashes of attention are observed across all datasets and models, while the only identified heavy machine\-inclined cluster ofHC3/GPT\-neoexhibits a prevalence of the horizontal bands\. The notable outlier is GPT\-neo for RAID, which is explained by HDBSCAN producing a very low amount of dense clusters \(n = 4\), resulting in a high intra\-class variance\. Such observations support the interpretation of clusters as visually coherent “motif families”\.

Table 10:Top\-1 motif cluster by absolute class\-rate gap\|Δ​r​\(c\)\|\|\\Delta r\(c\)\|for each dataset/setting\. Rates are mean per\-sample motif shares \(%\), andΔ​r\\Delta ris reported in percentage points \(machine minus human\)\. Confidence intervals are 95% bootstrap CIs over samples;ppis from a label\-permutation test\.
#### Patch\-wise saliency and ablation are computed per retained patch\.

To relate motif*prevalence*to motif*importance*, we assign each retained patch an \(i\) Grad\-CAM\(Selvarajuet al\.,[2019](https://arxiv.org/html/2606.00016#bib.bib102)\)score and \(ii\) a zeroing\-ablation score, and then aggregate these scores by cluster\. Concretely, for each retained patch index\(k,u,v\)∈ℐs\(k,u,v\)\\in\\mathcal\{I\}\_\{s\}we compute: \(i\) a Grad\-CAM activationgs,k,u,vg\_\{s,k,u,v\}at the last convolutional stage \(the same16×1616\\times 16grid that defines the patches\), and \(ii\) an ablation\-induced logit changeδs,k,u,v\\delta\_\{s,k,u,v\}obtained by zeroing the corresponding8×88\\times 8region in the input blockAkA\_\{k\}and re\-evaluating the detector\. We then summarize cluster\-level importance by averaging over all occurrences assigned to clustercc:

g¯​\(c\)\\displaystyle\\bar\{g\}\(c\)=𝔼​\[gs,k,u,v∣ck,u,v=c\],\\displaystyle=\\mathbb\{E\}\\\!\\left\[g\_\{s,k,u,v\}\\mid c\_\{k,u,v\}=c\\right\],\(20\)δ¯​\(c\)\\displaystyle\\bar\{\\delta\}\(c\)=𝔼​\[δs,k,u,v∣ck,u,v=c\]\.\\displaystyle=\\mathbb\{E\}\\\!\\left\[\\delta\_\{s,k,u,v\}\\mid c\_\{k,u,v\}=c\\right\]\.Hereδ¯​\(c\)\\bar\{\\delta\}\(c\)captures the*average marginal effect*of removing a single motif instance, whiler¯g​\(c\)\\bar\{r\}\_\{g\}\(c\)captures*how frequently*the motif occurs in a given group\.

#### Why prevalence need not correlate with saliency or ablation\.

Empirically, we observe that clusters with the largest\|Δ​r​\(c\)\|\|\\Delta r\(c\)\|are not necessarily the clusters with the highestg¯​\(c\)\\bar\{g\}\(c\)or\|δ¯​\(c\)\|\|\\bar\{\\delta\}\(c\)\|\. Aggregated correlations across all datasets and models are weak and unstable: Across top\-15 clusters for Grad\-CAM Pearsonρ=0\.143±0\.23\\rho=\\text\{0\.143\}\\pm 0\.23, Spearmanρ=0\.10±0\.29\\rho=\\text\{0\.10\}\\pm 0\.29; for Zero\-ablation Pearsonρ=\-0\.03±0\.36\\rho=\\text\{\-0\.03\}\\pm 0\.36, Spearmanρ=\-0\.09±0\.25\\rho=\\text\{\-0\.09\}\\pm 0\.25\. Across the top\-3 clusters, correlation becomes moderate for GRAD\-CAM but still unstable: Pearsonρ=0\.373±0\.66\\rho=\\text\{0\.373\}\\pm 0\.66, Spearmanρ=0\.37±0\.54\\rho=\\text\{0\.37\}\\pm 0\.54; while it slightly improves for Zero\-ablation, Pearsonρ=0\.09±0\.67\\rho=\\text\{0\.09\}\\pm 0\.67, Spearmanρ=0\.12±0\.64\\rho=\\text\{0\.12\}\\pm 0\.64\. We attribute this mismatch to the following reasons: First, a dataset imbalance can decouple prevalence from importance\. Taking into account the significant imbalance of used datasets, the RAID subset and HC3, training under skewed prior biases the detector toward features that optimize empirical risk for the majority class, so a motif may be strongly pronounced inΔ​r​\(c\)\\Delta r\(c\)yet have muted average Grad\-CAM/ablation effects\. Second, for Grad\-CAM and zeroing\-ablation the same motif family may be decisive only in certain positions or alongside other motifs; averagingg¯​\(c\)\\bar\{g\}\(c\)andδ¯​\(c\)\\bar\{\\delta\}\(c\)across all occurrences can weaken these conditional effects\.

Taken together, these findings suggest that motif analysis is best interpreted as a complementary view:Δ​r​\(c\)\\Delta r\(c\)reveals dataset and proxy\-model dependent differences in local attention map structure, whileg¯​\(c\)\\bar\{g\}\(c\)andδ¯​\(c\)\\bar\{\\delta\}\(c\)reflect how the trained detector*uses*\(or ignores\) individual motif instances at decision time conditioned on the training data\.

#### Implications\.

Overall, the presence of statistically reliable shifts in patch motif rates between gold machine and gold human groups \(Table[10](https://arxiv.org/html/2606.00016#S5.T10)\) supports our hypothesis that the internal dynamics of a proxyGθG\_\{\\theta\}induce detectable structure in attention\-based attribution maps beyond surface\-level text statistics\. At the same time, the weak alignment between prevalence and saliency cautions against interpreting frequent motifs as predictions of the model behaviour, instead, they appear to function as stable, repeatable signatures that the detector can exploit in combination with other cues\.

## 6Conclusion

We presentedAEyeDE, an attribution\-based framework for AI\-generated text detection that leverages attention\-derived attribution maps from a proxy Transformer as a signal to learn a model to distinguish between human\-written and AI\-generated text\. Across both encoder\-decoder translation benchmarks \(WMT14, UN\) and decoder\-only generation datasets \(HC3,RAID\), as well as adversarial attacks and unseen cross\-dataset samples,AEyeDEachieves competitive performance and shows strong results\. This is achieved by training a lightweight CNN model on a relatively small amount of data\.

Beyond evaluation metrics, we provide an analysis of localized attribution patterns and show that they systematically differ between human and AI\-generated text\. Overall, our results suggest that internal attribution behavior offers a complementary and effective signal for reliable authorship detection, motivating further work on broader robustness settings and alternative attribution sources\.

#### Limitations

Our study has some limitations\. First, the proposed framework assumeswhite\-boxaccess to a proxy Transformer model in order to extract attention\-based attribution maps\. While this assumption may limit applicability in fully black\-box settings, our experiments indicate that attribution patterns generalize across generator families, suggesting that exact access to the true generator is not strictly required\.

Furthermore, our implementation focuses on attention\-based attributions, which offer a favorable trade\-off between informativeness and computational cost for large models and long sequences\. Investigating alternative attribution methods, such as gradient\-based saliency, may further enrich the analysis, but is left for future work due to their higher computational overhead\. Additionally, we computed attention attribution by averaging over all layers and heads\. Investigating a more fine\-grained selection strategy for this task remains for future work\.

## References

- S\. Abudalfa, S\. Ezzini, A\. Abdelali, H\. Alami, A\. Benlahbib, S\. Chafik, M\. El\-Haj, A\. El Mahdaouy, M\. Jarrar, S\. Lamsiyah,et al\.\(2025\)The arageneval shared task on arabic authorship style transfer and ai generated text detection\.InProceedings of The Third Arabic Natural Language Processing Conference: Shared Tasks,pp\. 1–13\.Cited by:[§2](https://arxiv.org/html/2606.00016#S2.SS0.SSS0.Px5.p1.1)\.
- H\. Ahn, S\. Park, S\. Woo, and Y\. Han \(2025\)DITTO: a spoofing attack framework on watermarked llms via knowledge distillation\.arXiv preprint arXiv:2510\.10987\.Cited by:[§1](https://arxiv.org/html/2606.00016#S1.p2.1),[§2](https://arxiv.org/html/2606.00016#S2.SS0.SSS0.Px3.p1.1)\.
- F\. Alam, P\. Nakov, N\. Habash, I\. Gurevych, S\. Chowdhury, A\. Shelmanov, Y\. Wang, E\. Artemova, M\. Kutlu, and G\. Mikros \(Eds\.\) \(2025\)Proceedings of the 1stworkshop on genai content detection \(genaidetect\)\.International Conference on Computational Linguistics,Abu Dhabi, UAE\.External Links:[Link](https://aclanthology.org/2025.genaidetect-1.0/)Cited by:[§2](https://arxiv.org/html/2606.00016#S2.SS0.SSS0.Px5.p1.1)\.
- M\. Z\. Ali, Y\. Wang, B\. Pfahringer, and T\. C\. Smith \(2025\)Detection of human and machine\-authored fake news in Urdu\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 3419–3428\.External Links:[Link](https://aclanthology.org/2025.acl-long.170/),ISBN 979\-8\-89176\-251\-0Cited by:[§1](https://arxiv.org/html/2606.00016#S1.p1.1),[§2](https://arxiv.org/html/2606.00016#S2.SS0.SSS0.Px2.p1.1)\.
- E\. Artemova, J\. S\. Lucas, S\. Venkatraman, J\. Lee, S\. Tilga, A\. Uchendu, and V\. Mikhailov \(2025\)Beemo: benchmark of expert\-edited machine\-generated outputs\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),L\. Chiruzzo, A\. Ritter, and L\. Wang \(Eds\.\),Albuquerque, New Mexico,pp\. 6992–7018\.External Links:[Link](https://aclanthology.org/2025.naacl-long.357/),[Document](https://dx.doi.org/10.18653/v1/2025.naacl-long.357),ISBN 979\-8\-89176\-189\-6Cited by:[§4\.1](https://arxiv.org/html/2606.00016#S4.SS1.p2.1)\.
- N\. Ayoobi, S\. Shahriar, and A\. Mukherjee \(2025\)Beyond easy wins: a text hardness\-aware benchmark for llm\-generated text detection\.arXiv preprint arXiv:2507\.15286\.Cited by:[§4\.2](https://arxiv.org/html/2606.00016#S4.SS2.p1.5)\.
- G\. Bao, Y\. Zhao, Z\. Teng, L\. Yang, and Y\. Zhang \(2023\)Fast\-detectgpt: efficient zero\-shot detection of machine\-generated text via conditional probability curvature\.arXiv preprint arXiv:2310\.05130\.Cited by:[§1](https://arxiv.org/html/2606.00016#S1.p2.1),[§2](https://arxiv.org/html/2606.00016#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2606.00016#S2.SS0.SSS0.Px5.p1.1),[§4\.3](https://arxiv.org/html/2606.00016#S4.SS3.p1.1)\.
- K\. Bittle and O\. El\-Gayar \(2025\)Generative ai and academic integrity in higher education: a systematic review and research agenda\.Information16\(4\),pp\. 296\.Cited by:[§1](https://arxiv.org/html/2606.00016#S1.p1.1)\.
- S\. Black, G\. Leo, P\. Wang, C\. Leahy, and S\. Biderman \(2021\)GPT\-Neo: Large Scale Autoregressive Language Modeling with Mesh\-TensorflowNote:If you use this software, please cite it using these metadata\.External Links:[Document](https://dx.doi.org/10.5281/zenodo.5297715),[Link](https://doi.org/10.5281/zenodo.5297715)Cited by:[§4\.1](https://arxiv.org/html/2606.00016#S4.SS1.p3.1)\.
- O\. Bojar, C\. Buck, C\. Federmann, B\. Haddow, P\. Koehn, J\. Leveling, C\. Monz, P\. Pecina, M\. Post, H\. Saint\-Amand,et al\.\(2014\)Findings of the 2014 workshop on statistical machine translation\.InProceedings of the ninth workshop on statistical machine translation,pp\. 12–58\.Cited by:[§4\.1](https://arxiv.org/html/2606.00016#S4.SS1.p1.1)\.
- Y\. Chang, X\. Wang, J\. Wang, Y\. Wu, L\. Yang, K\. Zhu, H\. Chen, X\. Yi, C\. Wang, Y\. Wang,et al\.\(2024\)A survey on evaluation of large language models\.ACM transactions on intelligent systems and technology15\(3\),pp\. 1–45\.Cited by:[§1](https://arxiv.org/html/2606.00016#S1.p1.1)\.
- P\. F\. Christiano, J\. Leike, T\. Brown, M\. Martic, S\. Legg, and D\. Amodei \(2017\)Deep reinforcement learning from human preferences\.Advances in neural information processing systems30\.Cited by:[§1](https://arxiv.org/html/2606.00016#S1.p2.1)\.
- T\. Cohere, Aakanksha, A\. Ahmadian, M\. Ahmed, J\. Alammar, Y\. Alnumay, S\. Althammer, A\. Arkhangorodsky, V\. Aryabumi, D\. Aumiller, R\. Avalos, Z\. Aviv, S\. Bae, S\. Baji, A\. Barbet, M\. Bartolo, B\. Bebensee, N\. Beladia, W\. Beller\-Morales, A\. Bérard, A\. Berneshawi, A\. Bialas, P\. Blunsom, M\. Bobkin, A\. Bongale, S\. Braun, M\. Brunet, S\. Cahyawijaya, D\. Cairuz, J\. A\. Campos, C\. Cao, K\. Cao, R\. Castagné, J\. Cendrero, L\. C\. Currie, Y\. Chandak, D\. Chang, G\. Chatziveroglou, H\. Chen, C\. Cheng, A\. Chevalier, J\. T\. Chiu, E\. Cho, E\. Choi, E\. Choi, T\. Chung, V\. Cirik, A\. Cismaru, P\. Clavier, H\. Conklin, L\. Crawhall\-Stein, D\. Crouse, A\. F\. Cruz\-Salinas, B\. Cyrus, D\. D’souza, H\. Dalla\-Torre, J\. Dang, W\. Darling, O\. D\. Domingues, S\. Dash, A\. Debugne, T\. Dehaze, S\. Desai, J\. Devassy, R\. Dholakia, K\. Duffy, A\. Edalati, A\. Eldeib, A\. Elkady, S\. Elsharkawy, I\. Ergün, B\. Ermis, M\. Fadaee, B\. Fan, L\. Fayoux, Y\. Flet\-Berliac, N\. Frosst, M\. Gallé, W\. Galuba, U\. Garg, M\. Geist, M\. G\. Azar, S\. Goldfarb\-Tarrant, T\. Goldsack, A\. Gomez, V\. M\. Gonzaga, N\. Govindarajan, M\. Govindassamy, N\. Grinsztajn, N\. Gritsch, P\. Gu, S\. Guo, K\. Haefeli, R\. Hajjar, T\. Hawes, J\. He, S\. Hofstätter, S\. Hong, S\. Hooker, T\. Hosking, S\. Howe, E\. Hu, R\. Huang, H\. Jain, R\. Jain, N\. Jakobi, M\. Jenkins, J\. Jordan, D\. Joshi, J\. Jung, T\. Kalyanpur, S\. R\. Kamalakara, J\. Kedrzycki, G\. Keskin, E\. Kim, J\. Kim, W\. Ko, T\. Kocmi, M\. Kozakov, W\. Kryściński, A\. K\. Jain, K\. K\. Teru, S\. Land, M\. Lasby, O\. Lasche, J\. Lee, P\. Lewis, J\. Li, J\. Li, H\. Lin, A\. Locatelli, K\. Luong, R\. Ma, L\. Mach, M\. Machado, J\. Magbitang, B\. M\. Lopez, A\. Mann, K\. Marchisio, O\. Markham, A\. Matton, A\. McKinney, D\. McLoughlin, J\. Mokry, A\. Morisot, A\. Moulder, H\. Moynehan, M\. Mozes, V\. Muppalla, L\. Murakhovska, H\. Nagarajan, A\. Nandula, H\. Nasir, S\. Nehra, J\. Netto\-Rosen, D\. Ohashi, J\. Owers\-Bardsley, J\. Ozuzu, D\. Padilla, G\. Park, S\. Passaglia, J\. Pekmez, L\. Penstone, A\. Piktus, C\. Ploeg, A\. Poulton, Y\. Qi, S\. Raghvendra, M\. Ramos, E\. Ranjan, P\. Richemond, C\. Robert\-Michon, A\. Rodriguez, S\. Roy, L\. Ruis, L\. Rust, A\. Sachan, A\. Salamanca, K\. K\. Saravanakumar, I\. Satyakam, A\. S\. Sebag, P\. Sen, S\. Sepehri, P\. Seshadri, Y\. Shen, T\. Sherborne, S\. C\. Shi, S\. Shivaprasad, V\. Shmyhlo, A\. Shrinivason, I\. Shteinbuk, A\. Shukayev, M\. Simard, E\. Snyder, A\. Spataru, V\. Spooner, T\. Starostina, F\. Strub, Y\. Su, J\. Sun, D\. Talupuru, E\. Tarassov, E\. Tommasone, J\. Tracey, B\. Trend, E\. Tumer, A\. Üstün, B\. Venkitesh, D\. Venuto, P\. Verga, M\. Voisin, A\. Wang, D\. Wang, S\. Wang, E\. Wen, N\. White, J\. Willman, M\. Winkels, C\. Xia, J\. Xie, M\. Xu, B\. Yang, T\. Yi\-Chern, I\. Zhang, Z\. Zhao, and Z\. Zhao \(2025\)Command a: an enterprise\-ready large language model\.External Links:2504\.00698,[Link](https://arxiv.org/abs/2504.00698)Cited by:[§4\.1](https://arxiv.org/html/2606.00016#S4.SS1.p3.1)\.
- J\. Cornelius, O\. Lithgow\-Serrano, S\. Mitrović, L\. Dolamic, and F\. Rinaldi \(2024\)BUST: benchmark for the evaluation of detectors of llm\-generated text\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),pp\. 8029–8057\.Cited by:[§2](https://arxiv.org/html/2606.00016#S2.SS0.SSS0.Px5.p1.1)\.
- X\. Cui, J\. Wei, S\. Swayamdipta, and R\. Jia \(2025\)Robust data watermarking in language models by injecting fictitious knowledge\.InFindings of the Association for Computational Linguistics: ACL 2025,pp\. 14292–14306\.Cited by:[§2](https://arxiv.org/html/2606.00016#S2.SS0.SSS0.Px3.p1.1)\.
- J\. Devlin, M\. Chang, K\. Lee, and K\. Toutanova \(2019\)BERT: pre\-training of deep bidirectional transformers for language understanding\.InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 \(Long and Short Papers\),J\. Burstein, C\. Doran, and T\. Solorio \(Eds\.\),Minneapolis, Minnesota,pp\. 4171–4186\.External Links:[Link](https://aclanthology.org/N19-1423/),[Document](https://dx.doi.org/10.18653/v1/N19-1423)Cited by:[§2](https://arxiv.org/html/2606.00016#S2.SS0.SSS0.Px2.p1.1)\.
- L\. Dugan, A\. Hwang, F\. Trhlík, A\. Zhu, J\. M\. Ludan, H\. Xu, D\. Ippolito, and C\. Callison\-Burch \(2024\)RAID: a shared benchmark for robust evaluation of machine\-generated text detectors\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 12463–12492\.External Links:[Link](https://aclanthology.org/2024.acl-long.674/),[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.674)Cited by:[§2](https://arxiv.org/html/2606.00016#S2.SS0.SSS0.Px5.p1.1),[§4\.1](https://arxiv.org/html/2606.00016#S4.SS1.p2.1)\.
- L\. Dugan, D\. Ippolito, A\. Kirubarajan, S\. Shi, and C\. Callison\-Burch \(2023\)Real or fake text?: investigating human ability to detect boundaries between human\-written and machine\-generated text\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.37,pp\. 12763–12771\.Cited by:[§1](https://arxiv.org/html/2606.00016#S1.p1.1)\.
- M\. Frohmann, G\. Meseguer\-Brocal, M\. Schedl, and E\. V\. Epure \(2025\)Double entendre: robust audio\-based AI\-generated lyrics detection via multi\-view fusion\.InFindings of the Association for Computational Linguistics: ACL 2025,W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 1914–1926\.External Links:[Link](https://aclanthology.org/2025.findings-acl.98/),ISBN 979\-8\-89176\-256\-5Cited by:[§2](https://arxiv.org/html/2606.00016#S2.SS0.SSS0.Px5.p1.1)\.
- S\. Gehrmann, H\. Strobelt, and A\. Rush \(2019\)GLTR: statistical detection and visualization of generated text\.InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations,M\. R\. Costa\-jussà and E\. Alfonseca \(Eds\.\),Florence, Italy,pp\. 111–116\.External Links:[Link](https://aclanthology.org/P19-3019/),[Document](https://dx.doi.org/10.18653/v1/P19-3019)Cited by:[§1](https://arxiv.org/html/2606.00016#S1.p2.1),[§2](https://arxiv.org/html/2606.00016#S2.SS0.SSS0.Px1.p1.1),[§4\.3](https://arxiv.org/html/2606.00016#S4.SS3.p1.1)\.
- W\. Go, H\. Kim, A\. Oh, and Y\. Kim \(2025a\)XDAC: XAI\-driven detection and attribution of LLM\-generated news comments in Korean\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 22728–22750\.External Links:[Link](https://aclanthology.org/2025.acl-long.1108/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.1108),ISBN 979\-8\-89176\-251\-0Cited by:[§2](https://arxiv.org/html/2606.00016#S2.SS0.SSS0.Px4.p1.1)\.
- W\. Go, H\. Kim, A\. Oh, and Y\. Kim \(2025b\)XDAC: xai\-driven detection and attribution of llm\-generated news comments in korean\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 22728–22750\.Cited by:[§2](https://arxiv.org/html/2606.00016#S2.SS0.SSS0.Px4.p1.1)\.
- A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan, A\. Yang, A\. Fan, A\. Goyal, A\. Hartshorn, A\. Yang, A\. Mitra, A\. Sravankumar, A\. Korenev, A\. Hinsvark, A\. Rao, A\. Zhang, A\. Rodriguez, A\. Gregerson, A\. Spataru, B\. Roziere, B\. Biron, B\. Tang, B\. Chern, C\. Caucheteux, C\. Nayak, C\. Bi, C\. Marra, C\. McConnell, C\. Keller, C\. Touret, C\. Wu, C\. Wong, C\. C\. Ferrer, C\. Nikolaidis, D\. Allonsius, D\. Song, D\. Pintz, D\. Livshits, D\. Wyatt, D\. Esiobu, D\. Choudhary, D\. Mahajan, D\. Garcia\-Olano, D\. Perino, D\. Hupkes, E\. Lakomkin, E\. AlBadawy, E\. Lobanova, E\. Dinan, E\. M\. Smith, F\. Radenovic, F\. Guzmán, F\. Zhang, G\. Synnaeve, G\. Lee, G\. L\. Anderson, G\. Thattai, G\. Nail, G\. Mialon, G\. Pang, G\. Cucurell, H\. Nguyen, H\. Korevaar, H\. Xu, H\. Touvron, I\. Zarov, I\. A\. Ibarra, I\. Kloumann, I\. Misra, I\. Evtimov, J\. Zhang, J\. Copet, J\. Lee, J\. Geffert, J\. Vranes, J\. Park, J\. Mahadeokar, J\. Shah, J\. van der Linde, J\. Billock, J\. Hong, J\. Lee, J\. Fu, J\. Chi, J\. Huang, J\. Liu, J\. Wang, J\. Yu, J\. Bitton, J\. Spisak, J\. Park, J\. Rocca, J\. Johnstun, J\. Saxe, J\. Jia, K\. V\. Alwala, K\. Prasad, K\. Upasani, K\. Plawiak, K\. Li, K\. Heafield, K\. Stone, K\. El\-Arini, K\. Iyer, K\. Malik, K\. Chiu, K\. Bhalla, K\. Lakhotia, L\. Rantala\-Yeary, L\. van der Maaten, L\. Chen, L\. Tan, L\. Jenkins, L\. Martin, L\. Madaan, L\. Malo, L\. Blecher, L\. Landzaat, L\. de Oliveira, M\. Muzzi, M\. Pasupuleti, M\. Singh, M\. Paluri, M\. Kardas, M\. Tsimpoukelli, M\. Oldham, M\. Rita, M\. Pavlova, M\. Kambadur, M\. Lewis, M\. Si, M\. K\. Singh, M\. Hassan, N\. Goyal, N\. Torabi, N\. Bashlykov, N\. Bogoychev, N\. Chatterji, N\. Zhang, O\. Duchenne, O\. Çelebi, P\. Alrassy, P\. Zhang, P\. Li, P\. Vasic, P\. Weng, P\. Bhargava, P\. Dubal, P\. Krishnan, P\. S\. Koura, P\. Xu, Q\. He, Q\. Dong, R\. Srinivasan, R\. Ganapathy, R\. Calderer, R\. S\. Cabral, R\. Stojnic, R\. Raileanu, R\. Maheswari, R\. Girdhar, R\. Patel, R\. Sauvestre, R\. Polidoro, R\. Sumbaly, R\. Taylor, R\. Silva, R\. Hou, R\. Wang, S\. Hosseini, S\. Chennabasappa, S\. Singh, S\. Bell, S\. S\. Kim, S\. Edunov, S\. Nie, S\. Narang, S\. Raparthy, S\. Shen, S\. Wan, S\. Bhosale, S\. Zhang, S\. Vandenhende, S\. Batra, S\. Whitman, S\. Sootla, S\. Collot, S\. Gururangan, S\. Borodinsky, T\. Herman, T\. Fowler, T\. Sheasha, T\. Georgiou, T\. Scialom, T\. Speckbacher, T\. Mihaylov, T\. Xiao, U\. Karn, V\. Goswami, V\. Gupta, V\. Ramanathan, V\. Kerkez, V\. Gonguet, V\. Do, V\. Vogeti, V\. Albiero, V\. Petrovic, W\. Chu, W\. Xiong, W\. Fu, W\. Meers, X\. Martinet, X\. Wang, X\. Wang, X\. E\. Tan, X\. Xia, X\. Xie, X\. Jia, X\. Wang, Y\. Goldschlag, Y\. Gaur, Y\. Babaei, Y\. Wen, Y\. Song, Y\. Zhang, Y\. Li, Y\. Mao, Z\. D\. Coudert, Z\. Yan, Z\. Chen, Z\. Papakipos, A\. Singh, A\. Srivastava, A\. Jain, A\. Kelsey, A\. Shajnfeld, A\. Gangidi, A\. Victoria, A\. Goldstand, A\. Menon, A\. Sharma, A\. Boesenberg, A\. Baevski, A\. Feinstein, A\. Kallet, A\. Sangani, A\. Teo, A\. Yunus, A\. Lupu, A\. Alvarado, A\. Caples, A\. Gu, A\. Ho, A\. Poulton, A\. Ryan, A\. Ramchandani, A\. Dong, A\. Franco, A\. Goyal, A\. Saraf, A\. Chowdhury, A\. Gabriel, A\. Bharambe, A\. Eisenman, A\. Yazdan, B\. James, B\. Maurer, B\. Leonhardi, B\. Huang, B\. Loyd, B\. D\. Paola, B\. Paranjape, B\. Liu, B\. Wu, B\. Ni, B\. Hancock, B\. Wasti, B\. Spence, B\. Stojkovic, B\. Gamido, B\. Montalvo, C\. Parker, C\. Burton, C\. Mejia, C\. Liu, C\. Wang, C\. Kim, C\. Zhou, C\. Hu, C\. Chu, C\. Cai, C\. Tindal, C\. Feichtenhofer, C\. Gao, D\. Civin, D\. Beaty, D\. Kreymer, D\. Li, D\. Adkins, D\. Xu, D\. Testuggine, D\. David, D\. Parikh, D\. Liskovich, D\. Foss, D\. Wang, D\. Le, D\. Holland, E\. Dowling, E\. Jamil, E\. Montgomery, E\. Presani, E\. Hahn, E\. Wood, E\. Le, E\. Brinkman, E\. Arcaute, E\. Dunbar, E\. Smothers, F\. Sun, F\. Kreuk, F\. Tian, F\. Kokkinos, F\. Ozgenel, F\. Caggioni, F\. Kanayet, F\. Seide, G\. M\. Florez, G\. Schwarz, G\. Badeer, G\. Swee, G\. Halpern, G\. Herman, G\. Sizov, Guangyi, Zhang, G\. Lakshminarayanan, H\. Inan, H\. Shojanazeri, H\. Zou, H\. Wang, H\. Zha, H\. Habeeb, H\. Rudolph, H\. Suk, H\. Aspegren, H\. Goldman, H\. Zhan, I\. Damlaj, I\. Molybog, I\. Tufanov, I\. Leontiadis, I\. Veliche, I\. Gat, J\. Weissman, J\. Geboski, J\. Kohli, J\. Lam, J\. Asher, J\. Gaya, J\. Marcus, J\. Tang, J\. Chan, J\. Zhen, J\. Reizenstein, J\. Teboul, J\. Zhong, J\. Jin, J\. Yang, J\. Cummings, J\. Carvill, J\. Shepard, J\. McPhie, J\. Torres, J\. Ginsburg, J\. Wang, K\. Wu, K\. H\. U, K\. Saxena, K\. Khandelwal, K\. Zand, K\. Matosich, K\. Veeraraghavan, K\. Michelena, K\. Li, K\. Jagadeesh, K\. Huang, K\. Chawla, K\. Huang, L\. Chen, L\. Garg, L\. A, L\. Silva, L\. Bell, L\. Zhang, L\. Guo, L\. Yu, L\. Moshkovich, L\. Wehrstedt, M\. Khabsa, M\. Avalani, M\. Bhatt, M\. Mankus, M\. Hasson, M\. Lennie, M\. Reso, M\. Groshev, M\. Naumov, M\. Lathi, M\. Keneally, M\. Liu, M\. L\. Seltzer, M\. Valko, M\. Restrepo, M\. Patel, M\. Vyatskov, M\. Samvelyan, M\. Clark, M\. Macey, M\. Wang, M\. J\. Hermoso, M\. Metanat, M\. Rastegari, M\. Bansal, N\. Santhanam, N\. Parks, N\. White, N\. Bawa, N\. Singhal, N\. Egebo, N\. Usunier, N\. Mehta, N\. P\. Laptev, N\. Dong, N\. Cheng, O\. Chernoguz, O\. Hart, O\. Salpekar, O\. Kalinli, P\. Kent, P\. Parekh, P\. Saab, P\. Balaji, P\. Rittner, P\. Bontrager, P\. Roux, P\. Dollar, P\. Zvyagina, P\. Ratanchandani, P\. Yuvraj, Q\. Liang, R\. Alao, R\. Rodriguez, R\. Ayub, R\. Murthy, R\. Nayani, R\. Mitra, R\. Parthasarathy, R\. Li, R\. Hogan, R\. Battey, R\. Wang, R\. Howes, R\. Rinott, S\. Mehta, S\. Siby, S\. J\. Bondu, S\. Datta, S\. Chugh, S\. Hunt, S\. Dhillon, S\. Sidorov, S\. Pan, S\. Mahajan, S\. Verma, S\. Yamamoto, S\. Ramaswamy, S\. Lindsay, S\. Lindsay, S\. Feng, S\. Lin, S\. C\. Zha, S\. Patil, S\. Shankar, S\. Zhang, S\. Zhang, S\. Wang, S\. Agarwal, S\. Sajuyigbe, S\. Chintala, S\. Max, S\. Chen, S\. Kehoe, S\. Satterfield, S\. Govindaprasad, S\. Gupta, S\. Deng, S\. Cho, S\. Virk, S\. Subramanian, S\. Choudhury, S\. Goldman, T\. Remez, T\. Glaser, T\. Best, T\. Koehler, T\. Robinson, T\. Li, T\. Zhang, T\. Matthews, T\. Chou, T\. Shaked, V\. Vontimitta, V\. Ajayi, V\. Montanez, V\. Mohan, V\. S\. Kumar, V\. Mangla, V\. Ionescu, V\. Poenaru, V\. T\. Mihailescu, V\. Ivanov, W\. Li, W\. Wang, W\. Jiang, W\. Bouaziz, W\. Constable, X\. Tang, X\. Wu, X\. Wang, X\. Wu, X\. Gao, Y\. Kleinman, Y\. Chen, Y\. Hu, Y\. Jia, Y\. Qi, Y\. Li, Y\. Zhang, Y\. Zhang, Y\. Adi, Y\. Nam, Yu, Wang, Y\. Zhao, Y\. Hao, Y\. Qian, Y\. Li, Y\. He, Z\. Rait, Z\. DeVito, Z\. Rosnbrick, Z\. Wen, Z\. Yang, Z\. Zhao, and Z\. Ma \(2024\)The llama 3 herd of models\.External Links:2407\.21783,[Link](https://arxiv.org/abs/2407.21783)Cited by:[§4\.1](https://arxiv.org/html/2606.00016#S4.SS1.p3.1)\.
- B\. Guo, X\. Zhang, Z\. Wang, M\. Jiang, J\. Nie, Y\. Ding, J\. Yue, and Y\. Wu \(2023\)How close is chatgpt to human experts? comparison corpus, evaluation, and detection\.arXiv preprint arXiv:2301\.07597\.Cited by:[§2](https://arxiv.org/html/2606.00016#S2.SS0.SSS0.Px5.p1.1),[§4\.1](https://arxiv.org/html/2606.00016#S4.SS1.p2.1)\.
- A\. Hans, A\. Schwarzschild, V\. Cherepanova, H\. Kazemi, A\. Saha, M\. Goldblum, J\. Geiping, and T\. Goldstein \(2024\)Spotting llms with binoculars: zero\-shot detection of machine\-generated text\.External Links:2401\.12070,[Link](https://arxiv.org/abs/2401.12070)Cited by:[§4\.3](https://arxiv.org/html/2606.00016#S4.SS3.p1.1)\.
- X\. He, X\. Shen, Z\. Chen, M\. Backes, and Y\. Zhang \(2024\)Mgtbench: benchmarking machine\-generated text detection\.InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security,pp\. 2251–2265\.Cited by:[§2](https://arxiv.org/html/2606.00016#S2.SS0.SSS0.Px5.p1.1)\.
- X\. Hu, P\. Chen, and T\. Ho \(2023\)Radar: robust ai\-text detection via adversarial learning\.Advances in neural information processing systems36,pp\. 15077–15095\.Cited by:[§2](https://arxiv.org/html/2606.00016#S2.SS0.SSS0.Px2.p1.1)\.
- B\. Huang, D\. Su, F\. Sun, Q\. Cao, H\. Shen, and X\. Cheng \(2025a\)Low\-entropy watermark detection via bayes’ rule derived detector\.InFindings of the Association for Computational Linguistics: ACL 2025,pp\. 14330–14344\.Cited by:[§2](https://arxiv.org/html/2606.00016#S2.SS0.SSS0.Px3.p1.1)\.
- H\. Huang, N\. Sun, M\. Tani, Y\. Zhang, J\. Jiang, and S\. Jha \(2025b\)Can llm\-generated misinformation be detected: a study on cyber threat intelligence\.Future Generation Computer Systems173,pp\. 107877\.External Links:ISSN 0167\-739X,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.future.2025.107877),[Link](https://www.sciencedirect.com/science/article/pii/S0167739X25001724)Cited by:[§1](https://arxiv.org/html/2606.00016#S1.p1.1)\.
- D\. Ippolito, D\. Duckworth, C\. Callison\-Burch, and D\. Eck \(2020\)Automatic detection of generated text is easiest when humans are fooled\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics,D\. Jurafsky, J\. Chai, N\. Schluter, and J\. Tetreault \(Eds\.\),Online,pp\. 1808–1822\.External Links:[Link](https://aclanthology.org/2020.acl-main.164/),[Document](https://dx.doi.org/10.18653/v1/2020.acl-main.164)Cited by:[§1](https://arxiv.org/html/2606.00016#S1.p2.1),[§2](https://arxiv.org/html/2606.00016#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2606.00016#S2.SS0.SSS0.Px2.p1.1)\.
- \[31\]M\. Ivanitskiy, C\. D\. Behn, and S\. W\. FungMotifs in attention patterns of large language models\.InMechanistic Interpretability Workshop at NeurIPS 2025,Cited by:[Figure 1](https://arxiv.org/html/2606.00016#S3.F1),[Figure 1](https://arxiv.org/html/2606.00016#S3.F1.10.5),[§3](https://arxiv.org/html/2606.00016#S3.p6.3)\.
- A\. Q\. Jiang, A\. Sablayrolles, A\. Mensch, C\. Bamford, D\. S\. Chaplot, D\. de las Casas, F\. Bressand, G\. Lengyel, G\. Lample, L\. Saulnier, L\. R\. Lavaud, M\. Lachaux, P\. Stock, T\. L\. Scao, T\. Lavril, T\. Wang, T\. Lacroix, and W\. E\. Sayed \(2023\)Mistral 7b\.External Links:2310\.06825,[Link](https://arxiv.org/abs/2310.06825)Cited by:[§4\.1](https://arxiv.org/html/2606.00016#S4.SS1.p3.1)\.
- K\. Jiao, Q\. Wang, L\. Zhang, Z\. Guo, and Z\. Mao \(2025\)M\-RangeDetector: enhancing generalization in machine\-generated text detection through multi\-range attention masks\.InFindings of the Association for Computational Linguistics: ACL 2025,W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 8971–8983\.External Links:[Link](https://aclanthology.org/2025.findings-acl.469/),ISBN 979\-8\-89176\-256\-5Cited by:[§2](https://arxiv.org/html/2606.00016#S2.SS0.SSS0.Px2.p1.1)\.
- J\. Kirchenbauer, J\. Geiping, Y\. Wen, J\. Katz, I\. Miers, and T\. Goldstein \(2023\)A watermark for large language models\.InInternational Conference on Machine Learning,pp\. 17061–17084\.Cited by:[§1](https://arxiv.org/html/2606.00016#S1.p2.1),[§2](https://arxiv.org/html/2606.00016#S2.SS0.SSS0.Px3.p1.1)\.
- K\. Krishna, Y\. Chang, J\. Wieting, and M\. Iyyer \(2022\)RankGen: improving text generation with large ranking models\.InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing,Y\. Goldberg, Z\. Kozareva, and Y\. Zhang \(Eds\.\),Abu Dhabi, United Arab Emirates,pp\. 199–232\.External Links:[Link](https://aclanthology.org/2022.emnlp-main.15/),[Document](https://dx.doi.org/10.18653/v1/2022.emnlp-main.15)Cited by:[§2](https://arxiv.org/html/2606.00016#S2.SS0.SSS0.Px1.p1.1)\.
- K\. Krishna, Y\. Song, M\. Karpinska, J\. Wieting, and M\. Iyyer \(2023\)Paraphrasing evades detectors of ai\-generated text, but retrieval is an effective defense\.InAdvances in Neural Information Processing Systems,A\. Oh, T\. Naumann, A\. Globerson, K\. Saenko, M\. Hardt, and S\. Levine \(Eds\.\),Vol\.36,pp\. 27469–27500\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2023/file/575c450013d0e99e4b0ecf82bd1afaa4-Paper-Conference.pdf)Cited by:[§4\.4](https://arxiv.org/html/2606.00016#S4.SS4.SSS0.Px4.p1.1)\.
- K\. Kuznetsov, L\. Kushnareva, A\. Razzhigaev, P\. Druzhinina, A\. Voznyuk, I\. Piontkovskaya, E\. Burnaev, and S\. Barannikov \(2025\)Feature\-level insights into artificial text detection with sparse autoencoders\.InFindings of the Association for Computational Linguistics: ACL 2025,W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 25727–25748\.External Links:[Link](https://aclanthology.org/2025.findings-acl.1321/),ISBN 979\-8\-89176\-256\-5Cited by:[§2](https://arxiv.org/html/2606.00016#S2.SS0.SSS0.Px2.p1.1)\.
- S\. Lamsiyah, S\. Ezzini, A\. E\. Mahdaouy, H\. Alami, A\. Benlahbib, S\. E\. Amrany, S\. Chafik, and H\. Hammouchi \(2025\)M\-daigt: a shared task on multi\-domain detection of ai\-generated text\.arXiv preprint arXiv:2511\.11340\.Cited by:[§2](https://arxiv.org/html/2606.00016#S2.SS0.SSS0.Px5.p1.1)\.
- J\. Li and X\. Wan \(2025\)Who writes what: unveiling the impact of author roles on AI\-generated text detection\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 26620–26658\.External Links:[Link](https://aclanthology.org/2025.acl-long.1292/),ISBN 979\-8\-89176\-251\-0Cited by:[§2](https://arxiv.org/html/2606.00016#S2.SS0.SSS0.Px4.p1.1)\.
- Y\. Li, Z\. Zhang, C\. Li, C\. Shen, and X\. Liu \(2025\)Iron sharpens iron: defending against attacks in machine\-generated text detection with adversarial training\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 3091–3113\.External Links:[Link](https://aclanthology.org/2025.acl-long.155/),ISBN 979\-8\-89176\-251\-0Cited by:[§1](https://arxiv.org/html/2606.00016#S1.p2.1),[§2](https://arxiv.org/html/2606.00016#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Liu, L\. Pan, Y\. Lu, J\. Li, X\. Hu, X\. Zhang, L\. Wen, I\. King, H\. Xiong, and P\. Yu \(2024a\)A survey of text watermarking in the era of large language models\.ACM Computing Surveys57\(2\),pp\. 1–36\.Cited by:[§1](https://arxiv.org/html/2606.00016#S1.p2.1),[§2](https://arxiv.org/html/2606.00016#S2.SS0.SSS0.Px3.p1.1)\.
- Y\. Liu, M\. Ott, N\. Goyal, J\. Du, M\. Joshi, D\. Chen, O\. Levy, M\. Lewis, L\. Zettlemoyer, and V\. Stoyanov \(2019\)Roberta: a robustly optimized bert pretraining approach\.arXiv preprint arXiv:1907\.11692\.Cited by:[§2](https://arxiv.org/html/2606.00016#S2.SS0.SSS0.Px2.p1.1)\.
- Z\. Liu, Z\. Yao, F\. Li, and B\. Luo \(2024b\)On the detectability of chatgpt content: benchmarking, methodology, and evaluation through the lens of academic writing\.InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security,pp\. 2236–2250\.Cited by:[§1](https://arxiv.org/html/2606.00016#S1.p1.1)\.
- X\. Lu, J\. Wang, Z\. Zhao, Z\. Dai, C\. Foo, S\. K\. Ng, and B\. K\. H\. Low \(2025\)WASA: watermark\-based source attribution for large language model\-generated data\.InFindings of the Association for Computational Linguistics: ACL 2025,pp\. 23791–23824\.Cited by:[§2](https://arxiv.org/html/2606.00016#S2.SS0.SSS0.Px3.p1.1)\.
- Y\. Lu, A\. Liu, D\. Yu, J\. Li, and I\. King \(2024\)An entropy\-based text watermarking detection method\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 11724–11735\.External Links:[Link](https://aclanthology.org/2024.acl-long.630/),[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.630)Cited by:[§2](https://arxiv.org/html/2606.00016#S2.SS0.SSS0.Px3.p1.1)\.
- D\. Macko, J\. Kopál, R\. Moro, and I\. Srba \(2025\)MultiSocial: multilingual benchmark of machine\-generated text detection of social\-media texts\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 727–752\.External Links:[Link](https://aclanthology.org/2025.acl-long.36/),ISBN 979\-8\-89176\-251\-0Cited by:[§2](https://arxiv.org/html/2606.00016#S2.SS0.SSS0.Px5.p1.1)\.
- M\. Mao, D\. Wei, Z\. Chen, X\. Fang, and M\. Chau \(2025\)Watermarking large language models: an unbiased and low\-risk method\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 7939–7960\.Cited by:[§2](https://arxiv.org/html/2606.00016#S2.SS0.SSS0.Px3.p1.1)\.
- E\. Mitchell, Y\. Lee, A\. Khazatsky, C\. D\. Manning, and C\. Finn \(2023\)Detectgpt: zero\-shot machine\-generated text detection using probability curvature\.InInternational conference on machine learning,pp\. 24950–24962\.Cited by:[§1](https://arxiv.org/html/2606.00016#S1.p2.1),[§2](https://arxiv.org/html/2606.00016#S2.SS0.SSS0.Px1.p1.1)\.
- H\. Naveed, A\. U\. Khan, S\. Qiu, M\. Saqib, S\. Anwar, M\. Usman, N\. Akhtar, N\. Barnes, and A\. Mian \(2023\)A comprehensive overview of large language models\.ACM Transactions on Intelligent Systems and Technology\.Cited by:[§1](https://arxiv.org/html/2606.00016#S1.p1.1)\.
- G\. Niess and R\. Kern \(2025\)Ensemble watermarks for large language models\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 2903–2916\.Cited by:[§1](https://arxiv.org/html/2606.00016#S1.p2.1),[§2](https://arxiv.org/html/2606.00016#S2.SS0.SSS0.Px3.p1.1)\.
- OpenAI \(2023\)ChatGPT\.Note:Large language model[https://chat\.openai\.com](https://chat.openai.com/)Cited by:[§4\.1](https://arxiv.org/html/2606.00016#S4.SS1.p2.1)\.
- A\. Pedrotti, M\. Papucci, C\. Ciaccio, A\. Miaschi, G\. Puccetti, F\. Dell’Orletta, and A\. Esuli \(2025\)Stress\-testing machine generated text detection: shifting language models writing style to fool detectors\.InFindings of the Association for Computational Linguistics: ACL 2025,W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 3010–3031\.External Links:[Link](https://aclanthology.org/2025.findings-acl.156/),ISBN 979\-8\-89176\-256\-5Cited by:[§2](https://arxiv.org/html/2606.00016#S2.SS0.SSS0.Px5.p1.1)\.
- X\. Peng, Y\. Zhou, B\. He, L\. Sun, and Y\. Sun \(2023\)Hidding the ghostwriters: an adversarial evaluation of ai\-generated student essay detection\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,pp\. 10406–10419\.Cited by:[§2](https://arxiv.org/html/2606.00016#S2.SS0.SSS0.Px5.p1.1)\.
- J\. Qi, C\. Gao, Z\. Ren, and Q\. Chen \(2025\)DeltaLLM: a training\-free framework exploiting temporal sparsity for efficient edge llm inference\.arXiv preprint arXiv:2507\.19608\.Cited by:[Figure 1](https://arxiv.org/html/2606.00016#S3.F1),[Figure 1](https://arxiv.org/html/2606.00016#S3.F1.10.5),[§3](https://arxiv.org/html/2606.00016#S3.p6.3)\.
- R\. A\. Rivera Soto, B\. Y\. Chen, and N\. Andrews \(2025\)Mitigating paraphrase attacks on machine\-text detection via paraphrase inversion\.InFindings of the Association for Computational Linguistics: ACL 2025,W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 4421–4433\.External Links:[Link](https://aclanthology.org/2025.findings-acl.227/),ISBN 979\-8\-89176\-256\-5Cited by:[§2](https://arxiv.org/html/2606.00016#S2.SS0.SSS0.Px3.p1.1)\.
- G\. Sarti, N\. Feldhus, L\. Sickert, and O\. van der Wal \(2023\)Inseq: an interpretability toolkit for sequence generation models\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 3: System Demonstrations\),D\. Bollegala, R\. Huang, and A\. Ritter \(Eds\.\),Toronto, Canada,pp\. 421–435\.External Links:[Link](https://aclanthology.org/2023.acl-demo.40/),[Document](https://dx.doi.org/10.18653/v1/2023.acl-demo.40)Cited by:[§3](https://arxiv.org/html/2606.00016#S3.p5.5)\.
- R\. R\. Selvaraju, M\. Cogswell, A\. Das, R\. Vedantam, D\. Parikh, and D\. Batra \(2019\)Grad\-cam: visual explanations from deep networks via gradient\-based localization\.International Journal of Computer Vision128\(2\),pp\. 336–359\.External Links:ISSN 1573\-1405,[Link](http://dx.doi.org/10.1007/s11263-019-01228-7),[Document](https://dx.doi.org/10.1007/s11263-019-01228-7)Cited by:[§5](https://arxiv.org/html/2606.00016#S5.SS0.SSS0.Px4.p1.7)\.
- L\. Shen, X\. Zhang, S\. Ji, Y\. Pu, C\. Ge, X\. Yang, and Y\. Feng \(2023\)Textdefense: adversarial text detection based on word importance entropy\.arXiv preprint arXiv:2302\.05892\.Cited by:[§2](https://arxiv.org/html/2606.00016#S2.SS0.SSS0.Px1.p1.1)\.
- J\. Su, T\. Zhuo, D\. Wang, and P\. Nakov \(2023\)DetectLLM: leveraging log rank information for zero\-shot detection of machine\-generated text\.InFindings of the Association for Computational Linguistics: EMNLP 2023,H\. Bouamor, J\. Pino, and K\. Bali \(Eds\.\),Singapore,pp\. 12395–12412\.External Links:[Link](https://aclanthology.org/2023.findings-emnlp.827/),[Document](https://dx.doi.org/10.18653/v1/2023.findings-emnlp.827)Cited by:[§4\.3](https://arxiv.org/html/2606.00016#S4.SS3.p1.1)\.
- Z\. Su, Y\. Wang, H\. Wan, Z\. Zhang, and M\. Luo \(2025\)HACo\-det: a study towards fine\-grained machine\-generated text detection under human\-AI coauthoring\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 22015–22036\.External Links:[Link](https://aclanthology.org/2025.acl-long.1069/),ISBN 979\-8\-89176\-251\-0Cited by:[§2](https://arxiv.org/html/2606.00016#S2.SS0.SSS0.Px2.p1.1)\.
- V\. Tassopoulou, G\. Retsinas, and P\. Maragos \(2021\)Enhancing handwritten text recognition with n\-gram sequence decomposition and multitask learning\.In2020 25th International Conference on Pattern Recognition \(ICPR\),pp\. 10555–10560\.Cited by:[§2](https://arxiv.org/html/2606.00016#S2.SS0.SSS0.Px1.p1.1)\.
- J\. Tiedemann and S\. Thottingal \(2020\)OPUS\-MT – building open translation services for the world\.InProceedings of the 22nd Annual Conference of the European Association for Machine Translation,A\. Martins, H\. Moniz, S\. Fumega, B\. Martins, F\. Batista, L\. Coheur, C\. Parra, I\. Trancoso, M\. Turchi, A\. Bisazza, J\. Moorkens, A\. Guerberof, M\. Nurminen, L\. Marg, and M\. L\. Forcada \(Eds\.\),Lisboa, Portugal,pp\. 479–480\.External Links:[Link](https://aclanthology.org/2020.eamt-1.61/)Cited by:[§4\.1](https://arxiv.org/html/2606.00016#S4.SS1.p1.1)\.
- I\. Tolstykh, A\. Tsybina, S\. Yakubson, and M\. Kuprashevich \(2025\)LLMTrace: a corpus for classification and fine\-grained localization of ai\-written text\.arXiv preprint arXiv:2509\.21269\.Cited by:[§2](https://arxiv.org/html/2606.00016#S2.SS0.SSS0.Px5.p1.1)\.
- A\. Uchendu, Z\. Ma, T\. Le, R\. Zhang, and D\. Lee \(2021\)Turingbench: a benchmark environment for turing test in the age of neural text generation\.arXiv preprint arXiv:2109\.13296\.Cited by:[§1](https://arxiv.org/html/2606.00016#S1.p2.1),[§2](https://arxiv.org/html/2606.00016#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Urlana, A\. Saibewar, B\. M\. Garlapati, C\. Vinayak Kumar, A\. Singh, and S\. R\. Chalamala \(2024\)TrustAI at SemEval\-2024 task 8: a comprehensive analysis of multi\-domain machine generated text detection techniques\.InProceedings of the 18th International Workshop on Semantic Evaluation \(SemEval\-2024\),A\. Kr\. Ojha, A\. S\. Doğruöz, H\. Tayyar Madabushi, G\. Da San Martino, S\. Rosenthal, and A\. Rosá \(Eds\.\),Mexico City, Mexico,pp\. 927–934\.External Links:[Link](https://aclanthology.org/2024.semeval-1.134/),[Document](https://dx.doi.org/10.18653/v1/2024.semeval-1.134)Cited by:[§2](https://arxiv.org/html/2606.00016#S2.SS0.SSS0.Px5.p1.1)\.
- A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, Ł\. Kaiser, and I\. Polosukhin \(2017\)Attention is all you need\.InProceedings of the 31st International Conference on Neural Information Processing Systems,NIPS’17,Red Hook, NY, USA,pp\. 6000–6010\.External Links:ISBN 9781510860964Cited by:[§1](https://arxiv.org/html/2606.00016#S1.p3.3)\.
- Y\. Wang, J\. Mansurov, P\. Ivanov, J\. Su, A\. Shelmanov, A\. Tsvigun, O\. Mohammed Afzal, T\. Mahmoud, G\. Puccetti, and T\. Arnold \(2024a\)SemEval\-2024 task 8: multidomain, multimodel and multilingual machine\-generated text detection\.InProceedings of the 18th International Workshop on Semantic Evaluation \(SemEval\-2024\),A\. Kr\. Ojha, A\. S\. Doğruöz, H\. Tayyar Madabushi, G\. Da San Martino, S\. Rosenthal, and A\. Rosá \(Eds\.\),Mexico City, Mexico,pp\. 2057–2079\.External Links:[Link](https://aclanthology.org/2024.semeval-1.279/),[Document](https://dx.doi.org/10.18653/v1/2024.semeval-1.279)Cited by:[§2](https://arxiv.org/html/2606.00016#S2.SS0.SSS0.Px5.p1.1)\.
- Y\. Wang, J\. Mansurov, P\. Ivanov, J\. Su, A\. Shelmanov, A\. Tsvigun, C\. Whitehouse, O\. Mohammed Afzal, T\. Mahmoud, T\. Sasaki, T\. Arnold, A\. F\. Aji, N\. Habash, I\. Gurevych, and P\. Nakov \(2024b\)M4: multi\-generator, multi\-domain, and multi\-lingual black\-box machine\-generated text detection\.InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics \(Volume 1: Long Papers\),Y\. Graham and M\. Purver \(Eds\.\),St\. Julian’s, Malta,pp\. 1369–1407\.External Links:[Link](https://aclanthology.org/2024.eacl-long.83/),[Document](https://dx.doi.org/10.18653/v1/2024.eacl-long.83)Cited by:[§2](https://arxiv.org/html/2606.00016#S2.SS0.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2606.00016#S2.SS0.SSS0.Px4.p1.1)\.
- Z\. Wang, T\. Gu, B\. Wu, and Y\. Yang \(2025\)MorphMark: flexible adaptive watermarking for large language models\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 4842–4860\.External Links:[Link](https://aclanthology.org/2025.acl-long.240/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.240),ISBN 979\-8\-89176\-251\-0Cited by:[§1](https://arxiv.org/html/2606.00016#S1.p2.1),[§2](https://arxiv.org/html/2606.00016#S2.SS0.SSS0.Px3.p1.1)\.
- J\. Wu, S\. Yang, R\. Zhan, Y\. Yuan, L\. S\. Chao, and D\. F\. Wong \(2025\)A survey on LLM\-generated text detection: necessity, methods, and future directions\.Computational Linguistics51\(1\),pp\. 275–338\.External Links:[Link](https://aclanthology.org/2025.cl-1.8/),[Document](https://dx.doi.org/10.1162/coli%5Fa%5F00549)Cited by:[§1](https://arxiv.org/html/2606.00016#S1.p1.1),[§1](https://arxiv.org/html/2606.00016#S1.p2.1),[§2](https://arxiv.org/html/2606.00016#S2.SS0.SSS0.Px1.p1.1)\.
- J\. Wu, R\. Zhan, D\. Wong, S\. Yang, X\. Yang, Y\. Yuan, and L\. Chao \(2024\)Detectrl: benchmarking llm\-generated text detection in real\-world scenarios\.Advances in Neural Information Processing Systems37,pp\. 100369–100401\.Cited by:[§1](https://arxiv.org/html/2606.00016#S1.p2.1)\.
- G\. Xiao, Y\. Tian, B\. Chen, S\. Han, and M\. Lewis \(2023\)Efficient streaming language models with attention sinks\.arXiv preprint arXiv:2309\.17453\.Cited by:[Figure 1](https://arxiv.org/html/2606.00016#S3.F1),[Figure 1](https://arxiv.org/html/2606.00016#S3.F1.10.5),[§3](https://arxiv.org/html/2606.00016#S3.p6.3)\.
- X\. Zhu, Y\. Ren, Y\. Cao, X\. Lin, F\. Fang, and Y\. Li \(2025\)Reliably bounding false positives: a zero\-shot machine\-generated text detection framework via multiscaled conformal prediction\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 12298–12319\.External Links:[Link](https://aclanthology.org/2025.acl-long.601/),ISBN 979\-8\-89176\-251\-0Cited by:[§1](https://arxiv.org/html/2606.00016#S1.p2.1),[§2](https://arxiv.org/html/2606.00016#S2.SS0.SSS0.Px2.p1.1)\.
- M\. Ziemski, M\. Junczys\-Dowmunt, and B\. Pouliquen \(2016\)The United Nations parallel corpus v1\.0\.InProceedings of the Tenth International Conference on Language Resources and Evaluation \(LREC‘16\),N\. Calzolari, K\. Choukri, T\. Declerck, S\. Goggi, M\. Grobelnik, B\. Maegaard, J\. Mariani, H\. Mazo, A\. Moreno, J\. Odijk, and S\. Piperidis \(Eds\.\),Portorož, Slovenia,pp\. 3530–3534\.External Links:[Link](https://aclanthology.org/L16-1561/)Cited by:[§4\.1](https://arxiv.org/html/2606.00016#S4.SS1.p1.1)\.

Similar Articles

Findings of the Counter Turing Test: AI-Generated Text Detection

arXiv cs.CL

This paper presents findings from the Counter Turing Test shared task on AI-generated text detection, with top systems achieving perfect binary classification but significantly lower performance in model attribution, highlighting the difficulty of distinguishing outputs from different large language models.

Show, Don't TELL: Explainable AI-Generated Text Detection

Hugging Face Daily Papers

Introduces TELL, an AI-generated text detection system that provides explainable annotations alongside numerical scores, achieving competitive AUROC of 0.927 while enabling users to judge authorship based on highlighted textual indicators.

MELD: Multi-Task Equilibrated Learning Detector for AI-Generated Text

arXiv cs.CL

This paper introduces MELD, a detector for AI-generated text that uses multi-task learning with auxiliary heads for generator family, attack type, and source domain to improve robustness. MELD achieves strong performance on the RAID benchmark and maintains low false-positive rates under adversarial attacks.

New AI classifier for indicating AI-written text

OpenAI Blog

OpenAI has released a preliminary AI text classifier designed to help identify AI-written content, with a focus on supporting educators, journalists, and misinformation researchers. The tool comes with acknowledged limitations and is accompanied by an educational resource for teachers on ChatGPT's uses and constraints.