Multi-Level Contextual Token Relation Modeling for Machine-Generated Text Detection
Summary
This paper proposes a multi-level contextual token relation modeling framework for machine-generated text detection, integrating local Markov-informed calibration and global rule-support reasoning to improve detection across cross-LLM and cross-domain settings with low computational overhead.
View Cached Full Text
Cached at: 05/18/26, 06:36 AM
# Multi-Level Contextual Token Relation Modeling for Machine-Generated Text Detection
Source: [https://arxiv.org/html/2605.16107](https://arxiv.org/html/2605.16107)
file\.aux
Chenwang Wu, Yiu\-ming Cheung, Bo Han, Shuhai Zhang, and Defu LianChenwang Wu, Yiu\-ming Cheung, and Bo Han are affiliated with the Department of Computer Science, Hong Kong Baptist University, Hong Kong, China\. E\-mail: \{cscwwu, ymc, bhanml\}@comp\.hkbu\.edu\.hk\. Shuhai Zhang is affiliated with the School of Software Engineering, South China University of Technology, Guangzhou, Guangdong 510000, China\. E\-mail: shuhaizhangshz@gmail\.com\. Defu Lian is affiliated with the School of Computer Science and Technology, University of Science and Technology of China, Hefei, Anhui 230000, China\. He is also affiliated with the State Key Laboratory of Cognitive Intelligence\. E\-mail: liandefu@ustc\.edu\.cn\. Corresponding author: Yiu\-ming Cheung\.
###### Abstract
Machine\-generated texts \(MGTs\) pose risks such as disinformation and phishing, underscoring the need for reliable detection\. Metric\-based methods, which extract statistically distinguishable features of MGTs, are often more practical than complex model\-based methods that are prone to overfitting\. Given their diverse designs, we first place representative metric\-based methods within a unified framework, enabling a clear assessment of their advantages and limitations\. Our analysis identifies a core challenge across these methods: the token\-level detection score is easily biased by the inherent randomness of the MGTs generation process\. Then, we theoretically derive the multi\-hop transitions of the token\-level detection score and explore their local and global relations\. Based on these findings, we propose a multi\-level contextual token relation modeling framework for MGT detection\. Specifically, for local relations, we model them through a lightweight Markov\-informed calibration module that refines token\-level evidence before aggregation\. For global relations, we introduce a rule\-support reasoning module that uses explicit logical rules derived from contextual score statistics\. Finally, we combine the local calibrated score and the global rule\-support reasoning signal in a joint multi\-level inference framework\. Extensive experiments show broad and substantial improvements across various real\-world scenarios, including cross\-LLM and cross\-domain settings, with low computational overhead\.
## IIntroduction
Generative AI, represented by large language models \(LLMs\)\[[1](https://arxiv.org/html/2605.16107#bib.bib42),[26](https://arxiv.org/html/2605.16107#bib.bib43)\], has been advancing rapidly, and the machine\-generated texts \(MGTs\) they produce often match human writing in fluency, coherence, and diversity\. While this technological breakthrough offers immense opportunities, it has also triggered widespread societal concerns, including the spread of disinformation\[[34](https://arxiv.org/html/2605.16107#bib.bib80)\], violations of intellectual property rights\[[47](https://arxiv.org/html/2605.16107#bib.bib81)\], and phishing attacks\[[14](https://arxiv.org/html/2605.16107#bib.bib45)\]\. Therefore, the research and development of MGT detection technologies hold significant theoretical and practical value in uncovering the distinct patterns of generated text and ensuring a trustworthy AI environment\.
An effective detection method is to identify LLM watermarks\[[15](https://arxiv.org/html/2605.16107#bib.bib71)\], but this requires injecting watermarks into the LLM, which is often impractical due to high access permissions\. Therefore, passive detection methods, including model\- and metric\-based approaches, have attracted significant attention\. Model\-based methods use a set of human\- and machine\-generated texts to train a binary classifier, such as OpenAI detector\[[29](https://arxiv.org/html/2605.16107#bib.bib11)\], ChatGPT detector\[[9](https://arxiv.org/html/2605.16107#bib.bib1)\], SeqXGPT\[[35](https://arxiv.org/html/2605.16107#bib.bib19)\], and CoCo\[[20](https://arxiv.org/html/2605.16107#bib.bib56)\]\. However, such models are often too complex, leading to overfitting to the training data\. Instead, metric\-based methods exploit the inherent statistical biases of LLMs to discriminate MGTs, which are model\-agnostic and have better generalization properties\. These methods use metrics such as log\-likelihood, log\-rank, and entropy\. Furthermore, methods such as DetectGPT\[[24](https://arxiv.org/html/2605.16107#bib.bib10)\], FastDetectGPT\[[2](https://arxiv.org/html/2605.16107#bib.bib49)\], and Binoculars\[[12](https://arxiv.org/html/2605.16107#bib.bib63)\]detect MGTs by comparing the differences between a given text and a perturbed, regenerated, or continued text from an alternative model\.
Despite their diverse designs, this paper first systematically examines several representative approaches, including Log\-Likelihood\[[29](https://arxiv.org/html/2605.16107#bib.bib11)\], Entropy\[[7](https://arxiv.org/html/2605.16107#bib.bib6)\], Binoculars\[[10](https://arxiv.org/html/2605.16107#bib.bib54)\], DetectGPT\[[24](https://arxiv.org/html/2605.16107#bib.bib10)\], FastDetectGPT\[[2](https://arxiv.org/html/2605.16107#bib.bib49)\], and DNA\-DetectLLM\[[54](https://arxiv.org/html/2605.16107#bib.bib85)\], and situates them within a unified framework, thereby revealing their commonalities: They first compute token\-level detection scores, and then employ various carefully designed strategies to aggregate them into text\-level scores to make threshold\-based decisions\. This unified view reveals a common challenge across the existing methods: The token\-level score is easily biased by the inherent randomness of the LLM generation process, while subsequent aggregation steps fail to correct for the underlying imprecision\. As a result, the detection performance is tightly constrained by the precision of token\-level scores\. Given that token\-level scores are tied to the generation process and context\-dependent, a natural question arises: can we explicitly reveal and exploit contextual relationships among token\-level detection scores to improve detection?
In our preliminary work\[[36](https://arxiv.org/html/2605.16107#bib.bib90)\], we attempted to answer this question from a local perspective\. Starting from a theoretical bound on attention\-score evolution in a simplified transformer, we were led to two important findings regarding local contextual relations:*Neighbor Similarity*, namely, adjacent tokens tend to exhibit similar detection scores, and*Initial Instability*, namely, early\-position token scores are more unstable than later ones\. Building on these two observations, we proposed a Markov\-informed score calibration method that models local contextual dependence through a pairwise Markov random field and implements it efficiently via a mean\-field approximation\. This calibration module can be stacked on top of existing detectors to refine token\-level scores before final aggregation, thereby improving detection with negligible computational overhead\.
Nevertheless, purely local contextual modeling is still insufficient\. While it can correct short\-range score bias, it cannot adequately capture the global organization of the token\-level score across the text\. In the paper, we therefore go beyond the local contextual relation and propose multi\-level contextual token relation modeling for MGT detection\. Specifically, we extend theoretical results from attention scores to more direct token scores and from single\-hop local transitions to multi\-hop contextual relations\. It reveals that the score differences between distant positions are also structurally bounded rather than arbitrary\. This implies global contextual relations: within non\-initial text segments, token scores exhibit global relationships, including*score stability*,*adjacent\-difference stability*, and*long\-range stability*for MGT\. Furthermore, this paper introduces a rule\-support reasoning module to model these relations\. Specifically, based on these relations, we extract corresponding global statistics from the text and construct logical rules; we then employ rule\-support reasoning to derive a confidence score, which complements the locally calibrated score from our preliminary work\. In this way, local Markov calibration improves the quality of token\-level scores, whereas the global rule\-support reasoning module provides a confidence score to enhance detection, together forming a unified multi\-level contextual token relation modeling framework for MGT detection\.
Except for contributions in the preliminary work\[[36](https://arxiv.org/html/2605.16107#bib.bib90)\]on score calibration, this paper further makes the following contributions\.
- •We extend the theoretical analysis from attention scores to more direct token scores and from single\-hop to more general multi\-hop transitions, thereby revealing global contextual relations among token\-level scores\.
- •We propose a rule\-support reasoning module to explicitly capture these global relations by constructing logical rules from global statistics\.
- •We propose a multi\-level framework for modeling contextual token relations to enhance MGT detection by integrating local Markov calibration and global rule\-support reasoning\.
- •Extensive experiments demonstrate the effectiveness of the proposed approach in various real\-world scenarios, including cross\-LLM generalization, cross\-domain transfer, adversarial/paraphrasing settings, and mixed text detection\.
## IIRelated Work
This section provides an overview of the existing detection methods, which can be categorized into active watermark\-based methods and passive model\- and metric\-based methods\.
### II\-AWatermark\-based Detection
Watermarking is a proactive defense technique that embeds verifiable information during text generation, thereby enabling simple and reliable detection\. RedList\[[17](https://arxiv.org/html/2605.16107#bib.bib9)\]is a model\-agnostic watermarking method that dynamically partitions the vocabulary into a “greenlist” and “redlist” based on preceding context, slightly increasing the probability of sampling tokens from the greenlist\. Subsequent works have made various improvements to this approach\. For instance, SemStamp\[[15](https://arxiv.org/html/2605.16107#bib.bib71)\]introduces a sentence\-level semantic hashing watermark to enhance robustness against paraphrasing attacks; DiPmark\[[40](https://arxiv.org/html/2605.16107#bib.bib72)\]designs an unbiased watermark that does not alter the original output distribution\. REMARK\-LLM\[[50](https://arxiv.org/html/2605.16107#bib.bib73)\]is a training\-based watermarking method that employs a message encoding module to generate an encrypted token distribution for watermark embedding prior to inference\. Beyond manually designed watermarks, directly leveraging language models to learn to generate watermarked text is also promising\[[21](https://arxiv.org/html/2605.16107#bib.bib75)\]\.
### II\-BModel\-based Detection
Model\-based methods represent a classical paradigm in detection, training a binary classifier on a dataset containing both human\- and machine\-generated texts\. A series of works, such as OpenAI Detector\[[29](https://arxiv.org/html/2605.16107#bib.bib11)\], ChatGPT Detector\[[9](https://arxiv.org/html/2605.16107#bib.bib1)\], GPTZero\[[8](https://arxiv.org/html/2605.16107#bib.bib14)\], and G3 Detector\[[49](https://arxiv.org/html/2605.16107#bib.bib2)\], collect texts generated by various LLMs to train a unified classifier\. GPT\-Pat\[[46](https://arxiv.org/html/2605.16107#bib.bib5)\]finds that detectors trained solely on a single decoding strategy generalize poorly, thereby enhancing performance by utilizing mixed decoding strategies\. In addition to original data, GLTR\[[7](https://arxiv.org/html/2605.16107#bib.bib6)\]trains a simple logistic regression classifier by analyzing the predicted ranking of each word within its context\. SeqXGPT\[[35](https://arxiv.org/html/2605.16107#bib.bib19)\]treats the sequence of logits as waveform signals for detection\. Beyond the data level, recent works have explored more advanced training strategies\. For example, LLMDet\[[39](https://arxiv.org/html/2605.16107#bib.bib57)\]leverages the perplexity of surrogate models as additional features; MPU\[[31](https://arxiv.org/html/2605.16107#bib.bib41)\]adopts a positive\-unlabeled learning paradigm; and RADAR\[[16](https://arxiv.org/html/2605.16107#bib.bib50)\]incorporates adversarial training to enhance model robustness\. The above methods generally assume a known text source, but when the source is unknown, Ghostbuster\[[33](https://arxiv.org/html/2605.16107#bib.bib23)\]proposes training classifiers directly on texts generated by known surrogate models\. Besides, DGM4\[[28](https://arxiv.org/html/2605.16107#bib.bib92)\]incorporated contrastive learning with the image modality to capture more fine\-grained data features\.
TABLE I:Comparing existing metric\-based methods from a unified view\. Here,ssis the text to be detected containingNNtokens,s′s^\{\\prime\}is the perturbed text generated by DetectGPT,s~\\tilde\{s\}is the regenerated text of Fast\-DetectGPT, ands∗s^\{\*\}is the ideal text of DNA\-DetectLLM\. Functionμ\(⋅\)\\mu\(\\cdot\)andσ\(⋅\)\\sigma\(\\cdot\)represent the mean and standard deviation of the given set, respectively\. For methods whose original score directions differ, we apply sign normalization so that a larger score consistently indicates a higher likelihood of MGT\.
### II\-CMetric\-based Detection
Metric\-based methods do not require training on specific datasets; instead, they directly leverage the inherent statistical biases or intrinsic properties of language model\-generated text to distinguish it\. Early studies mainly relied on token\-level probability statistics, such as Log\-Likelihood\[[29](https://arxiv.org/html/2605.16107#bib.bib11)\], Log\-Rank\[[24](https://arxiv.org/html/2605.16107#bib.bib10)\], and Entropy\[[7](https://arxiv.org/html/2605.16107#bib.bib6)\], and their variants\[[30](https://arxiv.org/html/2605.16107#bib.bib58),[51](https://arxiv.org/html/2605.16107#bib.bib91)\]\. Beyond these direct scoring methods, perturbation\- or rewrite\-based approaches detect machine\-generated text by comparing the original text with perturbed, continued, or rewritten variants, including DetectGPT\[[24](https://arxiv.org/html/2605.16107#bib.bib10)\], Fast\-DetectGPT\[[2](https://arxiv.org/html/2605.16107#bib.bib49)\], DNA\-GPT\[[43](https://arxiv.org/html/2605.16107#bib.bib59)\], DetectGPT4Code\[[44](https://arxiv.org/html/2605.16107#bib.bib61)\], SimLLM\[[25](https://arxiv.org/html/2605.16107#bib.bib67)\], and L2D\[[52](https://arxiv.org/html/2605.16107#bib.bib89)\]\. A growing line of work explores deeper intrinsic signals in text representations\. These include intrinsic dimensionality\[[32](https://arxiv.org/html/2605.16107#bib.bib15)\], token coherence\[[23](https://arxiv.org/html/2605.16107#bib.bib68)\], vocabulary\-space distribution gaps\[[45](https://arxiv.org/html/2605.16107#bib.bib69)\], surrogate\-model activation features\[[3](https://arxiv.org/html/2605.16107#bib.bib3)\], temporal patterns of token probabilities\[[42](https://arxiv.org/html/2605.16107#bib.bib65)\], relative probability spectra\[[41](https://arxiv.org/html/2605.16107#bib.bib64)\], and uncertainty in style perception\[[38](https://arxiv.org/html/2605.16107#bib.bib66)\]\. More recent methods further model higher\-level structure and robustness: DETree captures hierarchical clustering relations in hybrid human–AI text\[[13](https://arxiv.org/html/2605.16107#bib.bib84)\], DNA\-DetectLLM measures the repair effort needed to transform text into an ideal machine\-generated sequence\[[54](https://arxiv.org/html/2605.16107#bib.bib85)\], OOD\-based methods improve generalization by framing human text as out\-of\-distribution\[[48](https://arxiv.org/html/2605.16107#bib.bib86)\], IPAD enhances interpretability by inferring likely prompts\[[4](https://arxiv.org/html/2605.16107#bib.bib87)\], and HLD\-Detector models human and machine linguistic distributions across lexical, syntactic, and semantic levels\[[11](https://arxiv.org/html/2605.16107#bib.bib88)\]\.
## IIIA Unified Perspective of Metric\-based Detection
Although model\-based methods have shown competitive potential in specific domains, they are often too complex, leading to a tendency to overfit their training data\. This limitation hinders their generalizability\. In contrast, metric\-based methods extract discriminative features from MGT, and their model\-agnostic nature provides superior generalization potential\. Given the diverse implementations of representative metric\-based methods such as Log\-Likelihood\[[29](https://arxiv.org/html/2605.16107#bib.bib11)\], Entropy\[[7](https://arxiv.org/html/2605.16107#bib.bib6)\], Binoculars\[[10](https://arxiv.org/html/2605.16107#bib.bib54)\], DetectGPT\[[24](https://arxiv.org/html/2605.16107#bib.bib10)\], FastDetectGPT\[[2](https://arxiv.org/html/2605.16107#bib.bib49)\], and DNA\-DetectLLM\[[54](https://arxiv.org/html/2605.16107#bib.bib85)\], we first provide a systematic examination of them from a unified perspective\. This facilitates a deeper understanding of their mechanisms and allows for a fair comparison between them\. As summarized in Table[I](https://arxiv.org/html/2605.16107#S2.T1), we compare these methods across data, score aggregation, and detection dimensions\. Note that we do not discuss their diverse core metric designs here, because our purpose is to identify common structural properties that may support a detector\-agnostic enhancement framework\.
- •Data\. Some methods, such as Log\-Likelihood and Entropy, operate directly on the input textssand are therefore computationally efficient\. However, the randomness inherent in the LLM sampling mechanism may cause the MGT to deviate from these methods’ underlying assumptions, e\.g\., Log\-likelihood assumes that the generated tokens have a high likelihood\. This makes it difficult for methods relying on single samples to fully exploit the potential of their core mechanisms\. In contrast, DetectGPT, Fast\-DetectGPT, and DNA\-DetectLLM incorporate multiple perturbed \(i\.e\.,s′s^\{\\prime\}\) or regenerated \(i\.e\.,s~\\tilde\{s\}ands^\\hat\{s\}\) samples, which mitigates the errors caused by randomness\. However, this may increase computational overhead compared to single\-text\-based methods\.
- •Score Aggregation\. Although these methods appear to calculate scores differently, they all tend to directly aggregate token scores to obtain the final text score\. As discussed, the randomness introduced by the LLM generation process may bias token\-level scores\. Therefore, aggregating these potentially imprecise token scores directly may not fully reflect their core detection advantages\.
- •Detection\. These methods employ threshold\-based detection mechanisms, whose effectiveness relies heavily on the accuracy of their calculated scores\. Including uncalibrated, high\-noise scores in threshold\-based decision\-making may lead to poor performance\.
In summary, existing metric\-based methods improve detection in various ways, e\.g\., by introducing auxiliary text or redesigning scoring functions\. However, they fail to address the underlying token\-level errors arising from inherent randomness, thereby limiting their detection potential\. Considering that detection scores are tied to tokens and LLMs’ generative mechanisms induce dependencies among tokens, revealing and modeling contextual relations among token scores may help correct score errors and thus improve detection effectiveness\.
## IVContextual Relations among Token\-level Scores in MGT Detection
To understand the relationship between context tokens’ detection scores, we follow existing work\[[22](https://arxiv.org/html/2605.16107#bib.bib77)\]and consider the token generation process of a simplified single\-layer transformer model with single\-head attention111The theoretical framework is not intended to precisely characterize the full\-scale LLM, but rather to reveal potential contextual relations inherent in token\-level detection scores\. Compared to the intractable multi\-layer, multi\-head Transformer, the single\-layer, single\-head setting serves as an analytically tractable surrogate, and our empirical observations \(i\.e\., Figs\.[1](https://arxiv.org/html/2605.16107#S4.F1),[2](https://arxiv.org/html/2605.16107#S4.F2), and[3](https://arxiv.org/html/2605.16107#S4.F3)\) further support such relations\.:
xt\+1=ℱ\(αtXt−1WVWO\),whereαt=softmax\(1/t⋅xtWQWK⊤Xt−1⊤\)\.\\begin\{gathered\}x\_\{t\+1\}=\\mathcal\{F\}\\left\(\\alpha\_\{t\}X\_\{t\-1\}W\_\{V\}W\_\{O\}\\right\),\\text\{ where \}\\\\ \\alpha\_\{t\}=\\operatorname\{softmax\}\\left\(1/t\\cdot x\_\{t\}W\_\{Q\}W\_\{K\}^\{\\top\}X\_\{t\-1\}^\{\\top\}\\right\)\.\\end\{gathered\}\(1\)Here,αt\\alpha\_\{t\}denotes the attention scores\.xtx\_\{t\}is the embedding of tokensts\_\{t\}\. The matrixXt−1X\_\{t\-1\}is stacked by the embeddingsx1,…,xt−1x\_\{1\},\\ldots,x\_\{t\-1\}, where thejj\-th row isxjx\_\{j\}\.WQ,WK,WVW\_\{Q\},W\_\{K\},W\_\{V\}andWOW\_\{O\}are the attention parameters\. Following the attention block, an MLP block, denoted asℱ\(⋅\)\\mathcal\{F\}\(\\cdot\), is applied, and it is a two\-layer network with skip connections:
ℱ\(x\)=x\+W2relu\(W1x\)\.\\mathcal\{F\}\(x\)=x\+W\_\{2\}\\operatorname\{relu\}\\left\(W\_\{1\}x\\right\)\.
Based on this definition, we can derive the following local and global relations regarding the context detection score\.
### IV\-ALocal Contextual Relations
The following result characterizes how token\-level detection scores vary between adjacent positions\.
###### Theorem 1
Letd\(st\)=𝒢\(αt\)d\(s\_\{t\}\)=\\mathcal\{G\}\(\\alpha\_\{t\}\)denote the token\-level detection score at thett\-th step, where𝒢\\mathcal\{G\}isLL\-Lipschitz with respect to theL∞L\_\{\\infty\}\-norm\. LetλK,λQ,λV,λO\\lambda\_\{K\},\\lambda\_\{Q\},\\lambda\_\{V\},\\lambda\_\{O\}be the largest singular values of parametersWK,WQ,WV,WOW\_\{K\},W\_\{Q\},W\_\{V\},W\_\{O\}, respectively, and letW=WVWOWQWK⊤W=W\_\{V\}W\_\{O\}W\_\{Q\}W\_\{K\}^\{\\top\}\. For the transformer defined in Eq\. \([1](https://arxiv.org/html/2605.16107#S4.E1)\), assuming normalized inputs \(‖xt‖2=1\\left\\\|x\_\{t\}\\right\\\|\_\{2\}=1for alltt\) and constantsc,ϵ\>0c,\\epsilon\>0, consideratxt\+1⊤≥\(1−δ\)‖at‖2a\_\{t\}x\_\{t\+1\}^\{\\top\}\\geq\(1\-\\delta\)\\left\\\|a\_\{t\}\\right\\\|\_\{2\}withδ≤\(cϵλQλKλVλO\)2\\delta\\leq\\left\(\\frac\{c\\epsilon\}\{\\lambda\_\{Q\}\\lambda\_\{K\}\\lambda\_\{V\}\\lambda\_\{O\}\}\\right\)^\{2\}, whereat=αtXt−1WVWOa\_\{t\}=\\alpha\_\{t\}X\_\{t\-1\}W\_\{V\}W\_\{O\}\. Ifxℓx\_\{\\ell\}satisfiesxℓWxℓ⊤≥cx\_\{\\ell\}Wx\_\{\\ell\}^\{\\top\}\\geq candxℓWxℓ≥ϵ−1maxj∈\[ℓ\],j≠ℓxjWxℓ⊤x\_\{\\ell\}Wx\_\{\\ell\}\\geq\\epsilon^\{\-1\}\\max\_\{j\\in\[\\ell\],j\\neq\\ell\}x\_\{j\}Wx\_\{\\ell\}^\{\\top\}, then
\|d\(st\+1\)−d\(st\)\|≤L\(1\+2\)ϵ⋅maxj\(xjWxjT\)\(t\+1\)‖at‖2\\left\|d\(s\_\{t\+1\}\)\-d\(s\_\{t\}\)\\right\|\\leq\\frac\{L\(1\+\\sqrt\{2\}\)\\epsilon\\cdot\\max\_\{j\}\\left\(x\_\{j\}Wx\_\{j\}^\{T\}\\right\)\}\{\(t\+1\)\\left\\\|a\_\{t\}\\right\\\|\_\{2\}\}\(2\)
The proof is in Section I of the Supplementary Material222https://github\.com/Daftstone/Multi\-level\-MGT\-Detection\. Theorem[2](https://arxiv.org/html/2605.16107#S4.E2)characterizes how token\-level detection scores evolve across local positions by directly bounding the score difference\|d\(st\+1\)−d\(st\)\|\|d\(s\_\{t\+1\}\)\-d\(s\_\{t\}\)\|\. Therefore, adjacent token scores are not independent; instead, they are locally coupled through the autoregressive generation process, and this reveals two key properties:
- •Neighbor Similarity\. A direct implication of Theorem[2](https://arxiv.org/html/2605.16107#S4.E2)is that token\-level evidence is expected to evolve smoothly over short contextual ranges rather than fluctuate arbitrarily from one step to the next, since their score difference is explicitly bounded\. To empirically validate this property, we evaluated the distance \(mean absolute difference\) in detection scores across k hops \(i\.e\.,1\|S\|∑s∈S1N−k∑t=1N−k\|d\(st\+k\)−d\(st\)\|\\frac\{1\}\{\|S\|\}\\sum\_\{s\\in S\}\\frac\{1\}\{N\-k\}\\sum\_\{t=1\}^\{N\-k\}\|d\(s\_\{t\+k\}\)\-d\(s\_\{t\}\)\|, whereSSis the text set, andd\(st\)d\(s\_\{t\}\)is provided in Table[I](https://arxiv.org/html/2605.16107#S2.T1)\)\. As illustrated in Fig\.[1](https://arxiv.org/html/2605.16107#S4.F1)\(more results can be found in Section V\-A of the Supplementary Material\), the score distance increases consistently with hop size, and adjacent tokens always exhibit the strongest similarity, thereby providing empirical evidence for this neighbor similarity property\.
- •Initial Instability\. This theorem also suggests that the detection scores of initial tokens are statistically less stable than those of subsequent tokens\. This is because the bound depends on the current steptt, it is looser at the beginning of a sequence\. To validate the initial instability property, we analyzed the distance in detection score between adjacent tokens at positiontt\(i\.e\.,1\|S\|∑s∈S\|d\(st\+1\)−d\(st\)\|\\frac\{1\}\{\|S\|\}\\sum\_\{s\\in S\}\|d\(s\_\{t\+1\}\)\-d\(s\_\{t\}\)\|\)\. As shown in Fig\.[2](https://arxiv.org/html/2605.16107#S4.F2)\(more results are provided in Section V\-B of the Supplementary Material\), the adjacent score difference is much larger near the beginning of the text and then gradually decreases before stabilizing\. Combined with the neighbor\-similarity result, this confirms that early token\-level evidence is substantially less reliable\.
Figure 1:The detection score distances \(Mean Absolute Difference\) of neighbors with different hops in the Essay dataset\. Log\-likelihood scores are used\.Figure 2:The detection score distances \(Mean Absolute Difference\) of 1\-hop neighbors at different positions in Essay\. Log\-likelihood and Log\-Rank score are used\.These results reveal a clear local structure of token\-level evidence: nearby scores tend to be smooth, but this smoothness is significantly weaker at early positions\. Notably, although theoretical results impose no constraints on human texts, our empirical results suggest that they exhibit similar relations, given that LLMs mimic human text\. This motivates the local relation modeling introduced in Section[V\-A](https://arxiv.org/html/2605.16107#S5.SS1)\.
### IV\-BGlobal Contextual Relations
Figure 3:Visualization of the global statistics for detection scores of MGT and HGT\. Here, DetectLLM scores are used andt0=20t\_\{0\}=20\. \(a\) demonstrates Score Stability; \(b\)–\(c\) demonstrate Adjacent Difference Stability; and \(d\)–\(e\) demonstrate Long\-Range Difference Stability\.The local analysis stated above focuses on the adjacent case, i\.e\., one\-step contextual transitions\. However, contextual dependence in MGT is not limited to immediate neighbors\. Token\-level scores may also exhibit structured relations over longer ranges\. The following theorem extends the local result to multi\-hop score transitions\.
###### Theorem 2
Letd\(st\)=𝒢\(αt\)d\(s\_\{t\}\)=\\mathcal\{G\}\(\\alpha\_\{t\}\)denote the token\-level detection score at thett\-th step, where𝒢\\mathcal\{G\}isLL\-Lipschitz with respect to theL∞L\_\{\\infty\}\-norm\. LetλK,λQ,λV,λO\\lambda\_\{K\},\\lambda\_\{Q\},\\lambda\_\{V\},\\lambda\_\{O\}be the largest singular values of parametersWK,WQ,WV,WOW\_\{K\},W\_\{Q\},W\_\{V\},W\_\{O\}, respectively, and letW=WVWOWQWK⊤W=W\_\{V\}W\_\{O\}W\_\{Q\}W\_\{K\}^\{\\top\}\. For the transformer defined in Eq\. \([1](https://arxiv.org/html/2605.16107#S4.E1)\), assuming normalized inputs \(‖xt‖2=1\\left\\\|x\_\{t\}\\right\\\|\_\{2\}=1for alltt\) and constantsc,ϵ\>0c,\\epsilon\>0, consideratxt\+1⊤≥\(1−δ\)‖at‖2a\_\{t\}x\_\{t\+1\}^\{\\top\}\\geq\(1\-\\delta\)\\left\\\|a\_\{t\}\\right\\\|\_\{2\}withδ≤\(cϵλQλKλVλO\)2\\delta\\leq\\left\(\\frac\{c\\epsilon\}\{\\lambda\_\{Q\}\\lambda\_\{K\}\\lambda\_\{V\}\\lambda\_\{O\}\}\\right\)^\{2\}, whereat=αtXt−1WVWOa\_\{t\}=\\alpha\_\{t\}X\_\{t\-1\}W\_\{V\}W\_\{O\}\. Ifxℓx\_\{\\ell\}satisfiesxℓWxℓ⊤≥cx\_\{\\ell\}Wx\_\{\\ell\}^\{\\top\}\\geq candxℓWxℓ≥ϵ−1maxj∈\[ℓ\],j≠ℓxjWxℓ⊤x\_\{\\ell\}Wx\_\{\\ell\}\\geq\\epsilon^\{\-1\}\\max\_\{j\\in\[\\ell\],j\\neq\\ell\}x\_\{j\}Wx\_\{\\ell\}^\{\\top\}, then
\|d\(st\+k\)−d\(st\)\|≤L\(1\+2\)ϵ⋅maxj\(xjWxjT\)⋅ln\(1\+kt\)min0≤i<k‖at\+i‖2\\left\|d\(s\_\{t\+k\}\)\-d\(s\_\{t\}\)\\right\|\\leq\\frac\{L\(1\+\\sqrt\{2\}\)\\epsilon\\cdot\\max\_\{j\}\\left\(x\_\{j\}Wx\_\{j\}^\{T\}\\right\)\\cdot\\ln\\left\(1\+\\frac\{k\}\{t\}\\right\)\}\{\\min\_\{0\\leq i<k\}\\\|a\_\{t\+i\}\\\|\_\{2\}\}\(3\)
Theorem[3](https://arxiv.org/html/2605.16107#S4.E3)extends Theorem[2](https://arxiv.org/html/2605.16107#S4.E2)from one\-step transitions to multi\-hop score differences\. Its key implication is that even long\-range score changes remain structurally bounded rather than arbitrary\. This motivates us to move beyond local smoothness and examine how token\-level evidence is globally organized across the entire text\.
Ignoring the unstable initial part, i\.e\., whenttis large,ln\(1\+k/t\)\\ln\(1\+k/t\)in Eq\.[3](https://arxiv.org/html/2605.16107#S4.E3)tends toward 0, implying the score transitions of MGT are expected to be more tightly constrained\. Instead, HGT is not constrained by the same autoregressive generation trajectory of a fixed LLM, and therefore may exhibit more dispersed score variations\. Guided by this observation, we focus on three global properties\.
- •Score Stability, which is measured by the variance of token\-level scores in the latter part:σlate2\(s\)=Var\(\{d\(st\)\}t\>t0\)\\sigma^\{2\}\_\{\\mathrm\{late\}\}\(s\)=\\mathrm\{Var\}\(\\\{d\(s\_\{t\}\)\\\}\_\{t\>t\_\{0\}\}\), wheret0t\_\{0\}is the predefined initial part length\. The theorem motivates the expectation that score differences in the latter part of MGT tend to be more constrained \(sinceln\(1\+k/t\)→0ln\(1\+k/t\)\\to 0asttis large\) and thus exhibit lower variance\. Empirically, Fig\.[3](https://arxiv.org/html/2605.16107#S4.F3)\(a\) shows that MGT indeed exhibits lower score variance\.
- •Adjacent Difference Stability, which is measured by the variance and mean of adjacent score difference in the latter part:σlate,adj2\(s\)=Var\(Δt\>t0\(1\)\)\\sigma^\{2\}\_\{\\mathrm\{late,adj\}\}\(s\)=\\mathrm\{Var\}\(\\Delta\_\{t\>t\_\{0\}\}^\{\(1\)\}\), andμlate,adj\(s\)=Mean\(Δt\>t0\(1\)\)\\mu\_\{\\mathrm\{late,adj\}\}\(s\)=\\mathrm\{Mean\}\(\\Delta\_\{t\>t\_\{0\}\}^\{\(1\)\}\), whereΔt\(1\)=\|d\(st\+1\)−d\(st\)\|\\Delta\_\{t\}^\{\(1\)\}=\\left\|d\(s\_\{t\+1\}\)\-d\(s\_\{t\}\)\\right\|\. Since the local case \(k=1k=1\) of Theorem[3](https://arxiv.org/html/2605.16107#S4.E3)is also tighter at later positions \(ttis large\), this statistic in MGT is expected to exhibit a lower value\. As with Score Stability, the diversity inherent in HGT results in higher values for these metrics\. As shown in Fig\.[3](https://arxiv.org/html/2605.16107#S4.F3)\(b\)–\(c\), MGT again exhibits lower values, indicating more stable evolution in the latter part of the sequence\.
- •Long\-Range Difference Stability, which is measured by the variance and mean of long\-range score difference in the latter part:σlate,long2\(s\)=Var\(Δt\>t0\(\|N−t0\|/2\)\)\\sigma^\{2\}\_\{\\mathrm\{late,long\}\}\(s\)=\\mathrm\{Var\}\(\\Delta\_\{t\>t\_\{0\}\}^\{\(\|N\-t\_\{0\}\|/2\)\}\), andμlate,long\(s\)=Mean\(Δt\>t0\(\|N−t0\|/2\)\)\\mu\_\{\\mathrm\{late,long\}\}\(s\)=\\mathrm\{Mean\}\(\\Delta\_\{t\>t\_\{0\}\}^\{\(\|N\-t\_\{0\}\|/2\)\}\), whereΔt\(\|N−t0\|/2\)=\|d\(st\+\|N−t0\|/2\)−d\(st\)\|\\Delta\_\{t\}^\{\(\|N\-t\_\{0\}\|/2\)\}=\\left\|d\(s\_\{t\+\|N\-t\_\{0\}\|/2\}\)\-d\(s\_\{t\}\)\\right\|\. The theorem suggests that, even when the hop size \(i\.e\.,kk\) is large, the fact thatttis also large makes these long\-range differences relatively small and stable\. Fig\.[3](https://arxiv.org/html/2605.16107#S4.F3)\(d\)–\(e\) confirms that MGT displays smaller and more stable long\-range score differences than HGT, revealing a distinctive global organization of token\-level score\.
These findings complement the local results in Subsection[IV\-A](https://arxiv.org/html/2605.16107#S4.SS1)\. Local contextual relations describe short\-range smoothness and early\-position instability, whereas global contextual relations reveal how token\-level evidence is organized over the entire text\.
## VMulti\-Level Contextual Relation Modeling
Figure 4:The workflow of the Multi\-level contextual token relation modeling framework\.The previous section shows that token\-level detection scores exhibit both local and global contextual relations\. Motivated by these observations, we develop a multi\-level framework with three components: a local calibration module for short\-range score refinement, a global rule\-support reasoning module for sequence\-level pattern modeling, and a joint inference mechanism that combines the two signals\. The framework is shown in Fig\.[4](https://arxiv.org/html/2605.16107#S5.F4)\.
### V\-ALocal Relation Modeling via Markov Calibration
Section[IV\-A](https://arxiv.org/html/2605.16107#S4.SS1)reveals two local properties of token\-level detection scores: Neighbor Similarity and Initial Instability\. To capture these properties, we adopt a lightweight Markov calibration module that refines token scores before token\-level score aggregation\.
#### V\-A1Markov Random Field for MGT Detection
For each tokensts\_\{t\}in textss, we assign a binary random variableysty\_\{s\_\{t\}\}, whereyst=0y\_\{s\_\{t\}\}=0andyst=1y\_\{s\_\{t\}\}=1indicate a human\- or machine\-generated token333Note that token labels are not absolute but depend on the context in which they appear\. For example, ”the” can be a human token or a machine token depending on the text\., respectively, as measured by the detection score of the token\. Letysy\_\{s\}denote the label set for all tokens in textss, the pMRF over these tokens can be formalized as a Gibbs distribution\[[6](https://arxiv.org/html/2605.16107#bib.bib95)\]:P\(ys\)=1Zexp\(−E\(s,ys\)\),P\(y\_\{s\}\)=\\frac\{1\}\{Z\}\\exp\(\-E\(s,y\_\{s\}\)\),whereZZis a normalizing constant andE\(s,ys\)E\(s,y\_\{s\}\)is the energy function\. Our objective is to maximize the posterior probability of the token labelsysy\_\{s\}by minimizing the global energy functionE\(s,ys\)E\(s,y\_\{s\}\)\. The energy function typically consists of two components: the unary potentialΨU\\Psi\_\{U\}and the pairwise potentialΨP\\Psi\_\{P\}:
E\(s,ys\)=∑t=1NΨU\(st,yst\)\+∑t=1N∑sj∈𝒩\(st\)ΨP\(yst,ysj\),E\(s,y\_\{s\}\)=\\sum\\limits\_\{t=1\}^\{N\}\\Psi\_\{U\}\\left\(s\_\{t\},y\_\{s\_\{t\}\}\\right\)\+\\sum\\limits\_\{t=1\}^\{N\}\\sum\\limits\_\{s\_\{j\}\\in\\mathcal\{N\}\(s\_\{t\}\)\}\\Psi\_\{P\}\\left\(y\_\{s\_\{t\}\},y\_\{s\_\{j\}\}\\right\),where𝒩\(st\)=\{st−1,st\+1\}\\mathcal\{N\}\(s\_\{t\}\)=\\\{s\_\{t\-1\},s\_\{t\+1\}\\\}denotes the adjacent tokens of tokensts\_\{t\}\.
Unary potentialΨU\(st,yst\)\\Psi\_\{U\}\\left\(s\_\{t\},y\_\{s\_\{t\}\}\\right\)quantifies the cost of assigning labelysty\_\{s\_\{t\}\}to tokensts\_\{t\}\. We letΨU\(st,yst\)=−logp\(st\)\\Psi\_\{U\}\\left\(s\_\{t\},y\_\{s\_\{t\}\}\\right\)=\-\\log p\(s\_\{t\}\), wherep\(st\)p\(s\_\{t\}\)is the output probability from the original detector, which is measured by the 0\-1 normalized detection score of tokensts\_\{t\}, i\.e\., the normalizedd\(st\)d\(s\_\{t\}\)\.
Pairwise potentialΨP\(yst,ysj\)\\Psi\_\{P\}\\left\(y\_\{s\_\{t\}\},y\_\{s\_\{j\}\}\\right\)models the similarity in detection scores between adjacent tokens\. A penalty is applied if two adjacent tokens are assigned different labels; otherwise, a reward is given\. This enforces label smoothness and captures the neighbor similarity property:
ΨP\(yst,ysj\)=w⋅\(2⋅I\(yst≠ysj\)−1\),\\Psi\_\{P\}\\left\(y\_\{s\_\{t\}\},y\_\{s\_\{j\}\}\\right\)=w\\cdot\(2\\cdot I\(y\_\{s\_\{t\}\}\\neq y\_\{s\_\{j\}\}\)\-1\),\(4\)whereI\(⋅\)I\(\\cdot\)is the indicator, and the reward and penalty factorw≥0w\\geq 0\. This implies an energy penalty ofwwwhen adjacent tokens have different labels; otherwise, the reward is−w\-w\.
To model the initial instability property, we introduce a positional weighting functionβ\(t\)\\beta\(t\)in the binary potential\. This function assigns lower weights to binary potentials at earlier positions, thereby mitigating the errors caused by unstable initial neighbor tokens\. In this paper, we define the positional weighting functionβ\(t\)\\beta\(t\)as a Sigmoid function to ensure a smooth transition of weights, and the revised binary potential is then given by:
ΨP\(ysi,ysj\)=β\(j\)⋅w⋅\(2⋅I\(ysi≠ysj\)−1\),withβ\(j\)=11\+exp\(−\(j−t0\)\),\\begin\{gathered\}\\Psi\_\{P\}\\left\(y\_\{s\_\{i\}\},y\_\{s\_\{j\}\}\\right\)=\\beta\(j\)\\cdot w\\cdot\\left\(2\\cdot I\\left\(y\_\{s\_\{i\}\}\\neq y\_\{s\_\{j\}\}\\right\)\-1\\right\),\\\\ \\text\{ with \}\\beta\(j\)=\\frac\{1\}\{1\+\\exp\\left\(\-\\left\(j\-t\_\{0\}\\right\)\\right\)\},\\end\{gathered\}\(5\)wheret0t\_\{0\}is the predefined initial part length, effectively suppressing the pairwise potential of tokens beforet0t\_\{0\}\.
#### V\-A2Mean Field Approximation in MGT Detection
Exact inference in this MRF is intractable\. We therefore adopt a mean\-field approximation with a fully factorized variational distributionQ\(y\)=∏t=1NQt\(yt\)Q\(y\)=\\prod\_\{t=1\}^\{N\}Q\_\{t\}\(y\_\{t\}\)\. After standard derivation \(details can be found in Section II of Supplementary Material\), the resulting update can be written in matrix form as
Q\(r\)=softmax\(logQ\(r−1\)−AQ\(r−1\)\(W⊙\[−111−1\]\)\),Q^\{\(r\)\}=\\mathrm\{softmax\}\\\!\\left\(\\log Q^\{\(r\-1\)\}\-AQ^\{\(r\-1\)\}\\Big\(W\\odot\\begin\{bmatrix\}\-1&1\\\\ 1&\-1\\end\{bmatrix\}\\Big\)\\right\),\(6\)withQ\(0\)=\[1−p\(s\),p\(s\)\]Q^\{\(0\)\}=\[1\-p\(s\),p\(s\)\], wherep\(s\)=\[p\(s1\),…,p\(sN\)\]Tp\(s\)=\[p\(s\_\{1\}\),\\ldots,p\(s\_\{N\}\)\]^\{\\mathrm\{T\}\}\. For the adjacent matrixAA,At,t\+1=β\(t\+1\)A\_\{t,t\+1\}=\\beta\(t\+1\)for allt=0,…,N−2t=0,\.\.\.,N\-2,At−1,t=β\(t−1\)A\_\{t\-1,t\}=\\beta\(t\-1\)for allt=1,…,N−1t=1,\.\.\.,N\-1, and 0 otherwise\.W∈ℝ\+2×2W\\in\\mathbb\{R\}\_\{\+\}^\{2\\times 2\}is the reward and punishment weights\. This iterative refinement propagates reliable local evidence across neighboring tokens while suppressing unstable early interactions\.
AfterTTiterations, we further downweight early positions in the final calibrated scores:
Qfinal=Diag\(β\(1\),…,β\(N\)\)Q\(T\)\.Q^\{\\mathrm\{final\}\}=\\mathrm\{Diag\}\\\!\\big\(\\beta\(1\),\\dots,\\beta\(N\)\\big\)\\,Q^\{\(T\)\}\.\(7\)The calibrated token scores are then passed to the original detector’s aggregation component\. If the base detector is decomposed into a token\-level scoring moduleftokf\_\{\\mathrm\{tok\}\}and a text\-level decision modulefdecf\_\{\\mathrm\{dec\}\}, the enhanced detector becomes
fenh\(s\)=fdec\(fmrf\(ftok\(s\)\)\),f\_\{\\mathrm\{enh\}\}\(s\)=f\_\{\\mathrm\{dec\}\}\\\!\\big\(f\_\{\\mathrm\{mrf\}\}\(f\_\{\\mathrm\{tok\}\}\(s\)\)\\big\),wherefmrff\_\{\\mathrm\{mrf\}\}denotes the Markov calibration module of Eq\.[7](https://arxiv.org/html/2605.16107#S5.E7)\.
### V\-BGlobal Relation Modeling via Rule\-support Reasoning
Subsection[IV\-B](https://arxiv.org/html/2605.16107#S4.SS2)shows that MGT exhibits global relations, including lower score variance, lower adjacent\-difference variance, and lower long\-range difference variance\. These global patterns are difficult to capture by local smoothing alone\. Given the inferential power of symbolic logic\[[5](https://arxiv.org/html/2605.16107#bib.bib94)\], we introduce a rule\-support reasoning module\.
For a detector estimating the probabilityp\(y∣s\)p\(y\\mid s\), we introduce a latent variableα\\alphato denote the logical rule\. This yields the following ideal rule\-reasoning formulation:
p\(y∣s,b\)\\displaystyle p\(y\\mid s,b\)=∑αp\(y∣α,s,b\)p\(α∣s,b\),\\displaystyle=\\sum\\nolimits\_\{\\alpha\}p\(y\\mid\\alpha,s,b\)p\(\\alpha\\mid s,b\),\(8\)wherebbrepresents the prior knowledge about the rules, e\.g\., the desirable form of rules\. We can further decompose it as follows \(proof in Section III of Supplementary Material\):
p\(y∣s,b\)∝∑αp\(b∣α\)⏟Rule Prior⋅p\(y∣α\)⏟Detection⋅p\(α∣s\)⏟Rule Generation\.\\displaystyle p\(y\\mid s,b\)\\propto\\sum\_\{\\alpha\}\\underbrace\{p\(b\\mid\\alpha\)\}\_\{\\begin\{subarray\}\{c\}\\text\{ Rule Prior \}\\end\{subarray\}\}\\cdot\\underbrace\{p\(y\\mid\\alpha\)\}\_\{\\begin\{subarray\}\{c\}\\text\{ Detection \}\\end\{subarray\}\}\\cdot\\underbrace\{p\(\\alpha\\mid s\)\}\_\{\\begin\{subarray\}\{c\}\\text\{Rule Generation\}\\end\{subarray\}\}\.\(9\)
The three derived terms correspond to three main parts of the rule\-support reasoning module\.
#### V\-B1Rule Priorp\(b∣α\)p\(b\\mid\\alpha\)
The rule priors are used to constrain the rule employed for feasibility detection\. In the paper, they are treated as global statistics identified in the previous section, e\.g\.,σlate2\(s\)\\sigma^\{2\}\_\{late\}\(s\)andσlate,adj2\(s\)\\sigma^\{2\}\_\{late,adj\}\(s\)\. More generally, suppose each textssis represented byMMglobal statistics:
z\(s\)=\[z1\(s\),z2\(s\),…,zM\(s\)\]\.z\(s\)=\\big\[z\_\{1\}\(s\),z\_\{2\}\(s\),\\dots,z\_\{M\}\(s\)\\big\]\.For each statisticzm\(s\)∈z\(s\)z\_\{m\}\(s\)\\in z\(s\), we compute its empirical range on the training set𝒟train\\mathcal\{D\}\_\{train\}and uniformly divide it intoKKintervals usingK−1K\-1thresholds\{τm,j\}j=1K−1\\\{\\tau\_\{m,j\}\\\}\_\{j=1\}^\{K\-1\}:
τm,j=am\+j−1K−1\(bm−am\),j=1,…,K−1,\\tau\_\{m,j\}=a\_\{m\}\+\\frac\{j\-1\}\{K\-1\}\(b\_\{m\}\-a\_\{m\}\),\\qquad j=1,\\dots,K\-1,where
am=mins∈𝒟trainzm\(s\),bm=maxs∈𝒟trainzm\(s\)\.a\_\{m\}=\\min\_\{s\\in\\mathcal\{D\}\_\{\\mathrm\{train\}\}\}z\_\{m\}\(s\),\\qquad b\_\{m\}=\\max\_\{s\\in\\mathcal\{D\}\_\{\\mathrm\{train\}\}\}z\_\{m\}\(s\)\.
Then, for each statisticzm\(s\)z\_\{m\}\(s\), we obtain K threshold atoms \(the smallest unit of the rule\):
zm\(s\)≤τm,1,τm,1≤zm\(s\)≤τm,2,…,zm\(s\)\>τm,K−1,z\_\{m\}\(s\)\\leq\\tau\_\{m,1\},\\;\\tau\_\{m,1\}\\leq z\_\{m\}\(s\)\\leq\\tau\_\{m,2\},\\;\\dots,\\;z\_\{m\}\(s\)\>\\tau\_\{m,K\-1\},and similarly for the other statistics\. Based on these threshold atoms, we define the feasible rule spaceΩ\\Omegausing ”AND” conjunction, i\.e\., a rule takes the form ”atom 1 AND atom 2 AND …”\.
Naturally, a candidate ruleα\\alphais feasible \(i\.e\.α∈Ω\(α\)\\alpha\\in\\Omega\(\\alpha\)\) only if it satisfies the following constraints: \(1\) each atom inα\\alphamust be one of the threshold atoms defined above; \(2\) exactly one threshold atom is selected from each statistic, so as to avoid impossible rules such aszm\(s\)≤τm,1z\_\{m\}\(s\)\\leq\\tau\_\{m,1\}ANDτm,1≤zm\(s\)≤τm,2\\tau\_\{m,1\}\\leq z\_\{m\}\(s\)\\leq\\tau\_\{m,2\}; \(3\) the number of threshold atoms in the ruleα\\alphaisMM\. Therefore, the rule prior is
ph\(b∣α\)=\{1,α∈Ω\(α\),0,otherwise\.p\_\{h\}\(b\\mid\\alpha\)=\\begin\{cases\}1,&\\alpha\\in\\Omega\(\\alpha\),\\\\ 0,&\\text\{otherwise\}\.\\end\{cases\}\(10\)
#### V\-B2Rule Generationp\(α∣s\)p\(\\alpha\\mid s\)
For a given input textss, each statisticzm\(s\)z\_\{m\}\(s\)falls into one interval and therefore activates exactly one threshold atom, denoted byom\(s\)o\_\{m\}\(s\)\. We then generate the rule forssdeterministically as
α\(s\)=o1\(s\)ANDo2\(s\)AND…ANDom\(s\)\.\\alpha\(s\)=o\_\{1\}\(s\)\\ \\text\{AND\}\\ o\_\{2\}\(s\)\\text\{AND\}\\ \.\.\.\\ \\text\{AND\}\\ o\_\{m\}\(s\)\.Thus, rule generation is a direct mapping from the global statistic to a deterministic rule, and correspondingly, the rule\-generation distribution degenerates to a one\-hot form centered atα\(s\)\\alpha\(s\), that is,
p\(α∣s\)=\{1,α=α\(s\),0,otherwise\.p\(\\alpha\\mid s\)=\\begin\{cases\}1,&\\alpha=\\alpha\(s\),\\\\ 0,&\\text\{otherwise\}\.\\end\{cases\}\(11\)
#### V\-B3Detectionp\(y∣α\)p\(y\\mid\\alpha\)
Ideally, the detection termp\(y∣α\)p\(y\\mid\\alpha\)can be estimated by obtaining all samples that satisfy ruleα\\alphaand then calculating the percentage of them that have labelyy, i\.e\.,
p^\(y∣α\)=nα,ynα,\\hat\{p\}\(y\\mid\\alpha\)=\\frac\{n\_\{\\alpha,y\}\}\{n\_\{\\alpha\}\},wherenαn\_\{\\alpha\}is the number of training texts satisfying ruleα\\alpha, andnα,yn\_\{\\alpha,y\}is the number of such texts with class labelyy\.
However, this ideal estimation is unreliable in our setting\. Since the training set contains only a few texts, while the number of possible conjunction rules grows combinatorially after discretization \(KMK^\{M\}\), many specific rules are supported by very few samples\. As a result, the exact estimation ofp\(y∣α\)p\(y\\mid\\alpha\)becomes highly sparse and unstable\.
Therefore, in the actual implementation, we approximate the decision term by a stable rule\-support score derived from the activated atomic rules\. For an atomoo, we compute its detection\-support score as
r\(y∣o\)=\{no,yno,no\>0,12,no=0,r\(y\\mid o\)=\\begin\{cases\}\\frac\{n\_\{o,y\}\}\{n\_\{o\}\},&n\_\{o\}\>0,\\\\\[4\.0pt\] \\frac\{1\}\{2\},&n\_\{o\}=0,\\end\{cases\}wherenon\_\{o\}is the number of training texts satisfying atomoo,no,yn\_\{o,y\}is the number of such texts with labelyy\. For unseen atoms, we use a neutral score\.
Given the generated ruleα\(s\)=⋀m=1Mom\(s\)\\alpha\(s\)=\\bigwedge\_\{m=1\}^\{M\}o\_\{m\}\(s\), we define the practical approximation of the detection term by aggregating the support scores of its activated atoms:
r\(y∣α\(s\)\)=1M∑m=1Mr\(y∣om\(s\)\)\.r\(y\\mid\\alpha\(s\)\)=\\frac\{1\}\{M\}\\sum\_\{m=1\}^\{M\}r\(y\\mid o\_\{m\}\(s\)\)\.
For the ruleα\(s\)\\alpha\(s\)extracted fromss, it is clear thatp\(b∣α\)=1p\(b\\mid\\alpha\)=1andp\(α∣s\)=1p\(\\alpha\\mid s\)=1, then the rule\-support reasoning formulation is approximated in practice as:
p~\(y∣s,b\)∝r\(y∣α\(s\)\)\.\\tilde\{p\}\(y\\mid s,b\)\\propto r\(y\\mid\\alpha\(s\)\)\.In the binary setting of MGT detection, we use
rrule\(s\)=r\(MGT∣α\(s\)\)r\_\{\\mathrm\{rule\}\}\(s\)=r\(MGT\\mid\\alpha\(s\)\)\(12\)as the output of the rule\-support reasoning module\.
Notably, the rule score is not intended to be a calibrated posterior probability\. Instead, it is a support score induced by global statistics\. This design avoids the sparsity of full conjunction rules while preserving the semantics of rule activation\. We also compare with strict probabilistic rules in Section[VI\-D](https://arxiv.org/html/2605.16107#S6.SS4)to demonstrate the rationality of our design\.
### V\-CJoint Multi\-Level Inference
The local and global modules play complementary roles\. The local branch refines token\-level evidence by exploiting local contextual consistency, while the global branch captures global contextual relations to provide a confident score\.
Sincerrule\(s\)r\_\{\\mathrm\{rule\}\}\(s\)is a rule\-support score rather than a strict probability, it is not used as a standalone prediction\. Therefore, we treat it as a complementary confidence that adjusts the final detector score, rather than as a standalone prediction score\. The final detection score is as follows
F\(s\)=fenh\(s\)\+λrrule\(s\),F\(s\)=f\_\{\\mathrm\{enh\}\}\(s\)\+\\lambda\\,r\_\{\\mathrm\{rule\}\}\(s\),\(13\)whereλ≥0\\lambda\\geq 0is a learnable coefficient that controls the contribution of the global rule confidence\.
Complexity Analysis\. The parameters of our method are learned from text\-level supervision in the training set\. For the local calibration module, since the update only involves sparse\-dense matrix multiplications over adjacent positions, its computational complexity is𝒪\(NT\)\\mathcal\{O\}\(NT\)for a text of lengthNNandTTrefinement iterations\. For the global rule\-support reasoning module, the complexity of computing global statistics is𝒪\(NM\)\\mathcal\{O\}\(NM\), determining the threshold atom to which each statistic belongs is𝒪\(MK\)\\mathcal\{O\}\(MK\), and aggregating the rule support score is𝒪\(M\)\\mathcal\{O\}\(M\)\. Overall, the complexity is𝒪\(NT\+NM\+MK\)\\mathcal\{O\}\(NT\+NM\+MK\)\.
Overall Framework\. Alg\.[1](https://arxiv.org/html/2605.16107#alg1)summarizes the overall inference procedure of the proposed framework\. Specifically, theLocal Branch\(Lines 5\-13\) first computes raw token\-level scores from the base detector, and then refines them through the Markov\-informed calibration module\. This produces the locally enhanced detection scorefenh\(s\)f\_\{\\mathrm\{enh\}\}\(s\)\. TheGlobal Branch\(Lines 14\-23\) then computes the global statistics of the input text, maps each statistic to its corresponding threshold atom, constructs a deterministic logical rule, and aggregates the activated atom supports into the rule scorerrule\(s\)r\_\{\\mathrm\{rule\}\}\(s\)\. Finally, the two branches are fused through a simple additive form to obtain the final scoreF\(s\)F\(s\)\(Line 3\)\.
Algorithm 1Joint Multi\-Level Inference for MGT Detection1:Text
s=\{st\}t=1Ns=\\\{s\_\{t\}\\\}\_\{t=1\}^\{N\}; base detector
ftok,fdecf\_\{\\mathrm\{tok\}\},f\_\{\\mathrm\{dec\}\}\.
2:Local calibration parameters
Wmrf,t0,TW\_\{\\mathrm\{mrf\}\},t\_\{0\},T;
3:Global statistic functions
\{zm\(⋅\)\}m=1M\\\{z\_\{m\}\(\\cdot\)\\\}\_\{m=1\}^\{M\}; thresholds
\{τm,j\}m=1,j=1M,K−1\\\{\\tau\_\{m,j\}\\\}\_\{m=1,j=1\}^\{M,K\-1\}; atom support scores
r\(y∣o\)r\(y\\mid o\); fusion weight
λ\\lambda\.
4:
fenh\(s\)←Local\_Calibration\(s\)f\_\{\\mathrm\{enh\}\}\(s\)\\leftarrow\\textsc\{Local\\\_Calibration\}\(s\)\.
5:
rrule\(s\)←Global\_Reasoning\(s\)r\_\{\\mathrm\{rule\}\}\(s\)\\leftarrow\\textsc\{Global\\\_Reasoning\}\(s\)\.
6:
F\(s\)←fenh\(s\)\+λrrule\(s\)F\(s\)\\leftarrow f\_\{\\mathrm\{enh\}\}\(s\)\+\\lambda\\,r\_\{\\mathrm\{rule\}\}\(s\)\.
7:returndetection score
F\(s\)F\(s\)of text
ss\.
8:functionLocal\_Calibration\(
ss\)
9:Compute raw token\-level scores
p\(s\)=ftok\(s\)p\(s\)=f\_\{\\mathrm\{tok\}\}\(s\)\.
10:set
Q\(0\)=\[1−p\(s\),p\(s\)\]Q^\{\(0\)\}=\[1\-p\(s\),p\(s\)\]\.
11:for
r=1r=1to
TTdo
12:Update
Q\(r\)Q^\{\(r\)\}according to Eq\.[6](https://arxiv.org/html/2605.16107#S5.E6)\.
13:endfor
14:Calculate
QfinalQ\_\{\\text\{final\}\}according to Eq\.[7](https://arxiv.org/html/2605.16107#S5.E7)\.
15:returncalibrated score
fenh\(s\)=fdec\(Qfinal\)f\_\{\\mathrm\{enh\}\}\(s\)=f\_\{\\mathrm\{dec\}\}\(Q^\{\\mathrm\{final\}\}\)\.
16:endfunction
17:functionGlobal\_Reasoning\(
ss\)
18:
z\(s\)←\[z1\(s\),z2\(s\),…,zM\(s\)\]z\(s\)\\leftarrow\[z\_\{1\}\(s\),z\_\{2\}\(s\),\\dots,z\_\{M\}\(s\)\]
19:for
m=1m=1to
MMdo
20:Determine the activated threshold atom of
zm\(s\)z\_\{m\}\(s\)according to
\{τm,j\}j=1K−1\\\{\\tau\_\{m,j\}\\\}\_\{j=1\}^\{K\-1\}\.
21:Activate the corresponding atom
om\(s\)o\_\{m\}\(s\)\.
22:endfor
23:
α\(s\)←⋀m=1Mom\(s\)\\alpha\(s\)\\leftarrow\\bigwedge\_\{m=1\}^\{M\}o\_\{m\}\(s\)\.
24:
rrule\(s\)←1M∑m=1Mr\(MGT∣om\(s\)\)r\_\{\\mathrm\{rule\}\}\(s\)\\leftarrow\\frac\{1\}\{M\}\\sum\_\{m=1\}^\{M\}r\(\\mathrm\{MGT\}\\mid o\_\{m\}\(s\)\)\.
25:return
rrule\(s\)r\_\{\\mathrm\{rule\}\}\(s\)\.
26:endfunction
## VIExperiments
### VI\-AExperimental Settings
#### VI\-A1Datasets
We evaluate the proposed method on four public datasets:
- •Essay\[[33](https://arxiv.org/html/2605.16107#bib.bib23)\]\. Each source contains 1,000 samples\. The HGT portion comprises IvyPanda essays across various academic levels\. The corresponding MGTs were generated by prompting multiple LLMs \(including GPT4All, ChatGPT, ChatGPT‑turbo, ChatGLM, Dolly, and Claude\) based on topics from the source documents\.
- •Reuters\[[33](https://arxiv.org/html/2605.16107#bib.bib23)\]\. Built on the Reuters 50–50 authorship benchmark, this dataset contains 1,000 articles from 50 writers\. ChatGPT‑turbo was used to invent a headline for every article\. Those auto\-generated headlines were then embedded into prompts and submitted to multiple LLMs, including ChatGPT, GPT‑4, ChatGPT‑turbo, ChatGLM, Dolly, and Claude, to create the machine\-generated texts\.
- •TruthfulQA\[[19](https://arxiv.org/html/2605.16107#bib.bib4)\]\. It contains 817 questions covering 38 categories, including health, law, finance, and politics\. The generated answers were produced by several large language models, including GPT4, ChatGPT\-turbo, ChatGLM, Dolly, ChatGPT, and StableLM\.
- •DetectRL\[[37](https://arxiv.org/html/2605.16107#bib.bib48)\]\. HGTs include 2,800 samples each from arXiv, XSum, Writing Prompts, and Yelp\. MGTs were generated by ChatGPT, PaLM\-2, and Llama\-2\. It also features practical adversarial settings, including paraphrasing attacks\[[18](https://arxiv.org/html/2605.16107#bib.bib51)\]and mixed\-text conditions\. The dataset further models practical adversarial settings: \(1\) paraphrasing attacks that rewrites MGTs with the Dipper\[[18](https://arxiv.org/html/2605.16107#bib.bib51)\]and Polish paraphraser, and \(2\) a mixed\-text condition where 1/4 of machine\-generated sentences is randomly replaced with human\-written content while the label remains MGT\.
#### VI\-A2Baselines
We choose the following detection methods for base detectors for enhancements\.
- •Likelihood\[[29](https://arxiv.org/html/2605.16107#bib.bib11)\]\. It uses an LLM to calculate the log probability of each token in a text\. The average of these probabilities gives a detection score\. A higher score indicates a greater likelihood that the text was generated by an LLM\.
- •Log\-Rank\[[7](https://arxiv.org/html/2605.16107#bib.bib6)\]\. Its detection score is computed by first using an LLM to rank each token in a text according to its predicted order in the given context\. The logarithm of each word’s predicted rank is then calculated\. The final score is an average of these values, and a lower score is a strong indicator of machine\-generated text\.
- •Entropy\[[7](https://arxiv.org/html/2605.16107#bib.bib6)\]\. Similar to Log\-Rank, it calculates a score for a text by taking the average of each token’s conditional entropy within its given context\. A lower score indicates a higher likelihood that the text was generated by an LLM\.
- •DetectGPT\[[24](https://arxiv.org/html/2605.16107#bib.bib10)\]\. Its underlying idea is that text created by LLMs is already a high\-probability output\. So, when it is slightly altered, the new version is likely to have a lower log probability\. In contrast, making similar small changes to human\-written text does not consistently lower the log probability; it can just as easily stay the same or increase\.
- •Fast\-DetectGPT \(FastGPT\)\[[2](https://arxiv.org/html/2605.16107#bib.bib49)\]\. To overcome the major computational expense of DetectGPT, this approach replaces DetectGPT’s resource\-intensive perturbation step with a more efficient sampling process\. It identifies differences in token selection between humans and LLMs using a conditional probability curvature metric\.
- •Binoculars\[[12](https://arxiv.org/html/2605.16107#bib.bib63)\]\. It is a detection algorithm that requires no training data and accurately distinguishes between human\- and machine\-generated text by comparing the score differences between a pair of pre\-trained LLMs\.
- •FourierGPT\[[41](https://arxiv.org/html/2605.16107#bib.bib64)\]\. It proposes a detection method from a likelihood spectrum perspective, capturing subtle differences between MGT and HGT by analyzing relative changes in text likelihood values rather than their absolute values\.
- •AdaDetectGPT \(AdaGPT\)\[[51](https://arxiv.org/html/2605.16107#bib.bib91)\]\. To overcome the suboptimality of existing logic\-based detectors, which rely solely on log\-probabilities, this method introduces a classifier that adaptively learns witness functions from training data\.
- •DNA\-DetectLLM \(DetectLLM\)\[[54](https://arxiv.org/html/2605.16107#bib.bib85)\]\. Based on a DNA\-inspired perspective, it detects MGT through a repair\-based process\. It constructs an ideal AI\-generated sequence, iteratively repairs non\-optimal tokens, and quantifies the cumulative repair effort as an interpretable detection signal\.
The versions utilizing the preliminary work and the proposed method are designated by the suffixes ’M’ and ’Mult’, respectively, e\.g\., DetectGPT\-M and DetectGPT\-Mult\.
TABLE II:Performance concerning TPR@FPR\-1% \(%\) on Essay \(left\) and TruthfulQA \(right\)\. The detectors are trained on GPT4All texts on Essay and GPT4 texts on TruthfulQA\.
#### VI\-A3Evaluation Metrics
We evaluate detection performance using two metrics\. First, we report the area under the ROC curve \(AUROC\) as the main metric for binary classification\. Second, following recent MGT detection literature, we additionally report the true positive rate at a low false positive rate\. It is particularly important in practice because falsely classifying human\-written text as machine\-generated can be highly undesirable\. Specifically, we measure the TPR at an FPR of 1%, denoting this asTPR@FPR\-1%\.
#### VI\-A4Experimental Protocol and Proxy Models
We follow a strict black\-box threat model for the source LLMs: the detector is assumed to be entirely unknown to the target generative model\. Accordingly, we employ a proxy model to compute token\-level scores for metric\-based baselines\. Here, GPT\-2 is used as the proxy model for all baselines, while GPT\-2\-XL is used as the scoring model for Fast\-DetectGPT and DetectLLM\. For the trainable components of our method, we learn the local calibration and global rule\-support reasoning parameters from the specified LLM\-generated texts, then test on candidate texts generated by different LLMs, aiming to evaluate both in\-LLM and cross\-LLM generalization\.
#### VI\-A5Parameter Settings
We conduct five independent runs using fixed random seeds \{1,2,3,4,5\}\. For each dataset, 10% of the data is used for training, and the remaining 90% is evenly split into validation and test sets\. The thresholdsτm,j\{\\tau\_\{m,j\}\}and atom\-support scoresr\(y∣o\)r\(y\\mid o\)are estimated only on the training split\. The fusion coefficientλ\\lambdaand other tunable parameters are selected on the validation split, and the test split is used only for final evaluation\. To ensure fair comparison, the enhanced detectors share the same hyperparameters as their corresponding base detectors\. For the local calibration module, the number of MRF refinement iterations is set toT=10T=10by default\. For the global rule\-support reasoning module, the intervalKKis set to 10\. Besides, the initial part lengtht0t\_\{0\}is set tot0=20t\_\{0\}=20\. These values are kept fixed across datasets unless otherwise specified\.
### VI\-BPerformance Comparison
In this section, we evaluate the effectiveness of the proposed method’s enhancements in various real\-world scenarios, including cross\-LLM, cross\-domain, mixed\-text detection, and resistance to paraphrasing and adversarial attacks\. The details of these scenarios are provided in Section IV of the Supplementary Material\.
TABLE III:Performance concerning AUROC \(%\) on Essay \(left\) and TruthfulQA \(right\)\. The detectors are trained on GPT4All texts on Essay and GPT4 texts on TruthfulQA\.Figure 5:The performance improvement compared with original detector \(left\) and our preliminary work\. Here the base detector is DetectLLM\. Values greater than 0 indicate an enhanced effect\.#### VI\-B1Performance across Different LLMs
Tables[II](https://arxiv.org/html/2605.16107#S6.T2)and[III](https://arxiv.org/html/2605.16107#S6.T3)report the cross\-LLM results under TPR@FPR\-1% and AUROC when detectors are trained on GPT4 texts\. More results about other LLM text training and other datasets can be found in Section V\-C of the Supplementary Material\. While our preliminary work \(\-M\) already improves most base detectors, the proposed method \(\-Mult\) further delivers stronger cross\-LLM generalization, especially under the stricter TPR@FPR\-1% metric\. For example, on Essay, the average TPR@FPR\-1% improves from 52\.41% to 77\.87% to 82\.65% for Likelihood, and from 0\.15% to 37\.17% to 82\.32% for DetectGPT; similar gains are also observed for other detectors\. On TruthfulQA, which consists of short texts, our advantage is also evident for most detectors; e\.g\., Likelihood improves from 46\.38% to 50\.82% to 69\.30%\. The same pattern is reflected by AUROC\. Compared with TPR@FPR\-1%, the absolute gains in AUROC are generally smaller, which is expected because several baselines already achieve relatively high AUROC scores\. Even so, the proposed method still provides consistent improvements\. Overall, extensive cross\-LLM experiments demonstrate the effectiveness of our multi\-level contextual relation modeling in capturing MGT features\.
#### VI\-B2Performance across Different Domains
We further evaluate cross\-domain generalization on the DetectRL benchmark, which comprises four high\-risk domains:*arXiv*,*Writing Prompts*,*XSum*, and*Yelp Reviews*\. Each detector is trained on one domain and tested on the remaining domains, and Fig\.[5](https://arxiv.org/html/2605.16107#S6.F5)summarizes the performance gains of the DetectLLM detector \(additional enhanced results for more detectors are provided in Section V\-D of the Supplementary Material\)\. It is evident that the proposed framework improves cross\-domain detection in most settings, indicating that contextual modeling is not limited to a specific content domain and transfers well across heterogeneous writing styles and topics\. For example, the average AUROC gain compared with the original detector across the 16 train–test pairs is about\+19\.1\+19\.1, and the average AUROC gain compared with our preliminary work is about\+7\.2\+7\.2\. We attribute this out\-of\-domain generalization to the fact that, building upon preliminary local relation modeling, the proposed rule\-support reasoning module further leverages global contextual relations, making the enhanced detector less dependent on superficial domain cues and therefore more transferable across domains\.
Figure 6:Detection performance concerning AUROC under different mixed texts\. All detectors are trained on pure Llama\-2\-70b texts\.Figure 7:Detection performance under Dipper and Polish paraphrasing texts\. All detectors are trained on Llama\-2\-70b texts\.
#### VI\-B3Performance against Mixed Texts
In practice, human–AI collaboration is pervasive, so a detector must distinguish not only pure MGT from HGT, but also mixed texts in which machine\-generated content is partially blended with human\-written sentences\. We therefore evaluate the proposed framework on the mixed\-text setting of DetectRL, where 1/4 of the machine\-generated sentences are replaced by human\-written ones while the label remains MGT\. The results are shown in Fig\.[6](https://arxiv.org/html/2605.16107#S6.F6)\. The proposed method improves AUROC over both the original detector and our preliminary work in most settings\. This indicates that the proposed multi\-level design is effective even when the machine signal is partially diluted by human\-written content\. Compared with our preliminary work, our additional gains suggest that global contextual relation modeling is helpful in mixed\-text scenarios, since mixed texts are harder to distinguish using only local score smoothing and require more global structural cues\. Undeniably, the proposed method performs poorly in detecting ChatGPT text\. This is because its performance in this specific context approaches that of random chance \(e\.g\., an AUROC close to 0\.5, and TPR@FPR\-1%<0\.1<0\.1\), indicating that the detection scores generated by the original detector are completely unreliable, which in turn leads to suboptimal performance in the enhanced one\. However, as demonstrated by the detection of Llama\-2\-70b and Google\-PaLM texts, the proposed method proves effective provided that the underlying detection scores have a certain degree of predictive value\. We emphasize that our efforts toward enhancement should focus on developing effective detectors rather than weak, nearly stochastic ones, as the latter lack practical utility\.
#### VI\-B4Performance against Paraphrasing Attacks
Prior work\[[27](https://arxiv.org/html/2605.16107#bib.bib27)\]has shown that MGT detectors are particularly vulnerable to paraphrasing attacks, because paraphrasing can preserve the original semantics while concealing surface\-level MGT patterns\. We therefore evaluate the robustness of the proposed method on two widely used paraphrasing attacks in DetectRL, namely*Polish*and*Dipper*, where the detector is trained on clean texts from Llama\-2\-70b and tested on paraphrased texts from different source LLMs\. The results w\.r\.t\. AUROC are shown in Fig\.[7](https://arxiv.org/html/2605.16107#S6.F7)\. The proposed method consistently outperforms both the original detector and our preliminary work in most settings\. This result indicates that the proposed multi\-level framework remains effective even when the original MGT has been substantially rewritten\. In particular, the gains over the preliminary work suggest that global contextual relation modeling provides additional robustness beyond local calibration alone, since paraphrasing not only introduces local token\-level perturbations but may also alter broader sequence\-level score patterns\.
Figure 8:Detection performance concerning AUROC under adversarial texts concerning character perturbation\. All detectors are trained on Llama\-2\-70b texts\.Figure 9:The empirical MGT probability associated with different threshold atoms induced by the global statistics\. DetectLLM scores are used here\.
#### VI\-B5Performance against Adversarial Attacks
Beyond paraphrasing, MGT detection is also vulnerable to adversarial attacks\[[53](https://arxiv.org/html/2605.16107#bib.bib93)\]\. Therefore, we further evaluate robustness to stronger adversarial perturbations at*character*, and*word*granularities\. The corresponding comparison results are shown in Fig\.[8](https://arxiv.org/html/2605.16107#S6.F8)\. A highly consistent trend can be observed: across nearly all settings, the proposed method outperforms both the original detector and our preliminary local\-only version, indicating that the proposed multi\-level design substantially improves adversarial robustness\. A plausible explanation is that such attacks act as structured noise on detector scores: the local calibration module suppresses token\-level corruption by exploiting short\-range consistency, while the global rule\-support reasoning module further recovers more stable text\-level cues from score statistics\. Their combination therefore provides stronger robustness than either the original detector or the preliminary local\-only framework\.
### VI\-CRule Visualization
To provide an intuitive understanding of the effectiveness of the proposed rule\-support reasoning module, Fig\.[9](https://arxiv.org/html/2605.16107#S6.F9)visualizes the empirical MGT probability associated with different rule patterns induced by the global statistics of DetectLLM\. More results can be found in Section V\-E of the Supplementary Material\. A clear trend can be observed: rules corresponding to stronger latter\-part stability are consistently assigned higher MGT probability, whereas rules associated with larger score fluctuations tend to be less indicative of MGT\. This observation is highly consistent with the distributional evidence in Fig\.[3](https://arxiv.org/html/2605.16107#S4.F3), where MGT is concentrated in the low\-variance and low\-difference regions, while HGT exhibits a broader and more dispersed distribution\. Therefore, Fig\.[9](https://arxiv.org/html/2605.16107#S6.F9)provides a qualitative validation of the rule design: the rule\-support reasoning module transforms empirically supported global regularities into explicit rules whose support scores remain aligned with the actual class tendency in the data\. This result further justifies incorporating rule\-support reasoning into our framework, as it captures text\-level cues that complement local calibration and are genuinely informative for distinguishing MGT from HGT\.
### VI\-DComparison with Probabilistic Rule\-based Method
Figure 10:Performance comparison with probabilistic rule\-based method on the Essay dataset\. The reported results are the average performance across all LLM\-generated texts\.Figure 11:Ablation results concerning AUROC on the Essay dataset\. The reported results are the average performance across all LLM\-generated texts\.We further compare the proposed rule\-support reasoning with a more direct probabilistic rule\-based alternative that estimates the detection score from the full conjunction rule\. As shown in Fig\.[10](https://arxiv.org/html/2605.16107#S6.F10), our rule\-support reasoning consistently achieves better AUROC across different training sources and base detectors\. This result is consistent with the motivation in Section V\-B\. Although the probabilistic formulation is ideal in principle, directly estimating the conditional probability of a full rule is unreliable in our setting, because the training set is relatively small while the number of possible conjunction rules grows combinatorially after discretization\. As a result, many full rules are supported by very few samples, making the resulting estimates sparse and unstable\.
### VI\-EAblation Study
We introduce local calibration and global reasoning modules to model multi\-level relations among contextual tokens\. In this section, we verify their effectiveness through ablation studies, denoted as ”w/o global” and ’w/o local’\. Results on the Essay dataset are shown in Fig\.[11](https://arxiv.org/html/2605.16107#S6.F11), with additional results available in Section V\-F of the Supplementary Material\. Since the preliminary work has already established the effectiveness of local calibration, we focus here on the additional value brought by the proposed global rule\-support reasoning module\. It is evident that the full model consistently achieves the best performance, whereas removing the global reasoning branch leads to a noticeable drop across almost every setting, especially for challenging detectors such as FastGPT and DetectLLM\. This directly verifies that our gains are not merely inherited from the preliminary local module, but are substantially supported by the newly introduced global reasoning component\. Moreover, the variant without local calibration still remains competitive in many settings, indicating that the global rule\-support reasoning module itself already contributes strong discriminative power\.
Figure 12:Detection performance at different numbers of rules on the Essay dataset\. All detectors are trained on ChatGPT texts\.Figure 13:Detection performance at different initial part lengths on the Essay dataset\. All detectors are trained on ChatGPT texts\.
### VI\-FSensitivity Analysis
Since the preliminary work has already provided a sensitivity analysis for the local calibration module, we here focus on the key hyperparameter introduced by the global rule\-support reasoning module\.
Bucket NumberKK, which is related to the number of rules\. As shown in Fig\.[12](https://arxiv.org/html/2605.16107#S6.F12)\(more results are in Section V\-G of the Supplementary Material\), the proposed method remains relatively stable over a broad range ofKKvalues, indicating that the global reasoning branch is not overly sensitive to this discretization choice and thus enjoys good practical robustness\. Undeniably, whenKKbecomes too large, performance may tend to decline\. This is because finer discretization partitions the statistic space into too many small buckets, so that each bucket contains fewer training samples\. This weakens the reliability of the corresponding atom\-level support estimates and makes the resulting rule\-support scores more sparse and unstable\. Based on this trade\-off, we use a moderate default setting, i\.e\.,K=10K=10, in all experiments\.
Initial Part Lengtht0t\_\{0\}, which determines the starting position of the latter part used in the global statistics\. As shown in Fig\.[13](https://arxiv.org/html/2605.16107#S6.F13), the proposed method remains stable over a broad range oft0t\_\{0\}values, indicating that the global rule\-support reasoning module is not overly sensitive to the exact choice of the latter\-part starting point\. Even with a very small value oft0t\_\{0\}, the detection performance remains outstanding when compared to not utilizing global rule\-support reasoning \(t0=0t\_\{0\}=0\)\. This highlights the practicality of our method, since it does not require extensive parameter tuning\.
### VI\-GRunning Time
TABLE IV:Running time \(s\) of training and inference phases\.Table[IV](https://arxiv.org/html/2605.16107#S6.T4)reports the training and inference time on four datasets\. Overall, the proposed method introduces only a negligible runtime overhead, while providing substantially stronger detection performance\. This trend is particularly clear for detectors whose dominant cost already lies in score computation, such as DetectGPT\. These results are consistent with our design: the local calibration mainly involves sparse\-dense operations over adjacent positions, while the global rule\-support reasoning branch only computes a small number of low\-dimensional statistics and bucket\-based scores\. Therefore, our method preserves the practical efficiency advantage and achieves improved performance with only negligible\-to\-moderate additional computational cost\. This observation is also consistent with the complexity analysis in Section[V\-C](https://arxiv.org/html/2605.16107#S5.SS3), where the time complexity is onlyO\(NT\+NM\+MK\)O\(NT\+NM\+MK\)\.
## VIIConclusion
In this paper, we have proposed a multi\-level contextual token relation modeling framework for machine\-generated text detection\. By revisiting representative metric\-based detectors under a unified view, we have identified a shared limitation: token\-level detection scores can be biased by the stochasticity of LLM generation, while direct aggregation cannot explicitly correct such imprecision\. To this end, we have modeled contextual relations among token\-level scores from both local and global perspectives\. Locally, we have captured Neighbor Similarity and Initial Instability through a lightweight Markov\-informed calibration module\. Globally, we have characterized score stability patterns of MGTs and introduced a rule\-support reasoning module to model them explicitly\. The two modules are integrated into a joint multi\-level inference framework, leading to improved detection performance across multiple datasets and practical scenarios with low computational overhead\. Nevertheless, the proposed framework is mainly applicable to metric\-based detectors whose scores can be decomposed at the token level\. In addition, future adaptive attacks may deliberately disrupt score stability patterns, which remains an important direction for further study\.
## References
- \[1\]J\. Achiam, S\. Adler, S\. Agarwal, L\. Ahmad, I\. Akkaya, F\. L\. Aleman, D\. Almeida, J\. Altenschmidt, S\. Altman, S\. Anadkat,et al\.\(2023\)Gpt\-4 technical report\.arXiv preprint arXiv:2303\.08774\.Cited by:[§I](https://arxiv.org/html/2605.16107#S1.p1.1)\.
- \[2\]\(2024\)Fast\-detectgpt: efficient zero\-shot detection of machine\-generated text via conditional probability curvature\.InProceedings of the Twelfth International Conference on Learning Representations,pp\. 1–9\.Cited by:[§I](https://arxiv.org/html/2605.16107#S1.p2.1),[§I](https://arxiv.org/html/2605.16107#S1.p3.1),[§II\-C](https://arxiv.org/html/2605.16107#S2.SS3.p1.1),[§III](https://arxiv.org/html/2605.16107#S3.p1.1),[5th item](https://arxiv.org/html/2605.16107#S6.I2.i5.p1.1)\.
- \[3\]X\. Chen, J\. Wu, S\. Yang, R\. Zhan, Z\. Wu, Z\. Luo, D\. Wang, M\. Yang, L\. S\. Chao, and D\. F\. Wong\(2025\)RepreGuard: detecting llm\-generated text by revealing hidden representation patterns\.arXiv preprint arXiv:2508\.13152\.Cited by:[§II\-C](https://arxiv.org/html/2605.16107#S2.SS3.p1.1)\.
- \[4\]Z\. Chen, Y\. Feng, C\. He, Y\. Deng, H\. Pu, and B\. Li\(2025\)IPAD: inverse prompt for ai detection–a robust and explainable llm\-generated text detector\.arXiv e\-prints,pp\. arXiv–2502\.Cited by:[§II\-C](https://arxiv.org/html/2605.16107#S2.SS3.p1.1)\.
- \[5\]X\. Duan, X\. Wang, P\. Zhao, G\. Shen, and W\. Zhu\(2022\)Deeplogic: joint learning of neural perception and logical reasoning\.IEEE Transactions on Pattern Analysis and Machine Intelligence45\(4\),pp\. 4321–4334\.Cited by:[§V\-B](https://arxiv.org/html/2605.16107#S5.SS2.p1.1)\.
- \[6\]S\. Gao and X\. Zhuang\(2022\)Bayesian image super\-resolution with deep modeling of image statistics\.IEEE Transactions on Pattern Analysis and Machine Intelligence45\(2\),pp\. 1405–1423\.Cited by:[§V\-A1](https://arxiv.org/html/2605.16107#S5.SS1.SSS1.p1.14)\.
- \[7\]S\. Gehrmann, H\. Strobelt, and A\. M\. Rush\(2019\)Gltr: statistical detection and visualization of generated text\.arXiv preprint arXiv:1906\.04043\.Cited by:[§I](https://arxiv.org/html/2605.16107#S1.p3.1),[§II\-B](https://arxiv.org/html/2605.16107#S2.SS2.p1.1),[§II\-C](https://arxiv.org/html/2605.16107#S2.SS3.p1.1),[§III](https://arxiv.org/html/2605.16107#S3.p1.1),[2nd item](https://arxiv.org/html/2605.16107#S6.I2.i2.p1.1),[3rd item](https://arxiv.org/html/2605.16107#S6.I2.i3.p1.1)\.
- \[8\]GPTZero\(2023\)GPTZero official website\.Note:\[Online\]https://gptzero\.meCited by:[§II\-B](https://arxiv.org/html/2605.16107#S2.SS2.p1.1)\.
- \[9\]B\. Guo, X\. Zhang, Z\. Wang, M\. Jiang, J\. Nie, Y\. Ding, J\. Yue, and Y\. Wu\(2023\)How close is chatgpt to human experts? comparison corpus, evaluation, and detection\.arXiv preprint arXiv:2301\.07597\.Cited by:[§I](https://arxiv.org/html/2605.16107#S1.p2.1),[§II\-B](https://arxiv.org/html/2605.16107#S2.SS2.p1.1)\.
- \[10\]H\. Guo, S\. Cheng, X\. Jin, Z\. Zhang, K\. Zhang, G\. Tao, G\. Shen, and X\. Zhang\(2024\)BiScope: ai\-generated text detection by checking memorization of preceding tokens\.Proceedings of the Advances in Neural Information Processing Systems37,pp\. 104065–104090\.Cited by:[§I](https://arxiv.org/html/2605.16107#S1.p3.1),[§III](https://arxiv.org/html/2605.16107#S3.p1.1)\.
- \[11\]R\. Guo, W\. Zeng, F\. Wu, Y\. Kong, Y\. Wu, W\. Dong,et al\.\(2026\)HLD: approximate hierarchical linguistic distribution modeling for llm\-generated text detection\.InProceedings of the Fourteenth International Conference on Learning Representations,pp\. 1–10\.Cited by:[§II\-C](https://arxiv.org/html/2605.16107#S2.SS3.p1.1)\.
- \[12\]A\. Hans, A\. Schwarzschild, V\. Cherepanova, H\. Kazemi, A\. Saha, M\. Goldblum, J\. Geiping, and T\. Goldstein\(2024\)Spotting llms with binoculars: zero\-shot detection of machine\-generated text\.InProceedings of the International Conference on Machine Learning,pp\. 17519–17537\.Cited by:[§I](https://arxiv.org/html/2605.16107#S1.p2.1),[6th item](https://arxiv.org/html/2605.16107#S6.I2.i6.p1.1)\.
- \[13\]Y\. He, S\. Zhang, Y\. Cao, L\. Ma, and P\. Luo\(2025\)DETree: detecting human\-ai collaborative texts via tree\-structured hierarchical representation learning\.InProceedings of the Thirty\-ninth Annual Conference on Neural Information Processing Systems,pp\. 1–10\.Cited by:[§II\-C](https://arxiv.org/html/2605.16107#S2.SS3.p1.1)\.
- \[14\]J\. Hong\(2012\)The state of phishing attacks\.Communications of the ACM55\(1\),pp\. 74–81\.Cited by:[§I](https://arxiv.org/html/2605.16107#S1.p1.1)\.
- \[15\]A\. Hou, J\. Zhang, T\. He, Y\. Wang, Y\. Chuang, H\. Wang, L\. Shen, B\. Van Durme, D\. Khashabi, and Y\. Tsvetkov\(2024\)SemStamp: a semantic watermark with paraphrastic robustness for text generation\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),pp\. 4067–4082\.Cited by:[§I](https://arxiv.org/html/2605.16107#S1.p2.1),[§II\-A](https://arxiv.org/html/2605.16107#S2.SS1.p1.1)\.
- \[16\]X\. Hu, P\. Chen, and T\. Ho\(2023\)Radar: robust ai\-text detection via adversarial learning\.Advances in neural information processing systems36,pp\. 15077–15095\.Cited by:[§II\-B](https://arxiv.org/html/2605.16107#S2.SS2.p1.1)\.
- \[17\]J\. Kirchenbauer, J\. Geiping, Y\. Wen, J\. Katz, I\. Miers, and T\. Goldstein\(2023\)A watermark for large language models\.InProceedings of the International Conference on Machine Learning,pp\. 17061–17084\.Cited by:[§II\-A](https://arxiv.org/html/2605.16107#S2.SS1.p1.1)\.
- \[18\]K\. Krishna, Y\. Song, M\. Karpinska, J\. Wieting, and M\. Iyyer\(2023\)Paraphrasing evades detectors of ai\-generated text, but retrieval is an effective defense\.Proceedings of Advances in Neural Information Processing Systems36,pp\. 27469–27500\.Cited by:[4th item](https://arxiv.org/html/2605.16107#S6.I1.i4.p1.1)\.
- \[19\]S\. Lin, J\. Hilton, and O\. Evans\(2022\)Truthfulqa: measuring how models mimic human falsehoods\.InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 3214–3252\.Cited by:[3rd item](https://arxiv.org/html/2605.16107#S6.I1.i3.p1.1)\.
- \[20\]X\. Liu, Z\. Zhang, Y\. Wang, H\. Pu, Y\. Lan, and C\. Shen\(2022\)Coco: coherence\-enhanced machine\-generated text detection under data limitation with contrastive learning\.arXiv preprint arXiv:2212\.10341\.Cited by:[§I](https://arxiv.org/html/2605.16107#S1.p2.1)\.
- \[21\]Y\. Liu and Y\. Bu\(2024\)Adaptive text watermark for large language models\.InProceedings of the 41st International Conference on Machine Learning,pp\. 30718–30737\.Cited by:[§II\-A](https://arxiv.org/html/2605.16107#S2.SS1.p1.1)\.
- \[22\]Z\. Liu, A\. Desai, F\. Liao, W\. Wang, V\. Xie, Z\. Xu, A\. Kyrillidis, and A\. Shrivastava\(2023\)Scissorhands: exploiting the persistence of importance hypothesis for llm kv cache compression at test time\.Advances in Neural Information Processing Systems36,pp\. 52342–52364\.Cited by:[§IV](https://arxiv.org/html/2605.16107#S4.p1.11)\.
- \[23\]S\. Ma and Q\. Wang\(2024\)Zero\-shot detection of llm\-generated text using token cohesiveness\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,pp\. 17538–17553\.Cited by:[§II\-C](https://arxiv.org/html/2605.16107#S2.SS3.p1.1)\.
- \[24\]E\. Mitchell, Y\. Lee, A\. Khazatsky, C\. D\. Manning, and C\. Finn\(2023\)Detectgpt: zero\-shot machine\-generated text detection using probability curvature\.InProceedings of the International Conference on Machine Learning,pp\. 24950–24962\.Cited by:[§I](https://arxiv.org/html/2605.16107#S1.p2.1),[§I](https://arxiv.org/html/2605.16107#S1.p3.1),[§II\-C](https://arxiv.org/html/2605.16107#S2.SS3.p1.1),[§III](https://arxiv.org/html/2605.16107#S3.p1.1),[4th item](https://arxiv.org/html/2605.16107#S6.I2.i4.p1.1)\.
- \[25\]H\. Nguyen\-Son, M\. Dao, and K\. Zettsu\(2024\)SimLLM: detecting sentences generated by large language models using similarity between the generation and its re\-generation\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,pp\. 22340–22352\.Cited by:[§II\-C](https://arxiv.org/html/2605.16107#S2.SS3.p1.1)\.
- \[26\]A\. Radford, J\. Wu, R\. Child, D\. Luan, D\. Amodei, I\. Sutskever,et al\.\(2019\)Language models are unsupervised multitask learners\.OpenAI blog1\(8\),pp\. 9\.Cited by:[§I](https://arxiv.org/html/2605.16107#S1.p1.1)\.
- \[27\]V\. S\. Sadasivan, A\. Kumar, S\. Balasubramanian, W\. Wang, and S\. Feizi\(2023\)Can ai\-generated text be reliably detected?\.arXiv preprint arXiv:2303\.11156\.Cited by:[§VI\-B4](https://arxiv.org/html/2605.16107#S6.SS2.SSS4.p1.1)\.
- \[28\]R\. Shao, T\. Wu, J\. Wu, L\. Nie, and Z\. Liu\(2024\)Detecting and grounding multi\-modal media manipulation and beyond\.IEEE Transactions on Pattern Analysis and Machine Intelligence46\(8\),pp\. 5556–5574\.Cited by:[§II\-B](https://arxiv.org/html/2605.16107#S2.SS2.p1.1)\.
- \[29\]I\. Solaiman, M\. Brundage, J\. Clark, A\. Askell, A\. Herbert\-Voss, J\. Wu, A\. Radford, G\. Krueger, J\. W\. Kim, S\. Kreps,et al\.\(2019\)Release strategies and the social impacts of language models\.arXiv preprint arXiv:1908\.09203\.Cited by:[§I](https://arxiv.org/html/2605.16107#S1.p2.1),[§I](https://arxiv.org/html/2605.16107#S1.p3.1),[§II\-B](https://arxiv.org/html/2605.16107#S2.SS2.p1.1),[§II\-C](https://arxiv.org/html/2605.16107#S2.SS3.p1.1),[§III](https://arxiv.org/html/2605.16107#S3.p1.1),[1st item](https://arxiv.org/html/2605.16107#S6.I2.i1.p1.1)\.
- \[30\]J\. Su, T\. Zhuo, D\. Wang, and P\. Nakov\(2023\)Detectllm: leveraging log rank information for zero\-shot detection of machine\-generated text\.InProceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023,pp\. 12395–12412\.Cited by:[§II\-C](https://arxiv.org/html/2605.16107#S2.SS3.p1.1)\.
- \[31\]Y\. Tian, H\. Chen, X\. Wang, Z\. Bai, Q\. ZHANG, R\. Li, C\. Xu, and Y\. Wang\(2024\)Multiscale positive\-unlabeled detection of ai\-generated texts\.InProceedings of the Twelfth International Conference on Learning Representations,pp\. 1–9\.Cited by:[§II\-B](https://arxiv.org/html/2605.16107#S2.SS2.p1.1)\.
- \[32\]E\. Tulchinskii, K\. Kuznetsov, L\. Kushnareva, D\. Cherniavskii, S\. Nikolenko, E\. Burnaev, S\. Barannikov, and I\. Piontkovskaya\(2024\)Intrinsic dimension estimation for robust detection of ai\-generated texts\.Proceedings of the Advances in Neural Information Processing Systems36\.Cited by:[§II\-C](https://arxiv.org/html/2605.16107#S2.SS3.p1.1)\.
- \[33\]V\. Verma, E\. Fleisig, N\. Tomlin, and D\. Klein\(2024\)Ghostbuster: detecting text ghostwritten by large language models\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),pp\. 1702–1717\.Cited by:[§II\-B](https://arxiv.org/html/2605.16107#S2.SS2.p1.1),[1st item](https://arxiv.org/html/2605.16107#S6.I1.i1.p1.1),[2nd item](https://arxiv.org/html/2605.16107#S6.I1.i2.p1.1)\.
- \[34\]I\. Vykopal, M\. Pikuliak, I\. Srba, R\. Moro, D\. Macko, and M\. Bieliková\(2024\)Disinformation capabilities of large language models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 14830–14847\.Cited by:[§I](https://arxiv.org/html/2605.16107#S1.p1.1)\.
- \[35\]P\. Wang, L\. Li, K\. Ren, B\. Jiang, D\. Zhang, and X\. Qiu\(2023\)SeqXGPT: sentence\-level ai\-generated text detection\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,pp\. 1144–1156\.Cited by:[§I](https://arxiv.org/html/2605.16107#S1.p2.1),[§II\-B](https://arxiv.org/html/2605.16107#S2.SS2.p1.1)\.
- \[36\]C\. Wu, Y\. Cheung, S\. Zhang, B\. Han, and D\. Lian\(2026\)Beyond raw detection scores: markov\-informed calibration for boosting machine\-generated text detection\.InProceedings of the Fourteenth International Conference on Learning Representations,pp\. 1–10\.Cited by:[§I](https://arxiv.org/html/2605.16107#S1.p4.1),[§I](https://arxiv.org/html/2605.16107#S1.p6.1)\.
- \[37\]J\. Wu, R\. Zhan, D\. Wong, S\. Yang, X\. Yang, Y\. Yuan, and L\. Chao\(2024\)Detectrl: benchmarking llm\-generated text detection in real\-world scenarios\.Proceedings of the Advances in Neural Information Processing Systems37,pp\. 100369–100401\.Cited by:[4th item](https://arxiv.org/html/2605.16107#S6.I1.i4.p1.1)\.
- \[38\]J\. Wu, J\. Wang, Z\. Liu, B\. Chen, D\. Hu, H\. Wu, and S\. Xia\(2025\)Moses: uncertainty\-aware ai\-generated text detection via mixture of stylistics experts with conditional thresholds\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 5797–5816\.Cited by:[§II\-C](https://arxiv.org/html/2605.16107#S2.SS3.p1.1)\.
- \[39\]K\. Wu, L\. Pang, H\. Shen, X\. Cheng, and T\. Chua\(2023\)LLMDet: a third party large language models generated text detection tool\.InProceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023,pp\. 2113–2133\.Cited by:[§II\-B](https://arxiv.org/html/2605.16107#S2.SS2.p1.1)\.
- \[40\]Y\. Wu, Z\. Hu, J\. Guo, H\. Zhang, and H\. Huang\(2024\)A resilient and accessible distribution\-preserving watermark for large language models\.InProceedings of the International Conference on Machine Learning,pp\. 53443–53470\.Cited by:[§II\-A](https://arxiv.org/html/2605.16107#S2.SS1.p1.1)\.
- \[41\]Y\. Xu, Y\. Wang, H\. An, Z\. Liu, and Y\. Li\(2024\)Detecting subtle differences between human and model languages using spectrum of relative likelihood\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,pp\. 10108–10121\.Cited by:[§II\-C](https://arxiv.org/html/2605.16107#S2.SS3.p1.1),[7th item](https://arxiv.org/html/2605.16107#S6.I2.i7.p1.1)\.
- \[42\]Y\. Xu, Y\. Wang, Y\. Bi, H\. Cao, Z\. Lin, Y\. Zhao, and F\. Wu\(2025\)Training\-free llm\-generated text detection by mining token probability sequences\.InProceedings of the Thirteenth International Conference on Learning Representations,pp\. 1–10\.Cited by:[§II\-C](https://arxiv.org/html/2605.16107#S2.SS3.p1.1)\.
- \[43\]X\. Yang, W\. Cheng, Y\. Wu, L\. R\. Petzold, W\. Y\. Wang, and H\. Chen\(2024\)DNA\-gpt: divergent n\-gram analysis for training\-free detection of gpt\-generated text\.InProceedings of the Twelfth International Conference on Learning Representations,pp\. 1–9\.Cited by:[§II\-C](https://arxiv.org/html/2605.16107#S2.SS3.p1.1)\.
- \[44\]X\. Yang, K\. Zhang, H\. Chen, L\. Petzold, W\. Y\. Wang, and W\. Cheng\(2023\)Zero\-shot detection of machine\-generated codes\.arXiv preprint arXiv:2310\.05103\.Cited by:[§II\-C](https://arxiv.org/html/2605.16107#S2.SS3.p1.1)\.
- \[45\]X\. Yu, K\. Chen, Q\. Yang, W\. Zhang, and N\. Yu\(2024\)Text fluoroscopy: detecting llm\-generated text through intrinsic features\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,pp\. 15838–15846\.Cited by:[§II\-C](https://arxiv.org/html/2605.16107#S2.SS3.p1.1)\.
- \[46\]X\. Yu, Y\. Qi, K\. Chen, G\. Chen, X\. Yang, P\. Zhu, W\. Zhang, and N\. Yu\(2023\)LLM paternity test: generated text detection with llm genetic inheritance\.arXiv preprint arXiv:2305\.12519\.Cited by:[§II\-B](https://arxiv.org/html/2605.16107#S2.SS2.p1.1)\.
- \[47\]Z\. Yu, Y\. Wu, N\. Zhang, C\. Wang, Y\. Vorobeychik, and C\. Xiao\(2023\)Codeipprompt: intellectual property infringement assessment of code language models\.InProceedings of the International Conference on Machine Learning,pp\. 40373–40389\.Cited by:[§I](https://arxiv.org/html/2605.16107#S1.p1.1)\.
- \[48\]C\. Zeng, S\. Tang, Y\. Chen, Z\. Shen, W\. Yu, X\. Zhao, H\. Chen, W\. Cheng,et al\.\(2025\)Human texts are outliers: detecting llm\-generated texts via out\-of\-distribution detection\.InProceedings of the Thirty\-ninth Annual Conference on Neural Information Processing Systems,pp\. 1–10\.Cited by:[§II\-C](https://arxiv.org/html/2605.16107#S2.SS3.p1.1)\.
- \[49\]H\. Zhan, X\. He, Q\. Xu, Y\. Wu, and P\. Stenetorp\(2023\)G3detector: general gpt\-generated text detector\.arXiv preprint arXiv:2305\.12680\.Cited by:[§II\-B](https://arxiv.org/html/2605.16107#S2.SS2.p1.1)\.
- \[50\]R\. Zhang, S\. S\. Hussain, P\. Neekhara, and F\. Koushanfar\(2024\)\{\\\{remark\-LLM\}\\\}: a robust and efficient watermarking framework for generative large language models\.InProceedings of the 33rd USENIX Security Symposium \(USENIX Security 24\),pp\. 1813–1830\.Cited by:[§II\-A](https://arxiv.org/html/2605.16107#S2.SS1.p1.1)\.
- \[51\]H\. Zhou, J\. Zhu, P\. Su, K\. Ye, Y\. Yang, S\. Gavioli\-Akilagun, and C\. Shi\(2025\)AdaDetectGPT: adaptive detection of llm\-generated text with statistical guarantees\.InProceedings of the Thirty\-ninth Annual Conference on Neural Information Processing Systems,pp\. 1–10\.Cited by:[§II\-C](https://arxiv.org/html/2605.16107#S2.SS3.p1.1),[8th item](https://arxiv.org/html/2605.16107#S6.I2.i8.p1.1.1)\.
- \[52\]H\. Zhou, J\. Zhu, K\. Ye, Y\. Yang, E\. Xu, and C\. Shi\(2026\)Learn\-to\-distance: distance learning for detecting llm\-generated text\.arXiv preprint arXiv:2601\.21895\.Cited by:[§II\-C](https://arxiv.org/html/2605.16107#S2.SS3.p1.1)\.
- \[53\]J\. Zhu, B\. Han, J\. Yao, Q\. Yao, T\. Liu, and J\. Xu\(2025\)Slack federated adversarial training\.IEEE Transactions on Pattern Analysis and Machine Intelligence\.Cited by:[§VI\-B5](https://arxiv.org/html/2605.16107#S6.SS2.SSS5.p1.1)\.
- \[54\]X\. Zhu, Y\. Ren, F\. Fang, Q\. Tan, S\. Wang, and Y\. Cao\(2025\)DNA\-detectllm: unveiling ai\-generated text via a dna\-inspired mutation\-repair paradigm\.InProceedings of the Thirty\-ninth Annual Conference on Neural Information Processing Systems,pp\. 1–13\.Cited by:[§I](https://arxiv.org/html/2605.16107#S1.p3.1),[§II\-C](https://arxiv.org/html/2605.16107#S2.SS3.p1.1),[§III](https://arxiv.org/html/2605.16107#S3.p1.1),[9th item](https://arxiv.org/html/2605.16107#S6.I2.i9.p1.1.1)\.Similar Articles
DetectRL-X: Towards Reliable Multilingual and Real-World LLM-Generated Text Detection
DetectRL-X is a comprehensive multilingual benchmark for evaluating LLM-generated text detectors across 8 languages and 6 domains, including stress testing with AI-assisted writing operations and perturbations. It reveals strengths and limitations of current detectors in multilingual scenarios.
G^2C-MT: Graph-Guided Context Selection for Document-Level Machine Translation
Proposes G²C-MT, a graph-guided context selection framework for document-level machine translation that models structured discourse dependencies via a lightweight discourse graph and depth-biased random walk, outperforming baselines on multiple LLMs.
Log-Likelihood, Simpson's Paradox, and the Detection of Machine-Generated Text
This paper addresses the degradation of likelihood-based machine-generated text detectors by identifying a Simpson's paradox in token-score aggregation. It proposes a learned local calibration step that significantly improves detection performance across various models and datasets.
Hidden Human-Like Nature of Machine-Generated Texts: Theory and Detection Enhancement
This paper reveals the existence of hidden human-like spans in machine-generated texts and proposes a model-agnostic stacked enhancement framework that improves existing detectors by reducing the influence of these spans.
MELD: Multi-Task Equilibrated Learning Detector for AI-Generated Text
This paper introduces MELD, a detector for AI-generated text that uses multi-task learning with auxiliary heads for generator family, attack type, and source domain to improve robustness. MELD achieves strong performance on the RAID benchmark and maintains low false-positive rates under adversarial attacks.