Sem-Detect: Semantic Level Detection of AI Generated Peer-Reviews

arXiv cs.CL Papers

Summary

Sem-Detect introduces a method to distinguish AI-generated peer reviews from human-written ones by combining textual features with claim-level semantic analysis. It achieves a 25.5% improvement in true positive rate at 0.1% false positive rate over baselines, and shows that LLM-refined human reviews retain distinct semantic signals, with fewer than 3.5% misclassified as AI-generated.

arXiv:2605.21713v1 Announce Type: new Abstract: How can we distinguish whether a peer review was written by a human or generated by an AI model? We argue that, in this setting, authorship should not be attributed solely from the textual features of a review, but also from the ideas, judgments, and claims it expresses. To this end, we propose Sem-Detect, an authorship detection method for peer reviews that operationalizes this principle by combining textual features with claim-level semantic analysis. Sem-Detect compares a target review against multiple AI-generated reviews of the same paper, leveraging the observation that different AI models tend to converge on similar points, while human reviewers introduce more unique and diverse ones. As a result, Sem-Detect is able to distinguish fully AI reviews from authentic human-written ones, including those that have been refined using an LLM but still reflect human judgment. Across a dataset of over 20,000 peer reviews from ICLR and NeurIPS conferences, Sem-Detect improves over the strongest baseline by 25.5% in [email protected]% FPR in the binary setting. Moreover, in the three-class scenario, we empirically show that LLM refinement preserves the semantic signals of human reviews, which remain distinct from the patterns exhibited by fully AI-generated text; as a result, fewer than 3.5% of LLM-refined human reviews are misclassified as AI-generated.
Original Article
View Cached Full Text

Cached at: 05/22/26, 08:43 AM

# Sem-Detect: Semantic Level Detection of AI Generated Peer-Reviews
Source: [https://arxiv.org/html/2605.21713](https://arxiv.org/html/2605.21713)
###### Abstract

How can we distinguish whether a peer review was written by a human or generated by an AI model?We argue that, in this setting, authorship should not be attributed solely from the textual features of a review, but also from the ideas, judgments, and claims it expresses\. To this end, we propose Sem\-Detect, an authorship detection method for peer reviews that operationalizes this principle by combining textual features with claim\-level semantic analysis\. Sem\-Detect compares a target review against multiple AI\-generated reviews of the same paper, leveraging the observation that different AI models tend to converge on similar points, while human reviewers introduce more unique and diverse ones\. As a result, Sem\-Detect is able to distinguish fully AI reviews from authentic human\-written ones, including those that have been refined using an LLM but still reflect human judgment\. Across a dataset of over 20,000 peer reviews from ICLR and NeurIPS conferences, Sem\-Detect improves over the strongest baseline by 25\.5% in TPR@0\.1% FPR in the binary setting\. Moreover, in the three\-class scenario, we empirically show that LLM refinement preserves the semantic signals of human reviews, which remain distinct from the patterns exhibited by fully AI\-generated text; as a result, fewer than 3\.5% of LLM\-refined human reviews are misclassified as AI\-generated\.

Machine Learning, ICML

## 1Introduction

Peer review is fundamental to scientific progress\. When researchers submit a paper, they expect substantive feedback from domain experts; feedback that can clarify the work for future readers and guide authors in strengthening their contributions\. However, with the rapid advancement of large language models \(LLMs\), there is growing evidence of AI\-generated content appearing in peer reviews\(Lianget al\.,[2024](https://arxiv.org/html/2605.21713#bib.bib1); Zhouet al\.,[2025](https://arxiv.org/html/2605.21713#bib.bib3)\)\. This trend raises a serious concern: authors may no longer know whether the feedback they receive reflects genuine human judgment\.

![Refer to caption](https://arxiv.org/html/2605.21713v1/x1.png)Figure 1:Classical AI\-text detectors rely on textual features to decide whether a review was written by a human\. Sem\-Detect instead infers authorship by leveraging the semantic content of expressed ideas, thereby distinguishing fully AI\-generated reviews from LLM\-refined human ones\.While initial responses from the research community were strict, as exemplified by ICML 2025’s ban on any use of LLMs in the review process\(ICML Conference Chairs,[2025](https://arxiv.org/html/2605.21713#bib.bib4)\), there has since been a notable policy shift\. ICML 2026 now allows LLM assistance for editing and improving the clarity of reviews\(ICML Conference Chairs,[2026](https://arxiv.org/html/2605.21713#bib.bib5)\)\. This shift reflects a recognition that the appropriate boundary lies not in whether an LLM touched the text, but in whether the expressed ideas originated from a human or from a machine\. A reviewer who drafts an assessment and later uses an LLM to improve its readability is engaging in a qualitatively different activity than one who prompts an LLM to generate an entire review\. Detecting this distinction, however, poses a technical challenge that existing methods are not well\-equipped to address\(Fitzgibbonet al\.,[2024](https://arxiv.org/html/2605.21713#bib.bib33)\)\.

Current approaches to AI\-text detection can be broadly organized along two axes: \(i\) general\-purpose methods designed to work across diverse domains, and \(ii\) domain\-specific methods tailored to particular contexts such as peer review\.

General\-purpose methods can range from zero\-shot statistical approaches such as FastDetectGPT\(Baoet al\.,[2024](https://arxiv.org/html/2605.21713#bib.bib6)\), which leverage text conditional probability curvature to identify machine\-generated content, to more sophisticated techniques like RADAR\(Huet al\.,[2023](https://arxiv.org/html/2605.21713#bib.bib7)\), which use adversarial training to achieve increased robustness against LLM\-based paraphrasing\. However, because these approaches rely on surface\-level textual signals, when applied to the peer\-review domain they struggle to distinguish human\-authored judgments that have been linguistically refined by an LLM from content generated end\-to\-end by an LLM\.

Domain\-specific methods, by contrast, leverage contextual information unique to the task\. For example,Yuet al\.\([2026](https://arxiv.org/html/2605.21713#bib.bib8)\)generate synthetic AI reviews from research papers and train Anchor, which embeds entire reviews and compares them to a reference AI review using cosine similarity to infer authorship\. However, operating at the full\-review level limits interpretability, making it difficult to identify which claims drive a given classification\.

To address these limitations while building on the strengths of existing approaches, we propose Sem\-Detect\. Like general\-purpose methods, Sem\-Detect extracts textual features from the target review, as these remain fundamental for distinguishing purely human text from fully AI\-generated content\. However, inspired by domain\-specific approaches such as Anchor\(Yuet al\.,[2026](https://arxiv.org/html/2605.21713#bib.bib8)\), Sem\-Detect moves beyond text\-level analysis by explicitly modeling the semantic content of reviews\. Rather than embedding entire reviews and comparing them as a whole, our method operates at the claim level: it pairs each target review with multiple AI\-generated reviews of the same paper and measures semantic similarity at a finer granularity\. This design exploits the observation that different AI models tend to converge on similar points when reviewing the same paper, while human reviewers introduce more unique judgments\. As a result, we can distinguish not only between human and AI authorship, but also identify cases in which a human assessment has been refined by an LLM, treating such reviews as a separate class rather than mixing them with fully AI\-generated text\.

Using a corpus of over 20,000 reviews \(human\-written, LLM\-refined, and AI\-generated\) constructed from 800 papers across ICLR and NeurIPS conferences, we train and evaluate Sem\-Detect\. Human reviews collected up to 2022 serve as clean baselines\. To assess robustness beyond these controlled conditions, we further evaluate the method on: AI\-generated reviews produced by unseen models and prompting strategies; cross\-domain reviews from a medical imaging venue; and recent submissions from ICLR 2026\.

Our main contributions are as follows:

- •We identify a consistent pattern in peer reviews: when reviewing the same paper, AI\-generated reviews exhibit higher claim\-level overlap with one another than human\-written reviews, including those refined using LLMs\.
- •We operationalize this insight in Sem\-Detect, a practical detection framework that combines textual features with claim\-level semantic analysis to distinguish human\-written, LLM\-refined, and fully AI\-generated reviews\.
- •We construct and release a dataset of over 20,000 peer reviews spanning human\-written, AI\-generated, and LLM\-refined variants from ICLR and NeurIPS \(pre\-2022\), with additional evaluation data from a medical imaging venue and ICLR 2026\.
- •Experiments show that Sem\-Detect improves over the strongest prior detector by 25\.5% in TPR@0\.1% FPR in binary detection, with fewer than 3\.5% of LLM\-refined human reviews misclassified as AI\-generated\. We further validate robustness to unseen models, cross\-domain transfer, and temporal generalization\.

## 2Related Work

Detecting machine\-generated text has become a central challenge in the NLP community, with methods spanning watermarking, zero\-shot detection, and supervised classification\(Jawaharet al\.,[2020](https://arxiv.org/html/2605.21713#bib.bib42); Ghosalet al\.,[2023](https://arxiv.org/html/2605.21713#bib.bib40); Wuet al\.,[2025](https://arxiv.org/html/2605.21713#bib.bib11); Raoet al\.,[2025](https://arxiv.org/html/2605.21713#bib.bib53)\)\. We organize prior work along two axes: general\-purpose methods designed for broad applicability, and domain\-specific approaches for peer review\.

### 2\.1General\-Purpose AI\-Text Detection

##### Watermarking\.

Watermarking embeds detectable statistical signals during text generation, with some methods offering provable guarantees on false positive rates\(Kirchenbaueret al\.,[2023](https://arxiv.org/html/2605.21713#bib.bib34); Zhaoet al\.,[2024](https://arxiv.org/html/2605.21713#bib.bib41)\)\. However, watermarking requires control over the generation process and therefore has limited applicability in settings where the source model is unknown\.

##### Zero\-shot methods\.

Zero\-shot detectors operate without task\-specific training data by exploiting statistical properties of LLM outputs\(Hanset al\.,[2024](https://arxiv.org/html/2605.21713#bib.bib18)\)\. DetectGPT\(Mitchellet al\.,[2023](https://arxiv.org/html/2605.21713#bib.bib2)\)introduced the concept of probability curvature, observing that perturbations of LLM\-generated text tend to reduce its log\-probability in the source model\. In contrast, human\-written text does not exhibit the same systematic behavior\. Follow\-up work such as Fast\-DetectGPT\(Baoet al\.,[2024](https://arxiv.org/html/2605.21713#bib.bib6)\)achieves comparable accuracy with reduced computational cost\. Other approaches rely on simpler statistical metrics, including perplexity\(Gutiérrez Megíaset al\.,[2024](https://arxiv.org/html/2605.21713#bib.bib49)\)and entropy\(Lavergneet al\.,[2008](https://arxiv.org/html/2605.21713#bib.bib38)\)\.

##### Trained detectors\.

Supervised methods train classifiers on human and AI\-generated text\. Early approaches fine\-tuned models like RoBERTa\(Liuet al\.,[2019](https://arxiv.org/html/2605.21713#bib.bib14)\)on detection datasets\(Zellerset al\.,[2019](https://arxiv.org/html/2605.21713#bib.bib37); Solaimanet al\.,[2019](https://arxiv.org/html/2605.21713#bib.bib39)\), but these methods are often sensitive to adversarial scenarios such as LLM\-based paraphrasing\. To address this, recent work like RADAR\(Huet al\.,[2023](https://arxiv.org/html/2605.21713#bib.bib7)\)jointly trains a detector and a paraphraser in an adversarial framework, where the paraphraser learns to generate evasive rewrites while the detector learns to remain robust against them\. However, even robust trained detectors operate solely on the target text, without access to contextual information \(e\.g\., the manuscript under review\) that could provide additional discriminative signal\.

### 2\.2Domain\-Specific Detection in Peer Review

While general\-purpose detectors focus only on the target text, peer review methods can exploit the relationship between reviews and manuscripts, as well as the structured nature of review writing\.

##### Leveraging domain signals\.

Lianget al\.\([2024](https://arxiv.org/html/2605.21713#bib.bib1)\)provided early evidence of LLM\-generated content in peer reviews by tracking the surge of adjectives characteristic of ChatGPT\(OpenAI,[2022](https://arxiv.org/html/2605.21713#bib.bib15)\)outputs\. Building on this, the Term Frequency \(TF\) model introduced byKumaret al\.\([2024](https://arxiv.org/html/2605.21713#bib.bib12)\)exploits repetitive token usage patterns in AI\-generated text and demonstrates that even simple domain\-tailored signals can outperform more generic detection strategies\.

##### Manuscript\-conditioned detection\.

Anchor\(Yuet al\.,[2026](https://arxiv.org/html/2605.21713#bib.bib8)\)conditions detection on the paper under review\. The method generates a synthetic AI review for the target paper and compares it with the candidate review using embedding\-based cosine similarity: reviews that closely resemble the AI reference are flagged as machine\-generated\. However, Anchor operates at the full\-review level, embedding entire reviews as single vectors, limiting the method’s ability to disentangle partial semantic overlap from end\-to\-end AI authorship\. In a complementary direction,Raoet al\.\([2025](https://arxiv.org/html/2605.21713#bib.bib53)\)embed hidden instructions in submitted PDFs that induce LLMs to insert detectable watermarks into generated reviews\. However, this requires venue\-level adoption, which limits practical deployment\.

##### Beyond binary detection\.

Most recently, EditLens\(Thaiet al\.,[2026](https://arxiv.org/html/2605.21713#bib.bib9)\)re\-frames the task by moving beyond binary classification to quantify the extent of AI editing on a continuous scale\. This represents an important conceptual shift, acknowledging that the boundary between human and AI authorship is not always sharp\. However, EditLens focuses on estimating edit intensity rather than distinguishing the origin of the underlying ideas\. As a consequence, a human review fully polished by an LLM and an AI\-generated review may receive similar scores, despite representing fundamentally different authorship scenarios\.

### 2\.3Granularity in Semantic Comparison

Our approach is inspired by work in the retrieval literature showing that the granularity of text representation has a strong impact on downstream performance\. Dense X Retrieval\(Chenet al\.,[2024](https://arxiv.org/html/2605.21713#bib.bib16)\)adopts atomic propositions as retrieval units, ensuring that each representation corresponds to a single, semantically independent claim\. Similarly, LumberChunker\(Duarteet al\.,[2024](https://arxiv.org/html/2605.21713#bib.bib17)\)shows that segmenting text along semantic boundaries is more effective than arbitrary chunking strategies\. Together, these findings highlight a common principle: large document\-level representations mix multiple semantic units, which reduces precision in similarity\-based comparison\. For the same reason, Sem\-Detect operates at the claim level, allowing us to better isolate the semantic patterns that distinguish AI\-generated content from human\-written reviews\.

## 3Sem\-Detect

![Refer to caption](https://arxiv.org/html/2605.21713v1/x2.png)Figure 2:Sem\-Detect pipeline\. We construct our dataset by prompting LLMs to generate fully AI reviews from conference papers and to refine authentic human reviews, creating three classes\. For classification, each target review \(from any class\) is paired with multiple AI\-generated reference reviews of the same paper\. We extract textual features from the target review and semantic features from the target\-reference comparisons\. These combined features train a LightGBM classifier to distinguish between human\-written, LLM\-refined, and fully AI\-generated reviews\.Sem\-Detect addresses the problem of peer\-review authorship attribution by distinguishing between fully human\-written reviews, human reviews refined by an LLM, and end\-to\-end machine generated ones\. As illustrated in Figure[2](https://arxiv.org/html/2605.21713#S3.F2), the pipeline consists of two main stages: \(i\) the construction of a peer\-review dataset spanning these three classes, and \(ii\) the extraction of textual and claim\-level semantic features from this data to train a detection model\. We describe the key design choices of each stage below\. Further details are provided in Appendices[A\.1](https://arxiv.org/html/2605.21713#A1.SS1)\-[A\.5](https://arxiv.org/html/2605.21713#A1.SS5)\.

### 3\.1Training Data Construction

##### Human reviews\.

We randomly sample 200 papers from each of ICLR and NeurIPS for the years 2021 and 2022, resulting in a total of 800 papers\. We crawl both papers and their associated reviews from OpenReview,111[https://openreview\.net](https://openreview.net/)retrieving the blind submission version for each paper to ensure consistency with what reviewers saw at the time of writing\. In total we have 3,065 human\-written reviews\.

##### Fully AI\-generated reviews\.

Using the sampled papers, we generate a set of fully AI\-written reviews\. While every conference has their own reviewing guidelines, peer reviews across venues generally follow a common structure consisting of: \(1\) a summary of the paper, \(2\) a discussion of strengths, \(3\) a discussion of weaknesses, and \(4\) clarification questions for the authors\. We leverage this structure to prompt four different LLMs to generate their reviews\.

A second consideration concerns the distribution of review scores\. To avoid the optimism bias documented inRussoet al\.\([2025](https://arxiv.org/html/2605.21713#bib.bib48)\), we explicitly specify the target score during generation\. As such, for each paper, LLMs generate reviews corresponding to the distinct scores assigned by human reviewers, ensuring balanced coverage of evaluation outcomes, and resulting in a total of 6,768 AI\-generated reviews\.

##### LLM\-refined reviews\.

In contrast to fully AI\-generated reviews, this class originates from human\-written assessments\. It reflects the realistic scenario in which a reviewer drafts an initial evaluation and subsequently uses an LLM to improve its clarity\. As such, during this refinement step, the LLM is explicitly instructed to preserve all original judgments and to avoid introducing new content\. This procedure is applied to each human review using the four LLMs, and results in 12,332 LLM\-refined reviews\.

##### Post\-processing\.

Both fully AI\-generated and LLM\-refined reviews can include elements that directly reveal how they were produced, such as sentences like “Here is the review of …”\. We use an LLM to remove these artifacts through a post\-processing step, resulting in plain\-text reviews that follow the same format as human ones\.

##### Claim extraction\.

A central premise of Sem\-Detect is that authorship signals are reflected not only in writing style, but also in the content of a review\. To capture this information, we use an LLM to extract structured claim\-level representations from each text\. Specifically, we semantically segment each review into bullet points belonging to five categories: factual restatement, evaluation, constructive input, clarification dialogue, and meta\-commentary\. Each bullet point is designed to capture a single claim while preserving the reviewer’s original phrasing whenever possible\.

### 3\.2Model Training and Classification

Letttdenote a target review and letppbe the paper it evaluates\. We assume access to a set of AI\-generated reference reviews𝒜p=\{a1,…,ak\}\\mathcal\{A\}\_\{p\}=\\\{a\_\{1\},\\ldots,a\_\{k\}\\\}for the same paper, produced by promptingkkdifferent LLMs\. Our goal is to learn a functionf​\(t,𝒜p\)→\{0,1,2\}f\(t,\\mathcal\{A\}\_\{p\}\)\\to\\\{0,1,2\\\}that maps the target review and its references to one of three classes: human\-written, LLM\-refined, or fully AI\-generated\. Additional details are reported in Appendices[B\.1](https://arxiv.org/html/2605.21713#A2.SS1)–[B\.6](https://arxiv.org/html/2605.21713#A2.SS6)\.

##### Reference review pairing\.

For each target reviewtt, we pair it withk=3k=3AI\-generated reference reviews of the same paper\. Reference reviews are selected under two conditions: \(i\) they share the same evaluation score astt, so that semantic comparisons are not affected by differences in overall judgment; and \(ii\) whenttis AI\-generated, they are produced by different models, to avoid inflated similarity scores from model\-specific patterns\(Xuet al\.,[2024](https://arxiv.org/html/2605.21713#bib.bib19)\)\.

##### Claim filtering and embedding

As described in Section[3\.1](https://arxiv.org/html/2605.21713#S3.SS1), each review is segmented into five claim categories, but only a subset is informative for authorship attribution\. For semantic analysis, we consider only claims from categories that reflect evaluative judgment, namely \(i\) evaluation, \(ii\) constructive input, and \(iii\) clarification dialogue\.

##### Feature extraction and classifier training\.

For each target reviewtt, we extract a nine\-dimensional feature vector comprising five semantic features and four textual features\. Semantic features are computed from claim embeddings and their comparisons to AI\-generated reference reviews, while textual features come directly from the raw text oftt\.

Let𝒞t=\{c1,…,cn\}\\mathcal\{C\}\_\{t\}=\\\{c\_\{1\},\\dots,c\_\{n\}\\\}denote the set of claims extracted fromtt, and let𝒜p=\{a1,…,ak\}\\mathcal\{A\}\_\{p\}=\\\{a\_\{1\},\\dots,a\_\{k\}\\\}denote the set of AI\-generated reference reviews for the same paper\. For each target claimcic\_\{i\}and each reference reviewaja\_\{j\}, with claim set𝒞aj\\mathcal\{C\}\_\{a\_\{j\}\}, we compute the best\-match similarity

si,j=maxc∈𝒞aj⁡cos⁡\(ϕ​\(ci\),ϕ​\(c\)\),s\_\{i,j\}=\\max\_\{c\\in\\mathcal\{C\}\_\{a\_\{j\}\}\}\\cos\\\!\\left\(\\phi\(c\_\{i\}\),\\phi\(c\)\\right\),whereϕ​\(⋅\)\\phi\(\\cdot\)denotes a claim embedding function\. We further definesi=maxj⁡si,js\_\{i\}=\\max\_\{j\}s\_\{i,j\}as the best\-match similarity ofcic\_\{i\}across all reference reviews\.

Semantic features include: \(i\) the proportion of target claims whose similarity to at least one AI\-generated reference review exceeds a thresholdτ\\tau, i\.e\.,1n​∑i𝕀​\[si\>τ\]\\frac\{1\}\{n\}\\sum\_\{i\}\\mathbb\{I\}\[s\_\{i\}\>\\tau\]; \(ii\) the mean ofsi,js\_\{i,j\}over all claim\-reference pairs withsi,j\>τs\_\{i,j\}\>\\tau; \(iii\) the mean best\-match similarity1n​∑isi\\frac\{1\}\{n\}\\sum\_\{i\}s\_\{i\}; \(iv\) intra\-review semantic diversity, defined as one minus the mean pairwise cosine similarity between claim embeddings within𝒞t\\mathcal\{C\}\_\{t\}; and \(v\) the log\-length of extracted claims:log⁡\(1\+\|𝒞t\|\)\\log\(1\+\|\\mathcal\{C\}\_\{t\}\|\)\.

Textual features capture token\-level statistical properties oftt, including perplexity, entropy, the proportion of tokens whose likelihood falls within the top\-kkpredictions of a language model, and the Fast\-DetectGPT score\.

Finally, we train a gradient\-boosted decision trees classifier using the LightGBM framework\(Keet al\.,[2017](https://arxiv.org/html/2605.21713#bib.bib20)\)\. Hyperparameters are selected via randomized search with five\-fold stratified cross\-validation, optimizing macro\-F1 to ensure balanced performance across the three classes\.

## 4Experiments

### 4\.1Implementation and Evaluation Setup

##### Implementation\.

We generate fully AI\-written and LLM\-refined reviews using four models: Gemini\-2\.5\-Flash, Gemini\-2\.5\-Pro, DeepSeek\-V3\.1, and Qwen3\-235B\-A22B\(Comaniciet al\.,[2025](https://arxiv.org/html/2605.21713#bib.bib23); Liuet al\.,[2024](https://arxiv.org/html/2605.21713#bib.bib50); Yanget al\.,[2025](https://arxiv.org/html/2605.21713#bib.bib22)\)\. Review cleaning and claim extraction is performed with Gemini\-2\.5\-Flash; claim embeddings are obtained using Qwen3\-0\.6B\(Zhanget al\.,[2025](https://arxiv.org/html/2605.21713#bib.bib21)\); and textual features are computed with Mistral\-7B\-Instruct\-v0\.3 as the reference model\(Jianget al\.,[2023](https://arxiv.org/html/2605.21713#bib.bib24)\)\. We use an 80%\-20% train/test split, stratified by class and performed at the paper level, ensuring that all reviews of a given paper appear exclusively in either the training or test set\.

##### Evaluation\.

We evaluate Sem\-Detect under two problem framings: binary classification, which distinguishes AI\-generated reviews from non\-AI ones, and three\-class classification, which additionally separates LLM\-refined human reviews as a distinct category\. We report ROC curves, AUC, and True Positive Rates at 0\.1% and 1% False Positive Rates for binary settings, and macro F1 for three\-class\. Where reported, uncertainty is estimated via bootstrap resampling \(1,000 iterations\)\.

### 4\.2Baselines

We compare Sem\-Detect to general\-purpose and domain\-specific peer\-review detectors\.

On the general\-purpose side, we evaluate LogRank\(Ippolitoet al\.,[2020](https://arxiv.org/html/2605.21713#bib.bib47)\), Fast\-DetectGPT\(Baoet al\.,[2024](https://arxiv.org/html/2605.21713#bib.bib6)\), Binoculars\(Hanset al\.,[2024](https://arxiv.org/html/2605.21713#bib.bib18)\), MAGE\(Liet al\.,[2024](https://arxiv.org/html/2605.21713#bib.bib35)\), and RADAR\(Huet al\.,[2023](https://arxiv.org/html/2605.21713#bib.bib7)\), spanning zero\-shot, supervised, and adversarially\-trained methods\. Domain\-specific baselines are the TF model\(Kumaret al\.,[2024](https://arxiv.org/html/2605.21713#bib.bib12)\), Anchor\(Yuet al\.,[2026](https://arxiv.org/html/2605.21713#bib.bib8)\)and EditLens\(Thaiet al\.,[2026](https://arxiv.org/html/2605.21713#bib.bib9)\)\. See Appendix[C](https://arxiv.org/html/2605.21713#A3)for details\.

### 4\.3Research Questions

We evaluate Sem\-Detect through experiments that address the following questions:

Table 1:Two\-class detection \(Human vs\. AI\)\. We report AUC and true positive rates \(TPR\) at fixed false positive rates \(FPR\) of 0\.1% and 1%\.
†Domain\-specific detectors trained or tuned on peer\-review data\.Figure 3:ROC curves on the binary\-setting\. LLM\-Refined reviews are not considered in this experiment\.
![Refer to caption](https://arxiv.org/html/2605.21713v1/x3.png)
- •How competitive is Sem\-Detect on the standard human vs\. fully AI\-generated task?Since most prior works target binary authorship attribution, we first evaluate in a setting that excludes LLM\-refined reviews \(Section[5\.1](https://arxiv.org/html/2605.21713#S5.SS1)\)\.
- •Can detectors flag fully AI\-generated reviews without misclassifying legitimate LLM\-assisted writing?We study the three\-class setting and quantify the trade\-off between detecting fully AI reviews and avoiding false positives on AI\-generated reviews \(Section[5\.2](https://arxiv.org/html/2605.21713#S5.SS2)\)\.
- •Can confidence\-based filtering improve Sem\-Detect’s reliability in practice?We analyze the accuracy/coverage trade\-off when low\-confidence predictions are flagged for manual review, rather than auto\-classified \(Section[5\.3](https://arxiv.org/html/2605.21713#S5.SS3)\)\.
- •How robust is Sem\-Detect to shifts in generation conditions?We analyze out\-of\-distribution behavior under generation shifts by testing on both fully AI\-generated and LLM\-refined reviews produced by LLMs and prompting templates not used during training, and measure degradation relative to in\-distribution evaluation \(Section[5\.4](https://arxiv.org/html/2605.21713#S5.SS4)\)\.
- •Does Sem\-Detect generalize to a new peer\-review domain without modification?We apply Sem\-Detect as\-is to reviews from a medical imaging venue and measure cross\-domain transfer relative to the standard ML\-conferences test data \(Section[5\.5](https://arxiv.org/html/2605.21713#S5.SS5)\)\.
- •What does Sem\-Detect predict on recent peer\-review data?We analyze authorship distributions on ICLR 2026 reviews, and compare trends with existing claims about AI prevalence in top\-tier ML conferences \(Section[5\.6](https://arxiv.org/html/2605.21713#S5.SS6)\)\.

## 5Results

### 5\.1Main Results: Binary Classification

Table[1](https://arxiv.org/html/2605.21713#S4.T1)and Figure[3](https://arxiv.org/html/2605.21713#S4.F3)summarize performance on the binary classification task, where LLM\-refined reviews are not yet considered\. In this setting, general\-purpose detectors such as Binoculars and RADAR achieve moderate to strong AUC scores \(0\.751 and 0\.965, respectively\)\. However, their effectiveness declines at low false positive rates \(FPR\), which is critical for practical deployment\. By contrast, domain\-specific approaches are more robust in this region\. The TF Model, Anchor and EditLens maintain competitive AUC while achieving higher true positive rates \(TPR\) at low FPR thresholds, underscoring the value of using signals specific to the peer\-review domain\.

Sem\-Detect further improves on these results and performs best across all metrics\. With an AUC of 0\.999 and a TPR@0\.1% FPR of 0\.760 \(a 25\.5% relative improvement over EditLens\), the results indicate that, even in the binary setting, combining claim\-level semantic analysis with textual features improves performance over prior methods\.

### 5\.2Main Results: Multi\-Class Classification

The central contribution of our Sem\-Detect lies in its ability to distinguish not only between human and AI authorship, but also to identify human reviews polished with an LLM\.

##### Comparison with binary detectors\.

Most existing detectors produce only binary predictions\. To compare against them, we first evaluate all methods under a simplified setting: we group LLM\-refined and human reviews together as the non\-AI class, while fully AI\-generated reviews form the positive class\. Figure[4](https://arxiv.org/html/2605.21713#S5.F4)shows the results of this comparison\.

![Refer to caption](https://arxiv.org/html/2605.21713v1/x4.png)Figure 4:ROC curves for the collapsed binary task\. Human and LLM\-Refined reviews are grouped against fully AI reviews\.As shown in Figure[4](https://arxiv.org/html/2605.21713#S5.F4), this setting proves challenging for most general\-purpose detectors: LogRank, MAGE, Binoculars, and Fast\-DetectGPT all collapse to near\-random performance \(AUC≤0\.513\\leq 0\.513\)\. This outcome is expected: LLM\-refined text shares many surface\-level characteristics with fully AI\-generated text, making it hard to separate the two classes based on textual features alone\. The TF Model, despite being tailored to the peer\-review domain, also suffers a substantial drop \(AUC = 0\.674\), as its reliance on token frequency patterns is disrupted by LLM refinement\.

Two methods stand out as more robust\. RADAR achieves an AUC of 0\.966, suggesting that adversarial training helps the detector learn subtle differences between polished and fully generated text\. Anchor also performs well \(AUC = 0\.980\), which aligns with its emphasis on semantic similarity rather than surface\-level patterns\. However, neither method can distinguish among the three classes directly\. Sem\-Detect achieves the highest AUC \(0\.990\) while also providing full three\-class predictions\.

##### Three\-class classification results\.

We now turn to the main evaluation setting\. Figure[5](https://arxiv.org/html/2605.21713#S5.F5)reports the confusion matrix for Sem\-Detect on the three\-class task\.

Overall, the classifier performs well on both AI\-generated and LLM\-refined reviews, correctly identifying 91\.18% of AI reviews and 91\.61% of LLM\-refined ones\.

The main source of error involves human\-written reviews being classified as LLM\-refined \(35\.38%\), likely reflecting both the inherent difficulty of separating polished human writing from LLM\-assisted text and the class imbalance in training, where LLM\-refined reviews are more prevalent\.

![Refer to caption](https://arxiv.org/html/2605.21713v1/x5.png)Figure 5:Sem\-Detect Multi\-Class Confusion matrix \(%\)\.We view this error pattern as acceptable because the resulting bias is conservative: when uncertain, the model tends to predict LLM\-refined rather than fully AI\-generated\. As a result, hard misclassifications from human to AI remain very rare \(0\.66%\), which is desirable in practice\.

##### The role of semantic similarity\.

Figure[6](https://arxiv.org/html/2605.21713#S5.F6)illustrates why claim\-level analysis proves effective\. The plot displays the mean best\-match claim similarity for each class \(the most discriminative feature in our classifier\)\. AI\-generated reviews show consistently high similarity to reference AI reviews \(median≈0\.73\\approx 0\.73\)\. Human and LLM\-refined reviews, by contrast, cluster together at lower values \(median≈0\.64\\approx 0\.64\), hence supporting our premise: AI models converge on similar claims, but LLM refinement preserves the distinctiveness of human judgments\.

![Refer to caption](https://arxiv.org/html/2605.21713v1/x6.png)Figure 6:Mean best\-match claim similarity by class \(test set\)\.

### 5\.3Deployment via Confidence Thresholding

By default, Sem\-Detect predicts the highest\-probability class regardless of certainty\. For example, probabilities of 0\.51 AI\-generated, 0\.48 human\-written, and 0\.01 LLM\-refined still yield an AI\-generated label, despite near\-tie uncertainty between human and AI, which is undesirable when false accusations are costlier than missed detections\.

![Refer to caption](https://arxiv.org/html/2605.21713v1/x7.png)Figure 7:Prediction confidence calibration by predicted class\.Fortunately, as Figure[7](https://arxiv.org/html/2605.21713#S5.F7)shows, Sem\-Detect’s confidence scores are well\-calibrated: correct predictions average 0\.91 confidence while incorrect ones average 0\.72\.

![Refer to caption](https://arxiv.org/html/2605.21713v1/x8.png)\(a\)Accuracy\-coverage trade\-off\.
![Refer to caption](https://arxiv.org/html/2605.21713v1/x9.png)\(b\)Human review misclassification rates\.

Figure 8:Effect of confidence thresholding on classification accuracy, coverage, and Human→\\rightarrowLLM\-refined error rate\.We can therefore introduce a confidence thresholdθ\\thetathat flags low\-confidence predictions for manual review, trading coverage for accuracy on the rest\.

Figure[8](https://arxiv.org/html/2605.21713#S5.F8)further quantifies the trade\-off\. Atθ=0\.80\\theta=0\.80, 79% of reviews are still classified automatically and accuracy on that set rises to 94\.7%, while the Human→\\rightarrowLLM\-refined error, the main failure mode in Figure[5](https://arxiv.org/html/2605.21713#S5.F5), drops substantially\.

### 5\.4Robustness to Generation Conditions

In practice, reviewers may use diverse models and prompts to generate or refine reviews, raising the question of whether Sem\-Detect generalizes beyond its training conditions\. We evaluate two out\-of\-distribution settings: \(i\) OOD\-M, where reviews are generated by unseen model families using the same prompt template, and \(ii\) OOD\-M\+P, where both models and prompts differ\. For OOD\-M, we use Mistral\-Large\-3\(Mistral,[2025](https://arxiv.org/html/2605.21713#bib.bib27)\), Claude\-Sonnet\-4\(Anthropic,[2025](https://arxiv.org/html/2605.21713#bib.bib25)\), and GPT\-oss\-120b\(Agarwalet al\.,[2025](https://arxiv.org/html/2605.21713#bib.bib28)\); for OOD\-M\+P, we additionally vary prompt structure, specificity, and review format \(Further details in Appendix[D\.1](https://arxiv.org/html/2605.21713#A4.SS1)\)\. Table[2](https://arxiv.org/html/2605.21713#S5.T2)reports three\-class performance under these conditions\.

Table 2:Sem\-Detect under distribution shift, for two settings: \(i\) different models \(M\) and \(ii\) different models and prompts \(M\+P\)\.#### 5\.4\.1Performance Under Distribution Shift

We expected performance to drop under distribution shift, and it does: 3\-class Macro\-F1 falls from 0\.84 to 0\.71 \(OOD\-M\) and 0\.68 \(OOD\-M\+P\)\. What matters, though, is how the model fails\. Rather than making high\-stakes errors, Sem\-Detect routes uncertain samples to the LLM\-refined class, and overall, AI precision actually increases to 0\.97, meaning predictions of “AI\-generated” are highly reliable\.

This conservative behavior raises a natural question: is LLM\-refined merely an uncertainty bucket? The OOD class\-wise metrics suggest otherwise\. In fact, under OOD\-M\+P, the LLM\-refined class achieves a recall of 0\.769 and a precision of 0\.759, a pattern inconsistent with a catch\-all category, which would typically show degradation in at least one of these metrics \(further details in Appendices[D\.2](https://arxiv.org/html/2605.21713#A4.SS2)\-[D\.5](https://arxiv.org/html/2605.21713#A4.SS5)\)\.

### 5\.5Cross\-Domain Generalization

We now extend our evaluation to a different field: medical imaging\. We select MIDL 2022 for this analysis because, like ICLR and NeurIPS, it hosts its reviews on OpenReview, allowing us to collect authentic human reviews under the same conditions\. Specifically, we sample≈\\approx100 random papers from this venue, generate AI\-written and LLM\-refined reviews using our standard pipeline, and run Sem\-Detect without any modifications\.

The results are very positive\. As Figure[9](https://arxiv.org/html/2605.21713#S5.F9)shows, Sem\-Detect achieves comparable or slightly higher F1 scores on MIDL than on the ML conferences test set\. This holds across all three classes\. That said, one limitation deserves mention: MIDL, while medically oriented, still centers on deep learning methods\. Evaluating on more distant fields would be ideal, but open peer\-review data remains limited outside of computer science\.

![Refer to caption](https://arxiv.org/html/2605.21713v1/x10.png)Figure 9:Cross\-domain generalization results\. F1 scores for Sem\-Detect on the ML test set and the medical imaging venue MIDL 2022\. No domain\-specific retraining is performed\.
### 5\.6ICLR 2026 Comparison

Our evaluations so far have relied on data where ground truth labels are known due to temporal constraints\. To examine how Sem\-Detect behaves in a contemporary setting, we turn to ICLR 2026, sampling approximately 600 papers at random\. This analysis is motivated by recent claims from Pangram Labs\(Thaiet al\.,[2026](https://arxiv.org/html/2605.21713#bib.bib9)\), whose EditLens detector suggests that more than 20% of ICLR 2026 reviews were fully AI\-generated\(Emi,[2025](https://arxiv.org/html/2605.21713#bib.bib29)\)\. Without ground truth, our goal is not to establish which method is correct\. Instead, we examine whether Sem\-Detect produces a reasonable distribution of review categories\.

Figure[10](https://arxiv.org/html/2605.21713#S5.F10)shows that the two methods present quite different distributions: EditLens classifies 24% of reviews as AI\-generated, 32% as LLM\-refined, and 44% as human; Sem\-Detect predicts 5%, 61%, and 34%, respectively\. The divergence appears primarily in how each method handles the middle ground: while EditLens places predictions more liberally on the extreme classes, Sem\-Detect favors LLM\-refined classifications for ambiguous cases\. This conservative behavior ends up being desirable in practice as, in high\-stakes settings, false accusations carry greater cost than missed detections\.

That said, both distributions appear plausible, and for reviews that Sem\-Detect classifies as either fully AI\-generated or fully human, EditLens agrees with the prediction approximately 70% of the time\. This suggests that, despite their different design philosophies, both methods capture meaningful signal about AI presence in peer review\.

![Refer to caption](https://arxiv.org/html/2605.21713v1/x11.png)Figure 10:Sem\-Detect and EditLens \(Pangram Labs\) review authorship predictions on ICLR 2026 data\.

## 6Conclusions

In this paper, we propose Sem\-Detect, a detection framework for peer\-review authorship attribution that distinguishes fully human\-written reviews from those refined using an LLM and those generated end\-to\-end by a machine\. Our approach exploits the fact that authorship signals reside not only in textual features of the review, but also in the semantic content of expressed ideas\. While different AI models tend to converge on similar claims when reviewing the same paper, human reviewers introduce more unique and diverse judgments\.

We validate Sem\-Detect on reviews from top\-tier ML conferences and find that it outperforms all baselines in both binary and three\-class settings\. At the same time, fewer than 3\.5% of LLM\-refined human reviews are mistakenly flagged as AI\-generated\.

Beyond these controlled conditions, Sem\-Detect also shows reasonable behavior under distribution shift\. The method generalizes to unseen models, transfers to medical imaging reviews without retraining, and produces plausible predictions on recent ICLR 2026 data\. This shows that effective detection and fairness to legitimate LLM use can coexist\.

## Impact Statement

This work contributes to the ongoing effort to preserve integrity in peer review\. By distinguishing fully AI\-generated reviews from those where humans used an LLM only to improve clarity, our framework supports policies that can detect problematic content without penalizing responsible AI assistance\.

That said, we recognize an important limitation in our approach\. Our method assumes that the originality of ideas can help distinguish human from AI authorship\. As models continue to improve, they may eventually produce reviews with novel, high\-quality insights that are indistinguishable from, or even better than, those of human experts\. If that happens, the line between human and AI authorship may blur, raising a deeper question: does the origin of an idea matter if its quality is sound?

Finally, we note that any detection system risks false accusations, which can harm reviewers’ reputations\. While our results show very low rates of misclassification between true human entries and AI, we emphasize that our method should be used as one signal among many, not as a definitive judgment\.

## Reproducibility

We release the following artifacts:

- •[Code](https://github.com/avduarte333/Sem-Detect): full pipeline, two pre\-trained classifiers, and a self\-hosted Flask web demo\.
- •[![[Uncaptioned image]](https://arxiv.org/html/2605.21713v1/logos/huggingface.png)Data](https://huggingface.co/datasets/Sem-Detect/ML_Conferences-Peer-Reviews): complete set of reviews for the 800 papers from ICLR and NeurIPS 2021\-2022\.

Further details on prompts, data construction, and additional analyses are provided in the appendices\.

## Acknowledgements

We acknowledge support from national funds through Fundação para a Ciência e a Tecnologia, I\.P\. \(FCT\), under projects UID/50021/2025 and UID/PRR/50021/2025\.

This work is also co\-financed by FCT through the Carnegie Mellon Portugal Program under the fellowship PRT/BD/155049/2024\.

Lei Li is partly supported by the CMU CyLab seed grant\.

## References

- S\. Agarwal, L\. Ahmad, J\. Ai, S\. Altman, A\. Applebaum, E\. Arbus, R\. K\. Arora, Y\. Bai, B\. Baker, H\. Bao,et al\.\(2025\)GPT\-oss\-120b & GPT\-oss\-20b model card\.arXiv preprint arXiv:2508\.10925\.Cited by:[§5\.4](https://arxiv.org/html/2605.21713#S5.SS4.p1.1)\.
- A\. Anthropic \(2025\)System card: Claude opus 4 & claude sonnet 4\.Claude\-4 Model Card\.Cited by:[§5\.4](https://arxiv.org/html/2605.21713#S5.SS4.p1.1)\.
- A\. Anthropic \(2026\)System Card:Claude Opus 4\.6\.Note:[https://www\-cdn\.anthropic\.com/6a5fa276ac68b9aeb0c8b6af5fa36326e0e166dd\.pdf](https://www-cdn.anthropic.com/6a5fa276ac68b9aeb0c8b6af5fa36326e0e166dd.pdf)Cited by:[§D\.6](https://arxiv.org/html/2605.21713#A4.SS6.p2.1)\.
- G\. Bao, Y\. Zhao, Z\. Teng, L\. Yang, and Y\. Zhang \(2024\)Fast\-DetectGPT: Efficient Zero\-Shot Detection of Machine\-Generated Text via Conditional Probability Curvature\.InInternational Conference on Representation Learning,B\. Kim, Y\. Yue, S\. Chaudhuri, K\. Fragkiadaki, M\. Khan, and Y\. Sun \(Eds\.\),Vol\.2024,pp\. 24814–24836\.Cited by:[§1](https://arxiv.org/html/2605.21713#S1.p4.1),[§2\.1](https://arxiv.org/html/2605.21713#S2.SS1.SSS0.Px2.p1.1),[§4\.2](https://arxiv.org/html/2605.21713#S4.SS2.p2.1)\.
- T\. Chen, H\. Wang, S\. Chen, W\. Yu, K\. Ma, X\. Zhao, H\. Zhang, and D\. Yu \(2024\)Dense X Retrieval: What Retrieval Granularity Should We Use?\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 15159–15177\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.845)Cited by:[§2\.3](https://arxiv.org/html/2605.21713#S2.SS3.p1.1)\.
- G\. Comanici, E\. Bieber, M\. Schaekermann, I\. Pasupat, N\. Sachdeva, I\. Dhillon, M\. Blistein, O\. Ram, D\. Zhang, E\. Rosen,et al\.\(2025\)Gemini 2\.5: pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities\.arXiv preprint arXiv:2507\.06261\.Cited by:[§4\.1](https://arxiv.org/html/2605.21713#S4.SS1.SSS0.Px1.p1.1)\.
- A\. V\. Duarte, J\. D\. Marques, M\. Graça, M\. Freire, L\. Li, and A\. L\. Oliveira \(2024\)LumberChunker: Long\-Form Narrative Document Segmentation\.InFindings of the Association for Computational Linguistics: EMNLP 2024,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 6473–6486\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.findings-emnlp.377)Cited by:[§2\.3](https://arxiv.org/html/2605.21713#S2.SS3.p1.1)\.
- B\. Emi \(2025\)Pangram Predicts 21% of ICLR Reviews are AI\-Generated\.Note:[https://www\.pangram\.com/blog/pangram\-predicts\-21\-of\-iclr\-reviews\-are\-ai\-generated](https://www.pangram.com/blog/pangram-predicts-21-of-iclr-reviews-are-ai-generated)Accessed: 2025\-12\-01Cited by:[§5\.6](https://arxiv.org/html/2605.21713#S5.SS6.p1.1)\.
- A\. Fitzgibbon, L\. Leal\-Taixé, and V\. Murino \(2024\)Opening ceremony slides at the European Conference on Computer Vision \(ECCV 2024\)\.Note:Slide 31 of 67External Links:[Link](https://eccv2024.ecva.net/media/eccv-2024/Slides/2822.pdf)Cited by:[§1](https://arxiv.org/html/2605.21713#S1.p2.1)\.
- S\. S\. Ghosal, S\. Chakraborty, J\. Geiping, F\. Huang, D\. Manocha, and A\. Bedi \(2023\)A survey on the possibilities & impossibilities of AI\-generated text detection\.Transactions on Machine Learning Research\.Note:Survey CertificationExternal Links:ISSN 2835\-8856,[Link](https://openreview.net/forum?id=AXtFeYjboj)Cited by:[§2](https://arxiv.org/html/2605.21713#S2.p1.1)\.
- GLM\-5\-Team \(2026\)GLM\-5: from Vibe Coding to Agentic Engineering\.External Links:2602\.15763Cited by:[§D\.5](https://arxiv.org/html/2605.21713#A4.SS5.p1.1)\.
- A\. J\. Gutiérrez Megías, L\. A\. Ureña\-López, and E\. Martínez Cámara \(2024\)The influence of the perplexity score in the detection of machine\-generated texts\.InProceedings of the First International Conference on Natural Language Processing and Artificial Intelligence for Cyber Security,R\. Mitkov, S\. Ezzini, T\. Ranasinghe, I\. Ezeani, N\. Khallaf, C\. Acarturk, M\. Bradbury, M\. El\-Haj, and P\. Rayson \(Eds\.\),Lancaster, UK,pp\. 80–85\.External Links:[Link](https://aclanthology.org/2024.nlpaics-1.10/)Cited by:[§2\.1](https://arxiv.org/html/2605.21713#S2.SS1.SSS0.Px2.p1.1)\.
- A\. Hans, A\. Schwarzschild, V\. Cherepanova, H\. Kazemi, A\. Saha, M\. Goldblum, J\. Geiping, and T\. Goldstein \(2024\)Spotting LLMs With Binoculars: Zero\-Shot Detection of Machine\-Generated Text\.InProceedings of the 41st International Conference on Machine Learning,R\. Salakhutdinov, Z\. Kolter, K\. Heller, A\. Weller, N\. Oliver, J\. Scarlett, and F\. Berkenkamp \(Eds\.\),Proceedings of Machine Learning Research, Vol\.235,pp\. 17519–17537\.Cited by:[§2\.1](https://arxiv.org/html/2605.21713#S2.SS1.SSS0.Px2.p1.1),[§4\.2](https://arxiv.org/html/2605.21713#S4.SS2.p2.1)\.
- X\. Hu, P\. Chen, and T\. Ho \(2023\)RADAR: Robust AI\-Text Detection via Adversarial Learning\.InAdvances in Neural Information Processing Systems,A\. Oh, T\. Naumann, A\. Globerson, K\. Saenko, M\. Hardt, and S\. Levine \(Eds\.\),Vol\.36,pp\. 15077–15095\.Cited by:[§1](https://arxiv.org/html/2605.21713#S1.p4.1),[§2\.1](https://arxiv.org/html/2605.21713#S2.SS1.SSS0.Px3.p1.1),[§4\.2](https://arxiv.org/html/2605.21713#S4.SS2.p2.1)\.
- ICML Conference Chairs \(2025\)ICML 2025 Reviewer Instructions\.Note:[https://icml\.cc/Conferences/2025/ReviewerInstructions](https://icml.cc/Conferences/2025/ReviewerInstructions)Accessed: 2025\-06\-04Cited by:[§1](https://arxiv.org/html/2605.21713#S1.p2.1)\.
- ICML Conference Chairs \(2026\)ICML 2026 LLM\-Policy Instructions\.Note:[https://icml\.cc/Conferences/2026/LLM\-Policy](https://icml.cc/Conferences/2026/LLM-Policy)Accessed: 2026\-01\-08Cited by:[§1](https://arxiv.org/html/2605.21713#S1.p2.1)\.
- D\. Ippolito, D\. Duckworth, C\. Callison\-Burch, and D\. Eck \(2020\)Automatic Detection of Generated Text is Easiest when Humans are Fooled\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics,D\. Jurafsky, J\. Chai, N\. Schluter, and J\. Tetreault \(Eds\.\),Online,pp\. 1808–1822\.External Links:[Document](https://dx.doi.org/10.18653/v1/2020.acl-main.164)Cited by:[§4\.2](https://arxiv.org/html/2605.21713#S4.SS2.p2.1)\.
- G\. Jawahar, M\. Abdul\-Mageed, and L\. Lakshmanan \(2020\)Automatic detection of machine generated text: a critical survey\.InProceedings of the 28th International Conference on Computational Linguistics,D\. Scott, N\. Bel, and C\. Zong \(Eds\.\),pp\. 2296–2309\.Cited by:[§2](https://arxiv.org/html/2605.21713#S2.p1.1)\.
- A\. Q\. Jiang, A\. Sablayrolles, A\. Mensch, C\. Bamford, D\. S\. Chaplot, D\. de las Casas, F\. Bressand, G\. Lengyel, G\. Lample, L\. Saulnier, L\. R\. Lavaud, M\. Lachaux, P\. Stock, T\. L\. Scao, T\. Lavril, T\. Wang, T\. Lacroix, and W\. E\. Sayed \(2023\)Mistral 7B\.External Links:2310\.06825Cited by:[§4\.1](https://arxiv.org/html/2605.21713#S4.SS1.SSS0.Px1.p1.1)\.
- G\. Ke, Q\. Meng, T\. Finley, T\. Wang, W\. Chen, W\. Ma, Q\. Ye, and T\. Liu \(2017\)LightGBM: A Highly Efficient Gradient Boosting Decision Tree\.InAdvances in Neural Information Processing Systems,I\. Guyon, U\. V\. Luxburg, S\. Bengio, H\. Wallach, R\. Fergus, S\. Vishwanathan, and R\. Garnett \(Eds\.\),Vol\.30,pp\.\.Cited by:[§3\.2](https://arxiv.org/html/2605.21713#S3.SS2.SSS0.Px3.p5.1)\.
- Kimi\-Team \(2026\)Kimi K2\.5: Visual Agentic Intelligence\.External Links:2602\.02276Cited by:[§D\.5](https://arxiv.org/html/2605.21713#A4.SS5.p1.1)\.
- J\. Kirchenbauer, J\. Geiping, Y\. Wen, J\. Katz, I\. Miers, and T\. Goldstein \(2023\)A watermark for large language models\.InProc\. of ICML,A\. Krause, E\. Brunskill, K\. Cho, B\. Engelhardt, S\. Sabato, and J\. Scarlett \(Eds\.\),Proceedings of Machine Learning Research, Vol\.202,pp\. 17061–17084\.Cited by:[§2\.1](https://arxiv.org/html/2605.21713#S2.SS1.SSS0.Px1.p1.1)\.
- S\. Kumar, M\. Sahu, V\. Gacche, T\. Ghosal, and A\. Ekbal \(2024\)‘Quis custodiet ipsos custodes?’ who will watch the watchmen? on detecting AI\-generated peer\-reviews\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 22663–22679\.External Links:[Link](https://aclanthology.org/2024.emnlp-main.1262/),[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.1262)Cited by:[§2\.2](https://arxiv.org/html/2605.21713#S2.SS2.SSS0.Px1.p1.1),[§4\.2](https://arxiv.org/html/2605.21713#S4.SS2.p2.1)\.
- T\. Lavergne, T\. Urvoy, and F\. Yvon \(2008\)Detecting fake content with relative entropy scoring\.InProceedings of the 2008 International Conference on Uncovering Plagiarism, Authorship and Social Software Misuse \- Volume 377,PAN’08,Aachen, DEU,pp\. 27–31\.Cited by:[§2\.1](https://arxiv.org/html/2605.21713#S2.SS1.SSS0.Px2.p1.1)\.
- Y\. Li, Q\. Li, L\. Cui, W\. Bi, Z\. Wang, L\. Wang, L\. Yang, S\. Shi, and Y\. Zhang \(2024\)MAGE: machine\-generated text detection in the wild\.InProc\. of ACL,L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),pp\. 36–53\.Cited by:[§4\.2](https://arxiv.org/html/2605.21713#S4.SS2.p2.1)\.
- W\. Liang, Z\. Izzo, Y\. Zhang, H\. Lepp, H\. Cao, X\. Zhao, L\. Chen, H\. Ye, S\. Liu, Z\. Huang, D\. A\. McFarland, and J\. Y\. Zou \(2024\)Monitoring AI\-Modified Content at Scale: A Case Study on the Impact of ChatGPT on AI Conference Peer Reviews\.InProc\. of ICML,Proceedings of Machine Learning Research, Vol\.235,pp\. 29575–29620\.Cited by:[§1](https://arxiv.org/html/2605.21713#S1.p1.1),[§2\.2](https://arxiv.org/html/2605.21713#S2.SS2.SSS0.Px1.p1.1)\.
- A\. Liu, B\. Feng, B\. Xue, B\. Wang, B\. Wu, C\. Lu, C\. Zhao, C\. Deng, C\. Zhang, C\. Ruan,et al\.\(2024\)Deepseek\-v3 technical report\.arXiv preprint arXiv:2412\.19437\.Cited by:[§4\.1](https://arxiv.org/html/2605.21713#S4.SS1.SSS0.Px1.p1.1)\.
- Y\. Liu, M\. Ott, N\. Goyal, J\. Du, M\. Joshi, D\. Chen, O\. Levy, M\. Lewis, L\. Zettlemoyer, and V\. Stoyanov \(2019\)Roberta: A robustly optimized bert pretraining approach\.arXiv preprint arXiv:1907\.11692\.Cited by:[§2\.1](https://arxiv.org/html/2605.21713#S2.SS1.SSS0.Px3.p1.1)\.
- Mistral \(2025\)Introducing Mistral 3\.Note:[https://mistral\.ai/news/mistral\-3](https://mistral.ai/news/mistral-3)Accessed: 2025\-12\-20Cited by:[§5\.4](https://arxiv.org/html/2605.21713#S5.SS4.p1.1)\.
- E\. Mitchell, Y\. Lee, A\. Khazatsky, C\. D\. Manning, and C\. Finn \(2023\)DetectGPT: zero\-shot machine\-generated text detection using probability curvature\.InProc\. of ICML,A\. Krause, E\. Brunskill, K\. Cho, B\. Engelhardt, S\. Sabato, and J\. Scarlett \(Eds\.\),Proceedings of Machine Learning Research, Vol\.202,pp\. 24950–24962\.Cited by:[§2\.1](https://arxiv.org/html/2605.21713#S2.SS1.SSS0.Px2.p1.1)\.
- OpenAI \(2022\)Introducing Chat\-GPT\.Note:[https://openai\.com/blog/chatgpt](https://openai.com/blog/chatgpt)Accessed: 2022\-11\-30Cited by:[§2\.2](https://arxiv.org/html/2605.21713#S2.SS2.SSS0.Px1.p1.1)\.
- V\. S\. Rao, A\. Kumar, H\. Lakkaraju, and N\. B\. Shah \(2025\)Detecting LLM\-generated peer reviews\.PLoS One20\(9\),pp\. e0331871\.Cited by:[§2\.2](https://arxiv.org/html/2605.21713#S2.SS2.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2605.21713#S2.p1.1)\.
- G\. Russo, M\. Horta Ribeiro, T\. R\. Davidson, V\. Veselovsky, and R\. West \(2025\)The ai review lottery: widespread ai\-assisted peer reviews boost paper scores and acceptance rates\.Proc\. ACM Hum\.\-Comput\. Interact\.9\(7\)\.External Links:[Link](https://doi.org/10.1145/3757667),[Document](https://dx.doi.org/10.1145/3757667)Cited by:[§3\.1](https://arxiv.org/html/2605.21713#S3.SS1.SSS0.Px2.p2.1)\.
- A\. Singh, A\. Fry, A\. Perelman, A\. Tart, A\. Ganesh, A\. El\-Kishky, A\. McLaughlin, A\. Low, A\. Ostrow, A\. Ananthram,et al\.\(2025\)Openai gpt\-5 system card\.arXiv preprint arXiv:2601\.03267\.Cited by:[Appendix C](https://arxiv.org/html/2605.21713#A3.p2.1)\.
- I\. Solaiman, M\. Brundage, J\. Clark, A\. Askell, A\. Herbert\-Voss, J\. Wu, A\. Radford, G\. Krueger, J\. W\. Kim, S\. Kreps, M\. McCain, A\. Newhouse, J\. Blazakis, K\. McGuffie, and J\. Wang \(2019\)Release strategies and the social impacts of language models\.ArXiv preprintabs/1908\.09203\.Cited by:[§2\.1](https://arxiv.org/html/2605.21713#S2.SS1.SSS0.Px3.p1.1)\.
- S\. Sturua, I\. Mohr, M\. Kalim Akram, M\. Günther, B\. Wang, M\. Krimmel, F\. Wang, G\. Mastrapas, A\. Koukounas, N\. Wang, and H\. Xiao \(2025\)Jina Embeddings V3: Multilingual Text Encoder with Low\-Rank Adaptations\.InAdvances in Information Retrieval: 47th European Conference on Information Retrieval, ECIR 2025, Lucca, Italy, April 6–10, 2025, Proceedings, Part V,Berlin, Heidelberg,pp\. 123–129\.External Links:ISBN 978\-3\-031\-88719\-2,[Document](https://dx.doi.org/10.1007/978-3-031-88720-8%5F21)Cited by:[§B\.3\.2](https://arxiv.org/html/2605.21713#A2.SS3.SSS2.p1.1)\.
- K\. Thai, B\. Emi, E\. Masrour, and M\. Iyyer \(2026\)EditLens: Quantifying the Extent of AI Editing in Text\.InInternational Conference on Learning Representations \(ICLR\) 2026,External Links:[Link](https://openreview.net/forum?id=gOkitaPCfZ)Cited by:[§2\.2](https://arxiv.org/html/2605.21713#S2.SS2.SSS0.Px3.p1.1),[§4\.2](https://arxiv.org/html/2605.21713#S4.SS2.p2.1),[§5\.6](https://arxiv.org/html/2605.21713#S5.SS6.p1.1)\.
- L\. Wang, N\. Yang, X\. Huang, L\. Yang, R\. Majumder, and F\. Wei \(2024\)Multilingual e5 text embeddings: A technical report\.arXiv preprint arXiv:2402\.05672\.Cited by:[§B\.3\.2](https://arxiv.org/html/2605.21713#A2.SS3.SSS2.p1.1)\.
- J\. Wu, S\. Yang, R\. Zhan, Y\. Yuan, L\. S\. Chao, and D\. F\. Wong \(2025\)A survey on LLM\-generated text detection: necessity, methods, and future directions\.Computational Linguistics51\(1\),pp\. 275–338\.External Links:[Document](https://dx.doi.org/10.1162/coli%5Fa%5F00549)Cited by:[§2](https://arxiv.org/html/2605.21713#S2.p1.1)\.
- W\. Xu, G\. Zhu, X\. Zhao, L\. Pan, L\. Li, and W\. Wang \(2024\)Pride and Prejudice: LLM Amplifies Self\-Bias in Self\-Refinement\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 15474–15492\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.826)Cited by:[§3\.2](https://arxiv.org/html/2605.21713#S3.SS2.SSS0.Px1.p1.4)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[§B\.3\.2](https://arxiv.org/html/2605.21713#A2.SS3.SSS2.p1.1),[§4\.1](https://arxiv.org/html/2605.21713#S4.SS1.SSS0.Px1.p1.1)\.
- S\. Yu, M\. Luo, A\. Madasu, V\. Lal, and P\. Howard \(2026\)Is Your Paper Being Reviewed by an LLM? Investigating AI Text Detectability in Peer Review\.InInternational Conference on Learning Representations \(ICLR\) 2026,External Links:[Link](https://openreview.net/forum?id=HyZwf1rt4s)Cited by:[§B\.4](https://arxiv.org/html/2605.21713#A2.SS4.p2.1),[Appendix C](https://arxiv.org/html/2605.21713#A3.p2.1),[§1](https://arxiv.org/html/2605.21713#S1.p5.1),[§1](https://arxiv.org/html/2605.21713#S1.p6.1),[§2\.2](https://arxiv.org/html/2605.21713#S2.SS2.SSS0.Px2.p1.1),[§4\.2](https://arxiv.org/html/2605.21713#S4.SS2.p2.1)\.
- R\. Zellers, A\. Holtzman, H\. Rashkin, Y\. Bisk, A\. Farhadi, F\. Roesner, and Y\. Choi \(2019\)Defending against neural fake news\.InProc\. of NeurIPS,H\. M\. Wallach, H\. Larochelle, A\. Beygelzimer, F\. d’Alché\-Buc, E\. B\. Fox, and R\. Garnett \(Eds\.\),pp\. 9051–9062\.Cited by:[§2\.1](https://arxiv.org/html/2605.21713#S2.SS1.SSS0.Px3.p1.1)\.
- Y\. Zhang, M\. Li, D\. Long, X\. Zhang, H\. Lin, B\. Yang, P\. Xie, A\. Yang, D\. Liu, J\. Lin, F\. Huang, and J\. Zhou \(2025\)Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models\.External Links:2506\.05176Cited by:[§B\.3\.1](https://arxiv.org/html/2605.21713#A2.SS3.SSS1.p1.1),[§4\.1](https://arxiv.org/html/2605.21713#S4.SS1.SSS0.Px1.p1.1)\.
- X\. Zhao, P\. Ananth, L\. Li, and Y\. Wang \(2024\)Provable Robust Watermarking for AI\-Generated Text\.InInternational Conference on Learning Representations,B\. Kim, Y\. Yue, S\. Chaudhuri, K\. Fragkiadaki, M\. Khan, and Y\. Sun \(Eds\.\),Vol\.2024,pp\. 43738–43772\.External Links:[Link](https://proceedings.iclr.cc/paper_files/paper/2024/file/beae9ed5316bcc48e616754c06c11875-Paper-Conference.pdf)Cited by:[§2\.1](https://arxiv.org/html/2605.21713#S2.SS1.SSS0.Px1.p1.1)\.
- L\. Zhou, R\. Zhang, X\. Dai, D\. Hershcovich, and H\. Li \(2025\)Large Language Models Penetration in Scholarly Writing and Peer Review\.External Links:2502\.11193,[Link](https://arxiv.org/abs/2502.11193)Cited by:[§1](https://arxiv.org/html/2605.21713#S1.p1.1)\.

## Appendix ADataset Creation Details

### A\.1Data Statistics

Table[3](https://arxiv.org/html/2605.21713#A1.T3)summarizes the scale and composition of our dataset across venues and years\. We see that the average number of extracted claims per review is stable between human and LLM\-refined reviews, indicating that refinement preserves the underlying semantic structure, and we see fully AI\-generated reviews consistently containing more claims per review, reflecting their tendency to produce longer, more exhaustive feedback\. Figure[11](https://arxiv.org/html/2605.21713#A1.F11)complements these statistics by showing that claim type distributions are largely consistent across conferences and years, with evaluation and constructive input forming the majority of content\.

Table 3:Dataset statistics by review class\.![Refer to caption](https://arxiv.org/html/2605.21713v1/x12.png)Figure 11:Distribution of claim types across venues and years\.
### A\.2Generating Fully AI\-Reviews

Table[4](https://arxiv.org/html/2605.21713#A1.T4)presents the prompt template used to generate AI\-Reviews\. We ensure a maximum output length of 3,072 tokens and a temperature of 1\.0\. The goal is to encourage diversity in the generated reviews while still producing coherent evaluations\.

Table 4:System Prompt used to Generate the AI Reviews\.Generating AI Reviews \- System PromptReview the given paper for a top AI conference\.Please be concise, critical, focused, and constructive so that the authors find the review convincing and improve their manuscript accordingly\.Your final recommendation should be “\{score\}”\. Please write a review that includes:\(1\) Summary of the paper; \(2\) Strengths; \(3\) Weaknesses; \(4\) Questions for authors \(if any\) and \(5\) Final Judgement\.

### A\.3Generating LLM\-Refined Reviews

Table[5](https://arxiv.org/html/2605.21713#A1.T5)presents the prompt template used for this task\. Similarly to the fully AI\-generated reviews we use a maximum output length of 3,072 tokens but, we use a temperature of 0\.8 instead\. The slightly lower temperature \(compared to 1\.0 for fully AI\-generated reviews\) is to encourage the model to stay closer to the source text while still allowing stylistic variation\.

Table 5:System Prompt used to Generate the LLM\-Refined Reviews\.Generating LLM\-Refined Reviews \- System PromptYou are a professional writing assistant\.Your task is to take user\-provided text and rewrite it to be more polished, professional, and effective\.Ensure the tone is appropriate for academic communication\.Do not modify the content of the review or suggest any improvements\.

### A\.4Extracting Claims from Reviews

As described in Section[3\.1](https://arxiv.org/html/2605.21713#S3.SS1), we extract structured claim\-level representations from each review using Gemini\-2\.5 Flash\. Table[6](https://arxiv.org/html/2605.21713#A1.T6)presents the prompt template used to perform the extraction\.

Table 6:System Prompt used to Extract Claims from Reviews\.Extracting Claims from Reviews \- System PromptYou are given the full text of a peer review\.Your task is to extract and organize the reviewer’s comments into bullet points under the following categories:1\.Factual Restatement– Summaries or descriptions of what the paper does, its methods, contributions, or results\.2\.Evaluation– Judgments of quality, including both strengths \(positive evaluations\) and weaknesses/limitations \(negative evaluations\)\.3\.Constructive Input– Actionable suggestions or recommendations for improvement\.4\.Clarification Dialogue– Questions directed to the authors or requests for clarification\.5\.Meta\-Commentary– Remarks about the broader context, such as fit for the venue, clarity of writing, novelty, or overall recommendation\.

On Tables[7](https://arxiv.org/html/2605.21713#A1.T7)and[8](https://arxiv.org/html/2605.21713#A1.T8)we now illustrate the claim extraction process with a real example: the full original review text is shown first, followed by its extracted claims, with color coding to highlight the correspondence between source passages and their derived claims\.

Table 7:Complete example of an original human\-written peer review from ICLR 2021\.Original Review \- Fully HumanThis paper analyses the random shooting control strategy in combination with various generative models, which model observa\-tions conditioned on the history of observation\-action pairs\.The authors select two variants of the Acrobot environment to make requirements like multimodal posteriors more explicit\. Bothstatistics on the distribution associated with the generative model under a fixed policy and reward\-dependent metrics were definedand analysed for a range of models\.The paper’s contribution are twofold\. First, the suggested experimental protocol for model evaluation and benchmarking extendsthe usual evaluation process in reinforcement learning, which often focuses purely on the cumulative reward\. The frameworkdescribed introduces a range of static”likelihood\-based metrics\. These static metrics are evaluated under a fixed \(potentiallystochastic\) policy and allow the separation of the model\-based control strategy and the underlying model for evaluation purposes\.Reward\-based dynamic” metrics are evaluated under a random\-shooting control mechanism\. This framework heavily simplifiesmodel evaluation by providing evaluation metrics and fixing the environment and control strategy\. Under these assumptions thisapproach allows direct comparison and even visualisation of various quantities of interest, including trajectories of the one\-stepforward model\. Second, this work applies the conceived framework to evaluate a range of models on two different environments\.These environments are intended to make requirements like probabilistic posteriors explicit\. The authors conclude by claimingthat 1\. Probabilistic models are needed when the system benefits from multimodal predictive uncertainty”, and 2\. Deterministicmodels are sufficient if trained with a loss allowing heteroscedasticity”\.There’s an inherent trade\-off between simplicity of the study and generality its conclusion\. While some of the simplifyingassumptions made make this kind of study possible in the first place, they also raise a range of questions:\- How appropriate are these metrics for problems with higher observation spaces? Can we expect the variance of estimatesto increase and ratios and likelihoods to diminish by multiple orders of magnitude?\- Do the claims presented as important findings” generalise to other environments?\- Does the correlation of the explained variancewith the dynamic metrics hold on other environments?A clear definition of micro\-data reinforcement learning is missing, and MBRL is introduced twice with conflicting definitions\.The future directions outlined by the authors of extending the results to larger systems and other planning strategies are veryrelevant to reduce the concerns of generalisability of the results and applications to problems of higher dimensions\. At the sametime these modification will increase the experimental complexity\. This paper serves as a suitable baseline for reference for futurework to answer these questions\. Therefore I consider this paper a valid contribution to ICLR\.

Table 8:The same review after the claim extraction process\. For readability, some factual\-restatement claims are omitted for space\.Claim Extraction Example of Fully Human ReviewFactual Restatement•This paper analyses the random shooting control strategy in combination with various generative models which modelobservations conditioned on the history of observation\-action pairs\.•The authors select two variants of the Acrobot environment to make requirements like multimodal posteriors more explicit\.•Both statistics on the distribution associated with the generative model under a fixed policy and reward\-dependent metricswere defined and analysed for a range of models\.•The paper’s contribution is twofold: first, the suggested experimental protocol for model evaluation and benchmarking ext\-ends the usual evaluation process in reinforcement learning, which often focuses purely on the cumulative reward\.•The framework described introduces a range of “static” likelihood\-based metrics, which are evaluated under a fixedpolicy and allow the separation of the model\-based control strategy and the underlying model for evaluation\.•Reward\-based dynamic metrics are evaluated under a random\-shooting control mechanism\.•Second, this work applies the conceived framework to evaluate a range of models on two different environments, which areintended to make requirements like probabilistic posteriors explicit\.•The authors conclude by claiming that 1\. Probabilistic models are needed when the system benefits from multimodalpredictive uncertainty, and 2\. Deterministic models are sufficient if trained with a loss allowing heteroscedasticity\.Evaluation•There’s an inherent trade\-off between simplicity of the study and generality its conclusion\.•While some of the simplifying assumptions made make this kind of study possible, they also raise a range of questions\.•The future directions outlined by the authors of extending the results to larger systems and other planning strategies arevery relevant to reduce the concerns of generalisability of the results and applications to problems of higher dimensions\.•This paper serves as a suitable baseline for reference for future work to answer these questions\.Constructive Input•A clear definition of micro\-data reinforcement learning is missing\.•MBRL is introduced twice with conflicting definitions\.Clarification Dialogue•How appropriate are these metrics for problems with higher observation spaces?•Can we expect the variance of estimates to increase and ratios and likelihoods to diminish by multiple orders of magnitudein higher observation spaces?•Do the claims presented as “important findings” generalise to other environments?•Does the correlation of the explained variance with the dynamic metrics hold on other environments?Meta Commentary•I consider this paper a valid contribution to ICLR\.

### A\.5Cost Analysis: Review Generation, Cleaning and Claim Extraction

Figure[12](https://arxiv.org/html/2605.21713#A1.F12)summarizes the computational costs for review generation and cleaning\. Whenever possible, we use batch API calls to reduce latency and cost\. In our case, Gemini models are queried with Gemini Batch API requests222[https://ai\.google\.dev/gemini\-api/docs/batch\-api](https://ai.google.dev/gemini-api/docs/batch-api), while DeepSeek and Qwen\-3 use synchronous requests through AWS333[https://docs\.aws\.amazon\.com/bedrock/latest/userguide/models\-supported\.html](https://docs.aws.amazon.com/bedrock/latest/userguide/models-supported.html)\. The generation step \(Figure[12\(a\)](https://arxiv.org/html/2605.21713#A1.F12.sf1)\) represents the largest expense, with total costs approaching $170\. These costs, however, are distributed roughly evenly between fully AI\-generated and LLM\-refined reviews, despite the latter being far more numerous\. This happens because AI reviews require the parsed PDF as input, while LLM\-refined reviews only receive the shorter human review\. The cleaning stage \(Figure[12\(b\)](https://arxiv.org/html/2605.21713#A1.F12.sf2)\) results in lower costs, as it involves only rewriting reviews to remove formatting artifacts that would otherwise reveal their LLM origin\.

![Refer to caption](https://arxiv.org/html/2605.21713v1/x13.png)\(a\)Review Generation Cost\.
![Refer to caption](https://arxiv.org/html/2605.21713v1/x14.png)\(b\)Review Cleaning cost\.

Figure 12:Computational costs for \(a\) review generation and \(b\) review cleaning, broken down by venue and year\.![Refer to caption](https://arxiv.org/html/2605.21713v1/x15.png)Figure 13:Computational cost for claim extraction across review classes, broken down by venue and year\.Figure[13](https://arxiv.org/html/2605.21713#A1.F13)reports claim extraction costs across all three review classes\. As expected, LLM\-refined reviews dominate expenses due to their larger volume in our dataset \(four refinements per human review\)\. With that said, this cost structure applies only to classifier training\. In a future deployment setting, claim extraction would only be needed for incoming reviews and AI\-generated references, which would reduce inference\-time overhead\.

## Appendix BDesign Choice Ablations

### B\.1Selecting the Right Classifier

To combine our nine features into final predictions, we compared three classifiers: XGBoost, LightGBM, and Random Forest\. For each one, we performed randomized hyperparameter search with five\-fold stratified cross\-validation, using macro\-F1 as the optimization target\. Figure[14](https://arxiv.org/html/2605.21713#A2.F14)shows the resulting test\-set performance across all configurations\. As the boxplots indicate, median scores are similar across the three models, all falling between 0\.83 and 0\.84\. However, the models differ in how sensitive they are to hyperparameter choices\. Random Forest, in particular, produces several outliers below 0\.79, while XGBoost and LightGBM remain more stable\.

Based on these results, we selected LightGBM for our final model\. Although its median performance is only slightly higher than that of XGBoost, its outliers stay closer to the central distribution, suggesting more consistent behavior regardless of the specific hyperparameter configuration\. Table[9](https://arxiv.org/html/2605.21713#A2.T9)lists the hyperparameters of the best\-performing model\.

![Refer to caption](https://arxiv.org/html/2605.21713v1/x16.png)Figure 14:Comparison of classifier performance across hyperparameter configurations\. Each boxplot shows the distribution of macro\-F1 scores on the test set obtained during randomized search\.Table 9:LightGBM hyperparameters for the best\-performing model\.

### B\.2Number of Reference Reviews

Sem\-Detect, by default, pairs each target review withk=3k=3AI\-generated reference reviews of the same paper\. In this section we ablate whether three references are necessary, or whether fewer would be enough\. As such, we train and evaluate Sem\-Detect withkkranging from 1 to 3, keeping all other settings fixed\.

![Refer to caption](https://arxiv.org/html/2605.21713v1/x17.png)Figure 15:Effect of the number of reference reviews \(kk\) on three\-class detection performance\.From Figure[15](https://arxiv.org/html/2605.21713#A2.F15)we observe that performance improves monotonically withkk, but even a single reference review achieves a Macro\-F1 of 0\.819\. We usek=3k=3in our main experiments as it offers the best performance, but users seeking lower inference cost or latency could train with smaller values ofkkknowing this trade\-off\.

### B\.3Embedding Model Choice

In this section, we describe two sets of experiments that guided our choice of embedding model\. First, we examine how model size affects performance within a single model family\. Second, we compare different embedding model families to assess whether our choice generalizes across architectures\.

![Refer to caption](https://arxiv.org/html/2605.21713v1/x18.png)\(a\)Effect of model size within the Qwen\-3 family\.
![Refer to caption](https://arxiv.org/html/2605.21713v1/x19.png)\(b\)Comparison across embedding model families\.

Figure 16:Analysis of embedding model choice on the three\-class detection performance\.#### B\.3\.1Scaling Embedding Model Size

We evaluate three variants of Qwen\-3 Embedding\(Zhanget al\.,[2025](https://arxiv.org/html/2605.21713#bib.bib21)\)at different scales: 0\.6B, 4B, and 8B parameters\. Figure[16\(a\)](https://arxiv.org/html/2605.21713#A2.F16.sf1)presents the results\.

While performance improves as model size increases, the benefits plateau at the 4B scale, as the 8B model performs on par with the 4B variant\. Despite these findings, we report our main experiments using the 0\.6B model\.

This decision reflects practical considerations: the smaller model is substantially faster to run and requires less storage, making it more accessible for reproducibility\.

#### B\.3\.2Testing Different Embedding Models

Beyond model size, we also investigate whether our method is sensitive to the choice of the embedding model family\. To this end, we compare three top\-performing models of similar size according to the MTEB Leaderboard444[https://huggingface\.co/spaces/mteb/leaderboard](https://huggingface.co/spaces/mteb/leaderboard): Qwen\-3 Embedding, JINA\-V3, and Multilingual\-E5, all at approximately 0\.6B parameters\(Yanget al\.,[2025](https://arxiv.org/html/2605.21713#bib.bib22); Sturuaet al\.,[2025](https://arxiv.org/html/2605.21713#bib.bib30); Wanget al\.,[2024](https://arxiv.org/html/2605.21713#bib.bib31)\)\. As shown in Figure[16\(b\)](https://arxiv.org/html/2605.21713#A2.F16.sf2), all three models perform similarly, with Macro\-F1 scores ranging from 0\.84 to 0\.85\. JINA\-V3 and Multilingual\-E5 achieve, nevertheless, marginally higher scores than Qwen\-3\.

Given that we had already conducted multiple experiments before running this comparison, and since the performance gap is minimal, we present our main results using Qwen\-3\. This comparison, however, demonstrates that Sem\-Detect generalizes well across embedding architectures, giving users the freedom to select models that best fit their preferences\. One important detail: since we retrain the classifier from scratch for each embedding model, users who wish to use Sem\-Detect with an alternative architecture should expect to repeat the training step\.

### B\.4Claim Extraction vs Raw Review Text with Sentence\-Level Chunks

A main design choice in Sem\-Detect is the use of LLM\-based claim extraction from the reviews\. This approach, however, results in further computational costs, as each review must be processed an additional time by another LLM\. A natural question is whether this step is truly necessary, or whether other segmentation strategies could achieve better results\.

In Section[5](https://arxiv.org/html/2605.21713#S5), we show that not segmenting at all, as Anchor\(Yuet al\.,[2026](https://arxiv.org/html/2605.21713#bib.bib8)\)does when embeds entire reviews, performs quite well, but not as good as Sem\-Detect\. Here, we explore the opposite direction: segmenting at a finer granularity using sentence\-level chunking, which splits reviews at default sentence boundaries and produces chunks of more comparable length to our LLM\-extracted claims\. As shown in Figure[17](https://arxiv.org/html/2605.21713#A2.F17), the two approaches perform similarly on human reviews\. However, the gap becomes substantial for the other two classes\. In particular, for fully AI\-generated reviews, claim\-level segmentation proves far more effective than its sentence\-level counterpart\.

We believe that this gap happens because sentence boundaries do not always align with semantic boundaries\. Table[10](https://arxiv.org/html/2605.21713#A2.T10)illustrates one such failure case: over\-segmentation\. In this example, two consecutive sentences from a single argument get split by sentence\-level chunking, which breaks their semantic relation\. Our claim\-level approach, by contrast, recognizes they belong together and groups them as one unit\. When claims are fragmented in this way, similarity comparisons become noisier, and consequently reduce the method’s performance\.

![Refer to caption](https://arxiv.org/html/2605.21713v1/x20.png)Figure 17:F1 score comparison between claim\-level and sentence\-level segmentation strategies\.Table 10:Sem\-Detect understands both sentences belong to the same idea and groups them together\.Example of Failure Case for Sentence\-Level Chunks: Over\-segmentationReview VerbatimNon\-training data is limited to 5 books released in 2025, with no diversity in genre or timeframes\. This makes it hard to validate the method’s robustness against false positives across non\-training scenarios\.Sentence\-Level Chunking• Non\-training data is limited to 5 books released in 2025, with no diversity in genre or timeframes\.• This makes it hard to validate the method’s robustness against false positives across non\-training scenarios\.Sem\-Detect Claim\-Level SegmentationNon\-training data is limited to 5 books released in 2025, with no diversity in genre or timeframes, making it hard to validate RECAP’s robustness against false positives across non\-training scenarios\.

### B\.5Feature Selection and Interpretability

#### B\.5\.1Feature Importance and Distribution

As introduced in Section[3\.2](https://arxiv.org/html/2605.21713#S3.SS2), our classifier is based on a total of nine discriminative features\. Figure[18](https://arxiv.org/html/2605.21713#A2.F18)reports the feature importance scores assigned by LightGBM\. While the main paper provides the formal definitions of these features, this section offers additional intuition for their inclusion\. We first discuss each feature, and then analyze the empirical distributions of the most discriminative ones using box plots\.

![Refer to caption](https://arxiv.org/html/2605.21713v1/x21.png)Figure 18:Relative importance of the nine features as learned by the LightGBM classifier\.1. 1\.Proportion of High\-Similarity: Captures what proportion of a review is highly aligned with the AI\-generated references\. For each target claim, we check whether its maximum semantic similarity to any claim in the AI\-generated references exceeds a thresholdτ\\tau, and report the fraction of claims that do so\. We tuneτ\\tauvia a linear sweep from 0\.7 to 0\.9 during training, and fixτ=0\.8\\tau=0\.8for all reported results\.
2. 2\.Mean Similarities Above Threshold: For the subset of claims previously identified as having strong overlap with the AI\-generated references, this feature captures how strong that overlap is on average\. For all target claims whose maximum semantic similarity to any AI claim exceedsτ\\tau, we compute the mean of these maximum similarity values\.
3. 3\.Mean Best\-Match Claim Similarity: Captures the overall semantic proximity of a review to AI\-generated content\. For each target claim, we compute its best\-match semantic similarity to any claim in the AI\-generated reference reviews, and then average these best\-match similarities across all target claims\.
4. 4\.Intra\-Review Semantic Diversity: Captures how semantically varied the claims within a review are\. We compute all pairwise cosine similarities between claim embeddings within the target review and define the feature as one minus their mean, so that higher values correspond to greater semantic diversity and lower redundancy\.
5. 5\.Log Review Length: Captures the effective length of a review while reducing the influence of very long outliers\. We compute the natural logarithm of one plus the number of claims extracted from the target review\.
6. 6\.Entropy: Captures uncertainty in the language model’s next\-token predictions along the review\. We average the entropy of the model’s next\-token distribution over all positions in the text\.
7. 7\.Perplexity: Captures how predictable the review text is under a given language model\. Although entropy and perplexity are closely related, we include both features as they capture complementary aspects of the model’s behavior, and we find that including both consistently improves classification performance in practice\.
8. 8\.Top\-kkToken Percentage: Captures how often the review follows highly probable token choices under a language model\. We compute the fraction of tokens in the target review whose next\-token probability lies within the model’s top\-kkpredictions, usingk=200k=200\.
9. 9\.Fast\-Detect Score: Captures token\-level statistical signals associated with machine\-generated text\. As an additional textual feature, we include the score produced by Fast\-DetectGPT when applied to the target review\.

While Figure[18](https://arxiv.org/html/2605.21713#A2.F18)reveals which features the classifier relies on most, it does not explain why these features are discriminative\. To address this, Figure[19](https://arxiv.org/html/2605.21713#A2.F19)presents the distributions of the four most important features across the three classes\.

![Refer to caption](https://arxiv.org/html/2605.21713v1/x22.png)\(a\)Mean\-Max Similarities
![Refer to caption](https://arxiv.org/html/2605.21713v1/x23.png)\(b\)Entropy
![Refer to caption](https://arxiv.org/html/2605.21713v1/x24.png)\(c\)Mean Pairwise Cosine Distance within Target Review
![Refer to caption](https://arxiv.org/html/2605.21713v1/x25.png)\(d\)Log Review Length

Figure 19:Distribution of semantic and surface\-level features across review types\.A clear pattern emerges from these distributions\. Semantic features primarily separate fully AI\-generated reviews from the other two classes, with LLM\-refined reviews remaining close to human\-written ones\. This supports our core hypothesis: polishing a review with an LLM preserves the original human ideas\.

Textual features, by contrast, reveal an interesting pattern\. Entropy shows that LLM\-refined reviews occupy an intermediate position between the two classes: closer to AI\-generated text than to human, yet still somewhat distinguishable from both\. This explains why, in our ablation study at Appendix[B\.6](https://arxiv.org/html/2605.21713#A2.SS6), textual features alone outperform semantic features for three\-class classification\.

#### B\.5\.2Claim\-Level Interpretation of Semantic Overlap

By examining Figure[19\(a\)](https://arxiv.org/html/2605.21713#A2.F19.sf1), we confirm that AI\-generated reviews exhibit higher semantic similarity to AI references than human\-written ones\. But what does this overlap look like in practice? To answer this, we present an illustrative example for Mean Best\-Match Claim Similarity, the most discriminative feature identified by the classifier\.

Table 11:Example of a target AI\-generated review with high semantic overlap with AI reference reviews\. Highlighted claims contribute most strongly to this overlap, while the remaining claims show lower semantic similarity\.Target Review \(DeepSeek\-V3\.1\) Exhibiting High Semantic Overlap with Other AI Reviews\(1\)The derived policy gradient theorems for off\-policy learning are rigorous and provide exact gradients\.\(2\)The paper demonstrates promising results in training policies without environment interaction\.The paper acknowledges scalability concerns but does not thoroughly address them\.Results vary significantly across environments, suggesting task\-dependent effectiveness\.\(3\)The dimensionality of PVFs grows with policy parameters, which could hinder the performance of large policies\.Baselines like SAC or PPO are not included, and the paper focuses mainly on ARS and DDPG\.\(4\)How might PVFs scale to policies with millions of parameters \(e\.g\., in modern deep RL\)?

Closest\-Matching Claims from the AI\-Generated Reference Reviews\(1\)The derivation of policy gradient theorems for both stochastic and deterministic cases is rigorous and well\-presented, with clear algorithmic instantiations\.⇒\\RightarrowQwen\-3\-235BA22B\(2\)The zero\-shot learning and offline learning experiments are compelling, showing the ability to train entirely new policies from scratch using a frozen PVF\.⇒\\RightarrowGemini\-2\.5 Flash\(3\)A major concern with PVFs is the curse of dimensionality with respect to the policy parameters theta\.⇒\\RightarrowGemini\-2\.5 Pro\(4\)Addressing the scalability of PVFs for very large policy networks, either through learned embeddings or other dimensionality reduction techniques, would substantially elevate the work\.⇒\\RightarrowGemini\-2\.5 Flash

Table[11](https://arxiv.org/html/2605.21713#A2.T11)shows an AI\-generated review of an ICLR 2021 paper, shortened for clarity\. While no single model matches all claims, together they provide broad coverage of the target review’s points\. This overlap produces a Mean Best\-Match Claim Similarity of 0\.8269, which is far above the 0\.637 average observed for human reviews of the same paper\.

### B\.6Feature Type Selection: Textual vs\. Semantic vs\. Combined

A core premise of our work is that robust three\-class classification requires moving beyond purely textual or purely semantic features\. Here, we provide empirical evidence supporting this design choice\.

We trained three variants of our classifier: one using only the four textual features, another using only the five semantic features, and a third combining both sets: which constitutes Sem\-Detect\.

Figure[20](https://arxiv.org/html/2605.21713#A2.F20)presents the results\. The combined approach achieves a Macro\-F1 score of approximately 0\.84, outperforming both the textual\-only variant \(0\.76\) and the semantic\-only variant \(0\.59\)\. This performance gap highlights why neither feature type alone is sufficient for reliable three\-class detection\.

![Refer to caption](https://arxiv.org/html/2605.21713v1/x26.png)Figure 20:Impact of feature type on classification performance\.
### B\.7Exhaustive Feature Subset Evaluation

The previous sections established that combining semantic and textual features is necessary for reliable three\-class detection\. A natural follow\-up question is whether all nine features are needed, or whether a smaller subset could achieve comparable or even better performance\. To answer this, we evaluate every possible feature combination: with nine features, there are29−1=5112^\{9\}\-1=511non\-empty subsets, and we test each one\.

For each subset, we perform randomized hyperparameter search with five\-fold stratified cross\-validation, using Macro\-F1 as the optimization target \(the same protocol described in Appendix[B\.1](https://arxiv.org/html/2605.21713#A2.SS1)\) for selecting the final model\.

Table 12:Selected results from the exhaustive evaluation of all 511 feature subsets\. S = semantic features, T = textual features\. Each subset is individually optimized via randomized hyperparameter search\.Table[12](https://arxiv.org/html/2605.21713#A2.T12)summarizes the search space\. Two patterns stand out\. First, every top\-300 subset mixes semantic and textual features: the best pure\-textual combination ranks only 334th, and the best pure\-semantic ranks 453rd, confirming that neither family alone is sufficient regardless of how features are combined\. Second, performance among the top mixed subsets is tightly clustered: the gap between rank 1 \(0\.8416\) and our full 9\-feature model at rank 18 \(0\.8354\) is just 0\.0062, meaning the choice of which mixed subset to use matters far less than ensuring both types are present\.

We further visualize the full search space in Figure[21](https://arxiv.org/html/2605.21713#A2.F21), where we plot the Macro\-F1 of every subset, grouped by whether it contains only semantic features, only textual features, or a mix of both\. The separation is clear: mixed subsets occupy the upper region of the distribution, while pure\-type subsets are concentrated in the lower ranks, with virtually no overlap between the two\.

![Refer to caption](https://arxiv.org/html/2605.21713v1/x27.png)Figure 21:Distribution of Macro\-F1 scores across all 511 feature subsets, grouped by feature composition: textual\-only, semantic\-only, and both \(semantic \+ textual\)\.Together, these analyses provide comprehensive evidence that \(i\) the combination of semantic and textual features is structurally necessary and not an artifact of our particular selection, and \(ii\) the full feature set performs near\-optimally within the space of possible subsets\.

## Appendix CBaseline Algorithm Details

We use the official implementations of all baseline detectors whenever they are available\. In all cases, we follow the configurations and recommendations provided by the original authors\.

For Anchor, we adopt the anchor\-prompting strategy proposed byYuet al\.\([2026](https://arxiv.org/html/2605.21713#bib.bib8)\)\. This approach requires a paper\-specific prompt conditioned on the paper’s content for each submission, which we generate using GPT\-5\(Singhet al\.,[2025](https://arxiv.org/html/2605.21713#bib.bib32)\)\. We then tune the cosine\-similarity threshold \(θ\\theta\) on the training set at fixed TPR@% FPR values, and finally evaluate the method on the test set\.

For EditLens, we use the authors’ RoBERTa\-Large model555[https://huggingface\.co/pangram/editlens\_roberta\-large](https://huggingface.co/pangram/editlens_roberta-large)to obtain the results in Figure[10](https://arxiv.org/html/2605.21713#S5.F10)\. For the ICLR 2026 analysis in Section[5\.6](https://arxiv.org/html/2605.21713#S5.SS6), we use the official predictions released by Pangram Labs and intersect them with our dataset to obtain EditLens scores for the overlapping reviews\.

## Appendix DAdditional Details on Robustness to Generation Conditions

### D\.1Construction of Out\-of\-Distribution Evaluation Sets

As described in Section[3](https://arxiv.org/html/2605.21713#S3), although peer reviews follow a broadly shared structure, different AI conferences adopt distinct reviewing guidelines and templates\. These differences may affect how AI\-generated reviews are written, and therefore how well our method generalizes\. To study this effect, we consider two OOD evaluation settings: OOD\-M and OOD\-M\+P\.

The OOD\-M setting, where reviews are generated by unseen model families using the same prompt template as in training, is fully described in the main paper \(Section[5\.4](https://arxiv.org/html/2605.21713#S5.SS4)\)\. In this appendix, we therefore focus on the construction of the OOD\-M\+P setting, which introduces additional prompt variations not used during training\.

OOD\-M\+P evaluation\.We use the same three out\-of\-distribution models as in OOD\-M: Claude\-Sonnet\-4, Mistral\-Large\-3, and GPT\-oss\-120B, but combine them with different prompt templates that differ from those used during training\.

Reviewer Personality Variations\.We define five distinct reviewer personalities that capture different reviewing styles commonly observed in academic peer review, and Table[13](https://arxiv.org/html/2605.21713#A4.T13)presents the full personality prompts used in our experiments\.

Table 13:Reviewer personality prompts used for generating AI reviews in the OOD\-M\+P setting\.##### Main body and prompt combination\.

In addition to reviewer personality, we also vary the structural format of the review\. Specifically, we use three different main body templates: one matching the official ICLR 2021 reviewing guidelines, one matching the NeurIPS 2021 guidelines, and a general template suitable for ML conferences but distinct from our default prompt\. For each review, we randomly select one reviewer personality and one main body template\.

### D\.2Extended Analysis of Performance Under Distribution Shift

Table[14](https://arxiv.org/html/2605.21713#A4.T14)presents the full class\-wise performance of Sem\-Detect under distribution shift for each class across three settings: In\-Dist \(same models and prompts as used in training\), OOD\-M \(unseen models, same prompts\), and OOD\-M\+P \(unseen models and prompts\)\.

Table 14:Class\-wise performance under distribution shift\.The results reveal a consistent pattern across settings\. AI\-generated reviews maintain high precision \(0\.93\-0\.97\) in all conditions, confirming that Sem\-Detect’s positive predictions for this class are reliable\. The drop in AI recall under OOD \(from 0\.91 to 0\.67 and 0\.65\) reflects conservative behavior: uncertain samples are routed away from the AI class rather than risking false accusations\. LLM\-refined performance remains stable under distribution shift, with both precision and recall staying above 0\.76 across all settings\. Human precision sees a moderate decrease, however, recall remains stable \(0\.63\-0\.64\), indicating that the model continues to identify a majority of true human reviews\.

### D\.3Comparison with Binary Baseline Detectors Under Distribution Shift

We extend the results of Section[5\.4](https://arxiv.org/html/2605.21713#S5.SS4)by studying the impact of OOD data on baselines other than Sem\-Detect\. To enable comparison, we collapse the three classes into a binary setting and evaluate RADAR and Anchor under the same conditions\.

![Refer to caption](https://arxiv.org/html/2605.21713v1/x28.png)Figure 22:Binary generalization under distribution shift for Sem\-Detect and baselines\.Figure[22](https://arxiv.org/html/2605.21713#A4.F22)reveals that both baselines experience performance drops, but the most unexpected finding concerns RADAR\. Unlike Sem\-Detect, which explicitly uses the training models as reference points, RADAR was not optimized for any specific set of generators\. From its perspective, reviews from training models should be no easier to classify than reviews from unseen ones\. Yet RADAR shows the largest decline, with TPR at 1% FPR dropping substantially under OOD conditions\. Anchor, by contrast, proves more stable, likely due to its reliance on semantic comparison rather than surface\-level patterns\. Sem\-Detect, despite the performance drop observed in the three\-class setting, maintains its TPR when evaluated from this binary perspective\.

### D\.4Robustness to Reference Model Choice

The experiments in Section[5\.4](https://arxiv.org/html/2605.21713#S5.SS4)evaluate Sem\-Detect when the*target*reviews \(i\.e\., those being classified\) are produced by unseen models, while the reference reviews used for semantic comparison come from the same models as in training\. In practice, however, a user deploying Sem\-Detect may not have access to the specific LLMs used during training to generate reference reviews\. We therefore study the complementary scenario: the target reviews come from models seen during training \(Gemini\-2\.5\-Pro, Qwen\-3, and DeepSeek\-V3\.1\), but thek=3k=3reference reviews are produced by GPT\-oss\-120B, Mistral\-Large\-3, and Claude\-Sonnet\-4, which had no effect on the classifier training\.

Table[15](https://arxiv.org/html/2605.21713#A4.T15)shows that changing reference models \(OOD\-Ref\) leads to a more modest degradation than changing target models \(OOD\-M\): Macro\-F1 drops from 0\.84 to 0\.79, compared to 0\.71 under OOD\-M\. Most notably, AI recall remains at 0\.92 under OOD\-Ref versus 0\.67 under OOD\-M, while AI precision stays at 0\.96\. This suggests that the choice of reference models is less critical than the choice of target models for overall detection performance, and that users deploying Sem\-Detect can substitute the reference models with whichever LLMs they have available, with only a moderate effect on overall performance\.

Table 15:Comparison of distribution shift settings\. OOD\-M uses unseen target models with training reference models; OOD\-Ref uses training target models with unseen reference models\.
### D\.5Expanding the Training Generator Pool

The out\-of\-distribution results in Section[5\.4](https://arxiv.org/html/2605.21713#S5.SS4)raise a natural question: would exposing Sem\-Detect to a wider variety of generator families during training improve its performance under distribution shift? We investigate this through two experiments, both evaluated on the unchanged OOD\-M\+P test set\. In each, we extend the original four\-model generator set \(Gemini\-2\.5\-Flash, Gemini\-2\.5\-Pro, DeepSeek\-V3\.1, and Qwen3\-235B\), with two new families: GLM\-5\(GLM\-5\-Team,[2026](https://arxiv.org/html/2605.21713#bib.bib54)\)and Kimi\-K2\.5\(Kimi\-Team,[2026](https://arxiv.org/html/2605.21713#bib.bib55)\)\.

Table 16:Effect of expanding the training generator pool on the OOD\-M\+P test set\. Variant 1 uses all six families per paper\. Variant 2 samples four of the six families per paper, keeping the original class distribution while exposing the classifier to all six families overall\.Variant 1: Full six\-model pool\.We first use all six model families for every paper\. This increases the number of AI and LLM\-refined reviews and creates more training instances by allowing each target review to be paired with multiple non\-overlapping sets of three reference reviews\. As Table[16](https://arxiv.org/html/2605.21713#A4.T16)shows, Macro\-F1 increases from 0\.68 to 0\.70\. However, this gain reflects a precision/recall trade\-off: AI recall rises from 0\.65 to 0\.73, while AI precision drops from 0\.96 to 0\.88\. Thus, the model detects more AI reviews, but also produces more false positives\.

Variant 2: Per\-paper subset\.Variant 1 also changes the class balance, since adding two model families increases the number of AI and LLM\-refined reviews relative to human reviews\. To separate this effect from generator diversity, we run a second experiment where, for each paper, we randomly select four of the six model families\. This keeps the original class distribution, sample counts, andk=3k=3pairing scheme unchanged, while still exposing the classifier to all six generators across the dataset\.

Variant 2 gives a Macro\-F1 of 0\.69, with AI precision of 0\.89 and AI recall of 0\.72\. This closely matches Variant 1\. Since the class distribution is now unchanged, the precision/recall trade\-off cannot be explained by class imbalance; it is instead caused by broader generator exposure\.

Discussion\.Both variants produce only a small Macro\-F1 gain but a large drop in AI precision\. This is undesirable for our deployment setting, where falsely accusing a human reviewer is more costly than missing an AI\-generated review\. We therefore keep the original four\-model configuration in the main paper\. Still, the higher AI recall suggests that larger generator pools could be useful when combined with precision\-preserving mechanisms, such as the confidence\-based filtering in Section[5\.3](https://arxiv.org/html/2605.21713#S5.SS3)\.

### D\.6Sensitivity to Partial AI Content

In practice, a reviewer might selectively incorporate specific observations from an AI\-generated review into their own assessment, instead of generating the final review end\-to\-end\. We tested how Sem\-Detect, despite not being trained for this setting, would behave under this scenario\.

Starting from 90 papers in our test set, we construct synthetic hybrid reviews by systematically replacing human claims with AI\-generated ones at controlled ratios\. For each paper, we select the human review with the most substantive claims and a matching AI review \(same evaluation score, different model family\)\. We then use Claude\-4\.6\-Opus\(Anthropic,[2026](https://arxiv.org/html/2605.21713#bib.bib26)\)to replace 25%, 50%, or 75% of the human’s evaluation, constructive input, and clarification claims with claims from the AI review, while preserving all factual restatements and meta\-commentary from the original human review\. The resulting mixed reviews are assembled to read as coherent single\-author texts\.

In total, we have five different types of contamination groups per paper: 0% \(the original human review\), 25%, 50%, 75%, and 100% \(the source AI review\), each with exactly 90 reviews\. To avoid self\-reference bias, the three AI reference reviews used for computing semantic features are drawn from model families that exclude the source AI review’s author\.

![Refer to caption](https://arxiv.org/html/2605.21713v1/x29.png)Figure 23:Fraction of reviews classified as AI\-generated when human claims are increasingly replaced with AI\-generated ones\.As shown in Figure[23](https://arxiv.org/html/2605.21713#A4.F23), the number of reviews that are classified as AI\-generated increases monotonically with contamination, confirming that Sem\-Detect’s semantic features are sensitive to the proportion of AI\-originated claims\. However, the full model remains conservative at low\-to\-moderate ratios, as the textual features \(computed over the entire review text\) anchor predictions toward the non\-AI classes as the writing style is still predominantly human\. The tipping point occurs at 75%, where the claim\-level signal becomes strong enough to shift predictions substantially\.

In the end, a reviewer who contributes genuine evaluative points alongside AI\-suggested observations has, by definition, exercised human judgment over part of the review\. Sem\-Detect’s decision boundary is not ambiguous in these cases; it correctly recognizes that human intellectual contribution is present and classifies accordingly\. We note, however, that quantifying the precise degree of AI contamination within a single review is a distinct and complementary problem that falls outside the scope of this work\.

## Appendix EFactual Verification of Claims

Throughout this work, we have focused on modeling review authorship partially through the semantic content of expressed ideas\. In particular, we examined whether these ideas are original, repetitive, or aligned with AI\-generated references\. However, this perspective captures only part of what makes a high\-quality review\. In fact, conference organizers have increasingly emphasized another essential aspect: factual accuracy\. This growing concern is reflected in recent policy statements\. For example, the ICLR 2026 Guidelines666[https://blog\.iclr\.cc/2025/11/19/iclr\-2026\-response\-to\-llm\-generated\-papers\-and\-reviews](https://blog.iclr.cc/2025/11/19/iclr-2026-response-to-llm-generated-papers-and-reviews)explicitly state that “Reviews that feature false claims are a code of ethics violation\.” This raises the hypothesis on whether AI\-generated reviews, might contain more factual errors than human\-written ones, which could provide an additional detection signal\.

To test this, we randomly sample 30 papers from ICLR 2021 and classify every extracted claim for factual accuracy using an LLM\-as\-a\-judge pipeline \(Gemini\-3\.0\-Flash\), incorporating two key considerations:

1. 1\.Not all claims are verifiable\. Generic statements like “the document is well\-written” cannot be checked against the paper content\. We therefore fine\-tune a BERT classifier to filter out such claims \(≈\\approx25% claims are discarded\)\.
2. 2\.We target hallucinations rather than subjective assessment errors\. Human reviewers may legitimately misinterpret aspects of a paper\. Hallucinations, by contrast, are clear false statements that directly contradict the paper, for example, claiming “The method has not been evaluated on open\-source LLMs” when there are experiments clearly reporting them\. Our pipeline classifies each specific claim as either hallucinated or unverifiable\.

![Refer to caption](https://arxiv.org/html/2605.21713v1/x30.png)Figure 24:Average number of hallucinated claims per review across fully human and AI\-generated reviews\.The results, shown in Figure[24](https://arxiv.org/html/2605.21713#A5.F24), do not support our initial hypothesis\. On average, human reviews contain 0\.32 hallucinations per review, while all AI models produce fewer factual errors, ranging from 0\.03 for DeepSeek\-V3\.1 to 0\.17 for Qwen3\-235B\. Although the sample size is modest, manual verification indicates that the LLM\-as\-a\-judge assessments are largely accurate\.

One possible explanation is that AI models tend to be more conservative in their comments and engage with the paper at a more superficial level than human reviewers\. As a result, they may be less likely to make specific factual claims and, therefore, less prone to hallucinations\. This suggests that factual accuracy alone is not a reliable signal for distinguishing AI\-generated reviews from human\-written ones, as it could unfairly penalize careful human reviewers who simply avoid making mistakes\.

Nevertheless, factual accuracy remains an important aspect of review quality and warrants further study\. Future work could explore more advanced verification pipelines, for example by leveraging external document sources to validate factual claims more reliably\.

Table[17](https://arxiv.org/html/2605.21713#A5.T17)presents the LLM\-as\-a\-Judge prompt for factual verification\. We emphasize a conservative approach, flagging only clear hallucinations while treating misinterpretations or reasoning errors as acceptable\.

Table 17:System and User Prompts used for Hallucination Detection in Peer Reviews\.Hallucination Detection Prompt \(Adapted for Brevity\)System Prompt:You are anextremely conservative hallucination detector for peer reviews\. You are given:1\.The full text of an academic paper\.2\.The full peer review\.3\.One claim extracted from the peer review\.Your task is to determine whether the reviewer has made aHALLUCINATED claimabout the paper\.Core DistinctionYoumust distinguishbetween:Hallucination:The reviewer asserts false content*as if it were explicitly stated in the paper*\.Possible Incorrect Claim:The reviewer may be wrong due to interpretation, reasoning, assumptions, or normative judgment\.These are not hallucinations and must be labeledPASS\.If a claim could plausibly result from misunderstanding or reasoning, itmust bePASS\.What Counts as a HallucinationA claim is a hallucinationonly ifthe reviewer asserts an*objectively false fact*about the paper itself, like:•Fabricated names, acronyms, or expansions•Invented methods, stages, datasets, or definitions•Claims that the paper introduces, defines, or uses something nonexistent•Incorrect numeric valuesonly when:–the number defines a*critical*component of the method or experiments, and–the reviewer presents it as a fact stated in the paper, and–the paper explicitly states a different valueLabel DefinitionsCONTRADICTED: Useonlyfor clear hallucinations\.PASS: Use inall other cases, including possible incorrect claims\.User Prompt:Paper Content: \[parsed paper PDF\] Peer Review: \[full review text\] Claim: \[single extracted claim\]

Similar Articles

Base Models Look Human To AI Detectors

arXiv cs.CL

This paper reveals that commercial AI detectors like GPTZero and Pangram judge text from base language models as overwhelmingly human, while instruction-tuned model outputs are flagged as AI-generated. The authors propose HIP, a detector-agnostic iterative paraphrasing pipeline that improves human-likeness while preserving semantics.

TADDLE: A Tool-Augmented Agent for Detecting Deficient LLM-Generated Peer Reviews

arXiv cs.AI

Introduces TADDLE, a tool-augmented agent for detecting deficient LLM-generated peer reviews, along with an expert-annotated benchmark of 1,800 reviews on 50 ICLR 2025 papers. The system decomposes detection into four specialized analysis tools and uses two-stage semi-supervised learning for binary and multi-label classification.