Benchmarking Agentic Review Systems

arXiv cs.AI 06/20/26, 04:00 AM Papers
benchmarking agentic-review peer-review llm evaluation open-source proprietary
Summary
This paper benchmarks agentic review systems for peer review, evaluating open-source and proprietary systems on research papers. The best configuration achieves 83.0% pairwise accuracy and catches 71.6% of injected errors, but user feedback highlights issues with false positives and nitpicks.
arXiv:2606.19749v1 Announce Type: new Abstract: A new class of agentic review systems are emerging as a remedy to the pressure placed on peer review systems by AI-assisted research, but it is unclear how they should be evaluated. We evaluate two open-source systems (OpenAIReview and coarse), one proprietary system (Reviewer3), and a zero-shot baseline, across six LLMs spanning frontier and efficient models. First, we study whether AI reviews on ICLR/NeurIPS papers track with papers' quality as approximated by external signals such as citations and acceptance decisions. Every system performs above chance in pairwise accuracy, and the best is OpenAIReview + GPT-5.5 at 83.0%. Second, to test whether systems can catch errors with known ground truth, we construct a perturbation benchmark that injects four categories of errors into papers across eight arXiv subject classes and measure detection recall. The strongest configuration (OpenAIReview + GPT-5.5) catches 71.6% of injected errors, leaving substantial room for improvement. The union of detections across six models reaches 83.3% recall, suggesting different models detect different errors and better harness design can potentially increase performance. Beyond these benchmarks, we study a public deployment of OpenAIReview with real users. Votes on its comments skew positive at 1.44 to 1, and the most common complaints are about false positives and minor nitpicks. Together, by evaluating full review systems backed by state-of-the-art models on real research papers, we show that while AI reviews still have room for improvement, they can already track human quality judgments well, catch important errors, and earn positive feedback from real users.
Original Article
View Cached Full Text
Cached at: 06/20/26, 02:32 PM
# Benchmarking Agentic Review Systems
Source: [https://arxiv.org/html/2606.19749](https://arxiv.org/html/2606.19749)
Dang Nguyen1Wanqing Hao1Yanai Elazar2Chenhao Tan1 1University of Chicago2Bar\-Ilan University

###### Abstract

A new class of agentic review systems are emerging as a remedy to the pressure placed on peer review systems by AI\-assisted research, but it is unclear how they should be evaluated\. We evaluate two open\-source systems \(OpenAIReview and ‘coarse\), one proprietary system \(Reviewer3\), and a zero\-shot baseline, across six LLMs spanning frontier and efficient models\. First, we study whether AI reviews on ICLR/NeurIPS papers track with papers’ quality as approximated by external signals such as citations and acceptance decisions\. Every system performs above chance in pairwise accuracy, and the best is OpenAIReview \+ GPT\-5\.5 at83\.0%83\.0\\%\. Second, to test whether systems can catch errors with known ground truth, we construct a perturbation benchmark that injects four categories of errors into papers across eight arXiv subject classes and measure detection recall\. The strongest configuration \(OpenAIReview \+ GPT\-5\.5\) catches71\.6%71\.6\\%of injected errors, leaving substantial room for improvement\. The union of detections across six models reaches83\.3%83\.3\\%recall, suggesting different models detect different errors and better harness design can potentially increase performance\. Beyond these benchmarks, we study a public deployment of OpenAIReview with real users\. Votes on its comments skew positive at1\.441\.44to11, and the most common complaints are about false positives and minor nitpicks\. Together, by evaluating full review systems backed by state\-of\-the\-art models on real research papers, we show that while AI reviews still have room for improvement, they can already track human quality judgments well, catch important errors, and earn positive feedback from real users\.

Benchmarking Agentic Review Systems

Dang Nguyen1††thanks:Corresponding author:dangnguyen@uchicago\.eduWanqing Hao1Yanai Elazar2Chenhao Tan11University of Chicago2Bar\-Ilan University

## 1Introduction

![Refer to caption](https://arxiv.org/html/2606.19749v1/x1.png)Figure 1:AI reviewer systems can generate useful reviews\.\(1\)On ICLR/NeurIPS papers, the best backend for each system produces different comment volumes \(top\), yet every system discriminates paper quality above chance \(bottom\) at separating low\- from high\-quality papers\.\(2\)We inject controlled errors into papers and check whether the reviewer flags the perturbed span\.\(3\)The strongest configuration \(OpenAIReview \+ GPT\-5\.5\) catches71\.6%71\.6\\%of injected errors\.Peer review is under increasing pressure as AI\-assisted papers increase submission volumes and flood the system\(Liu and Tan,[2026](https://arxiv.org/html/2606.19749#bib.bib19); Luet al\.,[2024](https://arxiv.org/html/2606.19749#bib.bib7)\)\.Liu and Tan \([2026](https://arxiv.org/html/2606.19749#bib.bib19)\)formalize this dynamic as a*review death spiral*: as submissions overwhelm reviewer capacity, review accuracy degrades, acceptance becomes more random, and weaker papers are drawn in to “try their luck,” further overloading reviewers and pushing the system toward a sharp collapse\. Their analysis identifies that only improvements in*review precision*\(the ability to discriminate high from low\-quality papers\) can stabilize the system, making AI\-assisted reviewing a necessary remedy\. Indeed, major conferences have begun integrating LLMs into the reviewing pipeline accordingly and found promising results\(ICML Program Committee,[2026](https://arxiv.org/html/2606.19749#bib.bib18); Biswaset al\.,[2026](https://arxiv.org/html/2606.19749#bib.bib23)\)\.

At the same time, LLM\-based automated reviewer systems have emerged, with proprietary ones like Refine\(Calvó López and Golub,[2025](https://arxiv.org/html/2606.19749#bib.bib1)\)and Reviewer3\(Reviewer3,[2025](https://arxiv.org/html/2606.19749#bib.bib20)\), and open\-source ones like OpenAIReview\(Chicago Human\+AI Lab,[2025](https://arxiv.org/html/2606.19749#bib.bib16)\)and ‘coarse\(Van Dijcke,[2025](https://arxiv.org/html/2606.19749#bib.bib15)\)\. These systems support multi\-agent setups, structured prompts, and different forms of context management to emit detailed feedback on paper snippets rather than a score or acceptance decision\. Although there are encouraging anecdotes about their helpfulness, it is unclear how reviewer systems compare to each other\. Prior benchmarks have either been small in scale or evaluated raw LLMs rather than the system as a whole\(Liu and Shah,[2023](https://arxiv.org/html/2606.19749#bib.bib11); Tyseret al\.,[2024](https://arxiv.org/html/2606.19749#bib.bib12); Xiet al\.,[2025](https://arxiv.org/html/2606.19749#bib.bib21)\), and models have improved substantially ever since\. This marks a timely moment to revisit benchmarking AI reviewer systems withstronger models,system\-level comparisons, andreal papers\.

We first study whether AI review systems such as OpenAIReview, ‘coarse, and Reviewer3 can correlate with paper quality from different sources, including community citations, conference decisions, reviewer scores, and a composite of the three\. For instance, high citation counts can signal a strong paper while low counts can signal a weaker paper\. We find that AI review systems register this signal on ICLR/NeurIPS papers despite not being explicitly trained to approximate acceptance decisions\. Under the assumption that weaker papers \(according to the proxies\) should incur more comments, we compute the accuracy of systems on randomly sampled \(low, high\) quality paper pairs\. Every system assigns more comments to the low\-quality paper above chance, and the signal is strongest on frontier models \(OpenAIReview \+ GPT\-5\.5 reaches0\.830\.83pairwise accuracy\)\. This trend also holds across models and increases with model strength, providing evidence that today’s models can provide useful signal in reviewing\.

However, models are far from perfect when it comes to identifying all errors in a paper\. Whereas the quality\-proxy study above is confined to ICLR/NeurIPS submissions, here we move beyond a single field and construct a comprehensive*perturbation benchmark*, where we introduce errors into otherwise\-good papers and evaluate AI review systems’ recall of these errors\. The benchmark spans eight arXiv subject classes ranging from Econometrics to Genomics, with four error categories: local math edits, false claims, faulty reasoning, and experimental design or analysis errors\. The strongest configuration \(OpenAIReview backed by GPT\-5\.5\) catches71\.6%71\.6\\%of injected errors \(Figure[1](https://arxiv.org/html/2606.19749#S1.F1), right\)\. OpenAIReview’s largest gains over a zero\-shot baseline are on prose\-level errors \(e\.g\., faulty reasoning increases from 36\.8%→\\to68\.4%\), while math\-token edits show smaller gains\. This is in\-line with OpenAIReview’s running summarization design to support reviewing longer papers, since prose\-level errors can often span a longer context than local math edits\.

Despite the not\-yet\-perfect recall, gains are possible with better system design\. Different backend models under OpenAIReview are complementary: the recall from all models combined reaches83\.3%83\.3\\%,11\.711\.7points above GPT\-5\.5\. Together, these findings point at the potential for AI review systems to be adopted in conference reviewing and for new systems to be designed that can do well on all evaluation dimensions\.

We then move from controlled benchmarks to real use\. We deploy OpenAIReview as a public web tool and collect feedback on1,3601\{,\}360reviews of1,1001\{,\}100papers\. Users vote positively on the comments, with likes outnumbering dislikes1\.441\.44to11, and many mark comments as addressed, evidence that the reviews carry value in practice\. Sorting the downvoted comments by reason shows the main weakness is comment precision: most complaints are about false positives and minor nitpicks\.

To summarize, our contributions are as follows:

- •We introduce a perturbation\-based benchmark and find that the strongest system catches about71\.6%71\.6\\%of injected errors\.
- •We show that today’s AI reviewer systems pick up on signals of paper quality at66−83%66\-83\\%accuracy without being explicitly trained to, with a stronger signal in stronger models\.
- •We find that models often comment on disjoint sets of paragraphs and combining them can result in higher recall\.
- •In a public deployment of OpenAIReview, users vote positively on its comments \(1\.441\.44to11likes to dislikes\), and the main weakness is comment precision, with most complaints being false positives and minor nitpicks\.

## 2Related Work

#### Automated review systems\.

Earlier work targeted LLM\-assisted reviewing with smaller backbone models, ranging from aspect\-level summarization\(Yuanet al\.,[2022](https://arxiv.org/html/2606.19749#bib.bib4)\)to direct GPT\-4 feedback on full papers\(Lianget al\.,[2024](https://arxiv.org/html/2606.19749#bib.bib3)\)\. Recent systems wrap frontier LLMs in multi\-stage harnesses \(section\-level multi\-agent pipelines\(D’Arcyet al\.,[2024](https://arxiv.org/html/2606.19749#bib.bib8)\), rubric\-scored sequential prompts\(Tyseret al\.,[2024](https://arxiv.org/html/2606.19749#bib.bib12)\), structured\-reasoning agents trained on a dedicated review chain\-of\-thought corpus\(Gaoet al\.,[2025](https://arxiv.org/html/2606.19749#bib.bib9)\), and full research pipelines that embed reviewing as a downstream stage\(Luet al\.,[2024](https://arxiv.org/html/2606.19749#bib.bib7)\)\) to emit detailed advisory feedback rather than accept/reject decisions\. Our benchmark targets a different slice of this space: rather than the academic prototypes cited above, we evaluate the publicly available reviewer systems that authors can actually run today \(OpenAIReview\(Chicago Human\+AI Lab,[2025](https://arxiv.org/html/2606.19749#bib.bib16)\), ‘coarse\(Van Dijcke,[2025](https://arxiv.org/html/2606.19749#bib.bib15)\), and the commercial Reviewer3\(Reviewer3,[2025](https://arxiv.org/html/2606.19749#bib.bib20)\)\), together with a zero\-shot single\-prompt baseline, and we score each as a whole pipeline rather than swapping in its underlying LLM\.

#### Perturbation\-based error detection\.

Injecting controlled errors and measuring whether reviewers catch them is a recurring evaluation idea with a long NLP lineage\(Gardneret al\.,[2020](https://arxiv.org/html/2606.19749#bib.bib26); Kaushiket al\.,[2020](https://arxiv.org/html/2606.19749#bib.bib27); Talmoret al\.,[2020](https://arxiv.org/html/2606.19749#bib.bib30); Ribeiroet al\.,[2020](https://arxiv.org/html/2606.19749#bib.bib29); Kassner and Schütze,[2020](https://arxiv.org/html/2606.19749#bib.bib28); Saiet al\.,[2021](https://arxiv.org/html/2606.19749#bib.bib6)\)\. For paper review specifically, early work hand\-injected small numbers of errors into short papers\(Liu and Shah,[2023](https://arxiv.org/html/2606.19749#bib.bib11); Tyseret al\.,[2024](https://arxiv.org/html/2606.19749#bib.bib12)\), and concurrent benchmarks FLAWS\(Xiet al\.,[2025](https://arxiv.org/html/2606.19749#bib.bib21)\)and SPECS\(Biswaset al\.,[2026](https://arxiv.org/html/2606.19749#bib.bib23)\)scale this up to ICML and AAAI papers\. We differ in perturbing by error*type*\(math, claim, reasoning, experimental\) rather than review aspect, varying the model across six LLMs, and benchmarking independently\-developed third\-party systems head\-to\-head rather than raw LLM prompts\.

#### LLMs vs\. humans\.

Several recent works compare LLM and human reviews:Lianget al\.\([2024](https://arxiv.org/html/2606.19749#bib.bib3)\)find GPT\-4 captures consensus critiques but emphasizes broader implications while underweighting novelty\.Liet al\.\([2025](https://arxiv.org/html/2606.19749#bib.bib10)\)report LLM reviews skew toward strengths over weaknesses and barely scale critique complexity with paper quality\.Gaoet al\.\([2025](https://arxiv.org/html/2606.19749#bib.bib9)\)introduce ReviewBench for direct LLM\-vs\.\-human comparison\. Our quality\-proxy correlation study contributes a complementary signal, namely comment behavior on real ICLR/NeurIPS papers across four quality proxies, three reviewer systems, and six models\.

## 3OpenAIReview

OpenAIReview\(Chicago Human\+AI Lab,[2025](https://arxiv.org/html/2606.19749#bib.bib16)\)is an open\-source system that takes a full paper as input and returns a list of comments, each connected to a quoted passage\. We benchmark it against existing systems in Sections[5](https://arxiv.org/html/2606.19749#S5)and[6](https://arxiv.org/html/2606.19749#S6)and analyze real user feedback from a deployed version in Section[8](https://arxiv.org/html/2606.19749#S8)\.

![Refer to caption](https://arxiv.org/html/2606.19749v1/x2.png)Figure 2:OpenAIReview reviews a paper one passage at a time, checking each passage against its neighbors and a running summary of the paper so far\. The summary is updated after each passage, and a final call merges the collected comments\.#### Paper processing pipeline\.

The system accepts papers in different formats, such as PDF, markdown, or LaTeX\. The review begins with an overall feedback section, followed by a list of comments, each containing a quoted passage and an explanation of the issue \(Figure[8](https://arxiv.org/html/2606.19749#A1.F8)in Appendix[A](https://arxiv.org/html/2606.19749#A1)shows an example and Figure[3\(a\)](https://arxiv.org/html/2606.19749#S4.F3.sf1)shows the web interface\)\. Figure[2](https://arxiv.org/html/2606.19749#S3.F2)illustrates how the system produces a review: the paper is first split into passages of roughly equal length, and the system reviews one passage at a time\. Each passage is checked against two sources of context: a window of neighboring passages and a running summary of everything read so far\. The running summary accumulates notation, definitions, key equations, theorems, assumptions, and claims\. After each passage is checked, a separate model call updates the summary with any new content from it\. This lets the model catch issues that span long distances, such as a symbol used inconsistently with a definition given several sections earlier, without placing the entire paper into a single prompt\. Finally, the system drops duplicates and merges comments that point to the same underlying issue\. We use the same backbone model across the pipeline\.

#### Review prompts\.

The main review prompt gives the model a fixed list of checks to run on each passage, e\.g\., mathematical and formula errors, inconsistent notations, overstated claims, and methods described too vaguely to reproduce\. To reduce false positives, the prompt tells the model to first check whether a concern is resolved by the surrounding context, and to skip categories such as formatting issues or forward references\. Figure[5](https://arxiv.org/html/2606.19749#A1.F5)in Appendix[A](https://arxiv.org/html/2606.19749#A1)shows the full review prompt, and Figures[6](https://arxiv.org/html/2606.19749#A1.F6)and[7](https://arxiv.org/html/2606.19749#A1.F7)give the prompts for the summary update, consolidation, and overall feedback\.

#### Evaluating a review\.

As mentioned, a review has two parts: an overall feedback section and a list of individual comments\. This is the format that recent review systems such as Refine\(Calvó López and Golub,[2025](https://arxiv.org/html/2606.19749#bib.bib1)\)and ‘coarse\(Van Dijcke,[2025](https://arxiv.org/html/2606.19749#bib.bib15)\)share, so how to evaluate it is a general question beyond just OpenAIReview\. The overall feedback gives a high\-level assessment of the paper’s quality, clarity, and main issues\. Judging whether it is accurate and helpful is a separate question, likely best left to an LLM judge, and we set it aside here\. We focus on the comments, which make concrete claims about particular passages\. A comment is only useful if it points to a real problem, so we evaluate comments based on two criteria: whether weaker papers draw more comments, and more serious ones \(Section[5](https://arxiv.org/html/2606.19749#S5)\), and whether the comments catch errors we know are present \(Section[6](https://arxiv.org/html/2606.19749#S6)\)\.

## 4Systems Overview

Before presenting our two evaluations, we describe the review systems we benchmark and the LLMs that power them\. The same set of systems and models is used in both Section[5](https://arxiv.org/html/2606.19749#S5)and Section[6](https://arxiv.org/html/2606.19749#S6)\.

#### Review methods\.

We compare four review systems, all of which take the full paper as input and emit a list of comments, where each comment contains a quoted passage and an explanation of the issue \(two of the open systems’ interfaces are shown in Figure[3](https://arxiv.org/html/2606.19749#S4.F3)\):

- •*Zero\-shot\.*A single prompt asks the model to review the entire paper in one pass and return all issues it finds\.
- •*OpenAIReview*\(Chicago Human\+AI Lab,[2025](https://arxiv.org/html/2606.19749#bib.bib16)\)\. A system that processes the paper sequentially, maintaining a running summary of definitions, equations, and key claims\. For each passage it checks the current text against the running summary and surrounding context to flag inconsistencies\.
- •*‘coarse*\(Van Dijcke,[2025](https://arxiv.org/html/2606.19749#bib.bib15)\)\. An open\-source multi\-agent paper reviewer that combines a macro\-level overview agent with parallel per\-section agents \(and adversarial proof verification on math\-heavy sections\), followed by an editorial pass that deduplicates and filters the merged comment set\.
- •*Reviewer3*\(Reviewer3,[2025](https://arxiv.org/html/2606.19749#bib.bib20)\)\. A closed\-source commercial reviewer system that emits a small set of high\-priority comments per paper\. Its internal model and prompts are not exposed, so it is treated as a single fixed system and not paired with the LLM sweep below\.

![Refer to caption](https://arxiv.org/html/2606.19749v1/plots/openaireview.png)\(a\)OpenAIReview
![Refer to caption](https://arxiv.org/html/2606.19749v1/plots/coarse.png)\(b\)‘coarse

Figure 3:Review interfaces of the two automated systems we benchmark on the same paper\. Both surface a list of comments, each anchored to a passage in the paper, that we use as the unit of analysis throughout this work\.
#### Models\.

Zero\-shot, OpenAIReview, and ‘coarse can each be backed by a different LLM\. We evaluate the three open systems across two frontier models: GPT\-5\.5\(OpenAI,[2026](https://arxiv.org/html/2606.19749#bib.bib32)\)and Claude Opus 4\.7\(Anthropic,[2026](https://arxiv.org/html/2606.19749#bib.bib33)\), and four efficient models: DeepSeek\-V4\-Flash\(DeepSeek\-AI,[2026](https://arxiv.org/html/2606.19749#bib.bib34)\), Qwen3\.6\-35B\-A3B\(Qwen Team,[2026](https://arxiv.org/html/2606.19749#bib.bib35)\), Gemini\-3\.1\-Flash\-Lite\(Google,[2026](https://arxiv.org/html/2606.19749#bib.bib36)\), and Grok\-4\.1\-Fast\(xAI,[2026](https://arxiv.org/html/2606.19749#bib.bib37)\), all accessed via OpenRouter without reasoning mode\.

#### Evaluation studies\.

With the methods and models fixed, we evaluate them along two complementary axes\. The first evaluation \(Section[5](https://arxiv.org/html/2606.19749#S5)\) feeds each \(method, model\) pair real ICLR/NeurIPS papers and tests whether the number of issues raised tracks paper quality as approximated by reception signals \(citations, awards, review scores\)\. The perturbation benchmark \(Section[6](https://arxiv.org/html/2606.19749#S6)\) takes the same systems and tests whether they detect controlled errors injected into otherwise\-clean papers\. Beyond ICLR/NeurIPS papers, it spans eight arXiv subject classes, from Econometrics to Genomics, in order to provide diverse ground\-truth errors across different theoretical and empirical fields\.

## 5AI Review Systems Correlate with Human Quality Signals

### 5\.1Method

We compare automated review systems’ outputs on real ICLR/NeurIPS papers to test whether they correlate with paper quality as judged by the research community through publication, citation, and review score signals\. Because paper quality has no gold\-standard label, we instead use four*quality proxies*at different evaluation scales derived from publication, citation, and review score signals\. Within each proxy we select 30 high\-quality and 30 low\-quality papers from ICLR/NeurIPS\. We filter for papers with substantive reviews, meaning at least three official reviews with a non\-null average score\. The four quality proxies are defined at different scales:

- •Community\-level: the top 30 papers by citations\-per\-year versus 30 randomly sampled rejected papers with no subsequent publication venue \(construction details for all proxies in Appendix[B\.1](https://arxiv.org/html/2606.19749#A2.SS1)\)\.
- •Conference\-level: 30 randomly accepted papers that were highlighted \(Outstanding, Best, Oral, Spotlight\) versus 30 randomly sampled rejected papers\.
- •Reviewer\-level: among papers with substantive reviews, the top 30 by mean review score versus the bottom 30\.
- •Composite: the top 30 awarded papers ranked by the sum of citation rank and score rank, versus the bottom 30 papers \(rejected, never\-published\) ranked by mean review score\.

We emphasize that citations, awards, and review scores are noisy proxies of quality\. We select the top and bottom groups as a tractable approximation, not as ground\-truth measures of paper quality\.

#### Paper selection\.

We draw papers from SNOR\(Neumann,[2025](https://arxiv.org/html/2606.19749#bib.bib31)\), a dataset that links OpenReview submissions for ICLR \(2017–2025\) and NeurIPS \(2021–2025\) to Semantic Scholar with citation counts, decisions, and reviewer scores\. We restrict to ICLR and NeurIPS 2021–2022, the earliest years for which both venues posted submissions on OpenReview, and far enough in the past that citation counts have had time to accumulate\. Given each low/high quality group has 30 papers, we get 60 papers per proxy and 240 in total \(197 unique, since some papers satisfy more than one criterion\)\. For frontier models too expensive to run on the full set, we use a*frontier subset*of 74 unique papers and mainly report results on this subset\. Each paper is reviewed by every \(method, model\) pair from Section[4](https://arxiv.org/html/2606.19749#S4)on the first 10 pages\. Sampling design and full results are in Appendix[B](https://arxiv.org/html/2606.19749#A2)\.

#### Metrics\.

For each \(method, model, quality proxy\) triple, letc¯high\\bar\{c\}\_\{\\rm high\}andc¯low\\bar\{c\}\_\{\\rm low\}be the mean comments per paper on the high\- and low\-quality groups\. We report the*pairwise accuracy*: the percentage of low\- and high\-quality pairs where the low\-quality paper incurs more comments than the high\-quality one, with ties \(equal comment counts\) counting as 0\.5 \(Appendix[B\.2](https://arxiv.org/html/2606.19749#A2.SS2)\)\. We also reportΔ=c¯low−c¯high\\Delta=\\bar\{c\}\_\{\\rm low\}\-\\bar\{c\}\_\{\\rm high\}and the percentage increaseΔ/c¯high\\Delta/\\bar\{c\}\_\{\\rm high\}, expectingΔ\>0\\Delta\>0if the system tracks real issues\. Bracketed quantities are 95% confidence intervals from a cluster bootstrap over papers \(Appendix[B\.3](https://arxiv.org/html/2606.19749#A2.SS3)\)\.

### 5\.2Results

Table 1:Under OpenAIReview, GPT\-5\.5 picks up on paper quality signal the best\.c¯high\\bar\{c\}\_\{\\rm high\}andc¯low\\bar\{c\}\_\{\\rm low\}are mean comments per paper on the high\- and low\-quality groups \(averaged across the four proxies\)\.Table 2:OpenAIReview \+ GPT\-5\.5 is the best model and system combination on overall pairwise accuracy\.c¯\\bar\{c\}is mean comments per paper\.Overallis the pooled accuracy across all four proxies\.#### GPT\-5\.5 picks up the quality signal best overall while Grok\-4\.1\-Fast leads efficient models\.

Table[1](https://arxiv.org/html/2606.19749#S5.T1)ranks models under OpenAIReview on the 74\-paper frontier subset\. GPT\-5\.5 leads at0\.830\.83pairwise accuracy andΔ=\+6\.38\\Delta\\,\{=\}\\,\{\+\}6\.38comments, with Claude Opus 4\.7 next at0\.740\.74\. Among efficient models, Grok\-4\.1\-Fast tops the rest at0\.800\.80on onlyc¯≈5\\bar\{c\}\\approx 5comments per paper, with the other three models clustered between0\.600\.60and0\.620\.62\. The same ranking holds on the full 240\-paper set \(Appendix[B](https://arxiv.org/html/2606.19749#A2)\)\. Severity stratification confirms that weaker papers receive more*severe*comments, not merely more: every model under OpenAIReview performs above chance on both major \(0\.590\.59–0\.780\.78\) and moderate \(0\.610\.61–0\.910\.91\) tiers, with GPT\-5\.5 again leading \(Major0\.780\.78, Moderate0\.910\.91\)\. We omit the Minor tier, which correlates poorly with quality\. Per\-system, per\-tier breakdowns are in Appendix[B\.5](https://arxiv.org/html/2606.19749#A2.SS5)\.

#### OpenAIReview \+ GPT\-5\.5 is the strongest configuration\.

Table[2](https://arxiv.org/html/2606.19749#S5.T2)compares each system’s best \(method, model\) pair, plus Reviewer3\. OpenAIReview \+ GPT\-5\.5 tops the overall accuracy at0\.830\.83, just ahead of zero\-shot \+ DeepSeek\-V4\-Flash and Reviewer3 \(both0\.800\.80\)\. ‘coarse trails at0\.660\.66even with its best backend\. Zero\-shot’s accuracy comes on onlyc¯≈3\.7\\bar\{c\}\\approx 3\.7comments per paper, which is directionally correct but likely too sparse to be useful in practice\. Per proxy, OpenAIReview and Reviewer3 are the most balanced \(every proxy≥0\.66\\geq 0\.66and≥0\.70\\geq 0\.70respectively\), while zero\-shot peaks at the Reviewer proxy \(1\.001\.00\) but falls to0\.600\.60on Conference\. Full breakdowns are in Appendix[B](https://arxiv.org/html/2606.19749#A2)\.

#### Takeaway\.

Comment volume tracks paper quality above chance across every system, model, and quality proxy, and the pattern holds when we restrict to major and moderate comments only\. This is despite none of these systems being trained or prompted to predict acceptance decisions: they are simply asked to surface issues\. Pairwise accuracy on comment counts is admittedly a coarse metric, and may miss cases where a concise review points out a single fatal issue\. Still, the consistency across systems, models, proxies, and severity tiers suggests today’s models pick up real quality signals as a by\-product of issue\-finding, without task\-specific finetuning\.

## 6Perturbation Benchmark

Quality proxies tell us whether comment volume tracks paper quality, but not whether individual comments are correct\. Real papers lack such ground truth\. We complement the previous analysis with a controlled benchmark: inject known errors into clean papers and measure per\-comment recall\.

### 6\.1Method

We inject different types of errors into clean papers\.Surfaceerrors are local math edits \(sign flips, index/subscript changes, numeric edits, computation errors\) that a reader can catch within a single equation\. The other three categories are*prose\-level*errors that require understanding context across a paragraph or paper:Claim\(false theoretical or empirical statements\),Logic\(broken reasoning in proofs and arguments: circular reasoning, invalid implication, induction errors, missing cases\), andExperimental\(flawed experimental design: reversed causality, misinterpreted results, p\-hacking\)\. Examples of each error are in Table[13](https://arxiv.org/html/2606.19749#A3.T13), Appendix[C](https://arxiv.org/html/2606.19749#A3)\. Benchmark construction proceeds in five stages:extract,generate,validate,verify,inject, described below\.

Table 3:OpenAIReview \+ GPT\-5\.5 attains the best recall out of all systems and models\. Recall on the 24\-paper frontier subset; brackets are 95% bootstrap CIs over papers \(Appendix[C\.7](https://arxiv.org/html/2606.19749#A3.SS7)\)\. Dashes indicate runs not executed for that model\. Reviewer3 has no model selector and is reported as a separate row\.Table 4:OpenAIReview wins on every error category\. For each system we report its best\-performing backend on the 24\-paper frontier subset \(where GPT\-5\.5 and Claude\-Opus\-4\.7 were also run\)\. Reviewer3 has no model selector\. Brackets are 95% bootstrap CIs over papers \(Appendix[C\.7](https://arxiv.org/html/2606.19749#A3.SS7)\.#### Paper selection\.

We sample 74 papers from 8 arXiv subject classes spanning theoretical and empirical research: Computational Complexity, Machine Learning, Econometrics, Experimental High\-Energy Physics, Mathematics, Atomic and Cluster Physics, Genomics, and Applied Statistics\. For each class, we collect 10 arXiv submissions with fullLaTeXsource and discard those whose source fails to compile or lacks the structural cues \(equations, theorems, claim/argument paragraphs\) that the extractor needs\. Filtering yields 5–10 papers per class for a total of 74\. For frontier models too expensive to run on the full set, we use a 24\-paper subset \(3 per domain\) and report main\-text results on this subset\.

\(1\) Extract\.A deterministic extractor scans theLaTeXsource for candidate perturbation sites: equations, definitions, theorems, and proofs in theoretical papers, or equations, claim/argument paragraphs, and experimental paragraphs in empirical papers\. Each is annotated with the error categories admissible for that span \(Table[14](https://arxiv.org/html/2606.19749#A3.T14)\)\.\(2\) Generate\.A generator LLM picks a subset of candidates and emits the replacementLaTeXand an explanation why the perturbation results in an error\. An audit of this model\-driven selection against the full candidate pool finds only modest bias relative to random selection, mainly a preference for longer equations \(Appendix[C\.4](https://arxiv.org/html/2606.19749#A3.SS4)\)\.\(3\) Validate\.A structural validator drops perturbations that are identical to the original, overlap an accepted edit, or breakLaTeXat the span boundaries\.\(4\) Verify\.A checklist verifier then filters out edits that are indistinguishable from typos \(e\.g\., the value is not used downstream\) or that do not constitute an error \(e\.g\., changing a parameter in a way that still satisfies its bounds\)\.\(5\) Inject\.Surviving edits are applied to yield a single corrupted paper\.

To confirm that the kept perturbations are genuine, well\-formed errors, we manually audited a stratified sample of4040injected perturbations \(1010per error type, covering every subtype and all 8 domains\)\. We judge33/4033/40\(82\.5%82\.5\\%\) to be valid errors,22\(5%5\\%\) to be not true errors, and55\(12\.5%12\.5\\%\) to be ambiguous, with the latter two categories concentrated in the surface\-numeric and empirical\-claim subtypes\. Models, prompts, automated and manual validation details are in Appendix[C\.3](https://arxiv.org/html/2606.19749#A3.SS3)\.

### 6\.2Evaluation

For each ground\-truth perturbation, we determine detection via a two\-stage filter applied to the reviewer’s emitted comments: first, a*fuzzy substring match*requires the perturbed text to be approximately contained in a review comment’s quote above a thresholdτ\\tau\(which we validated in Table[16](https://arxiv.org/html/2606.19749#A3.T16)\)\. Second, an*LLM judge*rates whether the reviewer’s explanation identifies the same error as the ground\-truth perturbation, and passing comments are counted as detections\. Recall is then computed as the fraction of injected perturbations detected\. Threshold values, judge model, and rating cutoff are given in Appendix[C\.6](https://arxiv.org/html/2606.19749#A3.SS6)\.

### 6\.3Results

#### OpenAIReview outperforms both zero\-shot and ‘coarse with different backend models\.

Table[3](https://arxiv.org/html/2606.19749#S6.T3)shows overall recall on the 24\-paper frontier subset, with 95% bootstrap CIs over papers \(see CI calculation in Appendix[C\.7](https://arxiv.org/html/2606.19749#A3.SS7)\)\. OpenAIReview wins across the board: the largest absolute gain over zero\-shot comes from DeepSeek\-V4\-Flash \(\+24\.8 points\), and the smallest gap is for GPT\-5\.5 \(\+11\.8 points\), which already attains 59\.8% under zero\-shot\. The same pattern holds on the full 74\-paper benchmark, where the four efficient models ran on every paper \(Table[18](https://arxiv.org/html/2606.19749#A3.T18)in Appendix[C\.8](https://arxiv.org/html/2606.19749#A3.SS8)\)\. Notably, ‘coarse and Reviewer3 both underperform zero\-shot: ‘coarse falls below zero\-shot on every shared backend \(20\.7%20\.7\\%vs\.31\.0%31\.0\\%on DeepSeek\-V4\-Flash\), and Reviewer3’s26\.5%26\.5\\%sits below every zero\-shot row except Gemini\-3\.1\-Flash\-Lite\. This is likely because these two systems optimize for editorial prioritization rather than exhaustive enumeration\. Both systems emit 5–10 highest\-priority comments per paper, so recall on injected errors might not capture their full value as reviewers\.

#### Prose\-level errors show the largest OpenAIReview gains\.

Table[4](https://arxiv.org/html/2606.19749#S6.T4)breaks down recall by error category on the same setup\. OpenAIReview’s gains over zero\-shot are concentrated on prose\-level errors \(claim, reasoning, experimental, each\+10\+10–1818points\), while surface math\-token recall remains similar \(45\.8%→47\.3%45\.8\\%\\to 47\.3\\%\), suggesting that the running summarization fails to help with errors that can already be detected locally\. ‘coarse is weaker than every other system on prose\-level errors at19\.6%19\.6\\%on claim errors and26\.3%26\.3\\%on experimental errors\. While it does best on reasoning \(40\.0%40\.0\\%\), it still trails OpenAIReview by3030points\. Reviewer3 also trails on prose errors, particularly reasoning \(10\.0%10\.0\\%\), with its strongest category being experimental \(35\.4%35\.4\\%\)\. Per\-subtype recall is in Table[19](https://arxiv.org/html/2606.19749#A3.T19)in Appendix[C\.8](https://arxiv.org/html/2606.19749#A3.SS8)\.

## 7Review Analysis

![Refer to caption](https://arxiv.org/html/2606.19749v1/plots/venn_human_vs_ai.png)\(a\)Across sources: human reviewers vs\. union of the three AI systems using their best backend models \(Jaccard0\.2300\.230\)\.
![Refer to caption](https://arxiv.org/html/2606.19749v1/plots/venn_claude_gpt_efficient.png)\(b\)Across models: Claude Opus 4\.7, GPT\-5\.5, and the union of the efficient models under OpenAIReview \(3\-way Jaccard0\.3160\.316\)\.

Figure 4:Reviewers are complementary at the paragraph level\. Human reviewers and the AI union flag largely disjoint paragraphs \(left\), while different backend models under OpenAIReview are decently complementary \(right\)\. Values are average unique paragraphs per paper in each region\. The left panel is computed on the frontier subset of papers used in Section[5](https://arxiv.org/html/2606.19749#S5)\. The right panel uses the perturbation benchmark frontier subset\.Table 5:Humans concentrate in paper\-level / novelty critiques, while AI over\-indexes on surface\-level, claim, experimental, and formal\-math categories\. Percentages refer to who wrote the comments in each cluster: of all comments in a row’s cluster, the share contributed by human reviewers \(Humans %\) vs\. the AI union \(AI %\), so each row sums to100%100\\%\. The*Overall baseline*row gives the same split over all comments pooled across clusters \(39%39\\%/61%61\\%\); a cell above its column’s baseline means that source is over\-represented in that cluster\.Beyond aggregate recall, we want to know where in the paper each system flags issues, how they overlap or differ, and what kinds of issues those are\. We check for paragraph\-level location overlap \(across AI systems, models, and between AI and human reviewers\) and cluster comment content to characterize how systems complement each other\.

#### Method\.

For each \(method, model\) pair we record the set of paragraphs each reviewer commented on per paper\. We present the results using Venn diagrams \(Figure[4](https://arxiv.org/html/2606.19749#S7.F4)\) in which each intersection and exclusive region reports the average number of paragraphs per paper, together with the Jaccard similarity of each comparison group, computed per paper and then averaged across papers\. We run this overlap analysis on both the perturbation benchmark and the quality\-proxy papers of Section[5](https://arxiv.org/html/2606.19749#S5)\. To characterize the comments, we embed each comment \(title and explanation\) withall\-MiniLM\-L6\-v2and runkk\-means clustering \(k=10k\{=\}10\)\. Clusters are labeled by their top TF\-IDF keywords and manually merged into five interpretable groups, for which we report each source’s share\. These content groups are defined independently of the injected error types and need not align with them one\-to\-one\. For instance, reviewers also raise issues unrelated to any perturbation, such as table or figure problems\. However, in practice the two overlap substantially\. Per\-model paragraph\-overlap numbers and embedding details are in Appendix[D](https://arxiv.org/html/2606.19749#A4)\.

#### Humans and AI are complementary in the issues they find\.

We compare the union of the three AI systems’ paragraphs to those extracted from official OpenReview reviews over7070papers \(extraction and matching details in Appendix[D](https://arxiv.org/html/2606.19749#A4)\)\. In general, AI review systems give many more comments than humans, at 61% versus 39%, respectively \(Table[5](https://arxiv.org/html/2606.19749#S7.T5)\)\. Humans and AI agree on7\.167\.16paragraphs per paper on average \(Jaccard0\.2300\.230\), with humans raising an additional9\.039\.03paper\-level concerns that no AI flags, such as scope, novelty, and motivation\. In contrast, AI raises14\.6714\.67exclusive comments focusing more on claim validity and technical details\. Clustering all comments \(Table[5](https://arxiv.org/html/2606.19749#S7.T5)\) shows that humans take80%80\\%of paper\-level / novelty critiques \(vs\. a39%39\\%baseline\), while AI leads on surface, claim, experimental, and formal\-math issues, which is consistent with human reviewers triaging to high\-leverage abstract concerns and AI exhaustively surfacing technical ones\. This suggests that AI review systems are positioned to serve peer review as a complement to human reviewers, handling the exhaustive enumeration of concrete issues while human reviewers focus on the higher\-level judgments where they remain the dominant source of feedback\. More examples of human and AI comments from each Venn diagram region are in Table[25](https://arxiv.org/html/2606.19749#A4.T25)in Appendix[D\.6](https://arxiv.org/html/2606.19749#A4.SS6)\.

#### Models surface complementary errors\.

In Figure[4\(b\)](https://arxiv.org/html/2606.19749#S7.F4.sf2), under the OpenAIReview harness the backend models cover overlapping but not identical paragraphs: the union of the efficient models already covers most of what each frontier model flags—Claude Opus 4\.7 leaves only0\.730\.73paragraphs per paper that no other model raises, against a9\.439\.43\-paragraph all\-models intersection\. GPT\-5\.5 still contributes6\.366\.36exclusive paragraphs per paper and efficient models contribute4\.214\.21\. The choice of backend also shapes*what*kind of issue is raised: Claude focuses on surface\-level comments while GPT leans toward claim and experimental critiques \(Table[24](https://arxiv.org/html/2606.19749#A4.T24)in Appendix[D](https://arxiv.org/html/2606.19749#A4)\)\. Because the models catch partly different errors, collectively their detections cover83\.3%83\.3\\%of injected errors \(95%95\\%CI\[79\.7,86\.9\]\[79\.7,86\.9\]\),11\.711\.7points above the best single model \(GPT\-5\.5 at71\.6%71\.6\\%\)\. This gain holds across a cluster bootstrap over papers \(95%95\\%CI\[8\.1,15\.1\]\[8\.1,15\.1\], as in Appendix[C\.7](https://arxiv.org/html/2606.19749#A3.SS7)\), suggesting that better harness design can potentially increase performance\. The three*systems*\(‘coarse, OpenAIReview, Reviewer3\) are likewise somewhat complementary, though less so than models \(Appendix[D](https://arxiv.org/html/2606.19749#A4)\): OpenAIReview accounts for most flagged paragraphs, while ‘coarse and Reviewer3 each add only about one exclusive paragraph per paper\.

## 8User feedback

### 8\.1Method

We deploy the review system as a web application\. The system runs OpenAIReview with Claude Opus 4\.6 as the backend model\. Users can interact with the review output in three ways: like or dislike each individual comment, click a resolve button on a comment to mark it as addressed, or leave optional free\-text feedback on the review as a whole\. We analyze 1,360 completed reviews for 1,100 distinct papers\. This is an observational deployment open to the public, so the users are anonymous people on the web, most likely researchers\. The uploaded papers span many fields\.111We assign fields by clustering sentence embeddings of each paper’s title and abstract and labeling the clusters by inspection\.The largest share are in computer science and AI, with the rest spread across the social sciences, life sciences, and physics\. For privacy, we do not track user information beyond these feedback signals, and we inspect paper content only as each analysis requires: titles and abstracts for the fields above, and the reviewed text behind the downvoted comments we examine below\.

### 8\.2Results

#### Users vote positively and act on the comments\.

Of 27,587 comments shown to users, 690 received a vote for a 2\.5% engagement rate\. Likes outnumber dislikes 407 to 283, a 1\.44:1 ratio \(Table[6](https://arxiv.org/html/2606.19749#S8.T6)\)\. Users also marked 1,348 comments as resolved \(4\.9% of all comments shown\)\. Among the 109 papers with at least one resolved comment, an average of 50% of the comments were resolved\. We read this as promising evidence for reviews translating into author action\. The resolution rate is 24% for upvoted comments and 41% for downvoted ones \(Table[6](https://arxiv.org/html/2606.19749#S8.T6)\)\. The higher rate on downvoted comments suggests that beyond marking an issue as fixed, users also click it to dismiss a comment they disagree with\. The negative votes concentrate on specific failure modes, which we examine below\.

Table 6:Votes skew positive \(1\.44:1 up to down\), and resolution tracking shows comments being acted on\. The higher resolution rate on downvoted comments is consistent with resolution doubling as a dismissal mechanism\. Comment voting and resolution over thirteen weeks of production use\.
#### Most downvoted comments are unhelpful\.

We categorized 283 downvoted comments by the likely reason for the downvote, using an LLM judge and manually validated a sample of the labels \(Appendix[E](https://arxiv.org/html/2606.19749#A5)\)\. Most go to comments that are unhelpful: a false positive flags a non\-issue, a trivial nitpick is valid but minor, and an unreasonable ask demands detail that the paper reasonably omits\. The full prompt given to the labeler can be found in Figure[13](https://arxiv.org/html/2606.19749#A5.F13)in Appendix[E](https://arxiv.org/html/2606.19749#A5)\. These three modes together account for about70%70\\%of downvotes \(Table[7](https://arxiv.org/html/2606.19749#S8.T7)\)\. Parsing and optical character recognition \(OCR\) artifacts make up another6%6\\%\. The rest are correct points that the author dismissed, but our manual checks reveal that some of these can also be considered minor\. The taxonomy separates distinct failure modes that all point toward the need to make the tool more targeted to substantial issues and avoid nitpicks, which is in\-line with results in Section[5](https://arxiv.org/html/2606.19749#S5)\.

GroupDownvote reasonnn%Unhelpful71\.0%71\.0\\%False positive10938\.5Trivial nitpick4917\.3Unreasonable asks for detail4315\.2Other29\.1%29\.1\\%Parsing / OCR artifact186\.4Correct, author dismissed6322\.3Unclear10\.4Total283100

Table 7:Most downvoted comments are unhelpful, with false positives, nitpicks, and unreasonable detail asks together making up71%71\\%\. Reason each of the 283 downvoted comments was downvoted, assigned by an LLM judge \(Appendix[E](https://arxiv.org/html/2606.19749#A5)\)\.

## 9Discussion

Together, our studies show that current AI review systems already track human quality signals without additional post training, that the strongest configuration catches71\.6%71\.6\\%of injected errors, and that users of a deployed system find the comments worth acting on\. Two observations stand out\. First, in terms of raw recall, base model strength matters more than harness design: zero\-shot with GPT\-5\.5 already reaches59\.8%59\.8\\%, and the OpenAIReview harness adds∼\\sim1212points\. The main advantage of harnesses over zero\-shot is that models can output more comments, such as in the case of OpenAIReview which averages 19 comments per paper compared to zero\-shot’s 3\.7\. Second, under the same harness, models are complementary in their comments: under OpenAIReview, the union of all models reaches83\.3%83\.3\\%recall,11\.711\.7points above the best single model \(95%95\\%CI\[8\.1,15\.1\]\[8\.1,15\.1\]\)\. This suggests combining models via better harness design is a promising near\-term direction\.

Evidence from a deployed version of the system \(Section[8](https://arxiv.org/html/2606.19749#S8)\) suggests that the capabilities our benchmarks measure are also valuable to real users\. Users who vote judge the comments helpful more often than not, and they mark comments as resolved, suggesting the reviews translate into author action\. These signals are early, and engagement rates are modest, but they point in the same direction as the benchmarks: AI review systems are already useful in practice\. Reading the downvoted comments shows where the systems fall short: most are unhelpful, raising non\-issues, minor nitpicks, or over\-demanding asks \(Section[8](https://arxiv.org/html/2606.19749#S8)\)\. Where the benchmarks stress recall, deployment surfaces precision as the more pressing limitation, which points to calibration and tighter prompting as concrete near\-term fixes\. More generally, given that models pick up on quality signals and detect errors well, this opens up an exciting space for harness design, where systems can be tailored to different audiences or use cases, such as author\-facing, area chair\-facing, or domain\-specific\.

While we have included the best systems currently available to us, many more will likely be created in the future\. Extending the benchmark to additional systems will require standardized outputs to the format used by the evaluated systems, e\.g\., comments tagged with their source paragraph or span, which is only one useful design among many\. As more diverse systems are created, there will be a need for more benchmarks to measure their performance\.

## Limitations

Our benchmark measures recall but does not directly assess precision: a comment flagging an unperturbed passage may correspond to a pre\-existing issue or a hallucinated false positive, and these cases are indistinguishable without expert annotation, which we leave to future work\. The deployment study \(Section[8](https://arxiv.org/html/2606.19749#S8)\) gives a partial view of precision through the downvoted comments as judged by users, but it lacks expert ground truth errors on the papers themselves\. Effective reviewing also extends beyond identifying individual errors: a useful review organizes feedback, prioritizes consequential issues, calibrates severity, and engages with the paper’s broader contribution\. Thus, we view recall as a necessary but not sufficient condition for reviewer quality\. Finally, the perturbation pipeline is itself LLM\-driven, which introduces a concern for bias\. The generator and verifier LLMs may share distributional biases with the LLM\-based reviewers under evaluation, so the injected errors can skew toward mistakes that are salient to LLMs\. Recall measured on these perturbations could then overstate performance on the important errors that human experts would flag but that lie outside the LLM\-generated distribution\. A manual audit of a stratified sample \(Section[6](https://arxiv.org/html/2606.19749#S6)\) finds82\.5%82\.5\\%of kept perturbations to be valid errors, so they are genuine mistakes, but this speaks only to their validity and not to whether they represent the errors experts most care about\. A larger audit with subject\-matter experts is left to future work\.

## Declaration of LLM Usage

LLMs are integral to the methodology of this paper, both as components of the benchmark pipeline and as the systems under evaluation\.

#### Benchmark construction\.

The perturbation pipeline \(Appendix[C\.3](https://arxiv.org/html/2606.19749#A3.SS3)\) uses two LLMs in fixed roles: Gemini\-3 Flash Preview as the generator that proposes candidate errors from extracted spans, and Claude Sonnet 4\.6 as the checklist verifier that filters those candidates down to substantive perturbations\.

#### Scoring\.

The detection\-scoring stage \(Appendix[C\.6](https://arxiv.org/html/2606.19749#A3.SS6)\) uses Gemini\-3 Flash Preview as an explanation\-matching judge that rates whether a reviewer’s explanation identifies the same error as the ground\-truth perturbation, on a 1–5 scale with a≥3\\geq 3cutoff\.

#### Systems under evaluation\.

The systems compared in this paper \(OpenAIReview, ‘coarse, and the zero\-shot baseline\) are themselves LLM\-based pipelines, evaluated across six backbone LLMs spanning frontier and efficient models \(Section[4](https://arxiv.org/html/2606.19749#S4)\)\.

#### Writing assistance\.

LLM\-based tools were also used for editing, formatting, and prose polishing during manuscript preparation\. This usage did not affect the methodology, results, or claims of the paper and is reported here only for completeness\.

## References

- Anthropic \(2026\)Introducing Claude Opus 4\.7\.Note:[https://www\.anthropic\.com/news/claude\-opus\-4\-7](https://www.anthropic.com/news/claude-opus-4-7)Accessed: 2026\-05\-26Cited by:[§4](https://arxiv.org/html/2606.19749#S4.SS0.SSS0.Px2.p1.1)\.
- J\. Biswas, S\. Schoepp, G\. Vasan, A\. Opipari, A\. Zhang, Z\. Hu, S\. Joseph, M\. Lease, J\. J\. Li, P\. Stone, K\. L\. Wagstaff, M\. E\. Taylor, and O\. C\. Jenkins \(2026\)AI\-assisted peer review at scale: the AAAI\-26 AI review pilot\.arXiv preprint arXiv:2604\.13940\.Cited by:[§1](https://arxiv.org/html/2606.19749#S1.p1.1),[§2](https://arxiv.org/html/2606.19749#S2.SS0.SSS0.Px2.p1.1)\.
- Y\. Calvó López and B\. Golub \(2025\)Refine: AI\-powered research assistant\.Note:[https://www\.refine\.ink/](https://www.refine.ink/)Cited by:[§1](https://arxiv.org/html/2606.19749#S1.p2.1),[§3](https://arxiv.org/html/2606.19749#S3.SS0.SSS0.Px3.p1.1)\.
- Chicago Human\+AI Lab \(2025\)OpenAIReview: AI\-powered academic paper reviewer\.Note:[https://github\.com/ChicagoHAI/OpenAIReview](https://github.com/ChicagoHAI/OpenAIReview)Cited by:[§1](https://arxiv.org/html/2606.19749#S1.p2.1),[§2](https://arxiv.org/html/2606.19749#S2.SS0.SSS0.Px1.p1.1),[§3](https://arxiv.org/html/2606.19749#S3.p1.1),[2nd item](https://arxiv.org/html/2606.19749#S4.I1.i2.p1.1)\.
- M\. D’Arcy, T\. Hope, L\. Birnbaum, and D\. Downey \(2024\)MARG: multi\-agent review generation for scientific papers\.arXiv preprint arXiv:2401\.04259\.Cited by:[§2](https://arxiv.org/html/2606.19749#S2.SS0.SSS0.Px1.p1.1)\.
- DeepSeek\-AI \(2026\)DeepSeek\-V4\-Flash\.Note:[https://huggingface\.co/deepseek\-ai/DeepSeek\-V4\-Flash](https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash)Accessed: 2026\-05\-26Cited by:[§4](https://arxiv.org/html/2606.19749#S4.SS0.SSS0.Px2.p1.1)\.
- X\. Gao, J\. Ruan, Z\. Zhang, J\. Gao, T\. Liu, and Y\. Fu \(2025\)ReviewAgents: bridging the gap between human and AI\-generated paper reviews\.arXiv preprint arXiv:2503\.08506\.Cited by:[§2](https://arxiv.org/html/2606.19749#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2606.19749#S2.SS0.SSS0.Px3.p1.1)\.
- M\. Gardner, Y\. Artzi, V\. Basmov, J\. Berant, B\. Bogin, S\. Chen, P\. Dasigi, D\. Dua, Y\. Elazar, A\. Gottumukkala, N\. Gupta, H\. Hajishirzi, G\. Ilharco, D\. Khashabi, K\. Lin, J\. Liu, N\. F\. Liu, P\. Mulcaire, Q\. Ning, S\. Singh, N\. A\. Smith, S\. Subramanian, R\. Tsarfaty, E\. Wallace, A\. Zhang, and B\. Zhou \(2020\)Evaluating models’ local decision boundaries via contrast sets\.InFindings of EMNLP,pp\. 1307–1323\.Cited by:[§2](https://arxiv.org/html/2606.19749#S2.SS0.SSS0.Px2.p1.1)\.
- Google \(2026\)Gemini 3\.1 Flash Lite: our most cost\-effective AI model yet\.Note:[https://blog\.google/innovation\-and\-ai/models\-and\-research/gemini\-models/gemini\-3\-1\-flash\-lite/](https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-flash-lite/)Accessed: 2026\-05\-26Cited by:[§4](https://arxiv.org/html/2606.19749#S4.SS0.SSS0.Px2.p1.1)\.
- J\. A\. Hanley and B\. J\. McNeil \(1982\)The meaning and use of the area under a receiver operating characteristic \(ROC\) curve\.Radiology143\(1\),pp\. 29–36\.Cited by:[§B\.2](https://arxiv.org/html/2606.19749#A2.SS2.p1.3)\.
- ICML Program Committee \(2026\)ICML experimental program: using Google’s paper assistant tool \(PAT\)\.Note:[blog\.icml\.cc](https://blog.icml.cc/2026/01/14/icml-experimental-program-using-googles-paper-assistant-tool-pat/)Cited by:[§1](https://arxiv.org/html/2606.19749#S1.p1.1)\.
- N\. Kassner and H\. Schütze \(2020\)Negated and misprimed probes for pretrained language models: birds can talk, but cannot fly\.InProceedings of ACL,pp\. 7811–7818\.Cited by:[§2](https://arxiv.org/html/2606.19749#S2.SS0.SSS0.Px2.p1.1)\.
- D\. Kaushik, E\. Hovy, and Z\. C\. Lipton \(2020\)Learning the difference that makes a difference with counterfactually\-augmented data\.InICLR,Cited by:[§2](https://arxiv.org/html/2606.19749#S2.SS0.SSS0.Px2.p1.1)\.
- R\. Li, H\. Zhang, E\. Gehringer, T\. Xiao, J\. Ding, and H\. Chen \(2025\)Unveiling the merits and defects of LLMs in automatic review generation for scientific papers\.arXiv preprint arXiv:2509\.19326\.Cited by:[§2](https://arxiv.org/html/2606.19749#S2.SS0.SSS0.Px3.p1.1)\.
- W\. Liang, Y\. Zhang, H\. Cao, B\. Wang, D\. Ding, X\. Yang, K\. Vodrahalli, S\. He, D\. Smith, Y\. Yin, D\. McFarland, and J\. Zou \(2024\)Can large language models provide useful feedback on research papers? A large\-scale empirical analysis\.NEJM AI1\(8\)\.External Links:[Document](https://dx.doi.org/10.1056/AIoa2400196)Cited by:[§2](https://arxiv.org/html/2606.19749#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2606.19749#S2.SS0.SSS0.Px3.p1.1)\.
- H\. Liu and C\. Tan \(2026\)AI\-assisted reviewing is necessary for avoiding the review death spiral\.Note:Preprint[https://openaireview\.org/assets/review\-death\-spiral\.pdf](https://openaireview.org/assets/review-death-spiral.pdf)Cited by:[§1](https://arxiv.org/html/2606.19749#S1.p1.1)\.
- R\. Liu and N\. B\. Shah \(2023\)ReviewerGPT? An exploratory study on using large language models for paper reviewing\.arXiv preprint arXiv:2306\.00622\.Cited by:[§1](https://arxiv.org/html/2606.19749#S1.p2.1),[§2](https://arxiv.org/html/2606.19749#S2.SS0.SSS0.Px2.p1.1)\.
- C\. Lu, C\. Lu, R\. T\. Lange, J\. Foerster, J\. Clune, and D\. Ha \(2024\)The AI scientist: towards fully automated open\-ended scientific discovery\.arXiv preprint arXiv:2408\.06292\.Cited by:[§1](https://arxiv.org/html/2606.19749#S1.p1.1),[§2](https://arxiv.org/html/2606.19749#S2.SS0.SSS0.Px1.p1.1)\.
- H\. B\. Mann and D\. R\. Whitney \(1947\)On a test of whether one of two random variables is stochastically larger than the other\.Annals of Mathematical Statistics18\(1\),pp\. 50–60\.Cited by:[§B\.2](https://arxiv.org/html/2606.19749#A2.SS2.p1.3)\.
- M\. Neumann \(2025\)Cited by:[Table 27](https://arxiv.org/html/2606.19749#A6.T27.1.1.9.8.1),[§5\.1](https://arxiv.org/html/2606.19749#S5.SS1.SSS0.Px1.p1.1)\.
- OpenAI \(2026\)Introducing GPT\-5\.5\.Note:[https://openai\.com/index/introducing\-gpt\-5\-5/](https://openai.com/index/introducing-gpt-5-5/)Accessed: 2026\-05\-26Cited by:[§4](https://arxiv.org/html/2606.19749#S4.SS0.SSS0.Px2.p1.1)\.
- Qwen Team \(2026\)Qwen3\.6\-35B\-A3B\.Note:[https://huggingface\.co/Qwen/Qwen3\.6\-35B\-A3B](https://huggingface.co/Qwen/Qwen3.6-35B-A3B)Accessed: 2026\-05\-26Cited by:[§4](https://arxiv.org/html/2606.19749#S4.SS0.SSS0.Px2.p1.1)\.
- Reviewer3 \(2025\)Reviewer3: AI\-powered paper reviewer\.Cited by:[§1](https://arxiv.org/html/2606.19749#S1.p2.1),[§2](https://arxiv.org/html/2606.19749#S2.SS0.SSS0.Px1.p1.1),[4th item](https://arxiv.org/html/2606.19749#S4.I1.i4.p1.1)\.
- M\. T\. Ribeiro, T\. Wu, C\. Guestrin, and S\. Singh \(2020\)Beyond accuracy: behavioral testing of NLP models with CheckList\.InProceedings of ACL,pp\. 4902–4912\.Cited by:[§2](https://arxiv.org/html/2606.19749#S2.SS0.SSS0.Px2.p1.1)\.
- A\. B\. Sai, T\. Dixit, D\. Y\. Sheth, S\. Mohan, and M\. M\. Khapra \(2021\)Perturbation CheckLists for evaluating NLG evaluation metrics\.InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),pp\. 7219–7234\.External Links:[Link](https://aclanthology.org/2021.emnlp-main.575/)Cited by:[§2](https://arxiv.org/html/2606.19749#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Talmor, Y\. Elazar, Y\. Goldberg, and J\. Berant \(2020\)OLMpics\-on what language model pre\-training captures\.Transactions of the Association for Computational Linguistics8\(\),pp\. 743–758\.External Links:[Document](https://dx.doi.org/10.1162/tacl%5Fa%5F00342),[Link](https://doi.org/10.1162/tacl_a_00342),https://doi\.org/10\.1162/tacl\_a\_00342Cited by:[§2](https://arxiv.org/html/2606.19749#S2.SS0.SSS0.Px2.p1.1)\.
- K\. Tyser, B\. Segev, G\. Longhitano, X\. Zhang, Z\. Meeks, J\. Lee, U\. Garg, N\. Belsten, A\. Shporer, M\. Udell, D\. Te’eni, and I\. Drori \(2024\)AI\-driven review systems: evaluating LLMs in scalable and bias\-aware academic reviews\.arXiv preprint arXiv:2408\.10365\.Cited by:[§1](https://arxiv.org/html/2606.19749#S1.p2.1),[§2](https://arxiv.org/html/2606.19749#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2606.19749#S2.SS0.SSS0.Px2.p1.1)\.
- D\. Van Dijcke \(2025\)Coarse: an open\-source AI academic paper reviewer\.Note:[https://github\.com/Davidvandijcke/coarse](https://github.com/Davidvandijcke/coarse)Cited by:[§1](https://arxiv.org/html/2606.19749#S1.p2.1),[§2](https://arxiv.org/html/2606.19749#S2.SS0.SSS0.Px1.p1.1),[§3](https://arxiv.org/html/2606.19749#S3.SS0.SSS0.Px3.p1.1),[3rd item](https://arxiv.org/html/2606.19749#S4.I1.i3.p1.1)\.
- xAI \(2026\)Grok 4\.1 Fast and Agent Tools API\.Note:[https://x\.ai/news/grok\-4\-1\-fast](https://x.ai/news/grok-4-1-fast)Accessed: 2026\-05\-26Cited by:[§4](https://arxiv.org/html/2606.19749#S4.SS0.SSS0.Px2.p1.1)\.
- S\. Xi, V\. Rao, J\. Payan, and N\. B\. Shah \(2025\)FLAWS: a benchmark for error identification and localization in scientific papers\.arXiv preprint arXiv:2511\.21843\.Cited by:[§1](https://arxiv.org/html/2606.19749#S1.p2.1),[§2](https://arxiv.org/html/2606.19749#S2.SS0.SSS0.Px2.p1.1)\.
- W\. Yuan, P\. Liu, and G\. Neubig \(2022\)Can we automate scientific reviewing?\.Journal of Artificial Intelligence Research75,pp\. 171–212\.Cited by:[§2](https://arxiv.org/html/2606.19749#S2.SS0.SSS0.Px1.p1.1)\.

## Appendix AOpenAIReview Implementation Details

This appendix records the configuration used by OpenAIReview \(Section[3](https://arxiv.org/html/2606.19749#S3)\) in enough detail to reproduce the harness\. All values are defaults of the reference implementation, and our experiments vary only the backbone model \(Section[4](https://arxiv.org/html/2606.19749#S4)\)\.

#### Passage segmentation\.

The extracted text is split on paragraph boundaries, and adjacent paragraphs are merged greedily into passages of up to roughly8,0008\{,\}000characters, so that passage lengths are comparable across papers\.

#### Context window\.

When the model reviews a passage, it also sees a fixed window of neighbors: by default the five preceding and two following passages\. The asymmetry gives more weight to earlier material, where definitions and notation are introduced\.

#### Running summary\.

A separate model call updates the running summary after each passage\. The summary is held to a token budget ofmax⁡\(4,000,T/10\)\\max\(4\{,\}000,\\,T/10\), whereTTis the document’s length in tokens, so the budget grows with longer papers but never falls below roughly four thousand tokens\.

#### Review prompt\.

For each passage, the model receives the prompt in Figure[5](https://arxiv.org/html/2606.19749#A1.F5): the neighboring passages and running summary as context, the passage to check, the list of checks to run, what not to flag, and the required output format\. Angle\-bracketed fields are filled in at runtime, and the optical character recognition \(OCR\) caveat is included only when the passage text was extracted via OCR\.

Youareathoughtfulreviewercheckingapassagefromanacademicpaper\.Today’sdateis<date\>\.Engagedeeplywiththematerial\.Foreachpotentialissue,firsttrytounderstandtheauthors’intentandcheckwhetheryourconcernisresolvedbycontextbeforeflaggingit\.

<OCRcaveat:includedonlywhenthepassagetextwasextractedviaOCR\>

FULLPAPERCONTEXT\(relevantsections\):

<neighboringpassagesandtherunningsummaryofthepapersofar\>

\-\-\-

PASSAGETOCHECK:

<thecurrentpassage\>

\-\-\-

Checkfor:

1\.Mathematical/formulaerrors:wrongformulas,signerrors,missingfactors,incorrectderivations,subscriptorindexerrors

2\.Notationinconsistencies:symbolsusedinawaythatcontradictstheirearlierdefinition

3\.Inconsistencybetweentextandformaldefinitions:prosesaysonethingbuttheequationsaysanother

4\.Parameter/numericalinconsistencies:statedvaluescontradictwhatcanbederivedfromdefinitionsortableselsewhere

5\.Insufficientjustification:akeyderivationstepisskippedwheretheresultisnon\-trivial

6\.Questionableclaims:statementsthatoverstatewhathasactuallybeenshown

7\.Ambiguitythatcouldmislead:flagonlyifacarefulreadercouldreasonablyreachanincorrectconclusion

8\.Underspecifiedmethods:analgorithm,procedure,ormodificationisdescribedtoovaguelyforareadertoreproduce\-\-keychoices,boundaryconditions,orparametersettingsareleftimplicit

Foreachissue,writelikeacarefulreaderthinkingaloud\.Describewhatinitiallyconfusedorconcernedyou,whatyoucheckedtoresolveit,andwhatspecificallyremainsproblematic\.Acknowledgewhattheauthorsgotrightbeforenotingtheissue\.Referencestandardresultsorconventionsinthefieldwhenrelevant\.

Belenientwith:

\-Introductoryandoverviewsections,whichintentionallysimplifyorglossoverdetails

\-Forwardreferences\-\-symbolsorclaimsthatmaybedefinedorjustifiedlaterinthepaper

\-Informalprosethatparaphrasesaformalresultwithoutrepeatingeveryqualifier

DoNOTflag:

\-Formatting,typesetting,orcapitalizationissues

\-Referencestoequationsorsectionsnotshowninthecontext\(theyexistelsewhere\)

\-Trivialobservationsthatanyreaderinthefieldwouldimmediatelyresolve

\-Incompletetextatpassageboundaries

\-Notationnotyetinthesummary\-\-itmaybeintroducedlater

ReturnONLYaJSONarray\(canbe\[\]\)\.Eachitem:

\-"title":concisetitleoftheissue

\-"quote":theexactverbatimtext\(preservingLaTeX\)

\-"explanation":deepreasoning\-\-whatyouinitiallythought,whethercontextresolvesit,andwhatspecificallyremainsproblematic

\-"type":"technical"or"logical"

Figure 5:The full prompt used to review a single passage, reproduced verbatim\. Angle\-bracketed fields are substituted at runtime\.
#### Other prompts\.

The running summary is maintained by a separate prompt \(Figure[6](https://arxiv.org/html/2606.19749#A1.F6)\)\. Two further steps each use a single prompt, both shown in Figure[7](https://arxiv.org/html/2606.19749#A1.F7): the consolidation call that merges and deduplicates the collected comments, and the opening overall\-feedback paragraph, which the model writes from the first8,0008\{,\}000characters of the paper\.

Youaremaintainingaconciserunningsummaryofanacademicpaper’skeytechnicalcontent\.Thissummarywillbeusedascontextwhenreviewinglatersectionsofthepaper\.

CURRENTSUMMARY:

<therunningsummarysofar\>

\-\-\-

NEWPASSAGE\(section<i\>of<n\>\):

<thecurrentpassage\>

\-\-\-

UpdatethesummarytoincorporateanyNEWinformationfromthispassage\.Keepthesummarystructuredandconcise\.Include:

1\.\*\*Notation&Definitions\*\*:Anynewsymbols,variables,ortermsdefined

2\.\*\*KeyEquations\*\*:Importantequationsorformulasintroduced\(writethemout,preservingLaTeX\)

3\.\*\*Theorems&Propositions\*\*:Statementsoftheorems,lemmas,corollaries\(briefstatement,notproof\)

4\.\*\*Assumptions\*\*:Anystatedassumptionsorconditions

5\.\*\*KeyClaims\*\*:Importantresultsorconclusionsestablished

Rules:

\-PRESERVEallexistingsummarycontentunlessitissupersededbynewinformation

\-ADDnewitemsfromthepassage

\-DoNOTincludecommentary,proofdetails,orexperimentalresults

\-DoNOTincludeinformationnotinthepassageorexistingsummary

\-Keepentriesbrief\-\-onelineperitemwherepossible

\-Ifthepassagecontainsnonewdefinitions,equations,orkeyclaims,returnthesummaryunchanged

Returntheupdatedsummarydirectly\(noJSON,nocodefences\)\.

Figure 6:The prompt that updates the running summary after each passage, reproduced verbatim\. Angle\-bracketed fields are substituted at runtime\.\(a\) Consolidation\.Merges and deduplicates the collected comments\.

Youarereviewingthecompletelistofissuesfoundinanacademicpaper\.Yourjobistoconsolidatethislist:removeduplicatesandmergecloselyrelatedissues\.

Removeissuesthat:

\-Flagthesameunderlyingproblemasanotherissue\(keepthebetter\-explainedone\)

\-Flagstandardconventions,notationalshorthands,orwell\-knownresults

ISSUESFOUND:

<allcollectedcomments,asJSON\>

ReturnaJSONarraycontainingtheconsolidatedissues\(sameformatasinput\)\.Return\[\]ifnonesurvivefiltering\.

\(b\) Overall feedback\.Writes the opening paragraph of the review\.

Youareanexpertacademicreviewer\.Basedonthebeginningofthepaperbelow,writeoneparagraphofhigh\-levelfeedbackonthepaper’squality,clarity,andmostsignificantissues\.

PAPER\(first8000characters\):

<thestartofthepaper\>

Figure 7:The two remaining prompts in the pipeline, reproduced verbatim\. Angle\-bracketed fields are substituted at runtime\.
#### Consolidation\.

Once every passage is reviewed, the collected comments are serialized into a single list and passed to one model call that returns a deduplicated list, merging comments that flag the same underlying issue and dropping ones that restate standard conventions\. Each surviving comment is re\-anchored to its source passage by matching its quoted text\.

#### Comment fields\.

Each comment carries a quoted passage, a one\-line title, an explanation, a source passage index, and an optional severity \(minor, moderate, or major\)\. Figure[8](https://arxiv.org/html/2606.19749#A1.F8)shows an example of the resulting output\.

Paper:*Finetuned Language Models Are Zero\-Shot Learners*Overall feedback\.The paper is well written, clearly motivated, and presents a simple but important idea—finetuning large language models on many instruction\-formatted tasks to improve zero\-shot generalization—that is likely to be influential and practically useful\. The framing is strong: the contrast between standard finetuning, prompting, and instruction tuning is intuitive, and the leave\-one\-task\-cluster\-out evaluation is a reasonable first attempt to test generalization to unseen task types\. The empirical claims are compelling, especially the reported gains over the untuned 137B model and comparisons to GPT\-3, and the ablation directions address key factors such as scale, number of tasks, and the role of natural\-language instructions\. That said, the most significant concerns are about the strength and interpretation of the “zero\-shot” claim: FLAN is not zero\-shot in the broad sense, but rather multitask\-supervised and evaluated on held\-out task clusters, so the paper should be careful about possible leakage through related datasets, shared label spaces, similar input\-output formats, or pretraining contamination\. The comparison to GPT\-3 is also difficult to interpret because model architectures, pretraining data, prompt choices, decoding methods, and evaluation protocols may differ, making it less than a controlled comparison\. Reproducibility is another major issue, since the central experiments depend on a 137B proprietary model and a large curated instruction mixture, so full details of templates, mixture weights, dataset filtering, and prompt selection are essential\. Overall, the paper appears high quality and clearly presented, with a simple method and strong empirical results, but its main claims would be strengthened by more careful discussion of evaluation leakage, fair baselines, significance, and the precise sense in which the resulting model should be considered a zero\-shot learner\.Comment 1\. *Quoted passage:*“FLAN substantially improves the performance of its unmodified counterpart and surpasses zero\-shot 175B GPT\-3 on 20 of 25 datasets that we evaluate\.”*Explanation:*The paper defines both mean\-template and best\-dev\-template evaluation, but the strongest headline comparisons are not presented consistently with that qualifier: the abstract and introduction omit the best\-dev\-template caveat, while nearby figures use mean\-template FLAN results\. \[…\] Since template choice can materially affect zero\-shot performance, readers could mistake the 20\-of\-25 claims for fully template\-free zero\-shot results\.Comment 2\. *Quoted passage:*“Translation \(8 datasets\): ParaCrawl EN/DE, EN/ES, EN/FR; WMT\-16 EN/CS, EN/DE, EN/FI, EN/RO, EN/RU, EN/TR\.”*Explanation:*The figure lists three ParaCrawl language pairs plus six WMT\-16 language pairs, which would total nine translation datasets, while the header says eight\. This may be a formatting problem, or it may reflect a convention about what counts as one dataset, but as written it creates ambiguity about the exact instruction\-tuning mixture and the stated total dataset count\.Comment 3\. *Quoted passage:*“performance improvements from instruction tuning emerge only with sufficient model scale\.”*Explanation:*The section reports a useful model\-size ablation, but it does not estimate a scaling law in the usual sense: no functional relationship is fit, no uncertainty or extrapolation analysis is provided, and the evidence comes from one model family and one instruction\-tuning recipe\. The conclusion that improvements emerge “only” with sufficient scale therefore risks overgeneralizing; a more precise phrasing would say that, in these experiments, gains appeared only for the larger tested models\.

Figure 8:An example OpenAIReview output, with one paragraph of overall feedback followed by comments, each pairing a quoted passage with an explanation\. “…” marks elisions in the longer comments\.
#### Document parsing\.

Review quality depends on clean text, since math symbols, tables, and reading order are easy to corrupt during extraction\. PDFs are converted through a fallback chain of parsers, tried in order until one succeeds: a cloud OCR service \(Mistral OCR\), a local OCR model \(DeepSeek OCR\), a layout\-preserving local parser \(Marker\), and finally an offline extractor \(pymupdf4llm\)\. LaTeX, Word, and arXiv HTML inputs bypass OCR and are converted directly\.

## Appendix BCorrelation with Human Quality Signals

### B\.1Sampling design

We randomly subsample within the selection criteria so that papers in each proxy minimally overlap\. This applies to both groups of the Conference label and to the rejected group of the Community label\. The other groups are selected deterministically by their ranking criterion\. By design, none of the four labels is a uniform sample of the conference papers\. Each group is a deliberately selected contrast meant to surface a quality signal\.

#### Quality\-proxy construction\.

All signals come from the SNOR records \(Semantic Scholar citation counts, OpenReview decisions, publication venue, and review scores\)\. We compute*citations\-per\-year*as the citation count divided by the number of years since the paper’s conference year\. The four proxies’ groups are then:Community: high: the top papers by citations\-per\-year \(deterministic, ties broken by forum id\); low \(“never published”\): papers that were not accepted, have≥3\\geq 3reviews, are≥2\\geq 2years past their conference year, and have no formal publication venue \(empty or arXiv\-only\), sampled at random\.Conference: high: papers whose normalized decision matches an award keyword \(Outstanding/Best/Oral/Spotlight\); low: not\-accepted papers with≥3\\geq 3reviews; both sampled at random\.Reviewer: the top and bottom papers by mean review score among papers with≥3\\geq 3reviews\.Composite: high: awarded*and*top\-cited*and*top\-scored; low: rejected*and*never\-published*and*bottom\-scored\. Random samples use a fixed seed for reproducibility, and every group requires≥3\\geq 3official reviews \(our “substantive\-reviews” filter\)\.

#### Frontier subset cost\.

Running the two frontier models \(GPT\-5\.5 and Claude Opus 4\.7\) under the zero\-shot and OpenAIReview methods on the 80\-paper subset cost approximately $204 in API charges \($177 for OpenAIReview and $27 for zero\-shot\)\. Extending the frontier sweep to the full 240\-paper set would multiply this proportionally, which is why we restrict frontier runs to the subset\.

#### Review accounting\.

The benchmark evaluates 197 unique papers×\\times4 efficient models×\\times3 methods=2,364=2\{,\}364reviews on the full 240\-paper set, plus 74 unique papers×\\times2 frontier models×\\times2 methods \(zero\-shot and OpenAIReview, no ‘coarse\)=296=296reviews on the 80\-paper frontier subset, for2,6602\{,\}660reviews in total\.

### B\.2Interpreting the pairwise accuracy

The main text \(Section[5](https://arxiv.org/html/2606.19749#S5)\) reports the*pairwise accuracy*\(equivalently, the Mann–Whitney AUC\(Mann and Whitney,[1947](https://arxiv.org/html/2606.19749#bib.bib24); Hanley and McNeil,[1982](https://arxiv.org/html/2606.19749#bib.bib25)\)\)\. Concretely, for each quality proxy we form every \(low\-group paper, high\-group paper\) pair and count it as a hit if the low\-group paper receives more comments than the high\-group one \(ties contribute0\.50\.5\)\. Pairwise accuracy is the hit fraction\. Aggregating across the four proxies gives400400pairs in the frontier subset and3,6003\{,\}600pairs in the full\-set appendix tables\.

Three intuitions help anchor the numbers:

- •Probabilistic reading\.The pairwise accuracy is exactly the probability that, if you sample one low\-group paper and one high\-group paper uniformly at random within the same proxy, the low\-group paper will have received more comments, under random tiebreaking when the two papers have identical counts \(the0\.50\.5contribution per tie is the expected outcome of that coin flip\)\. A pairwise accuracy of0\.830\.83\(GPT\-5\.5\) means that this happens83%83\\%of the time\.
- •Chance baseline\.Under the null hypothesis that comment volume is independent of paper quality, the expected pairwise accuracy is0\.50\.5\. Values\>0\.5\>0\.5mean the system tracks the proxy direction\. Values<0\.5<0\.5mean it inverts it\.
- •Δ\\Deltavs\. pairwise accuracy\.Δ\\Deltameasures the average gap between the two groups’ means and is sensitive to outlier papers with very high comment counts\. Pairwise accuracy measures the per\-pair ordering and is robust to those outliers\. The two metrics usually agree on direction but can disagree in magnitude\.

### B\.3Confidence intervals for pairwise accuracy

All pairwise\-accuracy intervals reported in Section[5](https://arxiv.org/html/2606.19749#S5)are 95% intervals from a nonparametric*cluster bootstrap*over papers\. The same paper appears in many \(low, high\) pair comparisons across the four quality proxies \(and across the per\-severity tiers\), so those pairs are not independent draws of comments\. A naive interval that treated each pair as an independent Bernoulli trial would overstate precision\. For each \(method, model\) cell, each bootstrap draw works as follows: within each proxy we form a new high\-quality group by drawing papers*with replacement*from the original high\-quality papers \(so some papers appear two or three times and others not at all, but the group size stays the same\), and similarly form a new low\-quality group\. We then recompute the pooled AUC on these resampled groups\. Per\-severity tier AUCs \(Appendix[B\.5](https://arxiv.org/html/2606.19749#A2.SS5)\) reuse the same per\-bootstrap resample\. Repeating thisB=5000B=5000times gives the bootstrap distribution of the AUC; we report its2\.52\.5th and97\.597\.5th percentiles\.

### B\.4Results on the full 240\-paper set

The main\-text conference tables \(Section[5](https://arxiv.org/html/2606.19749#S5)\) restrict all six models to the 80\-paper frontier subset \(74 unique\)\. Tables[8](https://arxiv.org/html/2606.19749#A2.T8)and[9](https://arxiv.org/html/2606.19749#A2.T9)report the same breakdowns under OpenAIReview on the full 240\-paper set \(197 unique\)\.

Table 8:Under OpenAIReview, every efficient model and Reviewer3 emit more comments on low\-quality papers than high\-quality ones on the full 240\-paper set \(197 unique\), with Grok\-4\.1\-Fast attaining the highest pairwise accuracy \(0\.810\.81\)\. Brackets are 95% bootstrap CIs over papers \(Appendix[B\.3](https://arxiv.org/html/2606.19749#A2.SS3)\)\.#### The signal concentrates in the major and moderate tiers\.

Decomposing per\-model pairwise accuracy by severity tier \(Table[9](https://arxiv.org/html/2606.19749#A2.T9)\) shows that the quality signal concentrates in the major and moderate tiers\. The minor tier, omitted from the table for consistency with the per\-system tier breakdown in Appendix[B\.5](https://arxiv.org/html/2606.19749#A2.SS5), sits at or below chance for several of the efficient models \(DeepSeek0\.470\.47, Qwen0\.480\.48, Gemini0\.340\.34\)\.

Table 9:UnderOpenAIReview, per\-tier pairwise accuracy concentrates in the major and moderate tiers on the full 240\-paper set \(197 unique\)\. The minor tier is omitted \(see prose\);c¯l\\bar\{c\}\_\{\\rm l\}is the mean number of comments on the low\-quality group; brackets are 95% bootstrap CIs over papers \(Appendix[B\.3](https://arxiv.org/html/2606.19749#A2.SS3)\)\.

### B\.5Severity\-tier pairwise accuracy

The main paper reports overall pairwise accuracy only\. We retain the per\-tier breakdown here as a robustness check, separately for each system \(zero\-shot, OpenAIReview, ‘coarse\) across the six backbone models on the 74\-paper frontier subset\. We omit the minor tier due to it being a much noisier signal for quality compared to major or moderate\. Tables[10](https://arxiv.org/html/2606.19749#A2.T10),[11](https://arxiv.org/html/2606.19749#A2.T11), and[12](https://arxiv.org/html/2606.19749#A2.T12)show that the rank order of models is largely preserved across tiers, but the Major and Moderate tiers do not always agree on the per\-model ordering\. Bracketed quantities are 95% bootstrap CIs over papers \(Appendix[B\.3](https://arxiv.org/html/2606.19749#A2.SS3)\)\.

Table 10:Underzero\-shoton the frontier subset, DeepSeek\-V4\-Flash leads overall \(0\.800\.80\) and Gemini\-3\.1\-Flash\-Lite has the highest moderate\-tier accuracy \(0\.870\.87\)\.c¯l\\bar\{c\}\_\{\\rm l\}is the mean number of comments on the low\-quality group\.Table 11:UnderOpenAIReviewon the frontier subset, GPT\-5\.5 leads every tier \(Major0\.780\.78, Moderate0\.910\.91, Overall0\.830\.83\)\.c¯l\\bar\{c\}\_\{\\rm l\}is the mean number of comments on the low\-quality group\.Table 12:Under‘coarseon the frontier subset, Grok\-4\.1\-Fast attains the best overall \(0\.660\.66\) and moderate\-tier \(0\.710\.71\) accuracy\. GPT\-5\.5 and Claude Opus 4\.7 were not run on ‘coarse\.c¯l\\bar\{c\}\_\{\\rm l\}is the mean number of comments on the low\-quality group\.
### B\.6OpenAIReview: raw vs\. consolidated output

OpenAIReview emits two comment lists: the*raw*pre\-consolidation output \(one comment per passage, before deduplication\) and the*consolidated*output \(after the post\-hoc consolidation step that merges duplicates and re\-tiers comments\)\. The main text uses the raw output throughout\. Here we report the comparison\. On DeepSeek and Qwen, the consolidated lists average9\.39\.3and10\.110\.1comments per paper, respectively, against13\.313\.3and12\.612\.6raw, roughly a2020–30%30\\%reduction\. Per\-proxyΔ\\Deltaon the proxies values shift modestly: on DeepSeek, raw\{3\.57,0\.55,1\.80,3\.10\}\\\{3\.57,0\.55,1\.80,3\.10\\\}vs\. consolidated\{2\.34,0\.93,1\.70,3\.13\}\\\{2\.34,0\.93,1\.70,3\.13\\\}\(community, conference, reviewer, and composite\)\. On Qwen, raw\{3\.55,1\.97,0\.80,0\.67\}\\\{3\.55,1\.97,0\.80,0\.67\\\}vs\. consolidated\{1\.63,0\.07,0\.57,0\.33\}\\\{1\.63,0\.07,0\.57,0\.33\\\}, with consolidation tending to reduce the gap, especially on Qwen\.

## Appendix CPerturbation Benchmark

### C\.1Full perturbation taxonomy with examples

Table[13](https://arxiv.org/html/2606.19749#A3.T13)summarizes the taxonomy of injected perturbations, spanning surface\-level edits and higher\-level claim, logical, and experimental errors\.

CategorySubtypeExampleSurfaceOperator / sign\+⁣→⁣−\+\\to\-,≤⁣→⁣≥\\leq\\to\\geq,∪⁣→⁣∩\\cup\\to\\capIndex / subscriptxi→xi\+1x\_\{i\}\\to x\_\{i\+1\},An→An−1A^\{n\}\\to A^\{n\-1\}Numeric0\.5→0\.250\.5\\to 0\.25,n=100→n=10n=100\\to n=10Computation2\+3=32\+3=3;1−ρρ→ρ1−ρ\\frac\{1\-\\rho\}\{\\rho\}\\to\\frac\{\\rho\}\{1\-\\rho\}ClaimFalse theoretical claimDropping a domain restriction \(“Ifffis continuous on a*compact*set, thenffis bounded”→\\to“Ifffis continuous, thenffis bounded”\)False empirical claimInverting a standard empirical fact \(“Lower mean squared error \(MSE\) indicates better predictive performance”→\\to“Higher mean squared error \(MSE\) indicates better predictive performance\.”\)LogicCircular reasoningThe proof concludes thatffis injective because it has an inverse, but the existence of the inverse is justified by assumingffis injectiveInvalid implicationIfab=0ab=0, thena=0a=0Induction errorSkipping the inductive step or using an incorrect base caseMissing caseThe argument considersx\>0x\>0and concludes the result holds for allxx, ignoring the casex≤0x\\leq 0ExperimentalReversed causality“Increasing the sample size reduces the variance of the estimator”→\\to“Lower variance in the estimator causes an increase in sample size”Misinterpretation of results”App\-value of0\.200\.20provides strong evidence against the null hypothesis”P\-hacking“We repeated the experiment under different random seeds and report the run with the strongest performance\.”

Table 13:Taxonomy of injected perturbations with examples\. Surface errors are local math\-token edits\. Claim, reasoning, and experimental errors are paragraph\-level edits that introduce a more abstract error\. Examples are loosely based on perturbations in our benchmark\.
### C\.2Error\-type to span mapping

Table[14](https://arxiv.org/html/2606.19749#A3.T14)defines the span types used in the perturbation pipeline and the corresponding eligible error categories\.

Table 14:Span types and the error categories admissible for each\. Math\-token edits \(Surface\) target equations\. Prose\-level edits \(Claim, Logic, Experimental\) target spans appropriate to the paper type\.
### C\.3Pipeline\-stage implementation details

The main\-text Section[6](https://arxiv.org/html/2606.19749#S6)summarizes the five\-stage pipeline \(extract,generate,validate,verify,inject\)\. We give the per\-stage details here\.

#### Extract\.

A deterministicLaTeX\-source scanner finds places where errors can be injected\. For theoretical papers, it extracts equations, definitions/theorems, and proofs\. For empirical papers, it extracts equations, paragraphs that make claims or arguments, and paragraphs describing experimental setup or results\. These candidate spans are mapped to allowed perturbations according to Table[14](https://arxiv.org/html/2606.19749#A3.T14)\.

#### Generate\.

A generator LLM is first prompted with the abstract to identify the paper’s field and write a short list of plausible field\-specific errors, which is appended to the main generation prompt as guidance\. Each candidate’s input record includes the span’sLaTeXtext, its type \(display/inline/named equation, definition/theorem, proof, or paragraph\), a±200\\pm 200\-character window of surrounding prose for context, and the set of error subtypes compatible with that span type\. The generator is shown all candidates and selects a subset, writing one perturbation per chosen candidate\. For each, it emits \(i\) the specific error label \(one of the admissible subtypes\), \(ii\) the replacementLaTeXtext, \(iii\) a short explanation of how a careful reader could verify the error from the paper alone, and \(iv\) optionally, a verbatim quote from elsewhere in the paper that the perturbation directly contradicts\. Candidates are batched and the generator is asked to aim for 20 valid perturbations per paper\. We use Gemini\-3 Flash Preview as the perturbation generator\.

#### Validate\.

A structural validation step rejects perturbations where \(i\) the perturbed text is identical to the original, \(ii\) the span overlaps with an already\-accepted perturbation, or \(iii\) the replacement would create garbledLaTeXat the span boundaries\. Duplicate perturbations on the same span are also dropped here\.

#### Verify\.

Surviving perturbations next pass through a*checklist verifier*that filters out edits too local to constitute a substantive error or those that are technically not errors\. To do this, we find other passages in the paper related to the perturbed span and check if it contradicts them\. Related passages are computed by tokenizing the span \(LaTeXcommands, scripted identifiers likeWijW\_\{ij\}orxt\+1x\_\{t\+1\}, variable assignments, capitalized noun phrases\) and searching the rest of the document for hits, returning up to five±100\\pm 100\-character snippets around the matches\. A deterministic precheck first rejects structurally typo\-shaped edits without any model call, such as a dummy\-variable rename or a symbol swap whose replacement letter is bound nowhere in the paper\. The remaining perturbations from one run \(one paper and one error family\) are then judged in a single batched call, with the verifier instructed to judge each perturbation independently\. Each perturbation’s record contains the original span, the perturbed span, the error label, a±200\\pm 200\-character window of surrounding prose, and the span’s related passages\. For each perturbation, the verifier answers a 4\-item yes/no checklist and returns a short verbatim quote from the perturbed text pinpointing the issue\. We use Claude Sonnet 4\.6 as the verifier\.

The answer pattern maps deterministically to one of three verdicts:

- •if item 1 = N*or*item 4 = Y⇒\\Rightarrow*typo\-shaped*\(reject\)\.
- •else if item 2 = N⇒\\Rightarrow*not\-an\-error*\(reject\)\.
- •else if item 3 = N⇒\\Rightarrow*not\-an\-error*\(reject\)\.
- •else⇒\\Rightarrow*substantive*\(keep\)\.

Item 1 and item 4 are local well\-formedness / cosmetic\-edit checks that pre\-empt typo\-shaped artifacts\. Item 2 verifies that the available evidence pins down the original content\. Item 3 asks whether the perturbation breaks something concrete that the evidence relies on\. The checklist is shared across error families except item 3, whose contradiction criterion is specific to each family \(Surface, Reasoning, Claim, Empirical\)\. The items are:

1. 1\.*Well\-formed:*Is the perturbed span well\-formed when read in isolation \(no mixed\-direction inequalities, broken sandwich expressions, operator salad, type/unit mismatches, garbled grammar, unbound symbols, or definitions with undefined symbols, missing required quantifiers, or mismatched arity\)?
2. 2\.*Evidence available:*Is there a concrete basis to judge the perturbation against: either the surrounding context or related passages establish something specific about the same object \(a stated value, definition, applied theorem, named result, or downstream use of the same quantity\), or the perturbed text alone introduces a self\-evidencing methodological flaw \(e\.g\., post\-hoc selection, a removed multiple\-testing correction, treating a p\-value as the probability of the null\)? Mentioning the topic abstractly does not count\.
3. 3\.*Contradiction confirmed:*Does the perturbation contradict that evidence, judged per error family? Surface: the perturbed value, symbol, or operator now disagrees with how the same quantity appears elsewhere\. Reasoning: the perturbed step breaks the chain of inference \(a missing case whose conclusion is not forced by the remaining cases, an induction whose base case is now wrong or whose step no longer reducesn\+1n\+1tonn, a step that now invokes the very claim being proved, or a reversed/dropped implication\)\. Claim: the altered definition or theorem no longer supports an application of it visible in the evidence, e\.g\., weakened/strengthened quantifiers \(∀↔∃\\forall\\leftrightarrow\\exists\) or a dropped/reversed hypothesis the application uses\. Empirical: the perturbed claim disagrees with the established content \(a number/dataset/method contradiction, a misread of what a result means, a flipped causal direction\) or introduces a methodological flaw the original explicitly avoided\.
4. 4\.*Typo\-shaped or cosmetic:*Is the change typo\-shaped or cosmetically equivalent regardless of evidence \(a bare symbol swap whose replacement letter is not bound anywhere in the paper, a re\-indexing that leaves the expression algebraically unchanged such as the dummy\-variable rename∑i→∑j\\sum\_\{i\}\\to\\sum\_\{j\}, synonym swaps, reordering equivalent clauses, renaming bound variables, or hedging tweaks that do not flip the conclusion\)?

#### Inject\.

The previous stages produce a list of approved \(span, replacement\) records but leave the source paper untouched\. Injection applies all surviving edits to theLaTeXsource: because a replacement may change the length of the text, perturbations are sorted by offset in descending order and applied right\-to\-left, so each replacement only modifies text past positions that have already been finalized and the remaining \(lower\-offset\) perturbations keep the offsets recorded during extraction without any recalculation\. The result is a single corrupted paper carrying all of its accepted perturbations\.

#### Manual validation\.

Beyond the automated validate/verify filters, one of the authors manually audited a stratified sample of4040kept perturbations to confirm they are genuine, well\-formed errors\. The sample takes1010perturbations per error type \(Surface, Claim, Logic, Experimental\), drawn so that every subtype appears at least once and all eight domains are represented, and spread across papers \(fixed random seed\)\. For each perturbation the annotator was shown the original span, the injected replacement, the generator’s why\-wrong explanation, and the verifier’s contradicting evidence, and labeled it a*valid error*,*not an error*, or*ambiguous*\. We judge33/4033/40\(82\.5%82\.5\\%\) to be valid,22\(5%5\\%\) not true errors, and55\(12\.5%12\.5\\%\) ambiguous, with per\-type valid rates of Surface8/108/10, Claim7/107/10, Logic9/109/10, and Experimental9/109/10\. The non\-valid and ambiguous cases concentrate in two subtypes\. \(i\)*Surface\-numeric*edits where the extracted span bounds only part of a number, so injection can produce a malformed expression, e\.g\., replacing the “×\\times” token of “5×10−65\\times 10^\{\-6\}” yields the duplicated exponent “5×10−1210−65\\times 10^\{\-12\}10^\{\-6\}” rather than a clean magnitude change\. \(ii\)*Empirical\-claim*edits whose contradiction with the rest of the paper is not concretely pinned down by an identifiable passage, so the injected statement is unsupported rather than provably wrong\. These cases motivate the larger expert audit we leave to future work \(Section[Limitations](https://arxiv.org/html/2606.19749#Sx1)\)\.

### C\.4Generator span selection vs\. random selection

Because the generator chooses which candidate spans to perturb, the injected errors could be biased toward spans the model prefers, relative to random selection from the same candidates\. Since extraction is deterministic, we can reconstruct the exact candidate pool each generation run saw and compare the chosen spans against that pool; under random selection the two distributions would match\. We run this audit over all 222 generation runs \(74 papers, one run per error family\)\. Reconstruction reproduces every pool exactly, and all injected perturbations map back to their source spans verbatim\. Selection is measured at the earliest recorded stage, after structural validation, which retains 3,577 of 3,670 generated perturbations \(97\.5%\)\.

#### Selection freedom is limited\.

In 96 of 222 runs \(43%\), the generator perturbed every candidate it was shown, leaving no room for selection bias; the analysis below uses the remaining runs\. Moreover, candidates are presented in batches of 10 that are nearly always homogeneous in span type, with a fixed perturbation target per batch, so the allocation of errors across span types is largely set by the pipeline rather than chosen by the model\.

#### Where the generator’s preferences show\.

For equation spans, selection rate rises with span length, from 22% in the shortest within\-run quartile to 44% in the longest, so the generator favors substantial equations over short inline math\. Named equation environments are selected somewhat more often than inline math \(40% vs\. 33%; rate difference0\.080\.08, 95% CI\[0\.03,0\.13\]\[0\.03,0\.13\], bootstrap over papers\)\. Equations later in the paper are mildly favored \(29% in the earliest within\-run quartile vs\. 43% in the latest\)\. Among the admissible equation subtypes, the generator prefers operator/sign edits \(37%\) over index/subscript \(27%\), numeric \(23%\), and computation \(13%\) edits\. We find no position\-in\-prompt bias: a within\-batch permutation test shows no preference for candidates listed earlier or later in the prompt \(p≥0\.18p\\geq 0\.18for every error family\)\.

### C\.5Corpus details

The benchmark draws on 74 recent arXiv papers spanning eight subject classes: cs\.CC, cs\.LG, econ\.EM, hep\-ex, math\.\* \(covering math\.AG, math\.CO, math\.NT, math\.PR, and math\.ST\), physics\.atm\-clus, q\-bio\.GN, and stat\.AP\. We aim for 10 papers per class and accept 5 from physics\.atm\-clus and 9 from math\.\* due to availability andLaTeXcompilation constraints\. The perturbation pipeline yields 29–60 retained perturbations per paper \(median 46\) for a total of 3,365 injected edits after validation, verification, and deduplication\. Per\-cell denominators in the recall tables \(Section[6](https://arxiv.org/html/2606.19749#S6), Appendix[C\.8](https://arxiv.org/html/2606.19749#A3.SS8)\) are smaller than these raw counts because they restrict to the \(model, method\) cells that completed scoring\. We report the exact scored counts in each table cell\.

Table 15:The corpus is balanced across domains, with reasoning perturbations concentrated in cs\.CC and math\.\* and experimental perturbations in the empirical sciences\. Counts are retained perturbations after validation, verification, and deduplication\. “Claim” aggregates false theoretical claims \(perturbed definitions and theorem statements\) and false empirical claims \(perturbed prose statements\)\. “Reasoning” errors are injected only into proofs, so they appear only in the proof\-heavy classes\.
### C\.6Scoring details

The substring\-match stage normalizes whitespace and capitalization and requires that the perturbed text cover at least 75% of the comment’s quoted span \(or vice versa\)\. The LLM judge is Gemini\-3 Flash Preview\. For each comment passing the substring stage it rates the explanation against the ground truth on a 1–5 scale, and any rating≥3\\geq 3counts as a detection\. A perturbation is counted as detected if any emitted comment passes both stages\.

#### Threshold sensitivity\.

Detection recall is nearly flat around the chosen 0\.75: movingτ\\tauanywhere in\[0\.6,0\.9\]\[0\.6,0\.9\]changes recall by at most one detection out of 58 \(25 detections forτ≤0\.75\\tau\\leq 0\.75, 24 above\), so results are not sensitive to the threshold \(Table[16](https://arxiv.org/html/2606.19749#A3.T16)\)\. Loweringτ\\tauto0\.50\.5admits roughly 3×\\timesmore candidate pairs into the LLM\-judge stage but recovers only two additional detections, suggesting the looser threshold mainly admits noisy near\-matches\. Raisingτ\\tauabove 0\.9 begins to reject otherwise\-valid detections where the reviewer paraphrases the perturbed string\. The sweep is run on the math domain with Claude\-Opus\-4\.7 as the reviewer \(58 perturbations across the OpenAIReview and zero\-shot methods\) and uses Gemini\-3 Flash Preview as the explanation judge, with judge results cached per \(perturbation, comment\) pair so each pair is judged once and re\-aggregated at every threshold\.

Table 16:Detection recall changes by at most one detection \(of 58\) anywhere inτ∈\[0\.6,0\.9\]\\tau\\in\[0\.6,0\.9\], supporting the 0\.75 default\. “Pairs passing” counts \(perturbation, comment\) pairs whose normalized coverage is at leastτ\\tau\. “Detected” counts perturbations for which at least one such pair also passes the LLM judge\. Sweep slice: math papers reviewed by Claude\-Opus\-4\.7,n=58n=58injected perturbations\.

### C\.7Confidence intervals for recall

All recall intervals reported in Section[6](https://arxiv.org/html/2606.19749#S6)are 95% intervals from a nonparametric*cluster bootstrap*over papers\. Two perturbations injected into the same paper share the paper’s writing style, domain, and exposition, so a reviewer that catches one is more likely to catch the others; they are not statistically independent draws\. A naive interval that treated each perturbation as an independent Bernoulli trial would overstate precision\. For each \(method, model\) cell, each bootstrap draw works as follows: we form a new set of papers by drawing*with replacement*from the scored papers \(so some papers appear two or three times and others not at all, but the total number of papers stays the same\), and recompute the pooled recall∑pdetectedp/∑pinjectedp\\sum\_\{p\}\\mathrm\{detected\}\_\{p\}/\\sum\_\{p\}\\mathrm\{injected\}\_\{p\}on that resample\. RepeatingB=5000B=5000times gives the bootstrap distribution; we report its2\.52\.5th and97\.597\.5th percentiles\. Intervals are correspondingly wide for cells backed by few papers \(e\.g\., the Reasoning category, present only in cs\.CC and math\)\.

### C\.8Perturbation benchmark results

The main\-text recall tables in Section[6](https://arxiv.org/html/2606.19749#S6)all restrict to the 24\-paper frontier subset, so that the frontier\-model rows \(Claude\-Opus\-4\.7 and GPT\-5\.5\) and the four efficient\-model rows share the same papers\. Tables[18](https://arxiv.org/html/2606.19749#A3.T18),[19](https://arxiv.org/html/2606.19749#A3.T19), and[20](https://arxiv.org/html/2606.19749#A3.T20)below report parallel breakdowns on the full 74\-paper benchmark, aggregated across the four efficient models that ran on every paper, as a robustness check on the frontier\-subset main\-text view\.

#### OpenAIReview outperforms zero\-shot and ‘coarse on nearly all domains\.

Table[17](https://arxiv.org/html/2606.19749#A3.T17)reports recall by paper domain on the 24\-paper frontier subset, using each system’s best\-performing backend \(GPT\-5\.5 for zero\-shot and OpenAIReview, DeepSeek\-V4\-Flash for ‘coarse\)\. The ordering OpenAIReview\>\>zero\-shot\>\>‘coarse holds in every domain except Math, where zero\-shot with the same GPT\-5\.5 backend leads OpenAIReview \(79\.3% vs\. 62\.1%\)\. Empirical\-science domains \(Econometrics 91\.8%, ML 84\.7%, Atomic Phys\. 72\.0%, HEP 71\.8%\) reach the highest OpenAIReview recall, while the theoretical Complexity domain lags at 45\.7%\.

Table 17:OpenAIReview tops every arXiv domain except Math, where zero\-shot with the same backend leads\. Per\-domain columns ordered by OpenAIReview recall \(descending\)\. All cells use the 24\-paper frontier subset \(where GPT\-5\.5 and Claude\-Opus\-4\.7 were also run\), with each system’s best\-performing backend\. Reviewer3 has no model selector, and a few of its runs on Econometrics, HEP, and Genomics did not complete scoring\.
#### Recall on the full set preserves the main\-text ordering\.

Aggregated across the four efficient models on all 74 papers, the system ordering OpenAIReview\>\>zero\-shot\>\>‘coarse holds without exception: by model \(Table[18](https://arxiv.org/html/2606.19749#A3.T18)\), by error type \(Table[19](https://arxiv.org/html/2606.19749#A3.T19)\), and by domain \(Table[20](https://arxiv.org/html/2606.19749#A3.T20)\)\. OpenAIReview improves on zero\-shot for every efficient backend, with the largest margin on DeepSeek\-V4\-Flash \(29\.4%→55\.4%29\.4\\%\\to 55\.4\\%\), and ‘coarse trails zero\-shot in every cell, mirroring the frontier\-subset result in the main text\. The error\-type and domain breakdowns echo the main\-text findings: OpenAIReview’s gains are largest on the prose\-level categories \(claim48\.6%48\.6\\%, reasoning54\.4%54\.4\\%, experimental48\.5%48\.5\\%overall\) and on the empirical\-science domains \(cs\.LG54\.0%54\.0\\%, econ\.EM53\.2%53\.2\\%\), while the theoretical cs\.CC domain remains the hardest \(32\.7%32\.7\\%\)\. Absolute recall is lower than in the main\-text tables because the full set omits the two frontier models, on which every system scores highest\.

Table 18:OpenAIReview beats zero\-shot and ‘coarse on every efficient backend on the full 74\-paper benchmark\.Table 19:OpenAIReview wins on every error type on the full 74\-paper benchmark; the prose\-level categories \(Claim, Reasoning, Experimental\) show the largest absolute gains over zero\-shot\. Each cell pools the four efficient backends \(Grok\-4\.1\-Fast, DeepSeek\-V4\-Flash, Qwen3\.6\-35B\-A3B, and Gemini\-3\.1\-Flash\-Lite\) under that system; the frontier models are excluded\. Denominators differ across systems because only runs that completed scoring are counted\.Table 20:OpenAIReview wins on every domain on the full 74\-paper benchmark, with the strongest leads on cs\.LG \(54\.0%\) and econ\.EM \(53\.2%\); cs\.CC remains the hardest \(32\.7%\)\.

## Appendix DReview Analysis

### D\.1Embedding and clustering details

We characterize what each method talks about by clustering comment text in a shared embedding space\. We embed every comment with thesentence\-transformers/all\-MiniLM\-L6\-v2encoder \(384\-d, default pooling\) and run k\-means withk=10k=10andrandom\_state=42on the raw embeddings\. For interpretability we fit a TF\-IDF vectorizer \(max\_features=10000, English stop\-words\) on the same texts and label each cluster with its top\-15 average\-TF\-IDF terms\. We additionally surface the five comments closest to each centroid as exemplars\. Two of the authors then merge the 10 raw clusters into 5 interpretable groups by inspecting keywords and exemplars\. Each comment is concatenated astitle \+ " " \+ explanationbefore embedding\.

We run two clustering passes\. The first \(Appendix[D\.3](https://arxiv.org/html/2606.19749#A4.SS3)\) clusters ‘coarse and OpenAIReview comments from the efficient models over all 197 papers, totalling 12,704 comments \(7,526 ‘coarse \+ 5,178 OpenAIReview\)\. The second \(Appendix[D\.4](https://arxiv.org/html/2606.19749#A4.SS4)\) adds zero\-shot, totalling 14,456 comments \(7,526 \+ 5,178 \+ 1,752\)\. Per\-paper comment volumes are uneven across models and methods: ‘coarse averages 18\.1 comments/paper for DeepSeek and 14\.7 for Qwen but only 5\.4 for Gemini, while OpenAIReview is more uniform \(6\.0–10\.1 comments/paper across models\)\. This volume gap matters when reading the overlap statistics below, because Jaccard scales with set size\.

### D\.2Across\-systems paragraph overlap

![Refer to caption](https://arxiv.org/html/2606.19749v1/plots/venn_three_systems.png)Figure 9:Across systems on the perturbation benchmark frontier subset, using each system’s best backend \(‘coarse / DeepSeek\-V4\-Flash, OpenAIReview / GPT\-5\.5, Reviewer3\): the three systems target largely disjoint paragraphs \(3\-way Jaccard0\.0460\.046\)\. Values are average unique paragraphs per paper in each region\.#### The three AI systems target different paragraphs \(Figure[9](https://arxiv.org/html/2606.19749#A4.F9)\)\.

Across all papers in the frontier subset and all error types, the three\-way intersection averages only1\.161\.16paragraphs per paper out of≈25\\approx 25paragraphs touched by at least one system \(3\-way Jaccard0\.0460\.046\)\. OpenAIReview and Reviewer3 overlap the most among pairs \(4\.34\.3shared paragraphs per paper\), while ‘coarse contributes only1\.01\.0paragraph per paper that no other system raises\. In terms of volume, OpenAIReview emits the most comments, taking up69%69\\%of all comments vs\.13%13\\%for ‘coarse and19%19\\%for Reviewer3, and once normalized against this baseline no system shows a clear specialization in their comment content \(Table[23](https://arxiv.org/html/2606.19749#A4.T23)\)\. Looking at the type of comments, the systems mostly agree on surface notation and formal\-math issues but diverge on the higher\-level categories: ‘coarse leans toward claim\-related critiques and Reviewer3 toward experimental and statistical issues, while OpenAIReview spreads more evenly across categories\.

![Refer to caption](https://arxiv.org/html/2606.19749v1/plots/venn_union_models.png)Figure 10:Across systems on the quality\-proxy papers, aggregating each system over its backbone models \(‘coarse and OpenAIReview each unioned over the three efficient models run on the full set, while Reviewer3 has no model selector\)\. Unioning over models enlarges every region relative to the best\-model view \(Figure[9](https://arxiv.org/html/2606.19749#A4.F9)\) but the systems still occupy largely distinct paragraphs \(3\-way Jaccard0\.0530\.053\)\. Values are average unique paragraphs per paper in each region\.
#### Aggregating over all models per system \(Figure[10](https://arxiv.org/html/2606.19749#A4.F10)\)\.

Pinning one best model per system understates each system’s footprint\. Taking the union of paragraphs over each system’s backbone models \(the three efficient models each for ‘coarse and OpenAIReview on the quality\-proxy papers, with Reviewer3 unchanged\) enlarges every region but leaves the qualitative picture intact: ‘coarse and OpenAIReview share the largest pairwise region \(6\.86\.8paragraphs/paper\) yet the three\-way intersection is still only1\.61\.6paragraphs/paper \(3\-way Jaccard0\.0530\.053\), confirming that the systems address complementary slices of each paper rather than converging as models are added\.

### D\.3‘coarse and OpenAIReview

The two methods comment on largely disjoint sets of paragraphs: average per\-paper Jaccard ranges from 0\.106 \(Gemini\) to 0\.164 \(Qwen\), and intersection sizes are small in absolute terms \(Table[21](https://arxiv.org/html/2606.19749#A4.T21)\)\. DeepSeek and Qwen flag many ‘coarse\-only paragraphs \(10\.5 and 8\.9 per paper\) because they are the highest\-volume ‘coarse models\. Even so, the shared region is at most 2\.9 paragraphs per paper\. Across all three models, fewer than 20% of flagged paragraphs are shared, indicating that ‘coarse and OpenAIReview consistently address different issues\.

Table 21:‘coarse and OpenAIReview flag largely disjoint paragraphs across all three efficient models\. Per\-paper averages over 197 papers\. “‘coarse only” and “OpenAIReview only” count paragraphs flagged by exactly one method, “Both” counts the intersection, and Jaccard is averaged across papers\.
### D\.4‘coarse, OpenAIReview, and zero\-shot

Adding zero\-shot leaves the picture essentially unchanged: three\-way agreement is rare and zero\-shot contributes almost nothing that the other two methods do not already cover \(Table[22](https://arxiv.org/html/2606.19749#A4.T22)\)\. The three\-way intersection peaks at 0\.58 paragraphs per paper for DeepSeek\. Zero\-shot’s volume is small to begin with \(1\.0–3\.3 paragraphs/paper\) and most of it lands outside the ‘coarse∩\\capOpenAIReview region, so the three\-way Jaccard collapses below 0\.03 for every model\. The largest pairwise intersection that excludes zero\-shot \(‘coarse∩\\capOpenAIReview only\) is 1\.8–2\.4 paragraphs per paper for the high\-volume DeepSeek and Qwen, mirroring the two\-way result above\. Together with Appendix[D\.3](https://arxiv.org/html/2606.19749#A4.SS3), this confirms that the three methods address complementary slices of each paper rather than rediscovering the same issues\.

Table 22:Three\-way paragraph\-level overlap is sparse: zero\-shot adds little beyond what ‘coarse and OpenAIReview already cover, and the three\-way intersection never exceeds 0\.6 paragraphs/paper\. Per\-paper averages over 197 papers\. C = ‘coarse, O = OpenAIReview, Z = zero\-shot\. “C only” counts paragraphs flagged only by ‘coarse, “C∩\\capO only” counts paragraphs flagged by both ‘coarse and OpenAIReview but not zero\-shot, etc\.
### D\.5Cluster breakdowns referenced from Section[7](https://arxiv.org/html/2606.19749#S7)

Tables[23](https://arxiv.org/html/2606.19749#A4.T23)and[24](https://arxiv.org/html/2606.19749#A4.T24)give the cluster\-share breakdowns referenced from the results discussion in Section[7](https://arxiv.org/html/2606.19749#S7); the humans\-vs\-AI breakdown \(Table[5](https://arxiv.org/html/2606.19749#S7.T5)\) is in the main text\.

Table 23:On the perturbation frontier subset \(5050corrupted papers, one per source paper and error family\), OpenAIReview emits the majority of comments across every cluster and within a few points of its overall share \(69%69\\%\)\. No system shows a sharp topic specialization\. Each row sums to∼100%\\sim 100\\%across the three systems\. The overall comment\-volume baselines are13/69/1913/69/19\.Table 24:Relative to the33/6733/67Claude/GPT overall baseline, Claude over\-indexes sharply on Surface\-Level \(64%64\\%\) and GPT over\-indexes on Claims and Experimental \(79%79\\%,78%78\\%\)\. Formal math and tables/figures track the baseline\. Each row sums to100%100\\%across the two models\.
### D\.6Example human and AI comments by overlap region

To illustrate the categorical split reported in Table[5](https://arxiv.org/html/2606.19749#S7.T5), Table[25](https://arxiv.org/html/2606.19749#A4.T25)shows representative comments in three regions: paragraphs flagged by both humans and at least one AI system, paragraphs flagged only by humans, and paragraphs flagged only by AI\. The intersection and AI\-only examples are drawn from one paper \(ICLR 2021, “Autoregressive Entity Retrieval”\)\. The human\-only examples are drawn from two other ICLR 2021 papers in our set, chosen so that the comments fall in the*paper\-level / cross\-cutting*cluster \(the cluster where humans most over\-index against the39%39\\%baseline, see Table[5](https://arxiv.org/html/2606.19749#S7.T5)\)\. Even in the intersection region, humans and AI tend to comment on different aspects of the same paragraph: humans raise broader concerns about scope or methodology, while AI surfaces local notation, claim, or terminology issues\.

Table 25:Representative human and AI comments in three overlap regions\. Intersection rows pair each human comment with the AI comment grounded to the same paragraph\. In both cases, the AI comment addresses a local notation or grammar issue while the human raises a broader methodological concern\. Human\-only comments are drawn from the*paper\-level / cross\-cutting*cluster \(novelty, contributions, related\-work positioning\)\. AI\-only comments are substantive but localized \(cold\-start overclaim, terminology error\)\. Intersection and AI\-only rows are from “Autoregressive Entity Retrieval” \(ICLR 2021\)\. Human\-only rows are from “Unpacking Information Bottlenecks” and “Neural Time\-Dependent PDE” \(both ICLR 2021\)\. Excerpts are lightly abridged for length\.
### D\.7Quality\-proxy comment overlap

The main text reports comment\-overlap analyses on the perturbation benchmark \(Section[7](https://arxiv.org/html/2606.19749#S7)\)\. We also ran the same analyses on the quality\-proxy papers \(7070ICLR/NeurIPS papers from Section[5](https://arxiv.org/html/2606.19749#S5)\)\. The patterns there are different and worth recording for completeness\. The two main differences are: \(i\) each system contributes far more evenly to total comment volume on this corpus \(overall baseline47/33/2047/33/20for ‘coarse/ OpenAIReview / Reviewer3, versus13/69/1913/69/19on the perturbation subset\), and \(ii\) the three systems show clearer topic specialization \(Table[26](https://arxiv.org/html/2606.19749#A4.T26)\)\. ‘coarse is over\-represented on concrete paragraph\-localized issues \(notation, tables/figures, experimental setup\), while OpenAIReview is concentrated on argument\-level critiques \(claims, formal math\) and Reviewer3 on experimental and statistical\-reporting issues\. The across\-models comparison under OpenAIReview also flips: on the quality\-proxy papers, Claude, GPT\-5\.5, and the union of the efficient models cover largely distinct paragraphs \(3\-way Jaccard0\.1010\.101, Figure[12](https://arxiv.org/html/2606.19749#A4.F12)\), whereas on perturbed papers they overlap heavily \(3\-way Jaccard0\.3160\.316, cf\. the main\-text models Venn, Figure[4\(b\)](https://arxiv.org/html/2606.19749#S7.F4.sf2)\)\.

![Refer to caption](https://arxiv.org/html/2606.19749v1/plots/venn_three_systems_outcomes.png)Figure 11:Three\-system paragraph\-level overlap on the quality\-proxy papers \(7070ICLR/NeurIPS papers from Section[5](https://arxiv.org/html/2606.19749#S5)\); three\-way Jaccard0\.0460\.046\. Compare to Figure[9](https://arxiv.org/html/2606.19749#A4.F9)\(App\.[D\.2](https://arxiv.org/html/2606.19749#A4.SS2)\)\. Values are average unique paragraphs per paper in each region\.![Refer to caption](https://arxiv.org/html/2606.19749v1/plots/venn_claude_gpt_efficient_outcomes.png)Figure 12:Claude Opus 4\.7, GPT\-5\.5, and the union of the efficient models under OpenAIReview, paragraph\-level overlap on the quality\-proxy papers \(7070ICLR/NeurIPS papers from Section[5](https://arxiv.org/html/2606.19749#S5)\); 3\-way Jaccard0\.1010\.101\. Compare to Figure[4\(b\)](https://arxiv.org/html/2606.19749#S7.F4.sf2)\(main text\), which shows the same three sets on the perturbation benchmark \(3\-way Jaccard0\.3160\.316\)\. Values are average unique paragraphs per paper in each region\.Table 26:Quality\-proxy version of Table[23](https://arxiv.org/html/2606.19749#A4.T23): ‘coarse is broad and concentrated on concrete paragraph\-localized issues\. OpenAIReview leans toward argument\-level critiques \(claims, formal math\)\. Reviewer3 is narrowly focused on experimental and statistical issues\. Each row sums to∼100%\\sim 100\\%across the three systems\. The baseline volumes are47/33/2047/33/20\. Cluster definitions are the same as Table[23](https://arxiv.org/html/2606.19749#A4.T23)\. Only the underlying corpus differs \(quality\-proxy papers from Section[5](https://arxiv.org/html/2606.19749#S5),7070papers\)\.

## Appendix EUser Feedback: Downvote Error Analysis

This appendix gives the method and validation behind the downvote breakdown in Table[7](https://arxiv.org/html/2606.19749#S8.T7)\(Section[8](https://arxiv.org/html/2606.19749#S8)\)\. We categorized all283283downvoted comments from the deployment with Gemini 3 Flash at temperature0, given the comment together with the capped paper text the reviewer model saw, so the judge has the same context the model had \(Figure[13](https://arxiv.org/html/2606.19749#A5.F13)\)\. The labels capture the judge’s reading of the likely problem under our taxonomy, and we never observe the user’s actual reason for downvoting\. One paper is an outlier in this set: a single upload drew4141of the283283downvotes, far more than any other, because one user downvoted heavily\. That one user could skew the breakdown, so we recompute it with that paper removed\. The ranking is unchanged, and false positives remain the largest category at33\.9%33\.9\\%\.

#### Validation\.

We manually checked a stratified sample of2626of these comments, covering all six categories, by reading each comment against its paper\. We agreed with the judge on2323\. All three disagreements sit on one boundary: whether a correct comment is trivial or substantive\. The judge filed one substantive catch as a false positive, and promoted two trivial points to substantive catches\. So the broad split is reliable: the dominant modes \(false positives, nitpicks, and over\-asking\) are robust, but the line between a substantive catch and a minor one is noisy\. We therefore read the correct\-comment share as an upper bound on how often a downvote lands on a genuinely useful comment\.

Youareauditingcommentsproducedbyanautomatedpaper\-reviewmodel\.Eachcomment

belowwasDOWNVOTEDbyarealuser\.Yourjobistoassigneachdownvotedcommentto

EXACTLYONEcategoryfromthistaxonomy,judgingONLYagainstthepapertextprovided

\(thattextisexactlywhatthereviewmodelitselfsaw;donotassumeanythingbeyondit\)\.

TAXONOMY\(assignexactlyoneintegercategorypercomment\):

1\.Parsing/OCRartifact\-\-thecommentisaboutagarbledglyph/typesettingerrorfrom

PDFconversionwithNOcontentmeaning\(brokenheaders,image\-insertionmarkerslike

img\-0\.jpeg,mangledlinebreaks\)\.Thisispurerenderingnoise\.\(NOTatranscription

errorthatchangesanumber/name/citation\-\-thatiscategory5\.\)

2\.Falsepositive\(modelerror\)\-\-themodelflaggedanon\-issue,oritsownreasoningis

wrongor"imprecisebutactuallyfine\."TheflaggedthingisNOTarealdefect\.

3\.Trivialnitpick\-\-validbutminor:notation/captionconsistency,checklistitems,

stylistic/formattingpoints\.

4\.Underspecified/deferred\-\-acomplaintthatsomethingistoovague/underspecified,or

thatisactuallyaddressedinanappendix/elsewherethemodeldidn’taccountfor

\(reproducibility/completenesscomplaints\)\.

5\.Goodcatch,userdisagreed\-\-thecommentiscorrectandsubstantive:arealcontent

errorworthfixing\(arithmetic\-vs\-tablemismatch,wrongcitationyear/author,misleading

aggregate,arealbug\)\.Atranscription/renderingerrorthatCHANGESanumber,name,or

citationthatthecommentcorrectlyflagsbelongsHERE\.

6\.Other/unclear\-\-doesnotfittheabove,orgenuinelyambiguous\.

DECISIONRULES:

\(a\)Pureglyphnoisewithnosemanticcontent\-\>1;butarenderingerrorthatchangesa

number/name/citationthatthecommentcorrectlyflags\-\>5\.

\(b\)False\-positivevsnitpickvsgood\-catchturnsononequestion:istheflaggedthinga

REALdefect?No\-\>2\.Yesbutminor\-\>3\.Yesandsubstantive\-\>5\.

\(c\)JudgeONLYagainsttheprovidedpapertext\.

OUTPUTFORMAT\-\-returnSTRICTJSONandnothingelse:aJSONarray,oneobjectpercomment,

eachobjectexactly:

\{"comment\_id":"<theidstring\>","category":<int1\-6\>,"justification":"<oneshortsentence\>"\}

Includeeverycomment\_idgiven,exactlyonce\.Nomarkdown,noproseoutsidetheJSONarray\.

<thepapertext\>

<thedownvotedcomments\>

Figure 13:The prompt used to classify each downvoted comment into the error taxonomy\. Angle\-bracketed fields are filled in per review: the paper text \(the reviewer’s capped paragraphs\) and the downvoted comments to classify \(one JSON object per comment\)\.

## Appendix FArtifacts, Licenses, and Compute

### F\.1Licenses and terms of use

Table[27](https://arxiv.org/html/2606.19749#A6.T27)lists the license or terms of use for every model and dataset artifact we use\. The reviewer backbones and the two pipeline LLMs \(Gemini\-3\-Flash\-Preview as generator/judge, Claude\-Sonnet\-4\.6 as verifier\) split into closed, API\-only models \(governed by their providers’ commercial/API terms, with no released weights\) and open\-weight models under permissive licenses\. The clustering encoderall\-MiniLM\-L6\-v2is Apache\-2\.0\. Our corpora derive from the SNOR dataset \(CC\-BY\-4\.0\) and from arXivLaTeXsources, which carry each paper’s individual arXiv license \(most under arXiv’s default perpetual, non\-exclusive license, which permits research use but restricts redistribution by third parties\)\.

Table 27:Licenses / terms of use for the models and datasets used\. Closed models are accessed only through their providers’ APIs under the listed commercial terms; open\-weight models and datasets are under the listed licenses\.
### F\.2Consistency with intended use

Our use of every artifact is consistent with its intended use\. The closed models are accessed through their official APIs for research evaluation, which the providers’ commercial/API terms permit; the open\-weight models andall\-MiniLM\-L6\-v2are used under permissive licenses \(MIT, Apache\-2\.0\) that allow research use; and SNOR is used under CC\-BY\-4\.0 with attribution\. The perturbation corpus is built from publicly available arXivLaTeXsources accessed for research\. The artifact we create \(the perturbation benchmark\) is intended for research use only\. Because it derives from data accessed for research \(arXiv sources and OpenReview\-derived signals via SNOR\), we restrict it to research contexts and, to respect the default arXiv license, intend to release the injected edits and metadata keyed by arXiv identifier rather than redistributing full paper text\.

### F\.3Model sizes, infrastructure, and cost

The closed API models \(GPT\-5\.5, Claude Opus 4\.7 / Sonnet 4\.6, Gemini 3\.1 Flash\-Lite / 3 Flash, Grok\-4\.1\-Fast\) do not have publicly disclosed parameter counts\. Among the open\-weight models, Qwen3\.6\-35B\-A3B is a mixture\-of\-experts model with3535B total \(∼\\sim33B active\) parameters, DeepSeek\-V4\-Flash is an open\-weight mixture\-of\-experts model \(parameter count per its model card\), andall\-MiniLM\-L6\-v2has22\.722\.7M parameters\. We did not train or fine\-tune any model: all reviewer systems were run through hosted inference APIs \(OpenRouter and the providers’ own endpoints\), so our compute budget is API inference rather than local GPU\-hours\. The only local computation is sentence embedding andkk\-means clustering for the comment analysis, which runs in minutes on a single machine\.

Table[28](https://arxiv.org/html/2606.19749#A6.T28)reports the API cost of the*evaluation*\(review\) runs, totalling roughly $2,500 and353353M tokens\. Benchmark construction \(perturbation generation, verification, and scoring\) and the comment\-clustering analysis incur additional, smaller API costs not included here\.

Table 28:Approximate API cost and token usage of the evaluation \(review\) runs, from logged per\-call costs \(OpenRouter / provider billing\)\. Excludes benchmark\-construction and clustering costs\.
Benchmarking Agentic Review Systems

Similar Articles

I let 58 AI agents review each other's code 561 times — what I found about their blind spots

Is it agentic enough? Benchmarking open models on your own tooling

Open-source procurement rubric for agentic AI vendors, I scored 5 of them and want feedback on the methodology

PseudoBench: Measuring How Agentic Auto-Research Fuels Pseudoscience

Gaming AI-Assisted Peer Reviews Poses New Risks to the Scientific Community

Submit Feedback

Similar Articles

I let 58 AI agents review each other's code 561 times — what I found about their blind spots
Is it agentic enough? Benchmarking open models on your own tooling
Open-source procurement rubric for agentic AI vendors, I scored 5 of them and want feedback on the methodology
PseudoBench: Measuring How Agentic Auto-Research Fuels Pseudoscience
Gaming AI-Assisted Peer Reviews Poses New Risks to the Scientific Community