LLM-Based Scientific Peer Review: Methods, Benchmarks, and Reliability Challenges

arXiv cs.CL Papers

Summary

This survey provides a systems-level analysis of LLM-based scientific peer review, covering methods, benchmarks, and reliability challenges including robustness risks like prompt injection and data poisoning.

arXiv:2606.25057v1 Announce Type: new Abstract: The rapid growth of scientific submissions has pushed traditional peer review toward its scalability limits, motivating the exploration of large language models (LLMs) as intelligent automated evaluation assistants. Although recent studies show that LLMs can generate fluent critiques and approximate reviewer scores, their reliability, robustness, and security as decision-support systems remain insufficiently understood. This survey offers a systems-level analysis of LLM-based scientific peer review, focusing on two core evaluative functions: critique generation and score prediction. We present a structured taxonomy of modeling approaches (including prompt-based, supervised, retrieval-augmented, and alignment-optimized approaches), and synthesize empirical findings across existing benchmarks. We analyze dataset constraints, evaluation shortcomings, and domain concentration biases that limit current assessment practices. Beyond performance metrics, we identify emerging robustness risks, including prompt injection, data poisoning, retrieval vulnerabilities, and reward hacking, which expose automated review pipelines to strategic manipulation. From a data mining perspective, we outline key open challenges in modeling subjective disagreement and cross-domain generalization. By reframing automated peer review as a high-stakes, multi-objective decision problem, this survey provides a roadmap for developing robust, transparent, and trustworthy AI-assisted scientific evaluation systems.
Original Article
View Cached Full Text

Cached at: 06/25/26, 05:09 AM

# LLM-Based Scientific Peer Review: Methods, Benchmarks, and Reliability Challenges
Source: [https://arxiv.org/html/2606.25057](https://arxiv.org/html/2606.25057)
\\correspondingauthor

Thi Huyen NguyenandZahra Ahmadi[0000\-0003\-1110\-4756](https://orcid.org/0000-0003-1110-4756)Peter L\. Reichertz Institute for Medical Informatics of TU Braunschweig and Hannover Medical School Lower Saxony Center for AI and Causal Methods in Medicine \(CAIMed\)HannoverGermany[ahmadi\.zahra@mh\-hannover\.de](https://arxiv.org/html/2606.25057v1/mailto:[email protected])

\(23 June 2026\)

###### Abstract\.

The rapid growth of scientific submissions has pushed traditional peer review toward its scalability limits, motivating the exploration of large language models \(LLMs\) as intelligent automated evaluation assistants\. Although recent studies show that LLMs can generate fluent critiques and approximate reviewer scores, their reliability, robustness, and security as decision\-support systems remain insufficiently understood\. This survey offers a systems\-level analysis of LLM\-based scientific peer review, focusing on two core evaluative functions: critique generation and score prediction\. We present a structured taxonomy of modeling approaches \(including prompt\-based, supervised, retrieval\-augmented, and alignment\-optimized approaches\), and synthesize empirical findings across existing benchmarks\. We analyze dataset constraints, evaluation shortcomings, and domain concentration biases that limit current assessment practices\. Beyond performance metrics, we identify emerging robustness risks, including prompt injection, data poisoning, retrieval vulnerabilities, and reward hacking, which expose automated review pipelines to strategic manipulation\. From a data mining perspective, we outline key open challenges in modeling subjective disagreement and cross\-domain generalization\. By reframing automated peer review as a high\-stakes, multi\-objective decision problem, this survey provides a roadmap for developing robust, transparent, and trustworthy AI\-assisted scientific evaluation systems\.

††copyright:acmlicensed††journalyear:2026††doi:XXXXXXX\.XXXXXXX††conference:Make sure to enter the correct conference title from your rights confirmation email; June 2026; Woodstock, NY††isbn:978\-1\-4503\-XXXX\-X/2018/06## 1\.Introduction

Peer review is the primary quality\-control mechanism of scholarly publishing, but its effectiveness increasingly depends on a review workforce that has not scaled with submission volume\. Reviewers are expected to provide structured critiques and quantitative recommendations that assess the novelty, soundness, significance, and broader contribution of submitted work\. As submission volumes grow, however, this human\-centered evaluation process faces mounting pressure from reviewer shortages, subjective judgments, compressed timelines, and limited scalability\. These pressures are amplified by the rapid growth of scientific output\. Scientific publications are estimated to double roughly every decade, whereas the global population of scientists grows by only 21% over the same period\(Künzliet al\.,[2022](https://arxiv.org/html/2606.25057#bib.bib23)\)\. Statistics compiled by Paper Copilot\(Paper Copilot,[2026](https://arxiv.org/html/2606.25057#bib.bib98)\)show that leading computer science conferences such as ICLR, NeurIPS, and ICML have doubled, or more than doubled, their annual submission counts within five years \(2021–2025\), as shown in Figure[1](https://arxiv.org/html/2606.25057#S1.F1)\. As reviewer workload increases, evaluation timelines often shrink, potentially increasing variability in critique depth, score calibration, and review reliability\. These challenges have motivated growing interest in AI\-assisted peer review systems\.

![Refer to caption](https://arxiv.org/html/2606.25057v1/x1.png)Figure 1\.Annual submission counts for three conferences in computer science\.![Refer to caption](https://arxiv.org/html/2606.25057v1/x2.png)Figure 2\.Peer review pipeline\.Early work applied Natural Language Processing \(NLP\) techniques\(Price and Flach,[2017](https://arxiv.org/html/2606.25057#bib.bib1); Liet al\.,[2019](https://arxiv.org/html/2606.25057#bib.bib3); Wang and Tan,[2020](https://arxiv.org/html/2606.25057#bib.bib2); Wenget al\.,[2025](https://arxiv.org/html/2606.25057#bib.bib4)\)to support or automate specific stages of the review pipeline\. With the emergence of LLMs such as GPT\-4\(Achiamet al\.,[2023](https://arxiv.org/html/2606.25057#bib.bib5)\), LLaMA\(Touvronet al\.,[2023](https://arxiv.org/html/2606.25057#bib.bib6)\), and Gemini\(Teamet al\.,[2023](https://arxiv.org/html/2606.25057#bib.bib7)\), automated peer review has shifted from feature\-based prediction and limited summarization toward full review generation and score estimation\. Recent studies\(Yuet al\.,[2024a](https://arxiv.org/html/2606.25057#bib.bib10); Idahl and Ahmadi,[2025](https://arxiv.org/html/2606.25057#bib.bib11)\)show that LLMs can generate fluent reviewer\-like critiques and approximate human scoring patterns\. Consequently, LLM\-based systems are increasingly being investigated for automating two fundamental components of peer review reports: \(1\) textual critique generation and \(2\) quantitative score prediction\. These two components are central to editorial and program\-committee decision\-making\. Critiques articulate structured assessments of a manuscript’s strengths and weaknesses, while scores translate these assessments into quantifiable recommendations\. Automating these evaluative functions could shift parts of peer review toward a more scalable and data\-driven decision pipeline\. At the same time, such automation raises fundamental questions about reliability, bias, calibration, and security\.

Despite rapidly growing interest in LLM\-assisted reviewing, existing surveys\(Kuznetsovet al\.,[2024](https://arxiv.org/html/2606.25057#bib.bib9); Zhuanget al\.,[2025](https://arxiv.org/html/2606.25057#bib.bib8); Luoet al\.,[2025](https://arxiv.org/html/2606.25057#bib.bib24)\)do not provide a focused and systematic analysis of automated critique generation and score prediction\. Kuznetsov et al\.\(Kuznetsovet al\.,[2024](https://arxiv.org/html/2606.25057#bib.bib9)\)offer a high\-level overview of AI assistance across different stages of peer review, spanning activities before, during, and after the evaluation process\. Luo et al\.\(Luoet al\.,[2025](https://arxiv.org/html/2606.25057#bib.bib24)\)survey LLM applications in scientific research more broadly, with peer review discussed briefly as one of many tasks\. Zhuang et al\.\(Zhuanget al\.,[2025](https://arxiv.org/html/2606.25057#bib.bib8)\)concentrate more directly on automated scholarly review and relevant datasets, emphasizing the potential of LLMs to alleviate technical bottlenecks\. However, these surveys do not provide a structured taxonomy of critique\-generation and score\-prediction methodologies, nor do they analyze evaluation limitations and robustness concerns in depth\.

In contrast, this survey focuses specifically on LLM\-based scientific critique generation and score prediction as the core evaluative components of peer review reports\. We organize existing work into a structured taxonomy of modeling approaches, synthesize empirical findings across studies, examine dataset and evaluation challenges, and analyze emerging risks in automated scoring pipelines\. By treating critique generation and score prediction as decision\-critical tasks, we offer a systems\-level perspective that complements broader surveys on AI\-assisted peer review and LLMs for scientific workflows\. Specifically, we aim to answer four key questions:

1. \(1\)How reliable are LLM\-based score predictions relative to human reviewers?
2. \(2\)To what extent can LLMs generate substantive scientific critiques?
3. \(3\)How robust are existing systems under current data and evaluation limitations?
4. \(4\)What challenges must be addressed to ensure reliable and secure deployment?

From a modeling perspective, automated peer review can be understood as a high\-stakes, multi\-objective decision\-making problem under noisy supervision\. Peer review datasets contain inconsistent decision signals due to inter\-reviewer disagreement, acceptance bias, and distribution shift across venues and disciplines\. We therefore frame LLM\-based peer review not only as a text generation problem, but also as a structured data mining problem involving trade\-offs among quality, fairness, calibration, uncertainty, and robustness\.

## 2\.Scientific Peer Review

The peer review process evaluates scientific manuscripts through assessments by domain experts\. Each submission is typically assigned to one or more reviewers, who evaluate multiple aspects such as clarity, technical correctness, novelty, and potential impact\. Reviewers are often guided by venue\-specific templates that request a summary, strengths and weaknesses, questions for authors, and a preliminary recommendation, such as acceptance or rejection\. High\-quality reviews support both editorial decision\-making and manuscript improvement; for example, 91% of researchers report that peer review improved their last publication\(Mulliganet al\.,[2013](https://arxiv.org/html/2606.25057#bib.bib13)\)\.

Scientific peer review can be abstracted as an evaluative decision pipeline, as illustrated in Figure[2](https://arxiv.org/html/2606.25057#S1.F2)\. Although procedures vary across disciplines and publication venues, most peer review systems produce two central outputs: structured textual critique and quantitative score assignment\. The textual critique describes the manuscript’s strengths and weaknesses across dimensions such as clarity, novelty, technical quality, and significance, while the scoring component converts these qualitative judgments into operational signals, including dimension\-specific scores, overall recommendation scores, accept/reject decisions, and reviewer confidence ratings\.

From a computational perspective, critique generation and score prediction can be formulated as related modeling tasks\. Letxxdenote a representation of the manuscript\. The critique function can be viewed as a conditional generation mapping:

fc​\(x\)→yc​r​i​t​i​q​u​ef\_\{c\}\(x\)\\rightarrow y\_\{critique\}\.

Meanwhile, the scoring function corresponds to a regression or classification mapping:

fs​\(x\)→ys​c​o​r​ef\_\{s\}\(x\)\\rightarrow y\_\{score\},

whereyc​r​i​t​i​q​u​ey\_\{critique\}andys​c​o​r​ey\_\{score\}denote the structured evaluative feedback and numerical evaluation, respectively\.

From a machine learning perspective, both functions operate under weak supervision and label uncertainty\. Reviewer scores are not deterministic ground\-truth labels; rather, they are stochastic realizations influenced by subjective interpretation, reviewer expertise, and venue\-specific criteria\. Therefore, the target is not a single “correct" score, but a distribution of plausible judgments shaped by reviewer expertise, venue norms, and uncertainty\.

Peer review, however, differs from standard supervised learning settings in several important respects\. First, evaluation is inherently multidimensional\. Criteria such as novelty, technical soundness, clarity, and impact may be weighted differently across venues and may evolve over time\. Second, inter\-reviewer disagreement is common\. Multiple reviewers evaluating the same manuscript often produce divergent critiques and assign substantially different scores\. This variability raises fundamental questions about the target of prediction: should automated systems predict individual reviewer scores, aggregate scores, or final editorial decisions? Third, the consequences of prediction errors are high\. As Figure[2](https://arxiv.org/html/2606.25057#S1.F2)highlights, critique and scoring play a central role in the decision process\. Errors or biases at this stage can propagate directly into editorial aggregation and final publication outcomes\. Consequently, analyzing automated systems for critique generation and score prediction requires careful attention not only to generative quality but also to reliability, calibration, robustness, and security\.

These properties make automated peer review substantially more complex than fluent text generation or score approximation alone\. Reliable systems must preserve alignment between critique content and numerical recommendations, represent uncertainty arising from reviewer variability, and remain robust under domain shift and potential adversarial manipulation\. These structural characteristics motivate the systematic analysis of LLM\-based critique generation and score prediction systems developed in the following sections\.

## 3\.LLMs as Reviewers

![Refer to caption](https://arxiv.org/html/2606.25057v1/x3.png)Figure 3\.Fraction of reviews detected as LLM\-generated by year\.![Refer to caption](https://arxiv.org/html/2606.25057v1/x4.png)Figure 4\.Taxonomy of automated peer review generation, categorized by different aspects\.Before the emergence of large language models, research on automated peer review primarily focused on score prediction and limited summarization rather than full critique generation\(Kanget al\.,[2018](https://arxiv.org/html/2606.25057#bib.bib27); Stappenet al\.,[2020](https://arxiv.org/html/2606.25057#bib.bib31); Dyckeet al\.,[2023](https://arxiv.org/html/2606.25057#bib.bib28)\)\. Predicting numerical recommendations from paper features was considered more tractable than generating structured evaluative feedback, which requires domain knowledge, contextual understanding, and coherent reasoning\. Early supervised approaches\(Yuanet al\.,[2022](https://arxiv.org/html/2606.25057#bib.bib16); Linet al\.,[2023](https://arxiv.org/html/2606.25057#bib.bib29); Yuan and Liu,[2022](https://arxiv.org/html/2606.25057#bib.bib60)\)attempted to generate review text directly from manuscript representations, but the resulting reviews were often shallow, fragmented, or template\-like\. These limitations reflected the difficulty of modeling peer review as judgment\-oriented reasoning rather than surface\-level text generation\.

The rapid advancement of LLMs has substantially shifted this landscape\. Recent studies\(Latonaet al\.,[2024](https://arxiv.org/html/2606.25057#bib.bib54); Yuet al\.,[2024c](https://arxiv.org/html/2606.25057#bib.bib15)\)indicate a growing trend in the use of LLMs for review writing\. Since the emergence of ChatGPT\(OpenAI,[2023](https://arxiv.org/html/2606.25057#bib.bib14)\), the proportion of ICLR reviews flagged as AI\-generated has increased sharply\(Yuet al\.,[2024c](https://arxiv.org/html/2606.25057#bib.bib15)\), as shown in Figure[3](https://arxiv.org/html/2606.25057#S3.F3)\. At least 15\.8% of ICLR 2024 reviews were detected as being written with AI assistance\(Latonaet al\.,[2024](https://arxiv.org/html/2606.25057#bib.bib54)\)\. However, since AI\-text detectors are imperfect, these estimates should be interpreted as approximate indicators rather than definitive measurements of AI use\. AI\-generated reviews may offer value across multiple dimensions\(Tyseret al\.,[2024](https://arxiv.org/html/2606.25057#bib.bib20)\)\. For authors, they can provide early, actionable feedback before submission and support manuscript revisions\. For reviewers, they may serve as reference material for improving review quality\. For journals and conferences, such tools can support quality control and potentially accelerate parts of the peer review workflow\. In addition, AI\-generated assessments may eventually support reading prioritization, although their reliability for identifying high\-quality papers remains uncertain\.

Owing to large\-scale pretraining on diverse corpora, LLMs demonstrate strong capabilities in long\-form generation, instruction following, and reasoning\-style prompting\. LLM\-based systems often formulate peer review as a general text\-generation task: Given a manuscript, generate a complete review report\. In many such systems, critique generation and score prediction are not explicitly modeled as separate functions\. Instead, models directly produce free\-form review text that may implicitly include strengths, weaknesses, and recommendations\.

A growing body of work has examined the capabilities and limitations of LLM\-based review report generation\. Figure[4](https://arxiv.org/html/2606.25057#S3.F4)provides a taxonomy of existing work, categorized along several dimensions\. Existing approaches can be broadly grouped into four paradigms: prompt\-based systems, fine\-tuned systems, retrieval\-augmented systems, and alignment\-optimized systems\.

### 3\.1\.Prompt\-based LLMs

Prompt\-based systems are the most accessible form of LLM\-assisted reviewing, but their flexibility comes at the cost of sensitivity to instructions and weak score calibration\. In automated peer review generation, prompt\-based approaches allow general\-purpose LLMs, such as GPT\-4\(Achiamet al\.,[2023](https://arxiv.org/html/2606.25057#bib.bib5)\), LLaMA\(Touvronet al\.,[2023](https://arxiv.org/html/2606.25057#bib.bib6)\), and Gemini\(Teamet al\.,[2023](https://arxiv.org/html/2606.25057#bib.bib7)\)to be guided by carefully constructed instructions that simulate human reviewer behavior\. Rather than explicitly training models on peer review datasets, these systems rely on instruction\-following capabilities acquired during large\-scale pretraining\. Prompts typically ask the model to assess novelty, technical soundness, clarity, strengths and weaknesses, and to provide an overall recommendation score\.

The appeal of prompt\-based approaches lies in their flexibility and scalability\(Liuet al\.,[2023](https://arxiv.org/html/2606.25057#bib.bib57)\)\. Generating high\-quality reviews goes far beyond simple summarization; it requires critiques and evaluations across multiple dimensions, including novelty, significance, methodology, and clarity\. Prompting enables LLMs to generate outputs that follow predefined templates, criteria, or word limits, making such approaches particularly attractive when large annotated review datasets are unavailable\. This low\-cost adaptability has driven widespread experimentation with prompt\-based peer review systems\.

Prompting strategies\. Two main prompting strategies are commonly used: zero\-shot prompting\(Kojimaet al\.,[2022](https://arxiv.org/html/2606.25057#bib.bib58)\), where the model receives a single instruction and relies on pretraining to generate an output, and few\-shot prompting\(Minet al\.,[2022](https://arxiv.org/html/2606.25057#bib.bib59)\), where a small set of input\-output examples is provided to guide the model’s response\. Depending on their design, prompts for LLM\-based peer review can take several forms, such as criterion\-based, section\-based, style\-guided, review\-score paired, or chain\-of\-thought formats\.

Beyond these basic paradigms, prompt designs for peer review tasks vary considerably:

- •Criterion\-based: Explicitly request evaluation across multiple dimensions\.
- •Style\-guided: Instruct the model to follow venue\-specific tone or templates\.
- •Review–score paired: Require both textual feedback and numerical ratings\.
- •Section\-based: Focus on specific manuscript components\.
- •Chain\-of\-thought: Encourage step\-by\-step reasoning before final judgment\.

Table[1](https://arxiv.org/html/2606.25057#S3.T1)illustrates different prompt designs, examples, strengths, and limitations\. These designs implicitly shape how the model approximates the critique functionfc​\(x\)f\_\{c\}\(x\)and, when applicable, the scoring functionfs​\(x\)f\_\{s\}\(x\)\. For example, review–score paired prompts attempt to jointly elicit textual reasoning and quantitative recommendations, whereas section\-based prompts decouple local evaluation from global assessment\.

Prompt typeDescriptionPrompt ExampleStrengthLimitationcriterion\-based\(Markhasin,[2025](https://arxiv.org/html/2606.25057#bib.bib71)\)Prompts LLMs to write a review across multiple dimensions such as novelty, clarity, etc\.Evaluate the given paper on aspects: novelty, significance, clarity, and qualityGenerate structured reviewsMay lead to shallow, checklist\-style reviewsstyle\-guided\(Lianget al\.,[2024](https://arxiv.org/html/2606.25057#bib.bib46)\)Instructs LLMs to follow specific tone, review template, or styleWrite a constructive, formal review following ICLR guidelinesGenerate context\-appropriate reviewsMay constrain creativity or critical depth, reduce content diversity or critical honesty\.review\-score pair\(Saadet al\.,[2024](https://arxiv.org/html/2606.25057#bib.bib61)\)Requires both numeric rating and review commentsGive a review and score evaluation on the following aspects: novelty, clarity, significance, qualitySimulate human\-like reviewsInconsistent score\-text alignment\.section\-based\(D’Arcyet al\.,[2024](https://arxiv.org/html/2606.25057#bib.bib70)\)Focuses prompts on paper sections \(e\.g\., abstracts, methodology\)Review methodology section for strengths and limitationsEnable fine\-grained evaluationMisses global context of the paperchain\-of\-thought\(Stahlet al\.,[2024](https://arxiv.org/html/2606.25057#bib.bib66)\)Instructs the model to reason step\-by\-stepStep 1: what is the paper’s main claim? Step 2: is the claim supported? Step 3: evaluate the strengths and weaknesses of the methodologyImprove factuality, coherence, and reasoning depthIncrease output length, slower generation\.Table 1\.Prompt\-based approaches categorized by prompt design\.Findings\. Early studies\(Robertson,[2023](https://arxiv.org/html/2606.25057#bib.bib65); Lianget al\.,[2024](https://arxiv.org/html/2606.25057#bib.bib46)\)provided empirical evidence that GPT\-4, even with zero\-shot prompting, can meaningfully contribute to the peer review process\. Similarly, Biswaset al\.\(Biswaset al\.,[2023](https://arxiv.org/html/2606.25057#bib.bib62)\)found that ChatGPT can generate consistent evaluations and helpful feedback, but it struggles with contextual understanding and subjective interpretation\. Other work has assessed ChatGPT\-3\.5 and ChatGPT\-4 using zero\-shot prompts\(Saadet al\.,[2024](https://arxiv.org/html/2606.25057#bib.bib61)\), finding that their outputs were only weakly correlated with final acceptance decisions and tended to be overly positive\.

Few\-shot approaches have shown improvements in structural alignment and acceptance prediction\(Sukpanichnantet al\.,[2024](https://arxiv.org/html/2606.25057#bib.bib69)\)\. However, fine\-grained analyses\(Duet al\.,[2024](https://arxiv.org/html/2606.25057#bib.bib37)\)indicate that while LLMs produce coherent summaries and high\-level evaluations, they frequently miss subtle methodological weaknesses or nuanced experimental limitations\. Similarly, evaluations\(Zhouet al\.,[2024](https://arxiv.org/html/2606.25057#bib.bib22)\)across GPT\-3\.5 and GPT\-4 reveal persistent challenges in long\-document processing, zero\-shot score calibration, and producing critiques that match human reviewers in critical depth\.

Multi\-agent prompting strategies\(D’Arcyet al\.,[2024](https://arxiv.org/html/2606.25057#bib.bib70)\), in which different agents focus on clarity, methodology, experiments, or impact, attempt to enhance specificity and reduce generic feedback\. Although such approaches improve aspect coverage, they also increase system complexity and do not fully resolve the limitations of prompt\-based review generation\.

The performance characteristics of prompt\-based systems can be understood through their reliance on pretrained patterns\. LLMs approximate critique generation by mapping manuscript content into familiar evaluative language structures encountered during training\. This explains their fluency and structural coherence\. However, because they are not explicitly trained to assess experimental validity or methodological correctness, models may reward clear structure and polished prose while overemphasizing stylistic signals relative to substantive contribution\. This behavior reflects both pretraining distribution biases and alignment objectives that favor constructive tone\. Effective prompt engineering is therefore critical for improving review structure and critical depth\(Markhasin,[2025](https://arxiv.org/html/2606.25057#bib.bib71)\)\. Techniques such as multi\-step prompting and chain\-of\-thought reasoning\(Weiet al\.,[2022](https://arxiv.org/html/2606.25057#bib.bib67)\)have been shown to better simulate human reviewer reasoning and improve the factual accuracy\.

Structural limitations\. Beyond empirical performance metrics, prompt\-based systems exhibit structural vulnerabilities that affect reliability:

- •Prompt sensitivity: Small changes in instruction wording can significantly alter critique tone and numerical outputs, undermining reproducibility\(Zhaoet al\.,[2021](https://arxiv.org/html/2606.25057#bib.bib64)\)\.
- •Hallucination: Models may misinterpret manuscript content and generate factually inaccurate outputs\(Huanget al\.,[2025](https://arxiv.org/html/2606.25057#bib.bib63)\)\.
- •Score–\-text inconsistency: Jointly generated critiques and scores may lack logical alignment when numerical outputs are not explicitly constrained\(Bhartiet al\.,[2026](https://arxiv.org/html/2606.25057#bib.bib80)\)\.

Overall, prompt\-based systems are best understood as strong review\-format generators rather than reliable evaluators: they can reproduce the style and structure of reviews, but their judgments remain sensitive to prompting, calibration errors, and unsupported claims\.

### 3\.2\.Supervised fine\-tuning

Supervised fine\-tuning\(Gaoet al\.,[2024](https://arxiv.org/html/2606.25057#bib.bib51); Idahl and Ahmadi,[2025](https://arxiv.org/html/2606.25057#bib.bib11)\)improves domain alignment by learning from historical reviews, but it also inherits the noise, bias, and subjectivity of peer\-review data\. In this paradigm, LLMs are trained on datasets that pair input manuscripts with expert\-written peer reviews, enabling models to learn mappings from paper content to review text through instruction\-tuned language modeling\. The objective is to capture a reviewer evaluation style, adherence to review criteria, and structured reasoning patterns\.

Fine\-tuning strategies\. Supervised fine\-tuning can be implemented through two main adaptation strategies: full fine\-tuning and parameter\-efficient fine\-tuning \(PEFT\)\. In full fine\-tuning\(Idahl and Ahmadi,[2025](https://arxiv.org/html/2606.25057#bib.bib11)\), all model parameters are updated using peer review supervision, enabling maximal adaptation to domain\-specific evaluation patterns\. However, full fine\-tuning requires substantial computational resources and increases the risk of overfitting, particularly when review datasets are limited in size\.

Alternatively, PEFT methods such as LoRA\(Huet al\.,[2022](https://arxiv.org/html/2606.25057#bib.bib81)\)introduce a small number of trainable low\-rank adaptation parameters while keeping the base model largely frozen\. PEFT approaches substantially reduce memory and compute requirements, making them practical for training on modest\-sized peer review datasets\. Several recent peer review generation systems\(Gaoet al\.,[2024](https://arxiv.org/html/2606.25057#bib.bib51); Faizullahet al\.,[2024](https://arxiv.org/html/2606.25057#bib.bib73)\)adopt LoRA\-style adaptation to balance domain alignment with computational feasibility\. Although these strategies differ in efficiency and adaptation capacity, both remain fundamentally constrained by the quality and representativeness of available review data\.

Depending on the training objective, models can be optimized to generate criterion\-specific evaluations, predict scores, or produce complete review reports\. Compared to prompt\-based systems, supervised fine\-tuning explicitly approximates the critique functionfc​\(x\)f\_\{c\}\(x\)and, when applicable, the scoring functionfs​\(x\)f\_\{s\}\(x\)using labeled review data\.

Findings\. Recent work demonstrates that supervised fine\-tuning improves structural fidelity and domain alignment\. For example, Yuet al\.\(Yuet al\.,[2024b](https://arxiv.org/html/2606.25057#bib.bib21)\)introduced SEA, a fine\-tuning framework with three modules: standardization, evaluation, and analysis\. Their results show that SEA can produce feedback closely aligned with human reviews\.

Several studies have also focused on specialized subtasks\. Faizullahet al\.\(Faizullahet al\.,[2024](https://arxiv.org/html/2606.25057#bib.bib73)\)evaluated PEFT\-based models for generating suggestive limitations for a given research paper and demonstrated improvements in specificity over prompt\-only approaches\. Wenget al\.\(Wenget al\.,[2025](https://arxiv.org/html/2606.25057#bib.bib4)\)curated 5K reviews and proposed a closed\-loop research–review–revision cycle, powered by iterative preference training\. Zhuet al\.\(Zhuet al\.,[2025b](https://arxiv.org/html/2606.25057#bib.bib50)\)constructed a 13K annotated dataset capturing intermediate reasoning steps, enabling multi\-stage training for structured review generation\. Idal and Ahmadi\(Idahl and Ahmadi,[2025](https://arxiv.org/html/2606.25057#bib.bib11)\)constructed a large\-scale dataset of 79K reviews from OpenReview and developed a fully fine\-tuned LLaMA\-based model that generated more critical and realistic reviews than zero\-shot baselines\. Moreover, Mostafaet al\.\(Mostafaet al\.,[2026](https://arxiv.org/html/2606.25057#bib.bib83)\)fine\-tuned a LLaMA\-based model to generate human\-like novelty assessments and calibrated novelty scores\.

Across these studies, fine\-tuned models typically outperform prompt\-based systems in several aspects:

- •Review structure fidelity
- •Criterion coverage
- •Domain\-specific terminology usage
- •Correlation with reviewer scores

Structural limitations\. Despite their empirical advantages, supervised fine\-tuned systems face structural constraints:

- •Label noise and reviewer disagreement: Human review data contain substantial inter\-reviewer variability\. Treating individual scores as ground truth risks amplifying subjective bias\.
- •Dataset concentration: Because most training data come from open\-review computer science venues, fine\-tuned models may learn venue\-specific reviewing norms rather than general principles of scientific evaluation\.
- •Data imbalance: Accepted papers are often overrepresented relative to rejected papers, leading to skewed score distributions and reduced discrimination power\.
- •Bias amplification: Fine\-tuned models may encode historical inequities or stylistic preferences embedded in training data\.
- •Ethical and legal constraints: Reviewer anonymity, consent, and data ownership issues complicate large\-scale dataset construction and redistribution\.

Overall, supervised fine\-tuning is a promising direction for enhancing the realism and domain relevance of AI\-assisted peer reviews\. Yet its reliability remains bounded by the quality and representativeness of available datasets\.

### 3\.3\.Retrieval\-augmented generation

Retrieval\-augmented systems address grounding and novelty assessment, but shift the reliability bottleneck from generation to evidence selection\. LLMs, even when fine\-tuned, may hallucinate unsupported claims, misinterpret experimental details, or generate generic critiques detached from manuscript\-specific evidence\. To mitigate this issue, retrieval\-augmented generation \(RAG\) frameworks\(Gaoet al\.,[2023](https://arxiv.org/html/2606.25057#bib.bib84)\)have been explored, which couple LLMs with external retrieval systems to enhance factual consistency, citation awareness, and review depth\.

In RAG\-based approaches, models are provided with additional evidence beyond the raw manuscript text\. Retrieved context may include specific sections of the paper, cited references, related work from external corpora, or structured representations such as knowledge graphs\. LLMs then condition their critique and scoring generation on this retrieved information\. Conceptually, RAG modifies the critique functionfc​\(x\)f\_\{c\}\(x\)by enriching the input representation with auxiliary evidencer​\(x\)r\(x\), yieldingfc​\(x,r​\(x\)\)f\_\{c\}\(x,r\(x\)\), with the goal of improving grounding and reducing hallucination\. In peer review, retrieval can serve several functions:

- •Intra\-document grounding: Selecting relevant manuscript sections for focused critique, such as experimental details\.
- •Inter\-document comparison: Retrieving related work to assess novelty and contribution\.
- •Citation verification: Validating references and claims against external literature\.
- •Context enrichment: Providing background information for domain\-specific terminology\.

Findings\. Early work, such as ReviewRobot\(Wanget al\.,[2020](https://arxiv.org/html/2606.25057#bib.bib17)\), built knowledge graphs from the target manuscript, its cited works, and background literature to generate structured, evidence\-backed critiques and predict review scores\. By explicitly modeling relationships among contributions and prior work, ReviewRobot aimed to support more informed novelty and relevance assessment\. Similarly, Mostafaet al\.\(Mostafaet al\.,[2026](https://arxiv.org/html/2606.25057#bib.bib83)\)demonstrated that integrating a literature\-aware retrieval component improves human\-like critiques and calibrated novelty scores with interpretable, grounded justifications\.

Several recent frameworks incorporate retrieval into multi\-stage reasoning pipelines\. Zhuet al\.\(Zhuet al\.,[2025b](https://arxiv.org/html/2606.25057#bib.bib50)\)integrated structured literature search into its generation process, while Gaoet al\.\(Gaoet al\.,[2025](https://arxiv.org/html/2606.25057#bib.bib72)\)retrieved relevant publications to validate novelty claims and strengthen evaluative grounding\. Empirical evaluations in these works suggest that retrieval improves critique specificity and reduces overly generic feedback compared to prompt\-only baselines\.

Retrieval augmentation addresses a core weakness of LLM\-based reviewing: reliance on parametric knowledge encoded during training\. Without retrieval, models must infer novelty and methodological validity from internalized statistical patterns, which may be outdated or incomplete\. By incorporating external evidence, RAG reduces dependence on memorized knowledge and encourages context\-aware reasoning\. RAG systems can therefore improve factual alignment and reduce unsupported claims\. In scoring tasks, retrieval may also enhance novelty estimation and contribution assessment\.

Structural limitations\. Despite its promise, retrieval\-augmented review generation introduces new challenges:

- •Retrieval bottlenecks: Identifying the most relevant segments from long manuscripts remains difficult due to token\-length constraints and segmentation choices\.
- •Corpus dependency: Performance depends heavily on the completeness and quality of the retrieval corpus, such as arXiv or PubMed\.
- •Error propagation: Retrieval errors directly influence generation and may amplify inaccuracies\.
- •Computational overhead: Indexing and querying large scientific corpora increase system complexity and latency\.
- •Confidentiality concerns: Using external APIs or third\-party corpora may conflict with blind review protocols\.

In summary, retrieval\-augmented approaches can improve evidence\-grounded critique generation and novelty assessment in automated peer review\. By coupling LLMs with structured retrieval mechanisms, these systems aim to reduce hallucination and enhance contextual reasoning\. However, retrieval\-augmented systems shift the bottleneck of prompt\-based or fine\-tuned methods from language generation to information selection\. Their effectiveness depends not only on LLM capacity, but also on retrieval design, corpus coverage, and evidence integration strategies\. Practical deployment therefore requires careful attention to retrieval quality, computational efficiency, and confidentiality constraints\.

### 3\.4\.Feedback\-driven approaches

While prompt\-based, fine\-tuned, and retrieval\-augmented systems primarily operate in a static generation paradigm, feedback\-driven approaches introduce iterative refinement mechanisms into automated peer review generation\. These systems incorporate human or programmatic feedback signals to improve critique quality, structure, and alignment over time\. Rather than relying solely on pretrained knowledge or supervised paper\-–review pairs, feedback\-driven models attempt to approximate reviewer behavior through interaction histories, preference judgments, or reinforcement objectives\. Feedback may originate from reviewers, editors, authors, or automated quality checks\.

Conceptually, these approaches extend the critique and scoring functionsfc​\(x\)f\_\{c\}\(x\)andfs​\(x\)f\_\{s\}\(x\)by incorporating feedback signalshh, yielding adaptive mappingsfc​\(x,h\)f\_\{c\}\(x,h\)andfs​\(x,h\)f\_\{s\}\(x,h\)\. The goal is to improve alignment with human evaluative standards beyond what static training can achieve\.

Table 2\.Comparative synthesis of LLM\-based peer review paradigms, highlighting their main strengths, structural limitations, and associated data mining challenges\.Findings\. Recent work has begun to model multi\-round review dynamics explicitly\. For instance, Tanet al\.\(Tanet al\.,[2024](https://arxiv.org/html/2606.25057#bib.bib52)\)constructed a multi\-turn dialogue dataset including reviewer comments, author rebuttals, reviewer responses to rebuttals, and final decisions\. These reviewer\-author interaction signals, such as rebuttals or revision histories, are used to refine review generation and enable more adaptive and context\-aware critiques\. In parallel, reinforcement learning from human feedback \(RLHF\) has been explored to align review generation with expert preferences\. Taechoyotinet al\.\(Taechoyotin and Acuna,[2025](https://arxiv.org/html/2606.25057#bib.bib47)\)integrated reasoning\-enhanced fine\-tuning with multi\-objective reinforcement learning guided by human\-aligned reward functions\. Their empirical results suggest improvements in perceived helpfulness, reasoning depth, and structural coherence relative to supervised baselines\.

Across these studies, feedback\-driven approaches generally improve constructiveness and reduce overly generic outputs\. Compared to purely supervised models, alignment\-optimized systems tend to produce critiques judged by human evaluators as more actionable and balanced\. By incorporating feedback signals, such as preference comparisons, helpfulness ratings, or rebuttal responses, models can be optimized to better match desired evaluation properties\.

Structural limitations\. Despite their promise, feedback\-driven systems face significant challenges:

- •Reward specification: Designing reward functions that capture true evaluative rigor is difficult\. Over\-optimization for helpfulness or politeness may reinforce optimism bias\.
- •Preference bias: Human feedback reflects subjective preferences that may vary across disciplines or reviewer communities\.
- •Data scarcity: Collecting high\-quality preference annotations or multi\-round review interactions is resource\-intensive\.
- •Over\-optimization risk: Reinforcement learning may produce outputs that maximize reward signals without improving factual correctness\.
- •Calibration neglect: Alignment objectives often prioritize tone and coherence rather than uncertainty modeling or score calibration\.

Overall, feedback\-driven and reinforcement\-based approaches represent an important evolution in automated peer review generation\. Compared with prompt\-based systems, feedback\-driven models introduce adaptive refinement mechanisms that enhance constructiveness and structural coherence\. Compared to supervised fine\-tuning, RLHF\-style alignment shifts the objective from historical replication to preference optimization\. By incorporating human preferences, multi\-turn interactions, or reward\-guided optimization, feedback\-driven systems can produce more constructive, context\-aware, and expert\-aligned critiques\. However, their effectiveness depends critically on the quality of feedback signals and reward design\.

### 3\.5\.Synthesis of empirical patterns

Beyond surface performance comparisons, automated peer review systems should be evaluated through a multi\-objective lens\. Effective systems must jointly optimize critique quality, score calibration, fairness across domains and institutions, robustness to adversarial manipulation, and uncertainty awareness\. These objectives are often in tension\. For example, optimizing for helpfulness may reduce critical sharpness, whereas maximizing score correlation may amplify historical biases embedded in training data\. In this section, we synthesize empirical evidence across studies to characterize the current performance landscape of automated peer review systems\. Table[2](https://arxiv.org/html/2606.25057#S3.T2)summarizes the four major LLM\-based peer review paradigms discussed above, highlighting their primary strengths, structural limitations, and corresponding data mining challenges\. Building on this comparison, the following synthesis examines two recurring empirical patterns across these paradigms: critique quality and score prediction reliability\.

Critique quality\. Across studies, the main distinction is between surface\-level review quality and evaluative rigor: LLM\-generated reviews are often readable and well\-structured, but their ability to detect subtle methodological flaws remains limited\. Human evaluators frequently judge such outputs as readable and well organized\(Biswaset al\.,[2023](https://arxiv.org/html/2606.25057#bib.bib62); Robertson,[2023](https://arxiv.org/html/2606.25057#bib.bib65); Lianget al\.,[2024](https://arxiv.org/html/2606.25057#bib.bib46)\)\. However, deeper analyses reveal a gap between surface coherence and evaluative rigor\. Fine\-grained comparisons with human\-written reviews indicate that LLM critiques often emphasize high\-level summaries and general improvement suggestions while underrepresenting subtle methodological weaknesses\(Duet al\.,[2024](https://arxiv.org/html/2606.25057#bib.bib37); Yuet al\.,[2024c](https://arxiv.org/html/2606.25057#bib.bib15)\)\. Existing studies show that LLM\-generated comments tend to be more positive and contain fewer explicit weakness\-oriented statements than human reviews, particularly when identifying limitations or experimental flaws\. Moreover, LLMs often produce paper\-unspecific reviews without detailed justification\. They remain prone to hallucination, generating outputs that are plausible\-sounding but factually inaccurate or unverified\(Achiamet al\.,[2023](https://arxiv.org/html/2606.25057#bib.bib5); Jiet al\.,[2023](https://arxiv.org/html/2606.25057#bib.bib56)\)\. Similarly, some studies\(Zhouet al\.,[2024](https://arxiv.org/html/2606.25057#bib.bib22); Chenet al\.,[2025](https://arxiv.org/html/2606.25057#bib.bib75)\)report that while GPT\-4 produces coherent commentaries, its ability to generate critiques matching expert depth remains limited, especially for long and technically dense manuscripts\.

Score prediction\. Score prediction performance is commonly measured through correlation with reviewer\-evaluated scores or final acceptance outcomes\. Studies evaluating GPT\-3\.5 and GPT\-4 report weak to moderate correlations with final decisions\(Saadet al\.,[2024](https://arxiv.org/html/2606.25057#bib.bib61); Zhouet al\.,[2024](https://arxiv.org/html/2606.25057#bib.bib22)\)\. Several studies also observe that predicted scores cluster around mid\-range categories and show weaker discrimination between accepted and rejected papers than human reviewers\(Saadet al\.,[2024](https://arxiv.org/html/2606.25057#bib.bib61); Zhouet al\.,[2024](https://arxiv.org/html/2606.25057#bib.bib22)\)\. Fine\-tuned systems trained on OpenReview datasets often improve alignment with human\-evaluated scores\(Idahl and Ahmadi,[2025](https://arxiv.org/html/2606.25057#bib.bib11)\), although the magnitude of improvement varies across datasets\. Across reported studies, correlations between LLM\-generated scores and human reviewer scores typically fall within weak\-to\-moderate ranges, with fine\-tuned models consistently outperforming zero\-shot prompting baselines\. Retrieval augmentation tends to improve critique specificity and novelty assessment, although gains in score calibration remain limited\. Importantly, the achievable upper bound is constrained by intrinsic inter\-reviewer disagreement documented in peer review literature\(Bornmann,[2011](https://arxiv.org/html/2606.25057#bib.bib86); Pieret al\.,[2018](https://arxiv.org/html/2606.25057#bib.bib85)\), suggesting that perfect alignment with any single reviewer is neither realistic nor desirable\. These observations suggest that score prediction should be evaluated distributionally: models should be tested on calibration, uncertainty intervals, rank stability, and agreement with reviewer\-score distributions, not only on correlation with average scores\.

## 4\.Benchmark Gaps and Evaluation Challenges

While emerging datasets and evaluation protocols have supported recent progress in LLM\-based peer review generation, existing benchmarks remain fragmented and limited\. Current resources provide valuable starting points, but they fall short of enabling rigorous, cross\-domain, and reliability\-focused evaluation\. In this section, we examine structural gaps in available datasets and highlight methodological challenges in evaluating automated critique and scoring systems\.

### 4\.1\.Datasets

NameData SourceData SizeApplicationPeerReadICLR 201714K papers \+ decisionscore prediction\(Kanget al\.,[2018](https://arxiv.org/html/2606.25057#bib.bib27)\)ACL 2017\(3K papers \+ 10\.7K reviews;acceptance predictionNeurIPS 2013\-20171\.3K papers \+ aspect scores\)InterspeechInterspeech 20192\.1K papersacceptance prediction, score prediction\(Stappenet al\.,[2020](https://arxiv.org/html/2606.25057#bib.bib31)\)5\.8K reviews \+ decisionPeerAssistICLR 2017\-20204\.4K papers\+ 13\.4K reviewsacceptance prediction\(Bhartiet al\.,[2021](https://arxiv.org/html/2606.25057#bib.bib55)\)ASAP\-ReviewICLR 2017\-2020,8\.8K papers \+ 28\.1K reviewsreview report generation\(Yuanet al\.,[2022](https://arxiv.org/html/2606.25057#bib.bib16)\)NeurIPS 2016\-2019NLPEERACL 2017, ARR 2022,5\.6K papers \+ 11\.5K reviewsscore prediction, pragmatic labelling\(Dyckeet al\.,[2023](https://arxiv.org/html/2606.25057#bib.bib28)\)COLING 2022, CONLL 2016guided skimming for peer reviewMOPRDPeerJ journals6\.5K papers \+ 22\.4K reviewsreview report generation, meta\-review generation,\(Linet al\.,[2023](https://arxiv.org/html/2606.25057#bib.bib29)\)11\.2K rebuttal lettersacceptance prediction, rebuttal generation,scientometric analysisAgentReviewICLR 2020\-2023500 papersreview report generation\(Jinet al\.,[2024](https://arxiv.org/html/2606.25057#bib.bib42)\)10\.4K reviews \+ rebuttalsmeta\-review generationReviewMTICLR 2017\-202426\.8K papers \+ 92\.0K reviewsreview report generation\(Tanet al\.,[2024](https://arxiv.org/html/2606.25057#bib.bib52)\)Review\-5K\(Wenget al\.,[2025](https://arxiv.org/html/2606.25057#bib.bib4)\)ICLR 20245K papers \+ 16K reviewsreview report generation, score predictionDeepReview\-13KICLR 2024\-202513\.3K papers \+ reviewsscore prediction, acceptance prediction\(Zhuet al\.,[2025b](https://arxiv.org/html/2606.25057#bib.bib50)\)review report generationPeerRTICLR 2017\-20205\.5K papers \+ 16\.8K reviewsscore prediction, review report generation\(Taechoyotin and Acuna,[2025](https://arxiv.org/html/2606.25057#bib.bib47)\)OpenReviewerICLR 2022\-202436K papers \+ 79K reviewsreview report generation\(Idahl and Ahmadi,[2025](https://arxiv.org/html/2606.25057#bib.bib11)\)NeurIPS 2022\-2024score predictionR​e2Re^\{2\}\(Zhanget al\.,[2025](https://arxiv.org/html/2606.25057#bib.bib53)\)45 venues19\.9K papers \+ 70\.6K reviewsscore prediction, review report generation2017\-2025acceptance prediction

Table 3\.Datasets, with code for downloading or crawling when available, for automated peer review generation and related evaluation tasks\.Many studies have developed datasets to support peer review modeling, and these resources now serve as key benchmarks for measuring and comparing model performance\. Table[3](https://arxiv.org/html/2606.25057#S4.T3)summarizes representative datasets for automated review report generation, score prediction, and acceptance prediction tasks\. Most datasets are derived from open review platforms such as OpenReview111[https://openreview\.net](https://openreview.net/), PeerJ222[https://peerj\.com](https://peerj.com/), and F1000Research333[https://f1000research\.com](https://f1000research.com/)\. They include early benchmarks such as PeerRead\(Kanget al\.,[2018](https://arxiv.org/html/2606.25057#bib.bib27)\), NLPEER\(Dyckeet al\.,[2023](https://arxiv.org/html/2606.25057#bib.bib28)\), and MOPRD\(Linet al\.,[2023](https://arxiv.org/html/2606.25057#bib.bib29)\), as well as more recent large\-scale collections such as ReviewMT\(Tanet al\.,[2024](https://arxiv.org/html/2606.25057#bib.bib52)\), DeepReview\-13K\(Zhuet al\.,[2025b](https://arxiv.org/html/2606.25057#bib.bib50)\), OpenReviewer\(Idahl and Ahmadi,[2025](https://arxiv.org/html/2606.25057#bib.bib11)\), andR​e2Re^\{2\}\(Zhanget al\.,[2025](https://arxiv.org/html/2606.25057#bib.bib53)\)\. Despite growing dataset sizes, several structural limitations persist\.

Data scarcity and confidentiality\. Peer review is inherently confidential\. Most journals and conferences operate under closed\-review policies, limiting public access to data\. Existing datasets therefore rely primarily on opt\-in mechanisms or open\-review venues\. Consequently, even the largest datasets remain small relative to typical LLM pretraining corpora\. Early attempts to expand dataset coverage, such as matching arXiv submissions to later conference proceedings\(Kanget al\.,[2018](https://arxiv.org/html/2606.25057#bib.bib27)\), provide acceptance labels but not full review texts, thereby limiting their utility for training critique\-generation models\. Thus, current benchmarks remain constrained in both scale and representativeness\.

Domain concentration and distribution bias\. A prominent pattern across Table[3](https://arxiv.org/html/2606.25057#S4.T3)is the dominance of computer science \(CS\) venues, particularly machine learning and natural language processing conferences\. Although datasets such as MOPRD\(Linet al\.,[2023](https://arxiv.org/html/2606.25057#bib.bib29)\)and NLPEER\(Dyckeet al\.,[2023](https://arxiv.org/html/2606.25057#bib.bib28)\)include some multidisciplinary content, their scale remains insufficient for robust cross\-domain modeling\. This concentration creates two risks\. First, apparent performance improvements may reflect alignment between model pretraining data and CS\-specific evaluation norms\. Second, generalization to disciplines such as biomedical sciences, economics, and the humanities remains largely untested\. Furthermore, many datasets exhibit acceptance bias\. Authors and reviewers may often be reluctant to share rejected submissions because of reputational or institutional concerns\. This imbalance skews label distributions and may distort score prediction models, increasing the risk of false positives in acceptance classification\.

Heterogeneous formatting and limited structure\. Peer review formats vary widely across venues\. Some use structured templates, such as strengths and weaknesses, while others rely on free\-form reviews\. This inconsistency complicates dataset standardization and supervised training\. Although pragmatic annotations and discourse labeling efforts exist\(Huaet al\.,[2019](https://arxiv.org/html/2606.25057#bib.bib33); Dyckeet al\.,[2023](https://arxiv.org/html/2606.25057#bib.bib28)\), they remain small\-scale because of annotation costs\. The lack of consistent structural metadata limits the ability to train models on fine\-grained evaluative reasoning patterns\.

Legal and ethical barriers\. Many datasets lack explicit licensing terms governing reuse and redistribution\. Reviewer anonymity, author consent, and institutional policies further constrain dataset sharing\. These legal and ethical uncertainties inhibit the development of standardized, widely adopted benchmarks comparable to those in other NLP domains\.

### 4\.2\.Evaluation metrics

Evaluating automated peer review systems is inherently challenging because review quality combines factual correctness, subjective judgment, structured reasoning, and constructive tone\. Current evaluation protocols rely primarily on three approaches: automatic metrics, human assessment, and LLM\-as\-a\-judge evaluation\.

Automatic metrics\. For score prediction and acceptance classification, discrete metrics such as accuracy, F1, mean absolute error \(MAE\)\(Willmott and Matsuura,[2005](https://arxiv.org/html/2606.25057#bib.bib92)\), and root mean square error \(RMSE\)\(Chai and Draxler,[2014](https://arxiv.org/html/2606.25057#bib.bib95)\)are commonly used\. By contrast, critique generation is typically evaluated using text similarity metrics such as bilingual evaluation understudy \(BLEU\)\(Papineniet al\.,[2002](https://arxiv.org/html/2606.25057#bib.bib87)\), recall\-oriented understudy for gisting evaluation \(ROUGE\)\(Lin,[2004](https://arxiv.org/html/2606.25057#bib.bib88)\), metric for evaluation of translation with explicit ordering \(METEOR\)\(Banerjee and Lavie,[2005](https://arxiv.org/html/2606.25057#bib.bib89)\), BERTScore\(Zhanget al\.,[2019](https://arxiv.org/html/2606.25057#bib.bib90)\), and MoverScore\(Zhaoet al\.,[2019](https://arxiv.org/html/2606.25057#bib.bib91)\)\(Table[4](https://arxiv.org/html/2606.25057#S4.T4)\)\. However, these metrics were originally designed for tasks such as summarization or translation, not for evaluating critical, open\-ended, and subjective peer reviews\. As a result, they exhibit several limitations:

- •Multi\-reference ambiguity: Multiple valid reviews may exist for the same paper\. Averaging scores across multiple human reviews can obscure complementary evaluations\.
- •Subjectivity mismatch: Peer reviews are inherently subjective, and rarely have a single correct critique\. Metrics such as BLEU and ROUGE may penalize valid deviations from reference reviews\.
- •Semantic insufficiency: Embedding\-based metrics capture semantic similarity more effectively than lexical metrics, but they still fail to assess constructiveness, factual correctness, or coherence of reasoning\.
- •Reasoning blindness: Scientific reviews often involve structured reasoning, including critique, evidence, and suggestions; however, text similarity metrics generally ignore discourse structure or logical flow\.

As a result, high ROUGE or BERTScore does not necessarily indicate an accurate, helpful, or insightful critique\.

Human evaluation\. Human evaluation remains the most reliable method for assessing helpfulness, constructiveness, factual correctness, and domain relevance\. Studies commonly use Likert scales, pairwise comparisons, or domain\-specific criteria such as novelty identification and flaw detection\. However, human evaluation faces substantial scalability challenges\. It is time\-intensive, costly, and subject to inter\-rater disagreement, which mirrors the variability inherent in peer review itself\. Consequently, human evaluation is often limited to small samples, reducing statistical power and making comparisons across systems difficult\.

LLM\-as\-a\-judge\. Recent studies have used LLMs to evaluate generated reviews through scoring or pairwise comparison\. While scalable, this approach introduces risks of circularity: models trained on similar corpora may favor fluent but shallow critiques or outputs that resemble their own generation style\. In addition, LLM\-based judgments are sensitive to prompt design, model choice, and evaluation framing, and they often lack transparent reasoning mechanisms\. Thus, while promising, LLM\-as\-a\-judge frameworks require further validation before they can serve as reliable benchmarks for automated peer review evaluation\.

MetricEvaluationLimitationsBLEUPrecision\-oriented n\-gram overlappenalize diversity,ROUGERecall\-based n\-gram overlapignore semantic similarityMETEORSynonym and stem matching \+ precision/recallonly consider shallow semanticsBERTScoreSemantic similarity via contextual embeddingsfail to judge critique qualityCannot verify factual correctness or consistencyMoverScoreEmbedding\-based semantic alignmentOften favor fluent text over insightful contentTable 4\.List of well\-known automatic text similarity metrics and their limitations in evaluating generated peer reviews\.

## 5\.Security and Robustness Risks

As automated peer review systems move from experimental prototypes toward decision\-support tools, robustness and security concerns become increasingly central\. Unlike many conventional NLP tasks, peer review operates in a high\-stakes setting in which evaluation outcomes can influence publication decisions, academic reputation, and career trajectories\. As a result, vulnerabilities in LLM\-based review systems may be subject to strategic exploitation\. In this section, we outline potential risks across several paradigms for automated critique and scoring\.

### 5\.1\.Prompt injection and instruction manipulation

Prompt\-based review systems typically concatenate system instructions with manuscript content before generation\. This architecture exposes them to prompt injection risks, whereby malicious or unintended instructions embedded in the manuscript influence model behavior\. LLMs are known to be sensitive to prompt framing and contextual instructions\(Zhaoet al\.,[2021](https://arxiv.org/html/2606.25057#bib.bib64)\)\. If authors include hidden directives or strategically phrased content in their submissions, such as cues emphasizing novelty or instructions that redirect evaluative focus, models may inadvertently incorporate these signals into their critique or scores\.

Recent work demonstrates the severity of this threat in scientific review settings\. Keuperet al\.\(Keuper,[2025](https://arxiv.org/html/2606.25057#bib.bib94)\)provide the first systematic analysis of such hidden manipulations, using 1,000 ICLR 2024 paper reviews across multiple LLMs\. Their study shows that simple prompt injections can produce highly biased outputs, including artificially elevated acceptance scores, and in some cases can yield acceptance likelihood of up to 100%\. These findings reveal a concrete vulnerability in AI\-assisted review workflows\. Related evidence further suggests that prompt injections can induce topical shifts or alter review content when covert instructions are embedded within submissions\(Zhuet al\.,[2025a](https://arxiv.org/html/2606.25057#bib.bib93)\)\. These results show that the manuscript itself becomes an attack surface in automated peer review pipelines: text that appears innocuous to human readers may contain embedded instructions that influence the evaluative behavior of an LLM\.

Mitigating prompt injection risks requires robust input sanitization, including filtering invisible or embedded tokens, as well as adversarial robustness testing on curated corpora\. Without such safeguards, structural vulnerabilities in LLM pipelines may be exploited, undermining both the reliability and integrity of automated peer review\.

### 5\.2\.Data poisoning in fine\-tuned systems

Fine\-tuned review generation models derive their behavior directly from training data\. Although supervised fine\-tuning can improve structural alignment and task fidelity, it also exposes models to data poisoning and bias amplification risks\. In the machine learning security literature, data poisoning refers to the intentional insertion of corrupted or malicious training examples that cause a learned model to behave erroneously on target inputs; such attacks have been demonstrated across classification, regression, and generation tasks\(Biggio and Roli,[2018](https://arxiv.org/html/2606.25057#bib.bib96)\)\.

Fine\-tuned peer review systems often rely on training corpora collected from open review platforms, such as OpenReview or PeerJ, as well as conference archives or volunteered peer reviews\. This reliance introduces several risks:

- •Adversarial poisoning: An adversary could deliberately inject strategically designed reviews into public platforms to skew the learned evaluation patterns of downstream models\. For example, subtle labeling of weak submissions as “acceptable” or the systematic insertion of poorly justified positive critiques could bias models toward leniency\.
- •Bias amplification from noisy labels: Review data contains substantial label noise and subjective variation\. Human reviewers often disagree, and scores typically reflect individual judgment rather than objective ground truth\. When models are trained on such noisy and inconsistent labels, they may reproduce or amplify existing biases\.

Data auditing, filtering, and bias assessment are therefore essential when constructing training corpora for automated peer review generation\.

### 5\.3\.Retrieval\-based vulnerabilities

Retrieval\-augmented generation \(RAG\) systems reduce hallucination by conditioning outputs on retrieved evidence\(Lewiset al\.,[2020](https://arxiv.org/html/2606.25057#bib.bib68)\)\. However, retrieval also introduces additional attack surfaces and robustness concerns\. If external corpora contain outdated, misleading, or adversarially manipulated documents, generated critiques may incorporate inaccurate evidence\. In novelty assessment tasks, reliance on incomplete literature indices may bias evaluation outcomes\. Moreover, retrieval quality depends heavily on indexing strategies and embedding models, which may introduce topic imbalance or uneven coverage across research areas\. Robust evaluation of RAG\-based review systems should therefore include retrieval accuracy metrics, corpus\-quality checks, and stress tests under corpus perturbation\.

### 5\.4\.Reward hacking in alignment\-optimized models

Feedback\-driven approaches have proven effective for aligning LLM outputs with human preferences\. However, alignment objectives can introduce reward hacking risks, whereby models optimize for perceived helpfulness, politeness, or stylistic appeal rather than evaluative rigor\. In peer review, reward functions that emphasize constructiveness or tone may inadvertently reduce critical sharpness\. Over\-optimization toward reward models may also produce outputs that maximize preference scores without improving factual correctness, depth, or calibration\. Balancing multi\-objectives, including helpfulness, depth, grounding, and calibration, remains an open challenge in feedback\-driven peer review systems\.

## 6\.Deployment and Ethical Concerns

This section addresses the governance, policy, and ethical considerations associated with deployment\. These concerns extend beyond model reliability to include transparency, accountability, bias mitigation, privacy, and the appropriate role of AI in scholarly decision\-making\.

Accountability\. Traditional peer review operates within identifiable responsibility structures\. Reviewers are selected to evaluate manuscripts, provide critiques, and recommend scores or acceptance decisions; chairs and editors then aggregate reviewers’ feedback and make final decisions\. When AI\-generated outputs influence final decisions, responsibility becomes more diffuse\. In particular, if an automated critique contributes to a rejection or acceptance outcome, several questions arise:

- •Who is responsible for potential errors or misleading assessments?
- •Should LLM\-assisted reviews be explicitly labeled?
- •How can authors contest decisions influenced by LLM\-generated content?

Scholarly publishing organizations have begun issuing guidance on the use of AI in peer review, emphasizing disclosure and reviewer accountability\(Yeet al\.,[2024](https://arxiv.org/html/2606.25057#bib.bib36)\)\. Maintaining clear human decision authority is therefore essential for preserving procedural fairness, accountability, and trust\.

Transparency and explainability\. Transparency is a central requirement in scientific evaluation\. However, LLM\-based systems often operate as black boxes with internal reasoning processes that remain opaque\. Although explainable AI \(XAI\) methods aim to provide reasoning traces or confidence signals, generated explanations may not faithfully reflect underlying model computations\(OPUS Project Consortium,[2024](https://arxiv.org/html/2606.25057#bib.bib41)\)\. In the context of peer review, transparency and explainability span multiple dimensions:

- •Disclosure of AI use by reviewers, chairs, or editorial systems\.
- •Documentation of model training data, system design, and intended use\.
- •Clear distinction between human\-authored and AI\-generated content\.
- •Reporting of uncertainty, calibration, and known limitations\.
- •Provision of grounded explanations, such as evidence or manuscript passages supporting generated critiques\.

Without such mechanisms, opacity may undermine confidence in evaluation outcomes and hinder the responsible deployment of LLM peer review systems\.

Fairness\. LLM\-based systems may inherit and amplify biases in historical review corpora, including biases related to institutional prestige, geography, gender, and disciplinary domains\(Hosseini and Horbach,[2023](https://arxiv.org/html/2606.25057#bib.bib39); Pataranutapornet al\.,[2025](https://arxiv.org/html/2606.25057#bib.bib40)\)\. Recent analyses suggest that LLM evaluators may favor fluent and formally structured manuscripts\(Yeet al\.,[2024](https://arxiv.org/html/2606.25057#bib.bib36)\), potentially disadvantaging non\-native English authors or authors from less\-resourced research environments\. Moreover, if training datasets disproportionately reflect computer science venues or particular publication cultures, automated systems may encode discipline\-specific norms as universal standards\. Fair deployment, therefore, requires systematic auditing across author demographics, institutions, regions, and fields, as well as mechanisms for detecting and mitigating disparate impacts\.

Privacy and confidentiality\. Peer review involves unpublished research, proprietary methods, confidential findings, and in some domains, sensitive personal information\. Manuscripts in areas such as biomedicine or social sciences may contain regulated or otherwise sensitive data\. Submitting such materials to third\-party LLM systems raises confidentiality risks\(Jinet al\.,[2024](https://arxiv.org/html/2606.25057#bib.bib42); Yeet al\.,[2024](https://arxiv.org/html/2606.25057#bib.bib36)\), including data retention, logging, unauthorized reuse, or unintended exposure of novel findings or confidential data\. The deployment of AI\-assisted review tools must therefore comply with institutional ethics requirements, publisher policies, and applicable data protection legal frameworks\. At a minimum, systems should provide clear guarantees regarding data handling, retention, access control, and whether submitted manuscripts may be used for model training or system improvement\.

## 7\.Application Scenarios

Despite substantial challenges, LLM\-based peer review systems offer promising opportunities when deployed responsibly\. Rather than viewing automation as a binary replacement of human reviewers, it is more productive to consider a spectrum of application scenarios, ranging from author assistance to editorial augmentation\. In this section, we outline practical pathways for deploying automated peer review systems\.

Pre\-submission author assistance\. One of the most immediate and relatively low\-risk applications of LLM\-based review systems lies in the pre\-submission stage\. Authors can use LLM\-generated critiques to assess the clarity, structure, methodological soundness, and novelty framing of their manuscripts before formal submission\. Recent studies suggest that LLMs can produce structured and useful feedback that approximates review\-style commentary, although their depth, calibration, and reliability remain imperfect\(Lianget al\.,[2024](https://arxiv.org/html/2606.25057#bib.bib46); Idahl and Ahmadi,[2025](https://arxiv.org/html/2606.25057#bib.bib11)\)\. In this setting, LLM systems may serve as automated writing and critique assistants, helping authors identify ambiguous claims, missing experimental details, inconsistencies, or weakly supported conclusions\. Such tools could reduce avoidable errors and improve manuscript quality before peer review\. However, deployment in this context requires careful attention to confidentiality, since uploading unpublished work to external services may expose intellectual property\. Institutionally hosted systems or locally deployed models may therefore provide safer alternatives for pre\-submission use\.

LLMs as first\-pass screening\. Another potential application is first\-pass screening\. Conferences and journals often receive a large volume of submissions, creating substantial burdens for editors, area chairs, and reviewers\. LLM\-based systems could provide preliminary signals for editorial triage, such as identifying submissions that may require additional scrutiny, missing required components, or unclear methodological claims\. However, they should not be used as stand\-alone mechanisms for desk rejection or acceptance decisions\. Such systems could also support iterative improvement by offering authors early feedback before human review\. For example, authors might respond to LLM\-generated comments or revise their manuscripts before subsequent evaluation, thereby reducing avoidable issues and improving the efficiency of later review stages\. A two\-stage pipeline of this kind could improve scalability while preserving human oversight for final editorial and acceptance decisions\.

LLMs as co\-reviewer\. LLM\-generated reviews may also be used alongside human reviews to provide complementary perspectives\. In this role, automated reviews would not replace human judgment but would instead serve as an additional source of feedback, potentially highlighting issues that human reviewers overlook or offering alternative interpretations of a manuscript’s contributions and limitations\. For example, AAAI 2025 launched a pilot program in which LLM\-generated reviews were shared with authors as supplementary feedback, enriching the evaluation process with additional perspectives\(AAAI,[2025](https://arxiv.org/html/2606.25057#bib.bib44)\)\. Such deployments may be particularly useful when clearly labeled, carefully calibrated, and separated from binding decision\-making authority\.

LLMs as reviewer assistance\. LLMs can also support reviewers through auxiliary tasks such as summarizing manuscripts, verifying citations, and checking internal consistency, identifying missing experimental details, and improving the clarity of review comments\. These functions may allow human reviewers to focus more directly on substantive evaluation and final judgment\(Luoet al\.,[2025](https://arxiv.org/html/2606.25057#bib.bib24)\)\. For instance, a recent ICLR pilot experiment used LLMs to flag inappropriate remarks, highlight manuscript sections relevant to reviewers’ questions, and suggest clearer phrasing for vague review comments\(ICLR,[2025](https://arxiv.org/html/2606.25057#bib.bib45)\)\. Beyond these auxiliary uses, LLMs could also generate initial review drafts based on manuscript content, which human reviewers would then revise, contextualize, and personalize by adding their own evaluations and insights\. However, over\-reliance on AI\-generated content remains a significant risk\. To mitigate this concern, reviewers should retain responsibility for the final review and may be required to justify cases in which their final assessments closely align with LLM\-generated suggestions\.

## 8\.Future research directions

Despite rapid advances in LLM\-based critique and score generation, automated scientific peer review remains an emerging research area\. Future work should move from review\-style generation toward structured evaluative modeling that tests novelty, evidence support, methodological rigor, claim consistency, and the relationship between contributions and prior literature\. Promising directions include decomposing review generation into interpretable sub\-tasks, incorporating tool\-augmented reasoning, and leveraging structured representations such as argument graphs or claim–evidence mappings to better capture the underlying processes of scientific critique\.

Another important direction is to expand beyond the current computer science–centric datasets\. Future benchmarks should test whether models trained on ML/NLP review norms transfer to fields with different evidentiary standards, such as biomedicine, economics, and the humanities\. Future work should explore domain adaptation techniques and develop more domain\-diverse benchmarks that reflect the norms, evaluation criteria, and evidentiary standards of different scientific communities\.

Robustness research should move from general warnings to benchmarked stress tests, including prompt\-injection suites, poisoned\-review training sets, retrieval\-corruption tests, and adversarial manuscripts\. Such benchmarks should evaluate not only whether attacks change generated text, but also whether they distort scores, alter critique emphasis, or affect downstream editorial decisions\.

Finally, richer evaluation paradigms are needed to better align automated metrics with the goals of scientific critique\. Traditional text similarity metrics are insufficient for capturing reasoning depth, critique coverage, factual grounding, and logical coherence\. Future evaluation should prioritize evidence\-based assessment across multiple reviews, as well as human\-AI agreement modeling that goes beyond simple correlation\-based measures\.

## 9\.Conclusion

This study examined the use of LLMs for automated peer review generation across several key dimensions, including modeling approaches, benchmark resources, security risks, deployment considerations, and practical application scenarios\. Our analysis highlights that although LLMs can produce fluent, well\-structured reviews, substantial challenges remain in calibrated scoring, deep methodological reasoning, cross\-domain generalization, robustness, and fairness\. Moreover, the integration of automated systems into peer review introduces broader institutional concerns related to transparency, accountability, and confidentiality\.

Despite these limitations, LLMs hold considerable promise as assistive tools in the scholarly review process\. They can support authors in pre\-submission refinement, help reviewers with summarization and consistency checking, and assist editors and program chairs in managing review workflow\. When appropriately designed and deployed, such systems may reduce reviewer burden and improve workflow efficiency\. The most promising near\-term direction is a hybrid human–AI collaboration model that preserves human decision authority while leveraging the scalability of automated assistance\. Looking ahead, progress in automated peer review will require advances in evaluative reasoning, task\-specific benchmarking, robustness testing, transparency mechanisms, and fairness\-aware system design\. The central challenge is therefore not whether LLMs can imitate the surface form of peer review, but whether they can be integrated as calibrated, transparent, and auditable decision\-support tools within human\-led evaluation workflows\.

## References

- \[1\]AAAI\(2025\)AAAI launches ai\-powered peer review assessment system\.Note:[https://aaai\.org/aaai\-launches\-ai\-powered\-peer\-review\-assessment\-system/](https://aaai.org/aaai-launches-ai-powered-peer-review-assessment-system/)Published: 2025\-05\-16Cited by:[§7](https://arxiv.org/html/2606.25057#S7.p4.1)\.
- \[2\]J\. Achiam, S\. Adler, S\. Agarwal, L\. Ahmad, I\. Akkaya, F\. L\. Aleman, D\. Almeida, J\. Altenschmidt, S\. Altman, S\. Anadkat,et al\.\(2023\)GPT\-4 technical report\.arXiv preprint arXiv:2303\.08774\.Cited by:[§1](https://arxiv.org/html/2606.25057#S1.p2.1),[§3\.1](https://arxiv.org/html/2606.25057#S3.SS1.p1.1),[§3\.5](https://arxiv.org/html/2606.25057#S3.SS5.p2.1)\.
- \[3\]S\. Banerjee and A\. Lavie\(2005\)METEOR: an automatic metric for mt evaluation with improved correlation with human judgments\.InProceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization,pp\. 65–72\.Cited by:[§4\.2](https://arxiv.org/html/2606.25057#S4.SS2.p2.1)\.
- \[4\]P\. K\. Bharti, V\. Dalal, and M\. Panchal\(2026\)Co\-reviewer: can ai review like a human? an agentic framework for llm\-human alignment in peer review\.Scientometrics,pp\. 1–42\.Cited by:[3rd item](https://arxiv.org/html/2606.25057#S3.I2.i3.p1.1)\.
- \[5\]P\. K\. Bharti, S\. Ranjan, T\. Ghosal, M\. Agrawal, and A\. Ekbal\(2021\)Peerassist: leveraging on paper\-review interactions to predict peer review decisions\.InTowards Open and Trustworthy Digital Societies: 23rd International Conference on Asia\-Pacific Digital Libraries, ICADL 2021, Virtual Event, December 1–3, 2021, Proceedings 23,pp\. 421–435\.Cited by:[Table 3](https://arxiv.org/html/2606.25057#S4.T3.1.1.9.8.1)\.
- \[6\]B\. Biggio and F\. Roli\(2018\)Wild patterns: ten years after the rise of adversarial machine learning\.InProceedings of the 2018 ACM SIGSAC conference on computer and communications security,pp\. 2154–2156\.Cited by:[§5\.2](https://arxiv.org/html/2606.25057#S5.SS2.p1.1)\.
- \[7\]S\. Biswas, D\. Dobaria, and H\. L\. Cohen\(2023\)ChatGPT and the future of journal reviews: a feasibility study\.The Yale Journal of Biology and Medicine96\(3\),pp\. 415\.Cited by:[§3\.1](https://arxiv.org/html/2606.25057#S3.SS1.p5.1),[§3\.5](https://arxiv.org/html/2606.25057#S3.SS5.p2.1)\.
- \[8\]L\. Bornmann\(2011\)Scientific peer review\.Annual review of information science and technology45\(1\),pp\. 197–245\.Cited by:[§3\.5](https://arxiv.org/html/2606.25057#S3.SS5.p3.1)\.
- \[9\]T\. Chai and R\. R\. Draxler\(2014\)Root mean square error \(rmse\) or mean absolute error \(mae\)?–arguments against avoiding rmse in the literature\.Geoscientific model development7\(3\),pp\. 1247–1250\.Cited by:[§4\.2](https://arxiv.org/html/2606.25057#S4.SS2.p2.1)\.
- \[10\]S\. Chen, D\. Brumby, and A\. Cox\(2025\)Envisioning the future of peer review: investigating llm\-assisted reviewing using chatgpt as a case study\.InProceedings of the 4th Annual Symposium on Human\-Computer Interaction for Work,pp\. 1–18\.Cited by:[§3\.5](https://arxiv.org/html/2606.25057#S3.SS5.p2.1)\.
- \[11\]M\. D’Arcy, T\. Hope, L\. Birnbaum, and D\. Downey\(2024\)Marg: multi\-agent review generation for scientific papers\.arXiv preprint arXiv:2401\.04259\.Cited by:[§3\.1](https://arxiv.org/html/2606.25057#S3.SS1.p7.1),[Table 1](https://arxiv.org/html/2606.25057#S3.T1.1.5.5.1.1.1)\.
- \[12\]J\. Du, Y\. Wang, W\. Zhao, Z\. Deng, S\. Liu, R\. Lou, H\. P\. Zou, P\. N\. Venkit, N\. Zhang, M\. Srinath,et al\.\(2024\)Llms assist nlp researchers: critique paper \(meta\-\) reviewing\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Cited by:[§3\.1](https://arxiv.org/html/2606.25057#S3.SS1.p6.1),[§3\.5](https://arxiv.org/html/2606.25057#S3.SS5.p2.1)\.
- \[13\]N\. Dycke, I\. Kuznetsov, and I\. Gurevych\(2023\)NLPeer: a unified resource for the computational study of peer review\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics,Cited by:[§3](https://arxiv.org/html/2606.25057#S3.p1.1),[§4\.1](https://arxiv.org/html/2606.25057#S4.SS1.p1.1),[§4\.1](https://arxiv.org/html/2606.25057#S4.SS1.p3.1),[§4\.1](https://arxiv.org/html/2606.25057#S4.SS1.p4.1),[Table 3](https://arxiv.org/html/2606.25057#S4.T3.1.1.13.12.1)\.
- \[14\]A\. R\. B\. M\. Faizullah, A\. Urlana, and R\. Mishra\(2024\)LimGen: probing the llms for generating suggestive limitations of research papers\.InJoint European Conference on Machine Learning and Knowledge Discovery in Databases,pp\. 106–124\.Cited by:[§3\.2](https://arxiv.org/html/2606.25057#S3.SS2.p3.1),[§3\.2](https://arxiv.org/html/2606.25057#S3.SS2.p6.1)\.
- \[15\]X\. Gao, J\. Ruan, J\. Gao, T\. Liu, and Y\. Fu\(2025\)Reviewagents: bridging the gap between human and ai\-generated paper reviews\.arXiv preprint arXiv:2503\.08506\.Cited by:[§3\.3](https://arxiv.org/html/2606.25057#S3.SS3.p4.1)\.
- \[16\]Y\. Gao, Y\. Xiong, X\. Gao, K\. Jia, J\. Pan, Y\. Bi, Y\. Dai, J\. Sun, H\. Wang, H\. Wang,et al\.\(2023\)Retrieval\-augmented generation for large language models: a survey\.arXiv preprint arXiv:2312\.109972\(1\),pp\. 32\.Cited by:[§3\.3](https://arxiv.org/html/2606.25057#S3.SS3.p1.1)\.
- \[17\]Z\. Gao, K\. Brantley, and T\. Joachims\(2024\)Reviewer2: optimizing review generation through prompt generation\.arXiv preprint arXiv:2402\.10886\.Cited by:[§3\.2](https://arxiv.org/html/2606.25057#S3.SS2.p1.1),[§3\.2](https://arxiv.org/html/2606.25057#S3.SS2.p3.1)\.
- \[18\]M\. Hosseini and S\. P\. Horbach\(2023\)Fighting reviewer fatigue or amplifying bias? considerations and recommendations for use of chatgpt and other large language models in scholarly peer review\.Research integrity and peer review8\(1\),pp\. 4\.Cited by:[§6](https://arxiv.org/html/2606.25057#S6.p5.1)\.
- \[19\]E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, W\. Chen,et al\.\(2022\)LoRA: low\-rank adaptation of large language models\.\.Iclr1\(2\),pp\. 3\.Cited by:[§3\.2](https://arxiv.org/html/2606.25057#S3.SS2.p3.1)\.
- \[20\]X\. Hua, M\. Nikolov, N\. Badugu, and L\. Wang\(2019\)Argument mining for understanding peer reviews\.InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics\),Cited by:[§4\.1](https://arxiv.org/html/2606.25057#S4.SS1.p4.1)\.
- \[21\]L\. Huang, W\. Yu, W\. Ma, W\. Zhong, Z\. Feng, H\. Wang, Q\. Chen, W\. Peng, X\. Feng, B\. Qin,et al\.\(2025\)A survey on hallucination in large language models: principles, taxonomy, challenges, and open questions\.ACM Transactions on Information Systems43\(2\),pp\. 1–55\.Cited by:[2nd item](https://arxiv.org/html/2606.25057#S3.I2.i2.p1.1)\.
- \[22\]ICLR\(2025\)Leveraging llm feedback to enhance review quality\.Note:[https://blog\.iclr\.cc/2025/04/15/leveraging\-llm\-feedback\-to\-enhance\-review\-quality/](https://blog.iclr.cc/2025/04/15/leveraging-llm-feedback-to-enhance-review-quality/)Published: 2025\-04\-15Cited by:[§7](https://arxiv.org/html/2606.25057#S7.p5.1)\.
- \[23\]M\. Idahl and Z\. Ahmadi\(2025\)OpenReviewer: a specialized large language model for generating critical scientific paper reviews\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics,Cited by:[§1](https://arxiv.org/html/2606.25057#S1.p2.1),[§3\.2](https://arxiv.org/html/2606.25057#S3.SS2.p1.1),[§3\.2](https://arxiv.org/html/2606.25057#S3.SS2.p2.1),[§3\.2](https://arxiv.org/html/2606.25057#S3.SS2.p6.1),[§3\.5](https://arxiv.org/html/2606.25057#S3.SS5.p3.1),[§4\.1](https://arxiv.org/html/2606.25057#S4.SS1.p1.1),[Table 3](https://arxiv.org/html/2606.25057#S4.T3.1.1.27.26.1),[§7](https://arxiv.org/html/2606.25057#S7.p2.1)\.
- \[24\]Z\. Ji, N\. Lee, R\. Frieske, T\. Yu, D\. Su, Y\. Xu, E\. Ishii, Y\. J\. Bang, A\. Madotto, and P\. Fung\(2023\)Survey of hallucination in natural language generation\.ACM computing surveys55\(12\),pp\. 1–38\.Cited by:[§3\.5](https://arxiv.org/html/2606.25057#S3.SS5.p2.1)\.
- \[25\]Y\. Jin, Q\. Zhao, Y\. Wang, H\. Chen, K\. Zhu, Y\. Xiao, and J\. Wang\(2024\)Agentreview: exploring peer review dynamics with llm agents\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Cited by:[Table 3](https://arxiv.org/html/2606.25057#S4.T3.1.1.18.17.1),[§6](https://arxiv.org/html/2606.25057#S6.p6.1)\.
- \[26\]D\. Kang, W\. Ammar, B\. Dalvi, M\. Van Zuylen, S\. Kohlmeier, E\. Hovy, and R\. Schwartz\(2018\)A dataset of peer reviews \(peerread\): collection, insights and nlp applications\.InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,Cited by:[§3](https://arxiv.org/html/2606.25057#S3.p1.1),[§4\.1](https://arxiv.org/html/2606.25057#S4.SS1.p1.1),[§4\.1](https://arxiv.org/html/2606.25057#S4.SS1.p2.1),[Table 3](https://arxiv.org/html/2606.25057#S4.T3.1.1.4.3.1)\.
- \[27\]J\. Keuper\(2025\)Prompt injection attacks on llm generated reviews of scientific publications\.arXiv preprint arXiv:2509\.10248\.Cited by:[§5\.1](https://arxiv.org/html/2606.25057#S5.SS1.p2.1)\.
- \[28\]T\. Kojima, S\. S\. Gu, M\. Reid, Y\. Matsuo, and Y\. Iwasawa\(2022\)Large language models are zero\-shot reasoners\.Advances in neural information processing systems35,pp\. 22199–22213\.Cited by:[§3\.1](https://arxiv.org/html/2606.25057#S3.SS1.p3.1)\.
- \[29\]N\. Künzli, A\. Berger, K\. Czabanowska, R\. Lucas, A\. Madarasova Geckova, S\. Mantwill, and O\. von Dem Knesebeck\(2022\)«i do not have time»—is this the end of peer review in public health sciences?\.Public health reviews43,pp\. 1605407\.Cited by:[§1](https://arxiv.org/html/2606.25057#S1.p1.1)\.
- \[30\]I\. Kuznetsov, O\. M\. Afzal, K\. Dercksen, N\. Dycke, A\. Goldberg, T\. Hope, D\. Hovy, J\. K\. Kummerfeld, A\. Lauscher, K\. Leyton\-Brown,et al\.\(2024\)What can natural language processing do for peer review?\.arXiv preprint arXiv:2405\.06563\.Cited by:[§1](https://arxiv.org/html/2606.25057#S1.p3.1)\.
- \[31\]G\. R\. Latona, M\. H\. Ribeiro, T\. R\. Davidson, V\. Veselovsky, and R\. West\(2024\)The ai review lottery: widespread ai\-assisted peer reviews boost paper scores and acceptance rates\.arXiv preprint arXiv:2405\.02150\.Cited by:[§3](https://arxiv.org/html/2606.25057#S3.p2.1)\.
- \[32\]P\. Lewis, E\. Perez, A\. Piktus, F\. Petroni, V\. Karpukhin, N\. Goyal, H\. Küttler, M\. Lewis, W\. Yih, T\. Rocktäschel,et al\.\(2020\)Retrieval\-augmented generation for knowledge\-intensive nlp tasks\.Advances in neural information processing systems33,pp\. 9459–9474\.Cited by:[§5\.3](https://arxiv.org/html/2606.25057#S5.SS3.p1.1)\.
- \[33\]J\. Li, W\. X\. Zhao, J\. Wen, and Y\. Song\(2019\)Generating long and informative reviews with aspect\-aware coarse\-to\-fine decoding\.InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics \(ACL\),pp\. 1969–1979\.Cited by:[§1](https://arxiv.org/html/2606.25057#S1.p2.1)\.
- \[34\]W\. Liang, Y\. Zhang, H\. Cao, B\. Wang, D\. Y\. Ding, X\. Yang, K\. Vodrahalli, S\. He, D\. S\. Smith, Y\. Yin,et al\.\(2024\)Can large language models provide useful feedback on research papers? a large\-scale empirical analysis\.NEJM AI1\(8\),pp\. AIoa2400196\.Cited by:[§3\.1](https://arxiv.org/html/2606.25057#S3.SS1.p5.1),[§3\.5](https://arxiv.org/html/2606.25057#S3.SS5.p2.1),[Table 1](https://arxiv.org/html/2606.25057#S3.T1.1.3.3.1.1.1),[§7](https://arxiv.org/html/2606.25057#S7.p2.1)\.
- \[35\]C\. Lin\(2004\)Rouge: a package for automatic evaluation of summaries\.InText summarization branches out,pp\. 74–81\.Cited by:[§4\.2](https://arxiv.org/html/2606.25057#S4.SS2.p2.1)\.
- \[36\]J\. Lin, J\. Song, Z\. Zhou, Y\. Chen, and X\. Shi\(2023\)Moprd: a multidisciplinary open peer review dataset\.Neural Computing and Applications35\(34\),pp\. 24191–24206\.Cited by:[§3](https://arxiv.org/html/2606.25057#S3.p1.1),[§4\.1](https://arxiv.org/html/2606.25057#S4.SS1.p1.1),[§4\.1](https://arxiv.org/html/2606.25057#S4.SS1.p3.1),[Table 3](https://arxiv.org/html/2606.25057#S4.T3.1.1.15.14.1)\.
- \[37\]P\. Liu, W\. Yuan, J\. Fu, Z\. Jiang, H\. Hayashi, and G\. Neubig\(2023\)Pre\-train, prompt, and predict: a systematic survey of prompting methods in natural language processing\.ACM computing surveys55\(9\),pp\. 1–35\.Cited by:[§3\.1](https://arxiv.org/html/2606.25057#S3.SS1.p2.1)\.
- \[38\]Z\. Luo, Z\. Yang, Z\. Xu, W\. Yang, and X\. Du\(2025\)LLM4SR: a survey on large language models for scientific research\.arXiv preprint arXiv:2501\.04306\.Cited by:[§1](https://arxiv.org/html/2606.25057#S1.p3.1),[§7](https://arxiv.org/html/2606.25057#S7.p5.1)\.
- \[39\]E\. Markhasin\(2025\)AI\-driven scholarly peer review via persistent workflow prompting, meta\-prompting, and meta\-reasoning\.arXiv preprint arXiv:2505\.03332\.Cited by:[§3\.1](https://arxiv.org/html/2606.25057#S3.SS1.p8.1),[Table 1](https://arxiv.org/html/2606.25057#S3.T1.1.2.2.1.1.1)\.
- \[40\]S\. Min, X\. Lyu, A\. Holtzman, M\. Artetxe, M\. Lewis, H\. Hajishirzi, and L\. Zettlemoyer\(2022\)Rethinking the role of demonstrations: what makes in\-context learning work?\.InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing,Cited by:[§3\.1](https://arxiv.org/html/2606.25057#S3.SS1.p3.1)\.
- \[41\]A\. Mostafa, T\. H\. Nguyen, and Z\. Ahmadi\(2026\)What is novel? a knowledge\-driven framework for bias\-aware literature originality evaluation\.arXiv preprint arXiv:2602\.06054\.Cited by:[§3\.2](https://arxiv.org/html/2606.25057#S3.SS2.p6.1),[§3\.3](https://arxiv.org/html/2606.25057#S3.SS3.p3.1)\.
- \[42\]A\. Mulligan, L\. Hall, and E\. Raphael\(2013\)Peer review in a changing world: an international study measuring the attitudes of researchers\.Journal of the American Society for Information Science and Technology64\(1\),pp\. 132–161\.Cited by:[§2](https://arxiv.org/html/2606.25057#S2.p1.1)\.
- \[43\]OpenAI\(2023\)ChatGPT \(mar 14 version\)\.Note:[https://chat\.openai\.com](https://chat.openai.com/)Accessed: 2025\-04\-29Cited by:[§3](https://arxiv.org/html/2606.25057#S3.p2.1)\.
- \[44\]OPUS Project Consortium\(2024\)Issues of ai and academic transparency\.Note:https://opusproject\.eu/openscience\-news/issues\-of\-ai\-and\-academic\-transparency/?utm\_source=chatgpt\.compublished: 2024\-05\-03Cited by:[§6](https://arxiv.org/html/2606.25057#S6.p4.1)\.
- \[45\]Paper Copilot\(2026\)Paper copilot statistics\.Note:[https://papercopilot\.com/statistics/](https://papercopilot.com/statistics/)Cited by:[§1](https://arxiv.org/html/2606.25057#S1.p1.1)\.
- \[46\]K\. Papineni, S\. Roukos, T\. Ward, and W\. Zhu\(2002\)Bleu: a method for automatic evaluation of machine translation\.InProceedings of the 40th annual meeting of the Association for Computational Linguistics,pp\. 311–318\.Cited by:[§4\.2](https://arxiv.org/html/2606.25057#S4.SS2.p2.1)\.
- \[47\]P\. Pataranutaporn, N\. Powdthavee, and P\. Maes\(2025\)Can ai solve the peer review crisis? a large scale experiment on llm’s performance and biases in evaluating economics papers\.arXiv preprint arXiv:2502\.00070\.Cited by:[§6](https://arxiv.org/html/2606.25057#S6.p5.1)\.
- \[48\]E\. L\. Pier, M\. Brauer, A\. Filut, A\. Kaatz, J\. Raclaw, M\. J\. Nathan, C\. E\. Ford, and M\. Carnes\(2018\)Low agreement among reviewers evaluating the same nih grant applications\.Proceedings of the National Academy of Sciences115\(12\),pp\. 2952–2957\.Cited by:[§3\.5](https://arxiv.org/html/2606.25057#S3.SS5.p3.1)\.
- \[49\]S\. Price and P\. A\. Flach\(2017\)Computational support for academic peer review: a perspective from artificial intelligence\.Communications of the ACM60\(3\),pp\. 70–79\.Cited by:[§1](https://arxiv.org/html/2606.25057#S1.p2.1)\.
- \[50\]Z\. Robertson\(2023\)Gpt4 is slightly helpful for peer\-review assistance: a pilot study\.arXiv preprint arXiv:2307\.05492\.Cited by:[§3\.1](https://arxiv.org/html/2606.25057#S3.SS1.p5.1),[§3\.5](https://arxiv.org/html/2606.25057#S3.SS5.p2.1)\.
- \[51\]A\. Saad, N\. Jenko, S\. Ariyaratne, N\. Birch, K\. P\. Iyengar, A\. M\. Davies, R\. Vaishya, and R\. Botchu\(2024\)Exploring the potential of chatgpt in the peer review process: an observational study\.Diabetes & Metabolic Syndrome: Clinical Research & Reviews18\(2\),pp\. 102946\.Cited by:[§3\.1](https://arxiv.org/html/2606.25057#S3.SS1.p5.1),[§3\.5](https://arxiv.org/html/2606.25057#S3.SS5.p3.1),[Table 1](https://arxiv.org/html/2606.25057#S3.T1.1.4.4.1.1.1)\.
- \[52\]M\. Stahl, L\. Biermann, A\. Nehring, and H\. Wachsmuth\(2024\)Exploring llm prompting strategies for joint essay scoring and feedback generation\.InProceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications \(BEA 2024\),Cited by:[Table 1](https://arxiv.org/html/2606.25057#S3.T1.1.6.6.1.1.1)\.
- \[53\]L\. Stappen, G\. Rizos, M\. Hasan, T\. Hain, and B\. W\. Schuller\(2020\)Uncertainty\-aware machine support for paper reviewing on the interspeech 2019 submission corpus\.In21st Annual Conference of the International Speech Communication Association,Cited by:[§3](https://arxiv.org/html/2606.25057#S3.p1.1),[Table 3](https://arxiv.org/html/2606.25057#S4.T3.1.1.7.6.1)\.
- \[54\]P\. Sukpanichnant, A\. Rapberger, and F\. Toni\(2024\)PeerArg: argumentative peer review with llms\.arXiv preprint arXiv:2409\.16813\.Cited by:[§3\.1](https://arxiv.org/html/2606.25057#S3.SS1.p6.1)\.
- \[55\]P\. Taechoyotin and D\. Acuna\(2025\)REMOR: automated peer review generation with llm reasoning and multi\-objective reinforcement learning\.arXiv preprint arXiv:2505\.11718\.Cited by:[§3\.4](https://arxiv.org/html/2606.25057#S3.SS4.p3.1),[Table 3](https://arxiv.org/html/2606.25057#S4.T3.1.1.25.24.1)\.
- \[56\]C\. Tan, D\. Lyu, S\. Li, Z\. Gao, J\. Wei, S\. Ma, Z\. Liu, and S\. Z\. Li\(2024\)Peer review as a multi\-turn and long\-context dialogue with role\-based interactions\.arXiv preprint arXiv:2406\.05688\.Cited by:[§3\.4](https://arxiv.org/html/2606.25057#S3.SS4.p3.1),[§4\.1](https://arxiv.org/html/2606.25057#S4.SS1.p1.1),[Table 3](https://arxiv.org/html/2606.25057#S4.T3.1.1.20.19.1)\.
- \[57\]G\. Team, R\. Anil, S\. Borgeaud, J\. Alayrac, J\. Yu, R\. Soricut, J\. Schalkwyk, A\. M\. Dai, A\. Hauth, K\. Millican,et al\.\(2023\)Gemini: a family of highly capable multimodal models\.arXiv preprint arXiv:2312\.11805\.Cited by:[§1](https://arxiv.org/html/2606.25057#S1.p2.1),[§3\.1](https://arxiv.org/html/2606.25057#S3.SS1.p1.1)\.
- \[58\]H\. Touvron, T\. Lavril, G\. Izacard, X\. Martinet, M\. Lachaux, T\. Lacroix, B\. Rozière, N\. Goyal, E\. Hambro, F\. Azhar,et al\.\(2023\)Llama: open and efficient foundation language models\.arXiv preprint arXiv:2302\.13971\.Cited by:[§1](https://arxiv.org/html/2606.25057#S1.p2.1),[§3\.1](https://arxiv.org/html/2606.25057#S3.SS1.p1.1)\.
- \[59\]K\. Tyser, B\. Segev, G\. Longhitano, X\. Zhang, Z\. Meeks, J\. Lee, U\. Garg, N\. Belsten, A\. Shporer, M\. Udell,et al\.\(2024\)AI\-driven review systems: evaluating llms in scalable and bias\-aware academic reviews\.arXiv preprint arXiv:2408\.10365\.Cited by:[§3](https://arxiv.org/html/2606.25057#S3.p2.1)\.
- \[60\]Q\. Wang, Q\. Zeng, L\. Huang, K\. Knight, H\. Ji, and N\. F\. Rajani\(2020\)ReviewRobot: explainable paper review generation based on knowledge synthesis\.InProceedings of the 13th International Conference on Natural Language Generation,pp\. 384–397\.Cited by:[§3\.3](https://arxiv.org/html/2606.25057#S3.SS3.p3.1)\.
- \[61\]Q\. Wang and Y\. Tan\(2020\)Grammatical error detection with self attention by pairwise training\.In2020 International Joint Conference on Neural Networks \(IJCNN\),pp\. 1–7\.Cited by:[§1](https://arxiv.org/html/2606.25057#S1.p2.1)\.
- \[62\]J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, F\. Xia, E\. Chi, Q\. V\. Le, D\. Zhou,et al\.\(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.Advances in neural information processing systems35,pp\. 24824–24837\.Cited by:[§3\.1](https://arxiv.org/html/2606.25057#S3.SS1.p8.1)\.
- \[63\]Y\. Weng, M\. Zhu, G\. Bao, H\. Zhang, J\. Wang, Y\. Zhang, and L\. Yang\(2025\)Cycleresearcher: improving automated research via automated review\.Cited by:[§1](https://arxiv.org/html/2606.25057#S1.p2.1),[§3\.2](https://arxiv.org/html/2606.25057#S3.SS2.p6.1),[Table 3](https://arxiv.org/html/2606.25057#S4.T3.1.1.21.20.1)\.
- \[64\]C\. J\. Willmott and K\. Matsuura\(2005\)Advantages of the mean absolute error \(mae\) over the root mean square error \(rmse\) in assessing average model performance\.Climate research30\(1\),pp\. 79–82\.Cited by:[§4\.2](https://arxiv.org/html/2606.25057#S4.SS2.p2.1)\.
- \[65\]R\. Ye, X\. Pang, J\. Chai, J\. Chen, Z\. Yin, Z\. Xiang, X\. Dong, J\. Shao, and S\. Chen\(2024\)Are we there yet? revealing the risks of utilizing large language models in scholarly peer review\.arXiv preprint arXiv:2412\.01708\.Cited by:[§6](https://arxiv.org/html/2606.25057#S6.p3.1),[§6](https://arxiv.org/html/2606.25057#S6.p5.1),[§6](https://arxiv.org/html/2606.25057#S6.p6.1)\.
- \[66\]J\. Yu, Z\. Ding, J\. Tan, K\. Luo, Z\. Weng, C\. Gong, L\. Zeng, R\. Cui, C\. Han, Q\. Sun,et al\.\(2024\)Automated peer reviewing in paper sea: standardization, evaluation, and analysis\.arXiv preprint arXiv:2407\.12857\.Cited by:[§1](https://arxiv.org/html/2606.25057#S1.p2.1)\.
- \[67\]J\. Yu, Z\. Ding, J\. Tan, K\. Luo, Z\. Weng, C\. Gong, L\. Zeng, R\. Cui, C\. Han, Q\. Sun, Z\. Wu, Y\. Lan, and X\. Li\(2024\)Automated peer reviewing in paper SEA: standardization, evaluation, and analysis\.InFindings of the Association for Computational Linguistics: EMNLP 2024,Cited by:[§3\.2](https://arxiv.org/html/2606.25057#S3.SS2.p5.1)\.
- \[68\]S\. Yu, M\. Luo, A\. Madasu, V\. Lal, and P\. Howard\(2024\)Is your paper being reviewed by an llm? investigating ai text detectability in peer review\.arXiv preprint arXiv:2410\.03019\.Cited by:[§3\.5](https://arxiv.org/html/2606.25057#S3.SS5.p2.1),[§3](https://arxiv.org/html/2606.25057#S3.p2.1)\.
- \[69\]W\. Yuan, P\. Liu, and G\. Neubig\(2022\)Can we automate scientific reviewing?\.Journal of Artificial Intelligence Research75,pp\. 171–212\.Cited by:[§3](https://arxiv.org/html/2606.25057#S3.p1.1),[Table 3](https://arxiv.org/html/2606.25057#S4.T3.1.1.11.10.1)\.
- \[70\]W\. Yuan and P\. Liu\(2022\)Kid\-review: knowledge\-guided scientific review generation with oracle pre\-training\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.36,pp\. 11639–11647\.Cited by:[§3](https://arxiv.org/html/2606.25057#S3.p1.1)\.
- \[71\]D\. Zhang, Z\. Bao, S\. Du, Z\. Zhao, K\. Zhang, D\. Bao, and Y\. Yang\(2025\)Re2: a consistency\-ensured dataset for full\-stage peer review and multi\-turn rebuttal discussions\.arXiv preprint arXiv:2505\.07920\.Cited by:[§4\.1](https://arxiv.org/html/2606.25057#S4.SS1.p1.1),[Table 3](https://arxiv.org/html/2606.25057#S4.T3.1.1.1.1)\.
- \[72\]T\. Zhang, V\. Kishore, F\. Wu, K\. Q\. Weinberger, and Y\. Artzi\(2019\)Bertscore: evaluating text generation with bert\.arXiv preprint arXiv:1904\.09675\.Cited by:[§4\.2](https://arxiv.org/html/2606.25057#S4.SS2.p2.1)\.
- \[73\]W\. Zhao, M\. Peyrard, F\. Liu, Y\. Gao, C\. M\. Meyer, and S\. Eger\(2019\)MoverScore: text generation evaluating with contextualized embeddings and earth mover distance\.InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing \(EMNLP\-IJCNLP\),pp\. 563–578\.Cited by:[§4\.2](https://arxiv.org/html/2606.25057#S4.SS2.p2.1)\.
- \[74\]Z\. Zhao, E\. Wallace, S\. Feng, D\. Klein, and S\. Singh\(2021\)Calibrate before use: improving few\-shot performance of language models\.InInternational conference on machine learning,pp\. 12697–12706\.Cited by:[1st item](https://arxiv.org/html/2606.25057#S3.I2.i1.p1.1),[§5\.1](https://arxiv.org/html/2606.25057#S5.SS1.p1.1)\.
- \[75\]R\. Zhou, L\. Chen, and K\. Yu\(2024\)Is llm a reliable reviewer? a comprehensive evaluation of llm on automatic paper reviewing tasks\.InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation \(LREC\-COLING 2024\),pp\. 9340–9351\.Cited by:[§3\.1](https://arxiv.org/html/2606.25057#S3.SS1.p6.1),[§3\.5](https://arxiv.org/html/2606.25057#S3.SS5.p2.1),[§3\.5](https://arxiv.org/html/2606.25057#S3.SS5.p3.1)\.
- \[76\]C\. Zhu, J\. Xiong, R\. Ma, Z\. Lu, Y\. Liu, and L\. Li\(2025\)When your reviewer is an llm: biases, divergence, and prompt injection risks in peer review\.arXiv preprint arXiv:2509\.09912\.Cited by:[§5\.1](https://arxiv.org/html/2606.25057#S5.SS1.p2.1)\.
- \[77\]M\. Zhu, Y\. Weng, L\. Yang, and Y\. Zhang\(2025\)Deepreview: improving llm\-based paper review with human\-like deep thinking process\.arXiv preprint arXiv:2503\.08569\.Cited by:[§3\.2](https://arxiv.org/html/2606.25057#S3.SS2.p6.1),[§3\.3](https://arxiv.org/html/2606.25057#S3.SS3.p4.1),[§4\.1](https://arxiv.org/html/2606.25057#S4.SS1.p1.1),[Table 3](https://arxiv.org/html/2606.25057#S4.T3.1.1.23.22.1)\.
- \[78\]Z\. Zhuang, J\. Chen, H\. Xu, Y\. Jiang, and J\. Lin\(2025\)Large language models for automated scholarly paper review: a survey\.arXiv preprint arXiv:2501\.10326\.Cited by:[§1](https://arxiv.org/html/2606.25057#S1.p3.1)\.

Similar Articles

PRISM: A Multi-Dimensional Benchmark for Evaluating LLM Peer Reviewers

arXiv cs.CL

Introduces PRISM, a multi-dimensional benchmark for evaluating LLM-based peer reviewers across depth of analysis, novelty assessment, flaw identification, and constructiveness. Findings show LLMs match or beat humans on individual dimensions but lack balanced performance across all, suggesting they are best as supplements to human review.

On the Limits of LLM-as-Judge for Scientific Novelty Assessment

Hugging Face Daily Papers

This paper introduces RQ-Bench, a benchmark to evaluate LLMs' ability to assess the novelty of scientific research questions. It finds that LLM judges consistently rate generated questions as more novel than human experts do, raising concerns about the reliability of using LLMs for scientific novelty evaluation.

Review Arcade: On the Human Alignment and Gameability of LLM Reviews

arXiv cs.AI

This paper empirically evaluates the alignment between LLM-generated and human reviews for scientific papers, finding limited and variable alignment. It also shows that authors can 'game' LLM reviews by iteratively revising papers to improve scores, with up to 35% of papers seeing statistically significant score increases.

Review Arcade: On the Human Alignment and Gameability of LLM Reviews

Hugging Face Daily Papers

This paper investigates the alignment of LLM-generated reviews with human judgment using 1k real ACL 2025 submissions, finding limited agreement, instability across models/prompts, and a method to artificially inflate scores without meaningful changes. The authors advise against relying solely on LLM reviews and call for discussion on their use in handling increasing submission volumes.