From Passive Generation to Investigation: A Proactive Scientific Peer Review Agent
Summary
This paper proposes ProReviewer, an LLM-based scientific peer review agent formulated as a Markov Decision Process, which proactively investigates papers by maintaining a structured review log and outperforms existing methods across multiple quality dimensions.
View Cached Full Text
Cached at: 06/12/26, 08:52 AM
# From Passive Generation to Investigation: A Proactive Scientific Peer Review Agent Source: [https://arxiv.org/html/2606.13349](https://arxiv.org/html/2606.13349) Haishuo Fang1,2Yue Feng3Iryna Gurevych1,2 1Ubiquitous Knowledge Processing Lab \(UKP Lab\), Technical University of Darmstadt 2National Research Center for Applied Cybersecurity ATHENE, Germany 3School of Computer Science, University of Birmingham [www\.ukp\.tu\-darmstadt\.de](https://arxiv.org/html/2606.13349v1/www.ukp.tu-darmstadt.de)[y\.feng\.6@bham\.ac\.uk](https://arxiv.org/html/2606.13349v1/mailto:[email protected]) ###### Abstract Large language models \(LLMs\) have shown promise in automating scientific peer review\. However, existing approaches often struggle to generate in\-depth reviews supported by concrete evidence\. We argue that a key limitation is the lack of flexibility to proactively investigate suspicious parts of a paper based on accumulated evidence, as human reviewers do\. In this paper, we explore how to enable an LLM\-based review agent to perform such proactive investigation\. We find that this can be naturally formulated as a Markov Decision Process \(MDP\), and proposeProReviewer, a scientific peer review agent that proactively reviews a paper guided by a maintained, structured*review log*\. The structured review log serves as a workspace for the agent to track evidence and intermediate findings collected during review\. Experiments show thatProReviewerwith an 8B backbone, trained by supervised fine\-tuning and optimized by reinforcement learning, achieves the highest average score across five quality dimensions, outperforming prompt\-based methods with much larger frontier LLMs by up to 39% and the strongest fine\-tuned baseline by 16% relatively\. It also attains the highest win rates against baselines in human evaluation111https://github\.com/UKPLab/arxiv2026\-ProReviewer\. From Passive Generation to Investigation: A Proactive Scientific Peer Review Agent Haishuo Fang1,2Yue Feng3Iryna Gurevych1,21Ubiquitous Knowledge Processing Lab \(UKP Lab\), Technical University of Darmstadt2National Research Center for Applied Cybersecurity ATHENE, Germany3School of Computer Science, University of Birmingham[www\.ukp\.tu\-darmstadt\.de](https://arxiv.org/html/2606.13349v1/www.ukp.tu-darmstadt.de)[y\.feng\.6@bham\.ac\.uk](https://arxiv.org/html/2606.13349v1/mailto:[email protected]) ## 1Introduction Peer review is the main mechanism for the research community to evaluate and improve scholarly work for publication\. Recent advancements in Large Language Models \(LLMs\) have attracted growing attention to leveraging LLMs for automated scientific paper reviewing\(Biswaset al\.,[2026](https://arxiv.org/html/2606.13349#bib.bib2); Idahl and Ahmadi,[2025](https://arxiv.org/html/2606.13349#bib.bib4); Zhuanget al\.,[2025](https://arxiv.org/html/2606.13349#bib.bib3); Lianget al\.,[2023](https://arxiv.org/html/2606.13349#bib.bib5)\)\. Figure 1:An illustrative example ofProReviewer\. The agent extracts the claim “robustness across domains” in the introduction, navigates to the experiments to verify it, finds it contradicted by the reported results, and records the inconsistency in its review log\.Prior work has explored several strategies for generating reviews from a manuscript, including direct prompting\(Robertson,[2023](https://arxiv.org/html/2606.13349#bib.bib7); Lianget al\.,[2023](https://arxiv.org/html/2606.13349#bib.bib5); Liu and Shah,[2023](https://arxiv.org/html/2606.13349#bib.bib8)\), multi\-stage pipelines\(Gaoet al\.,[2024](https://arxiv.org/html/2606.13349#bib.bib12); Zhuet al\.,[2025](https://arxiv.org/html/2606.13349#bib.bib13)\), and multi\-agent collaboration\(Jinet al\.,[2024](https://arxiv.org/html/2606.13349#bib.bib20); Yamadaet al\.,[2025](https://arxiv.org/html/2606.13349#bib.bib21)\)\. However, recent studies find that existing methods produce shallow criticism\(Liet al\.,[2025b](https://arxiv.org/html/2606.13349#bib.bib48)\), give generic comments without concrete evidence\(Ouet al\.,[2025](https://arxiv.org/html/2606.13349#bib.bib49)\), accept authors’ claims as strengths without sufficient investigation\(Duet al\.,[2024](https://arxiv.org/html/2606.13349#bib.bib10); Yeet al\.,[2024](https://arxiv.org/html/2606.13349#bib.bib47)\), and fail to detect logical inconsistencies across sections \(e\.g\., claims contradicted by experimental results\)\(Dycke and Gurevych,[2026](https://arxiv.org/html/2606.13349#bib.bib9); Liet al\.,[2025a](https://arxiv.org/html/2606.13349#bib.bib46)\)\. We argue that these limitations arise from a lack of flexibility to proactively investigate suspicious parts of a paper, as human reviewers do\. Human expert reviewers connect evidence across sections, revisit earlier claims when inconsistencies surface, and decide what to inspect next based on what they have already found\(Willis,[2024](https://arxiv.org/html/2606.13349#bib.bib6)\)\. Existing methods, by contrast, treat reviewing as a passive generation task in which the investigation path is fixed in advance rather than adapted to what has been found, limiting this flexibility\. For example, when a claim in the introduction is contradicted by results in the experiments, a human reviewer would cross\-check and flag the discrepancy \(Figure[1](https://arxiv.org/html/2606.13349#S1.F1)\), whereas a system with a fixed investigation path may accept the claim at face value and never revisit it\. To bridge this gap, we proposeProReviewer, a review agent that investigates the paper proactively by maintaining a structured*review log*\(§[3\.2](https://arxiv.org/html/2606.13349#S3.SS2)\)\. The log records*claims*extracted from the manuscript,*questions*raised during reading, and*notes*capturing intermediate findings\. As the agent reads new content, it updates the log: verifying earlier claims against later evidence, resolving open questions, or noting new findings, so the log both accumulates evidence and guides what to inspect next\. The final review is derived directly from the log, making each critique traceable to its supporting evidence\. Because this process involves sequential decisions about what to inspect and how to update the review log, we formalize it as a Markov Decision Process \(MDP\) \(§[3\.1](https://arxiv.org/html/2606.13349#S3.SS1)\)\. Unlike prior systems that rely on hand\-designed pipelines, the MDP formulation allows the review strategy to be*learned*via reinforcement learning, enabling the agent to adapt its investigation depth to each paper\. We trainProReviewerwith supervised fine\-tuning on synthesized trajectories followed by Group Relative Policy Optimization \(GRPO\)\(Guoet al\.,[2025](https://arxiv.org/html/2606.13349#bib.bib36)\)with a multi\-dimensional reward \(§[3\.3](https://arxiv.org/html/2606.13349#S3.SS3)\)\. To ensure contamination\-free evaluation, we construct a version\-matched corpus of 5K ICLR 2025/2026 paper–review pairs, training on 4K ICLR 2025 papers and testing on 1K held\-out ICLR 2026 papers, which postdate the base model’s knowledge cutoff, mitigating potential data contamination \(§[4\.1](https://arxiv.org/html/2606.13349#S4.SS1)\)\. Experiments show thatProReviewerwith an 8B backbone ranks first on average across five review quality dimensions, improving over frontier LLM\-based systems \(e\.g\. Gemini\-3\.1\-flash\-lite, Qwen3\.5\-397B\-A17B\) by up to 39% relatively and over the best fine\-tuned baseline by 16%, with human evaluators also preferring its reviews across all pairwise comparisons \(§[4\.3](https://arxiv.org/html/2606.13349#S4.SS3)\)\. Further analyses confirm thatProReviewermore effectively detects subtle cross\-section inconsistencies \(§[5\.2](https://arxiv.org/html/2606.13349#S5.SS2)\) and maintains robust performance as paper length increases \(§[5\.3](https://arxiv.org/html/2606.13349#S5.SS3)\)\. Our contributions can be summarized as: - 1\.An MDP formulation of peer review as proactive investigation, instantiated inProReviewer, a reinforcement\-learning trained review agent\. - 2\.A structured review log that supports traceable, evidence\-grounded review generation by maintaining claims, questions, and notes throughout the review process\. - 3\.A curated version\-matched corpus of 5k ICLR 2025/2026 paper–review pairs where each review is aligned to the manuscript version it assessed, enabling contamination\-controlled training and evaluation\. - 4\.Empirical results showing thatProRevieweroutperforms both prompt\-based systems with frontier LLMs and fine\-tuned baselines across automatic and human evaluation\. ## 2Related Work ##### LLM\-based Review Generation\. Early work on automated scientific reviewing used direct prompting to produce a complete review in a single pass\(Robertson,[2023](https://arxiv.org/html/2606.13349#bib.bib7); Liu and Shah,[2023](https://arxiv.org/html/2606.13349#bib.bib8); Lianget al\.,[2023](https://arxiv.org/html/2606.13349#bib.bib5); Wenget al\.,[2025](https://arxiv.org/html/2606.13349#bib.bib14); Zenget al\.,[2025](https://arxiv.org/html/2606.13349#bib.bib24)\), but such reviews often lack specificity, depth, and reliable grounding\(Duet al\.,[2024](https://arxiv.org/html/2606.13349#bib.bib10); Shinet al\.,[2025](https://arxiv.org/html/2606.13349#bib.bib11)\)\. To introduce more structure, recent methods decompose reviewing into staged subtasks\(Gaoet al\.,[2024](https://arxiv.org/html/2606.13349#bib.bib12); Zhuet al\.,[2025](https://arxiv.org/html/2606.13349#bib.bib13)\), hierarchical question decomposition\(Changet al\.,[2025](https://arxiv.org/html/2606.13349#bib.bib25)\), multi\-agent role assignment\(Jinet al\.,[2024](https://arxiv.org/html/2606.13349#bib.bib20); Goyalet al\.,[2026](https://arxiv.org/html/2606.13349#bib.bib19); Yamadaet al\.,[2025](https://arxiv.org/html/2606.13349#bib.bib21)\), or modular pipelines\(Sahuet al\.,[2025](https://arxiv.org/html/2606.13349#bib.bib1)\)\. All these methods follow a*fixed*review procedure that does not adapt to what it has found in the paper\.ProReviewerdiffers in that \(1\) its review strategy is*learned*via RL rather than hand\-designed, enabling the agent to proactively investigate the paper based on accumulated evidence; and \(2\) it maintains a structured review log which persists claims, questions, and notes during the review process, supporting cross\-section evidence tracking and revision\. Concurrent to our work, DeepReviewer 2\.0\(Wenget al\.,[2026](https://arxiv.org/html/2606.13349#bib.bib44)\)also tracks evidence during reviewing, but its representation, a traceable review package with anchored annotations, is designed to assist human reviewers in auditing the final output\. In contrast, our review log serves as the working memory for the agent to decide what to investigate next based on its accumulated evidence\. ##### Agentic Reasoning\. LLM\-based agents that interleave reasoning with actions have achieved strong results across web navigation\(Nakanoet al\.,[2021](https://arxiv.org/html/2606.13349#bib.bib26)\), software engineering\(Jimenezet al\.,[2024](https://arxiv.org/html/2606.13349#bib.bib28)\), and scientific discovery\(Luet al\.,[2024](https://arxiv.org/html/2606.13349#bib.bib29)\)\. Frameworks such as ReAct\(Yaoet al\.,[2023](https://arxiv.org/html/2606.13349#bib.bib27)\)alternate between thought and action steps, while Reflexion\(Shinnet al\.,[2023a](https://arxiv.org/html/2606.13349#bib.bib31)\)and Self\-Refine\(Madaanet al\.,[2023](https://arxiv.org/html/2606.13349#bib.bib41)\)add iterative self\-correction loops\. Other work augments agents with scratchpads\(Nyeet al\.,[2021](https://arxiv.org/html/2606.13349#bib.bib30)\)or persistent memory to retain information across long horizons\(Shinnet al\.,[2023b](https://arxiv.org/html/2606.13349#bib.bib32); Huet al\.,[2025](https://arxiv.org/html/2606.13349#bib.bib40); Yanet al\.,[2025](https://arxiv.org/html/2606.13349#bib.bib37)\)\. While persistent memory helps retain information, these methods typically accumulate unstructured reasoning traces, making it difficult to selectively revise specific earlier findings or trace critiques back to their supporting evidence\. In contrast,ProReviewermaintains a structured review log with typed entries as part of a trainable MDP state, enabling selective revision and evidence tracing without requiring the full reasoning trajectory in context\. ## 3Method Figure 2:The interaction loop ofProReviewer\. At time steptt, the agentπθ\\pi\_\{\\theta\}observes statests\_\{t\}\(paper index, review log, and context\) and samples an actionata\_\{t\}, consisting of an environment actionatenva\_\{t\}^\{\\mathrm\{env\}\}and a log actionatloga\_\{t\}^\{\\mathrm\{log\}\}\. The policy usesatenva\_\{t\}^\{\\mathrm\{env\}\}to fetch content from the paper, whileatloga\_\{t\}^\{\\mathrm\{log\}\}updates the review log to maintain an evolving understanding and evaluation of the paper\. A multi\-component reward producesrtr\_\{t\}, and the system transitions tost\+1s\_\{t\+1\}until termination\.In this section, we presentProReviewerfor proactive reviewing of scientific papers \(Figure[2](https://arxiv.org/html/2606.13349#S3.F2)\)\. We first define the MDP formulation \(§[3\.1](https://arxiv.org/html/2606.13349#S3.SS1)\), describe the design of the review log \(§[3\.2](https://arxiv.org/html/2606.13349#S3.SS2)\), then present the multi\-dimensional reward function \(§[3\.3](https://arxiv.org/html/2606.13349#S3.SS3)\), and finally detail the training procedure \(§[3\.4](https://arxiv.org/html/2606.13349#S3.SS4)\)\. A concrete case study illustrating the full review process is provided in Appendix[J](https://arxiv.org/html/2606.13349#A10)\. ### 3\.1Review Generation as a Markov Decision Process Rather than following a predetermined pipeline, our method enables the agent to decide at each step which section to read, what evidence to extract, and when to revisit earlier content through a learnable policy\. We formalize this asℳ=\(𝒮,𝒜,𝒯,ℰ,ℛ\)\\mathcal\{M\}=\(\\mathcal\{S\},\\mathcal\{A\},\\mathcal\{T\},\\mathcal\{E\},\\mathcal\{R\}\)\. ##### State𝒮\\mathcal\{S\}\. The state must capture both what the agent currently observes and what it has learned so far, enabling informed decisions about where to look next\. Each statest=\(𝒞t,ℒt,𝒫\)s\_\{t\}=\(\\mathcal\{C\}\_\{t\},\\mathcal\{L\}\_\{t\},\\mathcal\{P\}\)represents the agent’s understanding at steptt: \(1\) thecurrent context𝒞t\\mathcal\{C\}\_\{t\}, containing the most recent action and observation \(e\.g\., a section\); \(2\) thereview logℒt\\mathcal\{L\}\_\{t\}that records the agent’s accumulated evidence entries \(§[3\.2](https://arxiv.org/html/2606.13349#S3.SS2)\); and \(3\) thepaper index𝒫\\mathcal\{P\}, comprising the title and table of contents, which guides navigation through the paper\. ##### Action𝒜\\mathcal\{A\}\. The action space reflects two complementary activities: acquiring information from the paper and maintaining the review log\. It divides into two categories\.Environment actionsacquire information:read\_sectionretrieves the full text of a section,look\_upsearches the paper for specific keywords, andfinishterminates the episode\.Log actionsmaintain the review logℒ\\mathcal\{L\}\(§[3\.2](https://arxiv.org/html/2606.13349#S3.SS2)\):logrecords new evidence entries \(claims, questions, or notes\),updaterevises the status of existing entries as new information emerges, andoutlineconstructs the final review by adding points that cite accumulated evidence\. In this work, we scope the current action space to the manuscript itself, excluding external retrieval, to evaluate our core design in isolation\. Notably, our proposed MDP formulation is modular: actions such as literature search for novelty assessment can be added without changing the core architecture\. Full action schemas are provided in Appendix[A](https://arxiv.org/html/2606.13349#A1)\. ##### Transition𝒯\\mathcal\{T\}and Environmentℰ\\mathcal\{E\}\. The transition𝒯\(st,at\)→st\+1\\mathcal\{T\}\(s\_\{t\},a\_\{t\}\)\\to s\_\{t\+1\}is deterministic\. Given the agent’s actionata\_\{t\}at steptt: \(1\) the environment actionatenva\_\{t\}^\{env\}is executed, producing observationCt\+1C\_\{t\+1\}\(e\.g\., section content, keyword match results\) from the paperℰ\\mathcal\{E\}; \(2\) log operationsatloga\_\{t\}^\{\\log\}are validated and executed, updatingℒt→ℒt\+1\\mathcal\{L\}\_\{t\}\\to\\mathcal\{L\}\_\{t\+1\}\. ##### Rewardℛ\\mathcal\{R\}\. We define a multi\-dimensional reward to cover both action validity at each step and the overall quality of the final review, described in §[3\.3](https://arxiv.org/html/2606.13349#S3.SS3)\. ##### Trajectory\. The complete review process forms a trajectoryτ=\(s0,a0,r0,s1,…,sT\)\\tau=\(s\_\{0\},a\_\{0\},r\_\{0\},s\_\{1\},\\ldots,s\_\{T\}\)induced by the policyπθ\\pi\_\{\\theta\}\. Att=0t=0, the agent is initialized with the paper index𝒫\\mathcal\{P\}, an empty review logℒ0=∅\\mathcal\{L\}\_\{0\}=\\emptyset, and no prior context∅\\varnothing: s0=\(∅,∅,𝒫\)s\_\{0\}=\(\\varnothing,\\;\\emptyset,\\;\\mathcal\{P\}\)\(1\)At the subsequent steptt, the agent samples an actionat∼πθ\(⋅∣st\)a\_\{t\}\\sim\\pi\_\{\\theta\}\(\\cdot\\mid s\_\{t\}\), which updates its context and augments the review log to produce the next statest\+1s\_\{t\+1\}and a per\-step rewardrtr\_\{t\}: st\+1,rt\\displaystyle s\_\{t\+1\},\\;r\_\{t\}=𝒯\(st,at\),\\displaystyle=\\mathcal\{T\}\(s\_\{t\},a\_\{t\}\),\(2\)wherest\+1\\displaystyle\\text\{where\}\\quad s\_\{t\+1\}=\(𝒞t\+1,ℒt\+1,𝒫\)\.\\displaystyle=\(\\mathcal\{C\}\_\{t\+1\},\\;\\mathcal\{L\}\_\{t\+1\},\\;\\mathcal\{P\}\)\.The episode terminates when the agent issuesfinishor reaches a maximum step limitTmaxT\_\{\\max\}\. At termination, the outline entries inℒT\\mathcal\{L\}\_\{T\}are rendered into the final review\. ### 3\.2State Design: Review Log To enable proactive investigation, the agent needs a mechanism to accumulate evidence entries and use them to guide subsequent inspection\. We introduce areview logℒ\\mathcal\{L\}, a structured workspace that \(1\) records these evidence entries with unique identifiers, allowing the agent to decide what to examine next based on what it has collected so far, and \(2\) requires each point in the final review to cite corresponding evidence IDs, creating a verifiable chain from critiques back to specific paper content\. The logℒ\\mathcal\{L\}maintains three types of evidence entries: - ∙\\bulletClaims\{C1,C2,…\}\\\{C\_\{1\},C\_\{2\},\\ldots\\\}: assertions from the paper, annotated with source section and a verification status \(e\.g\.,supported,weak,invalid\)\. - ∙\\bulletQuestions\{Q1,Q2,…\}\\\{Q\_\{1\},Q\_\{2\},\\ldots\\\}: questions raised during reading, each with a resolution status \(e\.g\.,open,resolved\)\. - ∙\\bulletNotes\{N1,N2,…\}\\\{N\_\{1\},N\_\{2\},\\ldots\\\}: free\-form intermediate findings and thoughts\. The agent buildsℒ\\mathcal\{L\}incrementally via thelogaction and refines earlier entries viaupdateas new evidence emerges \(e\.g\., marking a claim assupportedafter finding corroboration in a later section\)\. To produce the final review, the agent callsoutlineto write review points, each tagged with the IDs of the evidence entries that support it\. For example, a weakness such as"Limited baseline comparison \[C1, Q2, N5\]"links the critique to the supporting evidence: claim C1, question Q2, and note N5\. Any review point that lacks evidence tags or cites non\-existent IDs is rejected to prevent hallucinations\. ### 3\.3Multi\-dimensional Reward Training a review agent requires optimizing multiple complementary capabilities: issuing syntactically valid actions, producing structurally complete reviews, aligning quantitative judgments with human ratings, and demonstrating substantive engagement with technical content\. We decompose the reward into four components at two granularities: step\-level and trajectory\-level, each targeting a distinct aspect of review quality\. #### 3\.3\.1Step\-level Reward ##### Syntactic Validity\. To teach the agent correct action invocation, the syntactic rewardrtsynr^\{\\text\{syn\}\}\_\{t\}provides immediate feedback on action validity at each step: rtsyn=−𝟙\[λform∨λexec∨λground\]∈\{−1,0\}r^\{\\text\{syn\}\}\_\{t\}=\-\\mathds\{1\}\\bigl\[\\lambda\_\{\\text\{form\}\}\\lor\\lambda\_\{\\text\{exec\}\}\\lor\\lambda\_\{\\text\{ground\}\}\\bigr\]\\in\\\{\-1,0\\\}\(3\)whereformatting errors\(λform\\lambda\_\{\\text\{form\}\}\) indicate schema violations \(e\.g\., malformed JSON\),execution errors\(λexec\\lambda\_\{\\text\{exec\}\}\) indicate invalid action names or arguments \(e\.g\., querying a non\-existent section\), andgrounding errors\(λground\\lambda\_\{\\text\{ground\}\}\) penalize review points that cite evidence not present in the agent’s log\. #### 3\.3\.2Trajectory\-level Rewards ##### Review Completeness\. We define format compliance asrfmt=14∑i=14𝟙\[checkisatisfied\]r^\{\\text\{fmt\}\}=\\frac\{1\}\{4\}\\sum\_\{i=1\}^\{4\}\\mathds\{1\}\[\\text\{check\}\_\{i\}\\text\{ satisfied\}\]\. The four checks verify: \(1\) a summary is present, \(2\) at least one strength, \(3\) at least one weakness, and \(4\) an overall score\. ##### Review Content Quality\. Beyond structural completeness, a high\-quality review must demonstrate substantive engagement with the paper’s technical contentZhuet al\.\([2025](https://arxiv.org/html/2606.13349#bib.bib13)\); Garget al\.\([2025](https://arxiv.org/html/2606.13349#bib.bib17)\); Goyalet al\.\([2026](https://arxiv.org/html/2606.13349#bib.bib19)\)\. We measure this through two complementary dimensions: \(1\)Technical depth\(rdepthr^\{\\text\{depth\}\}\) evaluates whether the review engages with methodological details and experimental design beyond surface\-level observations\. \(2\)Grounding\(rgrdr^\{\\text\{grd\}\}\) measures whether critiques are grounded in the paper’s concrete content rather than hallucination\. Both are scored via rubric\-based LLM\-as\-a\-judge evaluation, combined asrqual=α⋅rdepth\+\(1−α\)⋅rgrdr^\{\\text\{qual\}\}=\\alpha\\cdot r^\{\\text\{depth\}\}\+\(1\-\\alpha\)\\cdot r^\{\\text\{grd\}\}\. In our experiments, we treat both equally by settingα=0\.5\\alpha=0\.5\. ##### Score Alignment\. Beyond textual feedback, peer reviews typically include a quantitative assessment\. We encourage the agent to align its scores^\\hat\{s\}with the human reviewer averages¯\\bar\{s\}: rscr=max\(0,1−\|s^−s¯\|/κ\)r^\{\\text\{scr\}\}=\\max\\bigl\(0,\\,1\-\|\\hat\{s\}\-\\bar\{s\}\|/\\kappa\\bigr\)\(4\)whereκ\\kappais the rating scale range \(e\.g\.,κ=9\\kappa=9for a 1–10 scale\)\. ##### Total Reward\. At each steptt, the total reward isrt=wsyn⋅rtsyn\+∑k∈Kwk⋅rkr\_\{t\}=w\_\{\\text\{syn\}\}\\cdot r^\{\\text\{syn\}\}\_\{t\}\+\\sum\_\{k\\in K\}w\_\{k\}\\cdot r^\{k\}, where trajectory\-level rewards are computed at episode termination and broadcast uniformly to all steps\. The specific reward weights are provided in Appendix[C](https://arxiv.org/html/2606.13349#A3)\. ### 3\.4Training We adopt a two\-stage training pipeline to obtain an agent that produces valid actions and explores efficiently within a limited interaction budget\. First, we perform supervised fine\-tuning \(SFT\) on interaction trajectories distilled from a strong teacher model \(Qwen3\.5\-397B\-A17B; see Appendix[B](https://arxiv.org/html/2606.13349#A2)for dataset construction\)\. Second, starting from the SFT checkpoint, we apply GRPO reinforcement learning with a two\-phase curriculum: Phase 1 trains with only deterministic, rule\-based rewards \(syntactic validity, review completeness, and score alignment\)\. For Phase 2, we additionally include LLM\-judge\-based content quality rewards while retaining all Phase 1 rewards\. ## 4Experiments ### 4\.1Experiment Setup ##### Dataset\. To facilitate reproducible experiments, following prior work\(Zhuet al\.,[2025](https://arxiv.org/html/2606.13349#bib.bib13); Wenget al\.,[2025](https://arxiv.org/html/2606.13349#bib.bib14); Goyalet al\.,[2026](https://arxiv.org/html/2606.13349#bib.bib19)\), we use peer\-review data from the International Conference on Learning Representations \(ICLR\), which is publicly available and covers a wide range of AI research topics in this fast\-paced field\. We collected submissions across two ICLR conference cycles \(2025–2026\), carefully matching each paper’s initial submission with its corresponding reviews and initial scores to ensure version alignment\. After filtering for review completeness, the final corpus comprises 5,011 papers: 4,011 ICLR 2025 papers for training and validation, and 1,000 ICLR 2026 papers for evaluation\. The temporal separation ensures that test papers were published after base model training, mitigating potential data contamination \(see Appendix[B](https://arxiv.org/html/2606.13349#A2)for full details\)\. ##### Baselines\. We compare against three categories of baselines that span different paradigms in automated review generation\. \(1\) Forprompt\-based methods, we include two representative methods: AgentReview\(Jinet al\.,[2024](https://arxiv.org/html/2606.13349#bib.bib20)\)and AI\-Scientist\-v2\(Yamadaet al\.,[2025](https://arxiv.org/html/2606.13349#bib.bib21)\)\. We evaluate each method across multiple backbones ranging from 8B to 397B parameters, as well as an advanced commercial model, i\.e\., Gemini\-3\.1\-flash\-lite \(see Table[1](https://arxiv.org/html/2606.13349#S4.T1)\)\. \(2\)Supervised fine\-tuning\.We include CycleReviewer\(Wenget al\.,[2025](https://arxiv.org/html/2606.13349#bib.bib14)\)and DeepReview\(Zhuet al\.,[2025](https://arxiv.org/html/2606.13349#bib.bib13)\)\. Both methods fine\-tune LLMs on human review data and represent the state\-of\-the\-art in SFT\-based review generation\. \(3\)Reinforcement learning\.We implement aVanilla RLbaseline trained with GRPO on the same reward signals and training stages asProReviewerbut with a single\-turn generation\. ##### Implementation Details\. We use both Qwen3\-8B\(Yanget al\.,[2025](https://arxiv.org/html/2606.13349#bib.bib22)\)and Llama3\.1\-8B\(Grattafioriet al\.,[2024](https://arxiv.org/html/2606.13349#bib.bib23)\)as the base model for fine\-tuned methods to assess generalization across model families\. All training is conducted on 8×\\timesA100 \(80 GiB\) GPUs\. For the LLM\-judge\-based reward during RL training, we use GPT\-OSS\-120B\(OpenAI,[2025](https://arxiv.org/html/2606.13349#bib.bib39)\)as the judge model\. To eliminate confounding factors of different base models and datasets, we implement the above fine\-tuned methods on the same base models and training data asProReviewer\(cf\. Appendix[C](https://arxiv.org/html/2606.13349#A3)\)\. ##### Complexity Analysis\. WhileProReviewerissues multiple LLM calls per paper, each call operates on a compact state rather than the full paper, keeping total inference cost comparable to multi\-stage pipelines\. We provide a detailed theoretical and empirical complexity analysis in Appendix[D](https://arxiv.org/html/2606.13349#A4)\. ### 4\.2Evaluation Protocol A useful review should help authors understand what to improve \(actionability\), where the issue arises \(grounding\), why the critique is justified \(verifiability\), and how deeply it engages with the technical substance \(depth\)Sadallahet al\.\([2025](https://arxiv.org/html/2606.13349#bib.bib15)\); Garget al\.\([2025](https://arxiv.org/html/2606.13349#bib.bib17)\); Zhuet al\.\([2025](https://arxiv.org/html/2606.13349#bib.bib13)\)\. Following the review utility framework introduced bySadallahet al\.\([2025](https://arxiv.org/html/2606.13349#bib.bib15)\), we score each dimension on a 1–5 rubric and normalize to a \[0,1\] scale\. Beyond review content quality, we reportScore Alignmentto measure the calibration of the numerical overall rating, computed asmax\(0,1−\|s^−s¯\|/κ\)\\max\(0,1\-\|\\hat\{s\}\-\\bar\{s\}\|/\\kappa\)based on mean absolute error \(MAE\), wheres^\\hat\{s\}is the predicted overall rating,s¯\\bar\{s\}the average human overall rating, andκ\\kappathe rating scale range\. ##### Automatic Evaluation\. All content quality dimensions are evaluated via LLM\-as\-a\-judge\. A single judge risks systematic bias toward particular writing styles or model families\(Zhenget al\.,[2023](https://arxiv.org/html/2606.13349#bib.bib42)\)\. To mitigate this risk, we aggregate scores from three diverse judges that are not used as base models in any baseline: two general\-purpose frontier LLMs, i\.e\., GPT\-5\.4 nano222[https://openai\.com/index/introducing\-gpt\-5\-4\-mini\-and\-nano/](https://openai.com/index/introducing-gpt-5-4-mini-and-nano/)and DeepSeek\-V4 flash\(DeepSeek\-AI,[2026](https://arxiv.org/html/2606.13349#bib.bib38)\), and one domain\-specific judge, RevUtil\(Sadallahet al\.,[2025](https://arxiv.org/html/2606.13349#bib.bib15)\), fine\-tuned on human\-annotated review quality data\. This diversity reduces the likelihood that results are driven by the idiosyncratic preferences of any single judge\. Per\-judge results in Appendix[I](https://arxiv.org/html/2606.13349#A9)further show that our main findings remain consistent across all three evaluators\. For each paper, we generate four independent reviews per method and report both average and best\-of\-4 scores to capture consistency and peak performance\. Evaluation rubrics are provided in Appendix[E](https://arxiv.org/html/2606.13349#A5)\. ##### Human Evaluation\. To validate automatic evaluation findings, we recruit five human evaluators who have served as reviewers for top\-tier AI conferences\. They evaluate reviews for 50 randomly sampled papers from our test dataset along all quality dimensions with pairwise comparisons\. Results are presented in Section[4\.3\.2](https://arxiv.org/html/2606.13349#S4.SS3.SSS2), with detailed evaluation instructions in Appendix[F](https://arxiv.org/html/2606.13349#A6)\. ### 4\.3Main Results Table 1:Average evaluation results across three judges \(DeepSeek\-V4 flash, GPT\-5\.4 nano, RevUtil\)\. All scores are mean±std \(on \[0,1\] scale\)\.Greenhighlights the best andBluethe second\-best result in each column\. The detailed result per judge can be found in Appendix[I](https://arxiv.org/html/2606.13349#A9)\.#### 4\.3\.1Automatic Evaluation Results Table[1](https://arxiv.org/html/2606.13349#S4.T1)presents results across four review quality dimensions and score alignment, evaluated by three independent LLM judges\.ProReviewer\(Qwen3\-8B\) achieves the highest overall score in both avg\-of\-4 \(0\.57\) and best\-of\-4 \(0\.65\), outperforming all baselines including those backed by much larger models such as Gemini\-3\.1\-flash\-lite and Qwen3\.5\-397B\-A17B\. For instance,ProReviewerexcels inGrounding\(0\.64\) andTechnical Depth\(0\.48\), whereProReviewersurpasses AI\-Scientist\-v2 \(Qwen3\.5\-397B\-A17B\) by 0\.20 and 0\.02 absolute points, respectively, highlighting the benefit of proactive investigation with supported evidence for in\-depth reviewing\. Moreover,ProReviewerleads onActionability\(0\.46\) andVerifiability\(0\.40\), which are not directly optimized during training, showing generalization beyond the reward signal\. These gains generalize across model families:ProReviewerwith Llama3\.1\-8B achieves 0\.52 average, also outperforming all baselines\. Comparing across training paradigms, RL\-based methods substantially outperform SFT\-trained baselines: CycleReviewer and DeepReview reach only 0\.36 and 0\.35 average on the same 8B backbone, trailingProReviewerby over 0\.2 points\. Scaling prompt\-based methods from 8B to 397B narrows the gap, yet AI\-Scientist\-v2 with Qwen3\.5\-397B\-A17B \(0\.46\) still trailsProReviewer\(0\.57\) by 0\.11 points, suggesting that model scale alone is insufficient for the multi\-faceted demands of peer review\. These content quality gains do not come at the expense of calibration:ProReviewermaintains competitive score alignment, showing that the agent learns to justify its judgments with evidence while keeping its scores well\-calibrated\. #### 4\.3\.2Human Evaluation Results Table 2:Human pairwise evaluation:ProReviewer\(Qwen3\-8B\) vs\. baselines judged by five human reviewers\. 397B = Qwen3\.5\-397B\-A17B; 8B = Qwen3\-8B\.GreenindicatesProReviewerwin rate\.We further conduct a human evaluation with five reviewers experienced in reviewing for AI conferences, who perform blind pairwise comparisons betweenProReviewerand different baselines \(protocol in Appendix[F](https://arxiv.org/html/2606.13349#A6)\)\. As shown in Table[2](https://arxiv.org/html/2606.13349#S4.T2),ProRevieweris preferred in every matchup on each dimension, with win rates of 51%–95%\. Consistent with the automatic results, the largest margins again appear on*Grounding*and*Technical Depth*\(69\.2%–94\.9%\), where proactive investigation through the review log is most visible to expert judges\. The performance lead in the untrained*Actionability*and*Verifiability*is also confirmed by human evaluators\. To aggregate these pairwise comparisons into a single consistent ranking, we fit a Bradley–Terry model over the full matchup data\. The resulting ranking placesProReviewerfirst across all dimensions, with non\-overlapping confidence intervals\. ## 5Discussion ### 5\.1Ablation Study Table 3:Ablation study ofProReviewer\(Qwen3\-8B\), averaged across three judges\. Act\.: Actionability, Grd\.: Grounding, TD: Technical Depth, Ver\.: Verifiability\.↓\\downarrowdenotes absolute drop from the full model\.We conduct ablation studies on two critical design choices ofProReviewer: structured review log and MDP formulation\. ##### Review Log\. We ablate the structured review log by replacing it with free\-form chain\-of\-thought while keeping the multi\-step agent loop unchanged\. As shown in Table[3](https://arxiv.org/html/2606.13349#S5.T3), performance drops across all quality dimensions: grounding by 38%, technical depth by 24%, actionability by 15%, and verifiability by 6%\. This suggests that structured tracking of claims, questions, notes is essential forProReviewerto produce well\-grounded and technically substantive reviews\. ##### MDP Formulation\. Removing the MDP formulation and generating the review in a single pass further degrades performance: both actionability and grounding drops by 13%, technical depth by 12%, and verifiability by 2%\. Together, these results show thatProReviewerbenefits from both structured review log for tracking evidence and sequential decision\-making ability for iterative manuscript analysis and investigation\. ### 5\.2Counterfactual Error Detection Beyond overall review quality, we evaluate the ability of different methods to perform in\-depth reviewing through a challenging task: detecting subtle logic errors deliberately embedded in manuscripts\. This task requires models to cross\-check information across sections and revisit suspicious content to identify logical inconsistencies, rather than relying on generic assessment heuristics\. We use the counterfactual dataset introduced byDycke and Gurevych \([2026](https://arxiv.org/html/2606.13349#bib.bib9)\), which contains 138 papers from multiple AI conferences\. Each paper is perturbed with one of three error types: \(1\)conclusion perturbation, which alters a conclusion to misalign with its supporting results; \(2\)finding perturbation, which exaggerates a finding beyond what the evidence supports; and \(3\)result perturbation, which modifies a result so that it contradicts the conclusion it originally supported\. A detection is considered successful if any weakness identified by a method correctly matches the injected error, as judged by GPT\-5\.4\-nano \(see Appendix[G](https://arxiv.org/html/2606.13349#A7)for dataset details and evaluation prompts\)\. Table 4:Counterfactual error detection\.Green: best;Blue: second\-best\.As shown in Table[4](https://arxiv.org/html/2606.13349#S5.T4),ProReviewerachieves the highest overall detection rate \(27%\), outperforming AI\-Scientist\-V2 by 6 percentage points\. Moreover,ProReviewermaintains relatively balanced performance across all three perturbation types \(24–29%\), whereas AI\-Scientist\-V2 drops sharply on finding perturbations \(9%\) despite performing well on conclusion perturbations \(27%\)\. CycleReviewer and DeepReview fall below 5% overall, reflecting that SFT\-based single\-pass generation cannot reliably detect cross\-section inconsistencies\. These results suggest thatProReviewer’s proactive and traceable review process supports more effective cross\-sectional reasoning and targeted investigation of subtle inconsistencies\. ### 5\.3Robustness to Paper Length Figure 3:Average rubric score across five paper\-length bins\.ProReviewer\(Qwen3\-8B\) maintains a stable lead across all lengths compared to baselines\. 8B = Qwen3\-8B; 397B = Qwen3\.5\-397B\-A17B\.We hypothesize thatProReviewer’s structured review log and iterative investigation make it robust to increasing paper length\. To test this, we partition the test set into five bins by token count \(see Appendix[H](https://arxiv.org/html/2606.13349#A8)for the distribution\) and report the average rubric score \(mean of four quality dimensions\) per bin\. As shown in Figure[3](https://arxiv.org/html/2606.13349#S5.F3),ProReviewer\(Qwen3\-8B\) is essentially flat across bins \(0\.49→0\.480\.49\{\\to\}0\.48from the shortest to the longest papers\), with no monotonic trend in between\. Every baseline, by contrast, trends downward as papers grow longer: Vanilla RL declines from0\.440\.44to0\.410\.41, AI\-Scientist\-v2 from0\.370\.37to0\.350\.35, and AgentReview from0\.370\.37to0\.340\.34\(relative declines of6\.8%6\.8\\%,5\.4%5\.4\\%, and8\.1%8\.1\\%\)\.ProReviewermaintains its lead across every length bin, outperforming AI\-Scientist\-v2 by\+0\.13\+0\.13and Vanilla RL by\+0\.07\+0\.07on the longest papers\. This demonstrates thatProRevieweris more robust to paper length\. ## 6Conclusion We introducedProReviewer, a review agent that shifts automated peer review from passive generation to proactive investigation by formulating the review process as an MDP guided by a structured review log\. The review log tracks claims, questions, and notes throughout the investigation, enabling the agent to verify earlier claims against later evidence, resolve open questions, and ground each critique in accumulated findings\. Experiments show thatProReviewerwith an 8B backbone outperforms both prompt\-based systems with much larger frontier LLMs and fine\-tuned baselines across automatic and human evaluation, while further analyses confirm its ability to detect cross\-section inconsistencies and maintain robust performance on longer papers\. These results suggest that proactive investigation supported by evidence tracking is a promising direction for LLM\-assisted peer review and potentially for tasks requiring multi\-step analytical reasoning over complex documents\. ## Limitations WhileProReviewerachieves strong performance, there are several limitations that motivate future work\. First, the current implementation is text\-only: the agent cannot directly inspect figures, which could include complementary evidence that is not accurately described in the text by the authors\. Extending the agent with multimodal perception would allow it to verify visual claims \(e\.g\., whether a reported trend matches a plotted curve\)\. Second,ProRevieweris trained and evaluated on AI conference papers \(ICLR\), as other fields currently lack sufficient publicly available, clean manuscript–review pairs\. It is promising to adapt the approach to domains such as biomedicine or the social sciences once review data becomes available\. Third, the current implementation focuses on intra\-manuscript reasoning and does not perform external novelty search\. Novelty assessment is a different problem, an open\-corpus retrieval task whose reliability depends on index coverage and corpus freshness, rather than reasoning over evidence within the paper, which is our focus in this work\. Looking ahead, the MDP formulation naturally accommodates all three extensions—multimodal perception, cross\-domain adaptation, and external retrieval—by adding corresponding actions without modifying the core architecture, making them promising directions for future work\. ## Ethical Considerations The development ofProReviewercarries several ethical considerations given its potential impact on the peer review process\. Its primary intended use is to help the authors of scientific papers identify potential issues and improve their work before submission and to provide supplementary reference to help reviewers identify possible issues, not as a final judgment\. However, there are risks of misuse and unintended consequences that we discuss below\. An automated reviewing system could be misused to mass\-produce superficial reviews or to game review assignment systems\. To mitigate this risk, we advocate for transparent disclosure whenever AI\-generated reviews are used and recommend that venues establish clear policies governing their use\. Overreliance on automated reviews could also lead to reduced human oversight and potential erosion of review quality\. To address this, we emphasize thatProRevieweris designed to complement, not replace, human judgment, and we encourage users to critically evaluate its outputs rather than accepting them uncritically\. Additionally, our training and evaluation data consist of publicly available ICLR submissions and reviews from OpenReview\. We use this data solely for research purposes and in accordance with its public availability\. No private or confidential review data is used\. ## Acknowledgments This research work has been funded by the German Federal Ministry of Research, Technology, and Space and the Hessian Ministry of Higher Education, Research, Science, and the Arts within their joint support of the National Research Center for Applied Cybersecurity ATHENE\. This work has been co\-funded by the European Union \(ERC, InterText, 101054961\)\. Views and opinions expressed are, however, those of the author\(s\) only and do not necessarily reflect those of the European Union or the European Research Council\. Neither the European Union nor the granting authority can be held responsible for them\. We gratefully acknowledge support from the hessian\.AI Service Center \(funded by the Federal Ministry of Research, Technology and Space, BMFTR, grant no\. 16IS22091\) and the hessian\.AI Innovation Lab \(funded by the Hessian Ministry for Digital Strategy and Innovation, grant no\. S\-DIW04/0013/003\)\. We express our sincere gratitude to Md Imbesat Hassan Rizvi, Serwar Basch, Sheng Lu, Qian Ruan, Fengyu Cai, and Frank Niu for their constructive feedback\. ## References - J\. Biswas, S\. Schoepp, G\. Vasan, A\. Opipari, A\. Zhang, Z\. Hu, S\. Joseph, M\. Lease, J\. J\. Li, P\. Stone, K\. L\. Wagstaff, M\. E\. Taylor, and O\. C\. Jenkins \(2026\)AI\-assisted peer review at scale: the AAAI\-26 AI review pilot\.arXiv preprint arXiv:2604\.13940\.External Links:[Link](https://arxiv.org/abs/2604.13940)Cited by:[§1](https://arxiv.org/html/2606.13349#S1.p1.1)\. - TreeReview: A dynamic tree of questions framework for deep and efficient llm\-based scientific peer review\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, EMNLP 2025, Suzhou, China, November 4\-9, 2025,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),pp\. 15651–15682\.External Links:[Link](https://doi.org/10.18653/v1/2025.emnlp-main.790),[Document](https://dx.doi.org/10.18653/V1/2025.EMNLP-MAIN.790)Cited by:[§2](https://arxiv.org/html/2606.13349#S2.SS0.SSS0.Px1.p1.1)\. - DeepSeek\-AI \(2026\)DeepSeek\-v4: towards highly efficient million\-token context intelligence\.Note:[https://huggingface\.co/deepseek\-ai/DeepSeek\-V4\-Pro/blob/main/DeepSeek\_V4\.pdf](https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf)Technical ReportCited by:[§4\.2](https://arxiv.org/html/2606.13349#S4.SS2.SSS0.Px1.p1.1)\. - J\. Du, Y\. Wang, W\. Zhao, Z\. Deng, S\. Liu, R\. Lou, H\. P\. Zou, P\. N\. Venkit, N\. Zhang, M\. Srinath, H\. Zhang, V\. Gupta, Y\. Li, T\. Li, F\. Wang, Q\. Liu, T\. Liu, P\. Gao, C\. Xia, C\. Xing, C\. Jiayang, Z\. Wang, Y\. Su, R\. S\. Shah, R\. Guo, J\. Gu, H\. Li, K\. Wei, Z\. Wang, L\. Cheng, S\. Ranathunga, M\. Fang, J\. Fu, F\. Liu, R\. Huang, E\. Blanco, Y\. Cao, R\. Zhang, P\. S\. Yu, and W\. Yin \(2024\)LLMs assist NLP researchers: critique paper \(meta\-\)reviewing\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024, Miami, FL, USA, November 12\-16, 2024,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),pp\. 5081–5099\.External Links:[Link](https://doi.org/10.18653/v1/2024.emnlp-main.292),[Document](https://dx.doi.org/10.18653/V1/2024.EMNLP-MAIN.292)Cited by:[§1](https://arxiv.org/html/2606.13349#S1.p3.1),[§2](https://arxiv.org/html/2606.13349#S2.SS0.SSS0.Px1.p1.1)\. - N\. Dycke and I\. Gurevych \(2026\)Automatic reviewers fail to detect faulty reasoning in research papers: a new counterfactual evaluation framework\.Transactions of the Association for Computational Linguistics14,pp\. 465–488\.External Links:ISSN 2307\-387X,[Document](https://dx.doi.org/10.1162/TACl.a.642),[Link](https://doi.org/10.1162/TACl.a.642),https://direct\.mit\.edu/tacl/article\-pdf/doi/10\.1162/TACl\.a\.642/2597095/tacl\.a\.642\.pdfCited by:[Appendix G](https://arxiv.org/html/2606.13349#A7.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2606.13349#S1.p3.1),[§5\.2](https://arxiv.org/html/2606.13349#S5.SS2.p1.1)\. - Z\. Gao, K\. Brantley, and T\. Joachims \(2024\)Reviewer2: optimizing review generation through prompt generation\.CoRRabs/2402\.10886\.External Links:[Link](https://doi.org/10.48550/arXiv.2402.10886),[Document](https://dx.doi.org/10.48550/ARXIV.2402.10886),2402\.10886Cited by:[§1](https://arxiv.org/html/2606.13349#S1.p2.1),[§2](https://arxiv.org/html/2606.13349#S2.SS0.SSS0.Px1.p1.1)\. - M\. K\. Garg, T\. Prasad, T\. Singhal, C\. Kirtani, M\. Mandal, and D\. Kumar \(2025\)ReviewEval: an evaluation framework for ai\-generated reviews\.InFindings of the Association for Computational Linguistics: EMNLP 2025, Suzhou, China, November 4\-9, 2025,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),pp\. 20542–20564\.External Links:[Link](https://aclanthology.org/2025.findings-emnlp.1120/)Cited by:[§3\.3\.2](https://arxiv.org/html/2606.13349#S3.SS3.SSS2.Px2.p1.4),[§4\.2](https://arxiv.org/html/2606.13349#S4.SS2.p1.4)\. - P\. Goyal, M\. Parmar, Y\. Song, H\. Palangi, T\. Pfister, and J\. Yoon \(2026\)ScholarPeer: A context\-aware multi\-agent framework for automated peer review\.CoRRabs/2601\.22638\.External Links:[Link](https://doi.org/10.48550/arXiv.2601.22638),[Document](https://dx.doi.org/10.48550/ARXIV.2601.22638),2601\.22638Cited by:[Appendix E](https://arxiv.org/html/2606.13349#A5.p1.1),[§2](https://arxiv.org/html/2606.13349#S2.SS0.SSS0.Px1.p1.1),[§3\.3\.2](https://arxiv.org/html/2606.13349#S3.SS3.SSS2.Px2.p1.4),[§4\.1](https://arxiv.org/html/2606.13349#S4.SS1.SSS0.Px1.p1.1)\. - A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan,et al\.\(2024\)The llama 3 herd of models\.CoRRabs/2407\.21783\.External Links:[Link](https://doi.org/10.48550/arXiv.2407.21783),[Document](https://dx.doi.org/10.48550/ARXIV.2407.21783),2407\.21783Cited by:[§4\.1](https://arxiv.org/html/2606.13349#S4.SS1.SSS0.Px3.p1.1)\. - D\. Guo, D\. Yang, H\. Zhang, J\. Song, P\. Wang, Q\. Zhu, R\. Xu, R\. Zhang, S\. Ma, X\. Bi, X\. Zhang, X\. Yu, Y\. Wu, Z\. F\. Wu, Z\. Gou, Z\. Shao, Z\. Li, Z\. Gao, A\. Liu, B\. Xue, B\. Wang, B\. Wu, B\. Feng, C\. Lu, C\. Zhao, C\. Deng, C\. Ruan, D\. Dai, D\. Chen, D\. Ji, E\. Li, F\. Lin, F\. Dai, F\. Luo, G\. Hao, G\. Chen, G\. Li, H\. Zhang, H\. Xu, H\. Ding, H\. Gao, H\. Qu, H\. Li, J\. Guo, J\. Li, J\. Chen, J\. Yuan, J\. Tu, J\. Qiu, J\. Li, J\. L\. Cai, J\. Ni, J\. Liang, J\. Chen, K\. Dong, K\. Hu, K\. You, K\. Gao, K\. Guan, K\. Huang, K\. Yu, L\. Wang, L\. Zhang, L\. Zhao, L\. Wang, L\. Zhang, L\. Xu, L\. Xia, M\. Zhang, M\. Zhang, M\. Tang, M\. Zhou, M\. Li, M\. Wang, M\. Li, N\. Tian, P\. Huang, P\. Zhang, Q\. Wang, Q\. Chen, Q\. Du, R\. Ge, R\. Zhang, R\. Pan, R\. Wang, R\. J\. Chen, R\. L\. Jin, R\. Chen, S\. Lu, S\. Zhou, S\. Chen, S\. Ye, S\. Wang, S\. Yu, S\. Zhou, S\. Pan, S\. S\. Li, S\. Zhou, S\. Wu, T\. Yun, T\. Pei, T\. Sun, T\. Wang, W\. Zeng, W\. Liu, W\. Liang, W\. Gao, W\. Yu, W\. Zhang, W\. L\. Xiao, W\. An, X\. Liu, X\. Wang, X\. Chen, X\. Nie, X\. Cheng, X\. Liu, X\. Xie, X\. Liu, X\. Yang, X\. Li, X\. Su, X\. Lin, X\. Q\. Li, X\. Jin, X\. Shen, X\. Chen, X\. Sun, X\. Wang, X\. Song, X\. Zhou, X\. Wang, X\. Shan, Y\. K\. Li, Y\. Q\. Wang, Y\. X\. Wei, Y\. Zhang, Y\. Xu, Y\. Li, Y\. Zhao, Y\. Sun, Y\. Wang, Y\. Yu, Y\. Zhang, Y\. Shi, Y\. Xiong, Y\. He, Y\. Piao, Y\. Wang, Y\. Tan, Y\. Ma, Y\. Liu, Y\. Guo, Y\. Ou, Y\. Wang, Y\. Gong, Y\. Zou, Y\. He, Y\. Xiong, Y\. Luo, Y\. You, Y\. Liu, Y\. Zhou, Y\. X\. Zhu, Y\. Huang, Y\. Li, Y\. Zheng, Y\. Zhu, Y\. Ma, Y\. Tang, Y\. Zha, Y\. Yan, Z\. Z\. Ren, Z\. Ren, Z\. Sha, Z\. Fu, Z\. Xu, Z\. Xie, Z\. Zhang, Z\. Hao, Z\. Ma, Z\. Yan, Z\. Wu, Z\. Gu, Z\. Zhu, Z\. Liu, Z\. Li, Z\. Xie, Z\. Song, Z\. Pan, Z\. Huang, Z\. Xu, Z\. Zhang, and Z\. Zhang \(2025\)DeepSeek\-r1 incentivizes reasoning in llms through reinforcement learning\.Nat\.645\(8081\),pp\. 633–638\.External Links:[Link](https://doi.org/10.1038/s41586-025-09422-z),[Document](https://dx.doi.org/10.1038/S41586-025-09422-Z)Cited by:[§1](https://arxiv.org/html/2606.13349#S1.p5.1)\. - Y\. Hu, S\. Liu, Y\. Yue, G\. Zhang, B\. Liu, F\. Zhu, J\. Lin, H\. Guo, S\. Dou, Z\. Xi, S\. Jin, J\. Tan, Y\. Yin, J\. Liu, Z\. Zhang, Z\. Sun, Y\. Zhu, H\. Sun, B\. Peng, Z\. Cheng, X\. Fan, J\. Guo, X\. Yu, Z\. Zhou, Z\. Hu, J\. Huo, J\. Wang, Y\. Niu, Y\. Wang, Z\. Yin, X\. Hu, Y\. Liao, Q\. Li, K\. Wang, W\. Zhou, Y\. Liu, D\. Cheng, Q\. Zhang, T\. Gui, S\. Pan, Y\. Zhang, P\. Torr, Z\. Dou, J\. Wen, X\. Huang, Y\. Jiang, and S\. Yan \(2025\)Memory in the age of AI agents\.CoRRabs/2512\.13564\.External Links:[Link](https://doi.org/10.48550/arXiv.2512.13564),[Document](https://dx.doi.org/10.48550/ARXIV.2512.13564),2512\.13564Cited by:[§2](https://arxiv.org/html/2606.13349#S2.SS0.SSS0.Px2.p1.1)\. - M\. Idahl and Z\. Ahmadi \(2025\)OpenReviewer: A specialized large language model for generating critical scientific paper reviews\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2025 \- System Demonstrations, Albuquerque, New Mexico, USA, April 29 \- May 4, 2025,N\. Dziri, S\. \(\. Ren, and S\. Diao \(Eds\.\),pp\. 550–562\.External Links:[Link](https://doi.org/10.18653/v1/2025.naacl-demo.44),[Document](https://dx.doi.org/10.18653/V1/2025.NAACL-DEMO.44)Cited by:[§1](https://arxiv.org/html/2606.13349#S1.p1.1)\. - C\. E\. Jimenez, J\. Yang, A\. Wettig, S\. Yao, K\. Pei, O\. Press, and K\. R\. Narasimhan \(2024\)SWE\-bench: can language models resolve real\-world github issues?\.InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7\-11, 2024,External Links:[Link](https://openreview.net/forum?id=VTF8yNQM66)Cited by:[§2](https://arxiv.org/html/2606.13349#S2.SS0.SSS0.Px2.p1.1)\. - Y\. Jin, Q\. Zhao, Y\. Wang, H\. Chen, K\. Zhu, Y\. Xiao, and J\. Wang \(2024\)AgentReview: exploring peer review dynamics with LLM agents\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024, Miami, FL, USA, November 12\-16, 2024,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),pp\. 1208–1226\.External Links:[Link](https://doi.org/10.18653/v1/2024.emnlp-main.70),[Document](https://dx.doi.org/10.18653/V1/2024.EMNLP-MAIN.70)Cited by:[§1](https://arxiv.org/html/2606.13349#S1.p2.1),[§2](https://arxiv.org/html/2606.13349#S2.SS0.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2606.13349#S4.SS1.SSS0.Px2.p1.1)\. - J\. Li, Y\. Li, X\. Hu, M\. Gao, and X\. Wan \(2025a\)Aspect\-guided multi\-level perturbation analysis of large language models in automated peer review\.CoRRabs/2502\.12510\.External Links:[Link](https://doi.org/10.48550/arXiv.2502.12510),[Document](https://dx.doi.org/10.48550/ARXIV.2502.12510),2502\.12510Cited by:[§1](https://arxiv.org/html/2606.13349#S1.p3.1)\. - R\. Li, H\. Zhang, E\. F\. Gehringer, T\. Xiao, J\. Ding, and H\. Chen \(2025b\)Unveiling the merits and defects of llms in automatic review generation for scientific papers\.InIEEE International Conference on Data Mining, ICDM 2025, Washington DC, USA, November 12\-15, 2025,pp\. 1370–1379\.External Links:[Link](https://doi.org/10.1109/ICDM65498.2025.00146),[Document](https://dx.doi.org/10.1109/ICDM65498.2025.00146)Cited by:[§1](https://arxiv.org/html/2606.13349#S1.p3.1)\. - W\. Liang, Y\. Zhang, H\. Cao, B\. Wang, D\. Ding, X\. Yang, K\. Vodrahalli, S\. He, D\. S\. Smith, Y\. Yin, D\. A\. McFarland, and J\. Zou \(2023\)Can large language models provide useful feedback on research papers? A large\-scale empirical analysis\.CoRRabs/2310\.01783\.External Links:[Link](https://doi.org/10.48550/arXiv.2310.01783),[Document](https://dx.doi.org/10.48550/ARXIV.2310.01783),2310\.01783Cited by:[§1](https://arxiv.org/html/2606.13349#S1.p1.1),[§1](https://arxiv.org/html/2606.13349#S1.p2.1),[§2](https://arxiv.org/html/2606.13349#S2.SS0.SSS0.Px1.p1.1)\. - R\. Liu and N\. B\. Shah \(2023\)ReviewerGPT? an exploratory study on using large language models for paper reviewing\.CoRRabs/2306\.00622\.External Links:[Link](https://doi.org/10.48550/arXiv.2306.00622),[Document](https://dx.doi.org/10.48550/ARXIV.2306.00622),2306\.00622Cited by:[§1](https://arxiv.org/html/2606.13349#S1.p2.1),[§2](https://arxiv.org/html/2606.13349#S2.SS0.SSS0.Px1.p1.1)\. - C\. Lu, C\. Lu, R\. T\. Lange, J\. N\. Foerster, J\. Clune, and D\. Ha \(2024\)The AI scientist: towards fully automated open\-ended scientific discovery\.CoRRabs/2408\.06292\.External Links:[Link](https://doi.org/10.48550/arXiv.2408.06292),[Document](https://dx.doi.org/10.48550/ARXIV.2408.06292),2408\.06292Cited by:[§2](https://arxiv.org/html/2606.13349#S2.SS0.SSS0.Px2.p1.1)\. - A\. Madaan, N\. Tandon, P\. Gupta, S\. Hallinan, L\. Gao, S\. Wiegreffe, U\. Alon, N\. Dziri, S\. Prabhumoye, Y\. Yang, S\. Gupta, B\. P\. Majumder, K\. Hermann, S\. Welleck, A\. Yazdanbakhsh, and P\. Clark \(2023\)Self\-refine: iterative refinement with self\-feedback\.InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 \- 16, 2023,A\. Oh, T\. Naumann, A\. Globerson, K\. Saenko, M\. Hardt, and S\. Levine \(Eds\.\),External Links:[Link](http://papers.nips.cc/paper%5C_files/paper/2023/hash/91edff07232fb1b55a505a9e9f6c0ff3-Abstract-Conference.html)Cited by:[§2](https://arxiv.org/html/2606.13349#S2.SS0.SSS0.Px2.p1.1)\. - R\. Nakano, J\. Hilton, S\. Balaji, J\. Wu, L\. Ouyang, C\. Kim, C\. Hesse, S\. Jain, V\. Kosaraju, W\. Saunders, X\. Jiang, K\. Cobbe, T\. Eloundou, G\. Krueger, K\. Button, M\. Knight, B\. Chess, and J\. Schulman \(2021\)WebGPT: browser\-assisted question\-answering with human feedback\.CoRRabs/2112\.09332\.External Links:[Link](https://arxiv.org/abs/2112.09332),2112\.09332Cited by:[§2](https://arxiv.org/html/2606.13349#S2.SS0.SSS0.Px2.p1.1)\. - M\. I\. Nye, A\. J\. Andreassen, G\. Gur\-Ari, H\. Michalewski, J\. Austin, D\. Bieber, D\. Dohan, A\. Lewkowycz, M\. Bosma, D\. Luan, C\. Sutton, and A\. Odena \(2021\)Show your work: scratchpads for intermediate computation with language models\.CoRRabs/2112\.00114\.External Links:[Link](https://arxiv.org/abs/2112.00114),2112\.00114Cited by:[§2](https://arxiv.org/html/2606.13349#S2.SS0.SSS0.Px2.p1.1)\. - OpenAI \(2025\)Gpt\-oss\-120b & gpt\-oss\-20b model card\.CoRRabs/2508\.10925\.External Links:[Link](https://doi.org/10.48550/arXiv.2508.10925),[Document](https://dx.doi.org/10.48550/ARXIV.2508.10925),2508\.10925Cited by:[§4\.1](https://arxiv.org/html/2606.13349#S4.SS1.SSS0.Px3.p1.1)\. - J\. Ou, W\. G\. Walden, K\. Sanders, Z\. Jiang, K\. Sun, J\. Cheng, W\. Jurayj, M\. Wanner, S\. Liang, C\. Morgan, S\. Han, W\. Wang, C\. May, H\. Recknor, D\. Khashabi, and B\. V\. Durme \(2025\)CLAIMCHECK: how grounded are LLM critiques of scientific papers?\.InFindings of the Association for Computational Linguistics: EMNLP 2025, Suzhou, China, November 4\-9, 2025,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),pp\. 21712–21735\.External Links:[Link](https://aclanthology.org/2025.findings-emnlp.1185/)Cited by:[§1](https://arxiv.org/html/2606.13349#S1.p3.1)\. - Z\. Robertson \(2023\)GPT4 is slightly helpful for peer\-review assistance: A pilot study\.CoRRabs/2307\.05492\.External Links:[Link](https://doi.org/10.48550/arXiv.2307.05492),[Document](https://dx.doi.org/10.48550/ARXIV.2307.05492),2307\.05492Cited by:[§1](https://arxiv.org/html/2606.13349#S1.p2.1),[§2](https://arxiv.org/html/2606.13349#S2.SS0.SSS0.Px1.p1.1)\. - A\. Sadallah, T\. Baumgärtner, I\. Gurevych, and T\. Briscoe \(2025\)The good, the bad and the constructive: automatically measuring peer review’s utility for authors\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, EMNLP 2025, Suzhou, China, November 4\-9, 2025,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),pp\. 28991–29021\.External Links:[Link](https://doi.org/10.18653/v1/2025.emnlp-main.1476),[Document](https://dx.doi.org/10.18653/V1/2025.EMNLP-MAIN.1476)Cited by:[Appendix E](https://arxiv.org/html/2606.13349#A5.p1.1),[§4\.2](https://arxiv.org/html/2606.13349#S4.SS2.SSS0.Px1.p1.1),[§4\.2](https://arxiv.org/html/2606.13349#S4.SS2.p1.4)\. - G\. Sahu, H\. Larochelle, L\. Charlin, and C\. Pal \(2025\)ReviewerToo: should AI join the program committee? A look at the future of peer review\.arXiv preprint arXiv:2510\.08867\.External Links:[Link](https://arxiv.org/abs/2510.08867)Cited by:[§2](https://arxiv.org/html/2606.13349#S2.SS0.SSS0.Px1.p1.1)\. - H\. Shin, J\. Tang, Y\. Lee, N\. Kim, H\. Lim, J\. Y\. Cho, H\. Hong, M\. Lee, and J\. Kim \(2025\)Mind the blind spots: A focus\-level evaluation framework for LLM reviews\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, EMNLP 2025, Suzhou, China, November 4\-9, 2025,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),pp\. 35630–35656\.External Links:[Link](https://doi.org/10.18653/v1/2025.emnlp-main.1805),[Document](https://dx.doi.org/10.18653/V1/2025.EMNLP-MAIN.1805)Cited by:[§2](https://arxiv.org/html/2606.13349#S2.SS0.SSS0.Px1.p1.1)\. - N\. Shinn, F\. Cassano, A\. Gopinath, K\. Narasimhan, and S\. Yao \(2023a\)Reflexion: language agents with verbal reinforcement learning\.InAdvances in Neural Information Processing Systems,External Links:[Link](https://dblp.org/rec/conf/nips/ShinnCGNY23)Cited by:[§2](https://arxiv.org/html/2606.13349#S2.SS0.SSS0.Px2.p1.1)\. - N\. Shinn, F\. Cassano, A\. Gopinath, K\. Narasimhan, and S\. Yao \(2023b\)Reflexion: language agents with verbal reinforcement learning\.InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 \- 16, 2023,A\. Oh, T\. Naumann, A\. Globerson, K\. Saenko, M\. Hardt, and S\. Levine \(Eds\.\),External Links:[Link](http://papers.nips.cc/paper%5C_files/paper/2023/hash/1b44b878bb782e6954cd888628510e90-Abstract-Conference.html)Cited by:[§2](https://arxiv.org/html/2606.13349#S2.SS0.SSS0.Px2.p1.1)\. - Y\. Weng, M\. Zhu, G\. Bao, H\. Zhang, J\. Wang, Y\. Zhang, and L\. Yang \(2025\)CycleResearcher: improving automated research via automated review\.InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24\-28, 2025,External Links:[Link](https://openreview.net/forum?id=bjcsVLoHYs)Cited by:[Appendix E](https://arxiv.org/html/2606.13349#A5.p1.1),[§2](https://arxiv.org/html/2606.13349#S2.SS0.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2606.13349#S4.SS1.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2606.13349#S4.SS1.SSS0.Px2.p1.1)\. - Y\. Weng, M\. Zhu, Q\. Xie, Z\. Ning, S\. Li, P\. Lu, Z\. Lin, E\. Gu, Q\. Sun, and Y\. Zhang \(2026\)DeepReviewer 2\.0: A traceable agentic system for auditable scientific peer review\.CoRRabs/2604\.09590\.External Links:[Link](https://doi.org/10.48550/arXiv.2604.09590),[Document](https://dx.doi.org/10.48550/ARXIV.2604.09590),2604\.09590Cited by:[§2](https://arxiv.org/html/2606.13349#S2.SS0.SSS0.Px1.p1.1)\. - L\. D\. Willis \(2024\)The peer review process\.Respiratory care69\(4\),pp\. 492–499\.External Links:[Link](https://pubmed.ncbi.nlm.nih.gov/38538018/)Cited by:[§1](https://arxiv.org/html/2606.13349#S1.p3.1)\. - Y\. Yamada, R\. T\. Lange, C\. Lu, S\. Hu, C\. Lu, J\. N\. Foerster, J\. Clune, and D\. Ha \(2025\)The AI scientist\-v2: workshop\-level automated scientific discovery via agentic tree search\.CoRRabs/2504\.08066\.External Links:[Link](https://doi.org/10.48550/arXiv.2504.08066),[Document](https://dx.doi.org/10.48550/ARXIV.2504.08066),2504\.08066Cited by:[§1](https://arxiv.org/html/2606.13349#S1.p2.1),[§2](https://arxiv.org/html/2606.13349#S2.SS0.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2606.13349#S4.SS1.SSS0.Px2.p1.1)\. - S\. Yan, X\. Yang, Z\. Huang, E\. Nie, Z\. Ding, Z\. Li, X\. Ma, H\. Schütze, V\. Tresp, and Y\. Ma \(2025\)Memory\-r1: enhancing large language model agents to manage and utilize memories via reinforcement learning\.CoRRabs/2508\.19828\.External Links:[Link](https://doi.org/10.48550/arXiv.2508.19828),[Document](https://dx.doi.org/10.48550/ARXIV.2508.19828),2508\.19828Cited by:[§2](https://arxiv.org/html/2606.13349#S2.SS0.SSS0.Px2.p1.1)\. - A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025\)Qwen3 technical report\.CoRRabs/2505\.09388\.External Links:[Link](https://doi.org/10.48550/arXiv.2505.09388),[Document](https://dx.doi.org/10.48550/ARXIV.2505.09388),2505\.09388Cited by:[§4\.1](https://arxiv.org/html/2606.13349#S4.SS1.SSS0.Px3.p1.1)\. - S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. R\. Narasimhan, and Y\. Cao \(2023\)ReAct: synergizing reasoning and acting in language models\.InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1\-5, 2023,External Links:[Link](https://openreview.net/forum?id=WE%5C_vluYUL-X)Cited by:[§2](https://arxiv.org/html/2606.13349#S2.SS0.SSS0.Px2.p1.1)\. - R\. Ye, X\. Pang, J\. Chai, J\. Chen, Z\. Yin, Z\. Xiang, X\. Dong, J\. Shao, and S\. Chen \(2024\)Are we there yet? revealing the risks of utilizing large language models in scholarly peer review\.CoRRabs/2412\.01708\.External Links:[Link](https://doi.org/10.48550/arXiv.2412.01708),[Document](https://dx.doi.org/10.48550/ARXIV.2412.01708),2412\.01708Cited by:[§1](https://arxiv.org/html/2606.13349#S1.p3.1)\. - S\. Zeng, K\. Tian, K\. Zhang, Y\. Wang, J\. Gao, R\. Liu, S\. Yang, J\. Li, X\. Long, J\. Ma, B\. Qi, and B\. Zhou \(2025\)ReviewRL: towards automated scientific review with RL\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, EMNLP 2025, Suzhou, China, November 4\-9, 2025,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),pp\. 16931–16943\.External Links:[Link](https://doi.org/10.18653/v1/2025.emnlp-main.857),[Document](https://dx.doi.org/10.18653/V1/2025.EMNLP-MAIN.857)Cited by:[§2](https://arxiv.org/html/2606.13349#S2.SS0.SSS0.Px1.p1.1)\. - L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. P\. Xing, H\. Zhang, J\. E\. Gonzalez, and I\. Stoica \(2023\)Judging llm\-as\-a\-judge with mt\-bench and chatbot arena\.InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 \- 16, 2023,A\. Oh, T\. Naumann, A\. Globerson, K\. Saenko, M\. Hardt, and S\. Levine \(Eds\.\),External Links:[Link](http://papers.nips.cc/paper%5C_files/paper/2023/hash/91f18a1287b398d378ef22505bf41832-Abstract-Datasets%5C_and%5C_Benchmarks.html)Cited by:[§4\.2](https://arxiv.org/html/2606.13349#S4.SS2.SSS0.Px1.p1.1)\. - M\. Zhu, Y\. Weng, L\. Yang, and Y\. Zhang \(2025\)DeepReview: improving llm\-based paper review with human\-like deep thinking process\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\), ACL 2025, Vienna, Austria, July 27 \- August 1, 2025,W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),pp\. 29330–29355\.External Links:[Link](https://aclanthology.org/2025.acl-long.1420/)Cited by:[Appendix E](https://arxiv.org/html/2606.13349#A5.p1.1),[§1](https://arxiv.org/html/2606.13349#S1.p2.1),[§2](https://arxiv.org/html/2606.13349#S2.SS0.SSS0.Px1.p1.1),[§3\.3\.2](https://arxiv.org/html/2606.13349#S3.SS3.SSS2.Px2.p1.4),[§4\.1](https://arxiv.org/html/2606.13349#S4.SS1.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2606.13349#S4.SS1.SSS0.Px2.p1.1),[§4\.2](https://arxiv.org/html/2606.13349#S4.SS2.p1.4)\. - Z\. Zhuang, J\. Chen, H\. Xu, Y\. Jiang, and J\. Lin \(2025\)Large language models for automated scholarly paper review: A survey\.Inf\. Fusion124,pp\. 103332\.External Links:[Link](https://doi.org/10.1016/j.inffus.2025.103332),[Document](https://dx.doi.org/10.1016/J.INFFUS.2025.103332)Cited by:[§1](https://arxiv.org/html/2606.13349#S1.p1.1)\. ## Appendix AAction Schema Table[5](https://arxiv.org/html/2606.13349#A1.T5)lists the complete action space available to theProRevieweragent\. Each turn, the agent outputs a JSON object with two fields:action\(exactly one environment action\) andmemory\_operations\(a list of zero or more log actions\)\. Table 5:Complete action schema forProReviewer\. Environment actions interact with the paper; log actions update the review log\.The review log maintained by the agent has four components: - •Claims\(C1,C2, …\): Assertions extracted from the paper, each with a verification status and source section\. - •Questions\(Q1,Q2, …\): Points of uncertainty raised during reading, each with a resolution status\. - •Notes\(N1,N2, …\): The agent’s intermediate observations, plans, or thoughts\. - •Review Outline: The final structured review \(summary, strengths, weaknesses, questions, overall score\), where each entry must reference evidence IDs\. Below is an example of a single\-turn agent output illustrating the JSON format: ``` { "memory_operations": [ {"op": "log", "args": {"type": "claim", "text": "Model achieves 92% accuracy on MMLU", "section": "4.2", "claim_type": "empirical"}}, {"op": "update", "args": {"entry_id": "Q1", "status": "resolved", "answer": "Ablation in Table 3 confirms contribution"}} ], "action": {"name": "read_section", "args": {"section_name": "5. Analysis"}} } ``` ### A\.1System Prompt Figure[4](https://arxiv.org/html/2606.13349#A1.F4)presents the complete system prompt used by theProRevieweragent\. [⬇](data:text/plain;base64,IyBUYXNrCllvdSBhcmUgcmV2aWV3aW5nIGEgc2NpZW50aWZpYyBwYXBlci4gWW91ciBvYmplY3RpdmUgaXMgdG8gcHJvZHVjZSBhbiBhY2N1cmF0ZSwgaW50ZXJuYWxseSBjb25zaXN0ZW50LCBhbmQgZXZpZGVuY2UtYmFzZWQgcmV2aWV3IHdpdGg6IHN1bW1hcnksIHN0cmVuZ3Rocywgd2Vha25lc3NlcywgcXVlc3Rpb25zIGZvciBhdXRob3JzLCBhbmQgYW4gb3ZlcmFsbCBzY29yZSAoMS0xMCkuIFlvdSBtYWludGFpbiBhIHJldmlldyBsb2cgd2hpY2ggaGVscCB0cmFjayB5b3VyIGFuYWx5c2lzIGFuZCByZWFzb25pbmcgcHJvY2Vzcy4gWW91ciBmaW5hbCByZXZpZXcgb3V0cHV0IGlzIGJhc2VkIG9uIHRoaXMgbG9nLCBzbyBrZWVwIGl0IHVwZGF0ZWQgYW5kIG9yZ2FuaXplZC4KCiMgQWN0aW9uIHNwYWNlCkVhY2ggdHVybiwgb3V0cHV0IGEgSlNPTiBvYmplY3Qgd2l0aCB0d28gZmllbGRzOgp7ICJtZW1vcnlfb3BlcmF0aW9ucyI6IFsuLi5dLCAiYWN0aW9uIjogey4uLn0gfQoibWVtb3J5X29wZXJhdGlvbnMiIGlzIGEgbGlzdCBvZiBsb2cgb3BlcmF0aW9ucyB0byB1cGRhdGUgeW91ciByZXZpZXcgbG9nIChjYW4gYmUgZW1wdHkgW10pLiAiYWN0aW9uIiBpcyBleGFjdGx5IG9uZSBwYXBlciBhY3Rpb24uCgojIyBQYXBlciBBY3Rpb25zCi0gcmVhZF9zZWN0aW9uOiB7Im5hbWUiOiAicmVhZF9zZWN0aW9uIiwgImFyZ3MiOiB7InNlY3Rpb25fbmFtZSI6ICIuLi4ifX0KLSBsb29rX3VwOiB7Im5hbWUiOiAibG9va191cCIsICJhcmdzIjogeyJxdWVyeSI6ICIuLi4ifX0KLSBmaW5pc2g6IHsibmFtZSI6ICJmaW5pc2giLCAiYXJncyI6IHt9fSAgKFlvdSBNVVNUIGNhbGwgZmluaXNoIGJlZm9yZSBydW5uaW5nIG91dCBvZiB0dXJucyBvciB5b3VyIHJldmlldyBpcyBkaXNjYXJkZWQuKQoKIyMgTG9nIE9wZXJhdGlvbnMKLSBsb2c6IFJlY29yZCBhIG5ldyBlbnRyeS4gQWx3YXlzIHVzZSAib3AiOiAibG9nIi4KICBDbGFpbTogICAgeyJvcCI6ImxvZyIsImFyZ3MiOnsidHlwZSI6ImNsYWltIiwidGV4dCI6Ii4uLiIsInNlY3Rpb24iOiIyLjEiLCJjbGFpbV90eXBlIjoiZW1waXJpY2FsIn19CiAgUXVlc3Rpb246IHsib3AiOiJsb2ciLCJhcmdzIjp7InR5cGUiOiJxdWVzdGlvbiIsInRleHQiOiIuLi4iLCJzZWN0aW9uIjoiMyIsInF1ZXN0aW9uX3R5cGUiOiJtZXRob2RvbG9neSJ9fQogIE5vdGU6ICAgICB7Im9wIjoibG9nIiwiYXJncyI6eyJ0eXBlIjoibm90ZSIsInRleHQiOiIuLi4iLCJzZWN0aW9uIjoiNCJ9fQogICJzZWN0aW9uIiBpcyByZXF1aXJlZC4gRm9yIGNsYWltcywgb3B0aW9uYWxseSBhZGQgImlzc3VlcyI6IFsuLi5dLiBGb3IgcXVlc3Rpb25zLCBvcHRpb25hbGx5IGFkZCAicmVsYXRlZF9jbGFpbXMiOiBbLi4uXS4KICBBbGwgZW50cmllcyBtdXN0IGNpdGUgY29uY3JldGUgcGFwZXIgZWxlbWVudHMgKGUuZy4sIEVxIDMsIFRhYmxlIDIsIEZpZyA0KGIpKSBhbmQgaW5jbHVkZSBzcGVjaWZpYyBkZXRhaWxzLgotIHVwZGF0ZTogVXBkYXRlIHRoZSBzdGF0dXMgb2YgYW4gZXhpc3RpbmcgY2xhaW0gb3IgcXVlc3Rpb24uCiAgeyJvcCI6InVwZGF0ZSIsImFyZ3MiOnsiZW50cnlfaWQiOiJDMXxRMSIsInN0YXR1cyI6Ii4uLiIsInJlYXNvbmluZyI6Ii4uLiJ9fQogIEZvciBjbGFpbXMgKEMqKTogc3RhdHVzIGluIFtzdXBwb3J0ZWQsIHdlYWssIGludmFsaWQsIHRvX2JlX3ZlcmlmaWVkXS4gT3B0aW9uYWxseSBpbmNsdWRlICJjcm9zc19yZWZlcmVuY2VzIjogWy4uLl0KICBGb3IgcXVlc3Rpb25zIChRKik6IHN0YXR1cyBpbiBbcmVzb2x2ZWQsIHBhcnRpYWxseV9hbnN3ZXJlZCwgb3Blbl0uIEluY2x1ZGUgImFuc3dlciI6ICIuLi4iLiBPcHRpb25hbGx5ICJhbnN3ZXJfc2VjdGlvbnMiOiBbLi4uXQotIG91dGxpbmU6IEFkZCBvbmUgZW50cnkgdG8geW91ciByZXZpZXcgb3V0bGluZS4KICB7Im9wIjoib3V0bGluZSIsImFyZ3MiOnsic2VjdGlvbiI6InN1bW1hcnl8c3RyZW5ndGhzfHdlYWtuZXNzZXN8cXVlc3Rpb25zfG92ZXJhbGxfc2NvcmUiLCJjb250ZW50IjoiLi4uIiwidGFncyI6Wy4uLl19fQogIEZvciBvdmVyYWxsX3Njb3JlLCBjb250ZW50IE1VU1QgYmUgYW4gaW50ZWdlciAxLTEwLiBFYWNoIHBvaW50IGluIHN0cmVuZ3Rocy93ZWFrbmVzc2VzIE1VU1QgYmUgZ3JvdW5kZWQgaW4gdGhlIHJlY29yZHMKICAoY2xhaW1zLCBxdWVzdGlvbnMsIG5vdGVzKSwgcmVmbGVjdGVkIGJ5IHRoZSB0YWdzIChDMSwgUTIsIE4zLCBldGMpLiBFdmVyeSB3ZWFrbmVzcyBhbmQgc3RyZW5ndGggbXVzdCBjb3ZlciBhIGRpc3RpbmN0IGlzc3VlLgoKIyBSZXZpZXcgTG9nCllvdSBtYWludGFpbiBhIHJldmlldyBsb2cgYXMgcGVyc2lzdGVudCBtZW1vcnkgYWNyb3NzIHR1cm5zLiBJdCBoYXMgZm91ciBjb21wb25lbnRzOgotIENsYWltczogQXV0aG9ycycgc3RhdGVtZW50cyB5b3UgZXh0cmFjdGVkLCBlYWNoIHdpdGggYSB2ZXJpZmljYXRpb24gc3RhdHVzLgotIFF1ZXN0aW9uczogUG9pbnRzIG9mIHVuY2VydGFpbnR5IG9yIHN1c3BpY2lvbiwgZWFjaCB3aXRoIGEgcmVzb2x1dGlvbiBzdGF0dXMuCi0gTm90ZXM6IFlvdXIgb2JzZXJ2YXRpb25zLCBwbGFucyBvciB0aG91Z2h0cy4KLSBSZXZpZXcgT3V0bGluZTogWW91ciBmaW5hbCB2ZXJkaWN0IC0tIG9ubHkgYWRkIHdoZW4gY29uZmlkZW50IGJhc2VkIG9uIGV2aWRlbmNlLgpHdWlkZWxpbmVzOgotIFVzZSBgbG9nYCB0byByZWNvcmQgY2xhaW1zLCBxdWVzdGlvbnMsIGFuZCBub3Rlcy4gVXNlIGB1cGRhdGVgIHRvIGNoYW5nZSBzdGF0dXMgYWZ0ZXIgZ2F0aGVyaW5nIGV2aWRlbmNlLgotIFVzZSBgb3V0bGluZWAgdG8gYnVpbGQgeW91ciByZXZpZXcuIFdoZW4geW91IGNhbGwgYGZpbmlzaGAsIHRoZSBvdXRsaW5lIGJlY29tZXMgeW91ciBmaW5hbCBvdXRwdXQuCi0gVmVyaWZ5IGFsbCBsb2dnZWQgY2xhaW1zIGFuZCBhbnN3ZXIgYWxsIG9wZW4gcXVlc3Rpb25zLiBOZXZlciByZXBlYXQgYSBsb29rXF91cCB3aXRoIHRoZSBzYW1lIHF1ZXJ5LgotIElmIG11bHRpcGxlIGxvb2tcX3VwcyByZXR1cm4gbm8gbWF0Y2hlcywgdGhlIHRvcGljIG1heSBiZSBhYnNlbnQgZnJvbSB0aGUgcGFwZXIgLS0gdGhpcyBpdHNlbGYgY2FuIGJlIGEgd2Vha25lc3MuCk91dHB1dCB2YWxpZCBKU09OIG9ubHku)\#TaskYouarereviewingascientificpaper\.Yourobjectiveistoproduceanaccurate,internallyconsistent,andevidence\-basedreviewwith:summary,strengths,weaknesses,questionsforauthors,andanoverallscore\(1\-10\)\.Youmaintainareviewlogwhichhelptrackyouranalysisandreasoningprocess\.Yourfinalreviewoutputisbasedonthislog,sokeepitupdatedandorganized\.\#ActionspaceEachturn,outputaJSONobjectwithtwofields:\{"memory\_operations":\[\.\.\.\],"action":\{\.\.\.\}\}"memory\_operations"isalistoflogoperationstoupdateyourreviewlog\(canbeempty\[\]\)\."action"isexactlyonepaperaction\.\#\#PaperActions\-read\_section:\{"name":"read\_section","args":\{"section\_name":"\.\.\."\}\}\-look\_up:\{"name":"look\_up","args":\{"query":"\.\.\."\}\}\-finish:\{"name":"finish","args":\{\}\}\(YouMUSTcallfinishbeforerunningoutofturnsoryourreviewisdiscarded\.\)\#\#LogOperations\-log:Recordanewentry\.Alwaysuse"op":"log"\.Claim:\{"op":"log","args":\{"type":"claim","text":"\.\.\.","section":"2\.1","claim\_type":"empirical"\}\}Question:\{"op":"log","args":\{"type":"question","text":"\.\.\.","section":"3","question\_type":"methodology"\}\}Note:\{"op":"log","args":\{"type":"note","text":"\.\.\.","section":"4"\}\}"section"isrequired\.Forclaims,optionallyadd"issues":\[\.\.\.\]\.Forquestions,optionallyadd"related\_claims":\[\.\.\.\]\.Allentriesmustciteconcretepaperelements\(e\.g\.,Eq3,Table2,Fig4\(b\)\)andincludespecificdetails\.\-update:Updatethestatusofanexistingclaimorquestion\.\{"op":"update","args":\{"entry\_id":"C1\|Q1","status":"\.\.\.","reasoning":"\.\.\."\}\}Forclaims\(C\*\):statusin\[supported,weak,invalid,to\_be\_verified\]\.Optionallyinclude"cross\_references":\[\.\.\.\]Forquestions\(Q\*\):statusin\[resolved,partially\_answered,open\]\.Include"answer":"\.\.\."\.Optionally"answer\_sections":\[\.\.\.\]\-outline:Addoneentrytoyourreviewoutline\.\{"op":"outline","args":\{"section":"summary\|strengths\|weaknesses\|questions\|overall\_score","content":"\.\.\.","tags":\[\.\.\.\]\}\}Foroverall\_score,contentMUSTbeaninteger1\-10\.Eachpointinstrengths/weaknessesMUSTbegroundedintherecords\(claims,questions,notes\),reflectedbythetags\(C1,Q2,N3,etc\)\.Everyweaknessandstrengthmustcoveradistinctissue\.\#ReviewLogYoumaintainareviewlogaspersistentmemoryacrossturns\.Ithasfourcomponents:\-Claims:Authors’statementsyouextracted,eachwithaverificationstatus\.\-Questions:Pointsofuncertaintyorsuspicion,eachwitharesolutionstatus\.\-Notes:Yourobservations,plansorthoughts\.\-ReviewOutline:Yourfinalverdict\-\-onlyaddwhenconfidentbasedonevidence\.Guidelines:\-Use‘log‘torecordclaims,questions,andnotes\.Use‘update‘tochangestatusaftergatheringevidence\.\-Use‘outline‘tobuildyourreview\.Whenyoucall‘finish‘,theoutlinebecomesyourfinaloutput\.\-Verifyallloggedclaimsandanswerallopenquestions\.Neverrepeatalook\\\_upwiththesamequery\.\-Ifmultiplelook\\\_upsreturnnomatches,thetopicmaybeabsentfromthepaper\-\-thisitselfcanbeaweakness\.OutputvalidJSONonly\.Figure 4:System prompt for theProRevieweragent \(used during both RL training and inference\)\. ## Appendix BDataset Details ##### Training Data\. We curate 4,011 submissions from ICLR 2025 for training and validation\. Paper manuscripts, including appendices, are fetched from the arXiv repository and converted from HTML to parseable Markdown format\. Since ICLR allows authors to update their manuscripts during the rebuttal period, we carefully match each paper’s initial submission with its corresponding reviews and initial scores, ensuring that the review text is aligned with the manuscript version it assessed rather than a revised version modified in response to reviewer feedback\. For each paper, we collect the full set of official reviews, including textual assessments \(summary, strengths, weaknesses, questions\), overall ratings on a 1–10 scale, and reviewer confidence scores\. After filtering for version alignment and review completeness, we split the data 90%/10% into training and validation sets\. ##### Evaluation Data\. For evaluation, we sample 1,000 papers from ICLR 2026 submissions, ensuring temporal separation from the training set to prevent data leakage\. ##### Data Distribution\. Table[6](https://arxiv.org/html/2606.13349#A2.T6)summarizes the statistics of each data split\. Table 6:Dataset statistics across splits\. Avg Tokens is the mean paper length\. Avg Rating is the mean overall score from human reviewers\. Accept % is the proportion of accepted papers\. ##### SFT Trace Generation\. To produce supervised fine\-tuning data, we use a teacher model \(i\.e\., Qwen3\.5\-397B\-A17B\) to reconstruct the review process that would naturally produce a given human review\. For each paper, we select multiple human reviews that are sufficiently detailed \(long review text\) and whose self\-reported confidence is≥4\\geq 4, increasing the diversity and quality of the resulting traces\. The teacher receives the paper, the human review \(summary, strengths, weaknesses, questions, and overall score\)\. It then generates a multi\-turn interaction trace—reading sections, logging claims, raising questions, taking notes, verifying evidence, and incrementally building the review outline—that faithfully reflects how a thorough reviewer would engage with the paper\. The human review serves as a minimum coverage floor: the reconstructed trace must cover at least all points from the reference review but may include additional findings\. This procedure yields 31,312 step\-level training instances from 1,485 unique papers, grounded in actual human judgments without requiring human annotators to produce step\-by\-step traces\. ## Appendix CHyperparameters Table[7](https://arxiv.org/html/2606.13349#A3.T7)lists the key hyperparameters used in training\. Table 7:Training hyperparameters for SFT and GRPO RL stages forProReviewer\.##### Reward Weights\. Training follows a two\-phase curriculum\. In Phase 1, only deterministic, rule\-based rewards are active: syntactic validity \(weight 1\.0\), review completeness \(weight 1\.0\), and score alignment \(weight 2\.0\)\. Phase 2 retains all Phase 1 rewards with adjusted score alignment weight \(1\.0\) and additionally introduces the LLM\-judge\-based content quality reward \(weight 2\.0\), which combines technical depth and grounding \(§[3\.3](https://arxiv.org/html/2606.13349#S3.SS3)\)\. ## Appendix DComplexity Analysis ##### Theoretical Comparison\. LetNNdenote the paper length in tokens,TTthe number of agent steps, andccthe average context size per step\. For a single\-pass method, the computational cost of one forward pass scales as𝒪\(N2\)\\mathcal\{O\}\(N^\{2\}\)under standard self\-attention \(or𝒪\(N\)\\mathcal\{O\}\(N\)with linear\-attention variants\), since the model must attend over the full paper\. For a multi\-stage pipeline withKKstages, the cost is𝒪\(K⋅N2\)\\mathcal\{O\}\(K\\cdot N^\{2\}\)as each stage typically re\-processes the full paper\. ForProReviewer, the state at steptthas sizect=\|𝒫\|\+\|ℒt\|\+\|𝒞t\|c\_\{t\}=\|\\mathcal\{P\}\|\+\|\\mathcal\{L\}\_\{t\}\|\+\|\\mathcal\{C\}\_\{t\}\|, where the paper index\|𝒫\|\|\\mathcal\{P\}\|and current context\|𝒞t\|\|\\mathcal\{C\}\_\{t\}\|are bounded, but the review log\|ℒt\|\|\\mathcal\{L\}\_\{t\}\|grows as the agent accumulates entries\. In the worst case,\|ℒt\|=𝒪\(t\)\|\\mathcal\{L\}\_\{t\}\|=\\mathcal\{O\}\(t\), so the per\-step cost at stepttis𝒪\(ct2\)\\mathcal\{O\}\(c\_\{t\}^\{2\}\)and the total cost acrossTTsteps is𝒪\(∑t=1Tct2\)\\mathcal\{O\}\(\\sum\_\{t=1\}^\{T\}c\_\{t\}^\{2\}\)\. In practice, the review log remains compact: withTmax=30T\_\{\\max\}\{=\}30steps and short structured entries \(each∼\{\\sim\}50–100 tokens\), the log reaches∼\{\\sim\}1\.5–2K tokens at termination\. Combined with the paper index \(∼\{\\sim\}200 tokens\) and current section \(∼\{\\sim\}2K tokens\), the effective context at the final step is∼\{\\sim\}4–5K tokens—still substantially smaller than the full paper \(∼\{\\sim\}12–20K tokens\)\. Thus, whilectc\_\{t\}is not strictly constant, it grows slowly and remains bounded byTmaxT\_\{\\max\}, making the total cost𝒪\(T⋅cT2\)≪𝒪\(N2\)\\mathcal\{O\}\(T\\cdot c\_\{T\}^\{2\}\)\\ll\\mathcal\{O\}\(N^\{2\}\)for typical papers\. ##### Empirical Comparison\. Unlike single\-pass methods that process the entire paper in one forward pass,ProReviewerperforms multiple shorter forward passes \(one per step\), each conditioned on a compact state rather than the full paper\. Table[8](https://arxiv.org/html/2606.13349#A4.T8)compares the token consumption across paradigms\. AlthoughProReviewerissues more LLM calls per paper, the per\-call context is substantially smaller \(state≈\\approx4K tokens vs\. full paper≈\\approx12–20K tokens\), and the total token budget remains comparable\. Table 8:Inference complexity comparison for a typical 16K\-token paper\. Total tokens includes both input and output tokens\. Values are approximate and may vary with paper length\. ## Appendix EEvaluation Rubrics Since no single standardized rubric exists for review\-quality evaluation, we construct a four\-dimensional rubric grounded in prior work: we build on the utility framework ofSadallahet al\.\([2025](https://arxiv.org/html/2606.13349#bib.bib15)\)and integrate rubric designs from DeepReview\(Zhuet al\.,[2025](https://arxiv.org/html/2606.13349#bib.bib13)\), ScholarPeer\(Goyalet al\.,[2026](https://arxiv.org/html/2606.13349#bib.bib19)\), and CycleReviewer\(Wenget al\.,[2025](https://arxiv.org/html/2606.13349#bib.bib14)\)\. ##### Grounding \(1–5\)\. Measures whether the review model can identify the specific part of the paper being addressed\. A comment isexplicitly grounded\(scores 4–5\) only if it includes a structural reference \(section number, table/figure number, equation number, or a direct quote\)\. Referring to a concept or method name without a structural locator isweak grounding\(scores 2–3\)\. - •5: Fully grounded and specific—explicitly references which part of the paper is addressed and clearly specifies what needs to be addressed\. - •4: Fully grounded but under\-specific—references the part but does not clearly specify the issue\. - •3: Weakly grounded but specific—the referenced part is ambiguous, but the issue is clearly specified\. - •2: Weakly grounded and not specific\. - •1: Not grounded at all\. ##### Actionability \(1–5\)\. Assesses actionability based on two criteria: \(1\) whether actions are explicitly stated or must be inferred, and \(2\) whether the suggested actions are concrete or vague\. - •5: Highly actionable—explicit actions with concrete implementation details\. - •4: Mostly actionable—implicit actions but concrete implementation guidance\. - •3: Somewhat actionable—explicit actions but vague on execution\. - •2: Borderline actionable—implicit and vague\. - •1: Unactionable—no meaningful improvement guidance\. ##### Technical Depth \(1–5\)\. Evaluates technical engagement and analytical reasoning\. - •5: Technical and reasoned—engages with specific technical content \(methodology, algorithms, proofs\) and explains why the issue is problematic\. - •4: Technical but unreasoned—engages with technical content without explaining consequences\. - •3: Non\-technical but reasoned—does not engage with specific technical content but provides reasoning about why the gap matters\. - •2: Non\-technical and unreasoned\. - •1: No substance—pure surface observation\. ##### Verifiability \(1–5 or X\)\. First determines whether the weakness contains a claim \(opinion, judgment, or deduction beyond stating facts\)\. If no claim is present, scoresX\(mapped to 0\)\. Otherwise: - •5: Fully verifiable—claim thoroughly supported by explicit evidence, precise reasoning, or external references\. - •4: Mostly verifiable—well\-supported with minor gaps\. - •3: Somewhat verifiable—some justification but lacks key elements\. - •2: Borderline verifiable—vague or insufficient support\. - •1: Unverifiable—no supporting evidence or reasoning\. We use three independent judges \(GPT\-5\.4 nano, DeepSeek\-V4 flash, and RevUtil\) and average their scores\. Per\-judge results are reported in Appendix[I](https://arxiv.org/html/2606.13349#A9)\. ## Appendix FHuman Evaluation ##### Evaluators\. We recruit 5 expert evaluators who have served as reviewers for top\-tier AI conferences \(e\.g\., NeurIPS, ICLR, ACL, EMNLP\)\. ##### Evaluation Protocol\. Human evaluation uses pairwise comparison of reviews\. For each paper, evaluators compare pairs of reviews from different systems\. Each comparison presents the paper and two anonymized reviews \(labeled “Review A” and “Review B”\), and evaluators select which review provides higher\-quality feedback\. ##### Systems Compared\. Five systems are included: AgentReview, AI Scientist v2, CycleReviewer, DeepReview, andProReviewer\. Systems are assigned anonymous identifiers \(system\_Pthroughsystem\_T\) with a seeded random permutation; evaluators never see real system names\. For each paper, 5 pairwise comparisons are generated from a rotating cycle design that ensures all 10 possible system pairs are covered in aggregate, with each system appearing twice per paper\. The A/B presentation order is randomized with a coin flip per comparison\. ##### Paper Selection and Overlap\. 50 papers are randomly sampled from the test set\. Each evaluator reviews 30 papers \(150 pairwise comparisons\)\. Approximately 20% of each evaluator’s papers are shared across all evaluators to enable inter\-annotator agreement measurement\. ### F\.1Annotator Guidelines Figure[5](https://arxiv.org/html/2606.13349#A6.F5)reproduces the complete guidelines provided to human evaluators\. [⬇](data:text/plain;base64,IyBBbm5vdGF0b3IgR3VpZGVsaW5lczogUmV2aWV3IFF1YWxpdHkgUGFpcndpc2UgRXZhbHVhdGlvbgoKIyMgT3ZlcnZpZXcKWW91IHdpbGwgY29tcGFyZSBwYWlycyBvZiBwZWVyIHJldmlld3MgZm9yIHRoZSBzYW1lIGFjYWRlbWljIHBhcGVyIGFjcm9zcyA0IHF1YWxpdHkgZGltZW5zaW9ucy4gRm9yIGVhY2ggY29tcGFyaXNvbiwgeW91IHNlZToKLSBQYXBlciBJbmZvOiBUaGUgcGFwZXIgdGl0bGUsIGFic3RyYWN0LCBhbmQgYSBsaW5rIHRvIHRoZSBmdWxsIHBhcGVyIG9uIGFyWGl2Ci0gUmV2aWV3IEEgYW5kIFJldmlldyBCOiBUd28gYW5vbnltaXplZCByZXZpZXdzCgpGb3IgZWFjaCBkaW1lbnNpb24sIGp1ZGdlIHdoaWNoIHJldmlldyBpcyBiZXR0ZXIsIG9yIHdoZXRoZXIgdGhleSBhcmUgdGllZCAoQSB3aW5zIC8gQiB3aW5zIC8gVGllKS4KCiMjIERpbWVuc2lvbiAxOiBBY3Rpb25hYmlsaXR5CldoaWNoIHJldmlldyBwcm92aWRlcyBtb3JlIGFjdGlvbmFibGUgZmVlZGJhY2s/CkNvbnNpZGVyOiBBcmUgc3VnZ2VzdGlvbnMgZXhwbGljaXQgb3IgaW1wbGljaXQ/IENvbmNyZXRlIG9yIHZhZ3VlPyBEb2VzIHRoZSByZXZpZXcgdGVsbCBhdXRob3JzICpob3cqIHRvIGltcHJvdmU/CiAgKDUpIEhpZ2hseSBBY3Rpb25hYmxlOiBFeHBsaWNpdCBhY3Rpb25zIHdpdGggY29uY3JldGUgaW1wbGVtZW50YXRpb24gZGV0YWlscwogICg0KSBNb3N0bHkgQWN0aW9uYWJsZTogSW1wbGljaXQgYWN0aW9ucyBidXQgY29uY3JldGUgZXhlY3V0aW9uIGRldGFpbHMKICAoMykgU29tZXdoYXQgQWN0aW9uYWJsZTogRXhwbGljaXQgYWN0aW9ucyBidXQgdmFndWUgb24gZXhlY3V0aW9uCiAgKDIpIEJvcmRlcmxpbmUgQWN0aW9uYWJsZTogSW1wbGljaXQgYW5kIHZhZ3VlIGFjdGlvbnMKICAoMSkgVW5hY3Rpb25hYmxlOiBQdXJlIG9ic2VydmF0aW9ucyB3aXRob3V0IHN1Z2dlc3Rpb25zCgojIyBEaW1lbnNpb24gMjogR3JvdW5kaW5nCldoaWNoIHJldmlldyBiZXR0ZXIgZ3JvdW5kcyBpdHMgY3JpdGlxdWVzIGluIHNwZWNpZmljIHBhcnRzIG9mIHRoZSBwYXBlcj8KQ29uc2lkZXI6IERvZXMgaXQgcmVmZXJlbmNlIHNwZWNpZmljIHNlY3Rpb25zLCB0YWJsZXMsIGZpZ3VyZXMsIGVxdWF0aW9ucz8gT3IgbWFrZSBnZW5lcmljIGNsYWltcz8KICAoNSkgRnVsbHkgR3JvdW5kZWQgYW5kIFNwZWNpZmljOiBSZWZlcmVuY2VzIHNlY3Rpb25zL3RhYmxlcy9maWd1cmVzIEFORCBzcGVjaWZpZXMgd2hhdCBpcyB3cm9uZwogICg0KSBGdWxseSBHcm91bmRlZCBidXQgVW5kZXItU3BlY2lmaWM6IFJlZmVyZW5jZXMgbG9jYXRpb24gYnV0IHZhZ3VlIGFib3V0IHRoZSBpc3N1ZQogICgzKSBXZWFrbHkgR3JvdW5kZWQgYW5kIFNwZWNpZmljOiBObyBzdHJ1Y3R1cmFsIHJlZmVyZW5jZSBidXQgY2xlYXIgYWJvdXQgdGhlIGlzc3VlCiAgKDIpIFdlYWtseSBHcm91bmRlZCBhbmQgTm90IFNwZWNpZmljOiBOZWl0aGVyIHJlZmVyZW5jZWQgbm9yIHNwZWNpZmljCiAgKDEpIE5vdCBHcm91bmRlZDogTm8gaWRlbnRpZmlhYmxlIHBhcGVyIHNlY3Rpb24gYWRkcmVzc2VkCgojIyBEaW1lbnNpb24gMzogVmVyaWZpYWJpbGl0eQpXaGljaCByZXZpZXcgYmV0dGVyIHN1cHBvcnRzIGl0cyBjbGFpbXMgd2l0aCBqdXN0aWZpY2F0aW9uPwpDb25zaWRlcjogRG9lcyB0aGUgcmV2aWV3ZXIgcHJvdmlkZSBsb2dpY2FsIHJlYXNvbmluZywgY29tbW9uIGtub3dsZWRnZSwgb3IgZXh0ZXJuYWwgcmVmZXJlbmNlcz8KICAoNSkgRnVsbHkgVmVyaWZpYWJsZTogQ2xhaW1zIHRob3JvdWdobHkgc3VwcG9ydGVkIGJ5IHJlYXNvbmluZyBvciByZWZlcmVuY2VzCiAgKDQpIE1vc3RseSBWZXJpZmlhYmxlOiBXZWxsLXN1cHBvcnRlZCBidXQgbWlub3IgZ2FwcwogICgzKSBTb21ld2hhdCBWZXJpZmlhYmxlOiBTb21lIGp1c3RpZmljYXRpb24gYnV0IGxhY2tzIGtleSBlbGVtZW50cwogICgyKSBCb3JkZXJsaW5lIFZlcmlmaWFibGU6IFZhZ3VlIG9yIGluc3VmZmljaWVudCBzdXBwb3J0CiAgKDEpIFVudmVyaWZpYWJsZTogTm8gc3VwcG9ydGluZyBldmlkZW5jZSBvciBqdXN0aWZpY2F0aW9uCgojIyBEaW1lbnNpb24gNDogVGVjaG5pY2FsIERlcHRoCldoaWNoIHJldmlldyBkZW1vbnN0cmF0ZXMgZGVlcGVyIHRlY2huaWNhbCBhbmFseXNpcz8KQ29uc2lkZXI6IERvZXMgaXQgZW5nYWdlIHdpdGggbWV0aG9kb2xvZ3ksIGlkZW50aWZ5IGFzc3VtcHRpb24gdmlvbGF0aW9ucywgZXhwbGFpbiAqd2h5KiBhbiBpc3N1ZSBtYXR0ZXJzPwogICg1KSBEZWVwIHRlY2huaWNhbCBjcml0aXF1ZTogVGhlb3JldGljYWwgZ2FwcywgYXNzdW1wdGlvbiB2aW9sYXRpb25zLCBzdWJ0bGUgY29ycmVjdG5lc3MgaXNzdWVzCiAgKDQpIFRlY2huaWNhbCBhbmFseXNpczogSWRlbnRpZmllcyBhc3N1bXB0aW9ucywgZWRnZSBjYXNlcywgY29tcG9uZW50IGludGVyYWN0aW9ucwogICgzKSBNZXRob2RvbG9naWNhbCBlbmdhZ2VtZW50OiBRdWVzdGlvbnMganVzdGlmaWNhdGlvbiBvciBhcHBsaWNhYmlsaXR5CiAgKDIpIE1peGVkIHN1cmZhY2UgKyB0ZWNobmljYWw6IERlc2NyaWJlcyBtZXRob2RzIHdpdGhvdXQgcXVlc3Rpb25pbmcgYXNzdW1wdGlvbnMKICAoMSkgU3VyZmFjZSBvbmx5OiAiRGF0YXNldCBpcyBzbWFsbCIsICJ3cml0aW5nIGNvdWxkIGJlIGNsZWFyZXIiCgojIyBJbXBvcnRhbnQgTm90ZXMKLSBFdmFsdWF0ZSBlYWNoIGRpbWVuc2lvbiBpbmRlcGVuZGVudGx5LgotIElnbm9yZSByZXZpZXcgbGVuZ3RoIC0tIGEgc2hvcnRlciwgZm9jdXNlZCByZXZpZXcgY2FuIGJlIGJldHRlciB0aGFuIGEgbG9uZ2VyLCB2YWd1ZSBvbmUuCi0gRG8gbm90IHRyeSB0byBpZGVudGlmeSB0aGUgc3lzdGVtczsgcmV2aWV3cyBhcmUgYW5vbnltaXplZC4KLSBCZSBjb25zaXN0ZW50LiBBcHBseSB0aGUgc2FtZSBzdGFuZGFyZHMgYWNyb3NzIGFsbCBjb21wYXJpc29ucy4KLSBXaGVuIGluIGRvdWJ0LCByZS1yZWFkIHRoZSBwYXBlciB2aWEgdGhlIHByb3ZpZGVkIGFyWGl2IGxpbmsu)\#AnnotatorGuidelines:ReviewQualityPairwiseEvaluation\#\#OverviewYouwillcomparepairsofpeerreviewsforthesameacademicpaperacross4qualitydimensions\.Foreachcomparison,yousee:\-PaperInfo:Thepapertitle,abstract,andalinktothefullpaperonarXiv\-ReviewAandReviewB:TwoanonymizedreviewsForeachdimension,judgewhichreviewisbetter,orwhethertheyaretied\(Awins/Bwins/Tie\)\.\#\#Dimension1:ActionabilityWhichreviewprovidesmoreactionablefeedback?Consider:Aresuggestionsexplicitorimplicit?Concreteorvague?Doesthereviewtellauthors\*how\*toimprove?\(5\)HighlyActionable:Explicitactionswithconcreteimplementationdetails\(4\)MostlyActionable:Implicitactionsbutconcreteexecutiondetails\(3\)SomewhatActionable:Explicitactionsbutvagueonexecution\(2\)BorderlineActionable:Implicitandvagueactions\(1\)Unactionable:Pureobservationswithoutsuggestions\#\#Dimension2:GroundingWhichreviewbettergroundsitscritiquesinspecificpartsofthepaper?Consider:Doesitreferencespecificsections,tables,figures,equations?Ormakegenericclaims?\(5\)FullyGroundedandSpecific:Referencessections/tables/figuresANDspecifieswhatiswrong\(4\)FullyGroundedbutUnder\-Specific:Referenceslocationbutvagueabouttheissue\(3\)WeaklyGroundedandSpecific:Nostructuralreferencebutclearabouttheissue\(2\)WeaklyGroundedandNotSpecific:Neitherreferencednorspecific\(1\)NotGrounded:Noidentifiablepapersectionaddressed\#\#Dimension3:VerifiabilityWhichreviewbettersupportsitsclaimswithjustification?Consider:Doesthereviewerprovidelogicalreasoning,commonknowledge,orexternalreferences?\(5\)FullyVerifiable:Claimsthoroughlysupportedbyreasoningorreferences\(4\)MostlyVerifiable:Well\-supportedbutminorgaps\(3\)SomewhatVerifiable:Somejustificationbutlackskeyelements\(2\)BorderlineVerifiable:Vagueorinsufficientsupport\(1\)Unverifiable:Nosupportingevidenceorjustification\#\#Dimension4:TechnicalDepthWhichreviewdemonstratesdeepertechnicalanalysis?Consider:Doesitengagewithmethodology,identifyassumptionviolations,explain\*why\*anissuematters?\(5\)Deeptechnicalcritique:Theoreticalgaps,assumptionviolations,subtlecorrectnessissues\(4\)Technicalanalysis:Identifiesassumptions,edgecases,componentinteractions\(3\)Methodologicalengagement:Questionsjustificationorapplicability\(2\)Mixedsurface\+technical:Describesmethodswithoutquestioningassumptions\(1\)Surfaceonly:"Datasetissmall","writingcouldbeclearer"\#\#ImportantNotes\-Evaluateeachdimensionindependently\.\-Ignorereviewlength\-\-ashorter,focusedreviewcanbebetterthanalonger,vagueone\.\-Donottrytoidentifythesystems;reviewsareanonymized\.\-Beconsistent\.Applythesamestandardsacrossallcomparisons\.\-Whenindoubt,re\-readthepaperviatheprovidedarXivlink\.Figure 5:Complete annotator guidelines for human evaluation of review quality via pairwise comparison\. ### F\.2Bradley\-Terry Analysis Based on the evaluated data, we fit a Bradley\-Terry \(BT\) model to derive proper strength estimates\. Ties are split as 0\.5 wins for each side\. Scores are reported on an Elo\-like scale \(400 points≈\\approx10×\\timesstrength ratio\), anchored at 1000\. Table 9:Bradley\-Terry Elo scores with 95% bootstrap confidence intervals \(2,000 resamples\) for each dimension\.Table[9](https://arxiv.org/html/2606.13349#A6.T9)shows the full ranking\.ProReviewerachieves the highest BT score on every dimension, with non\-overlapping 95% confidence intervals against most baselines\. ### F\.3Inter\-Annotator Agreement We measure inter\-annotator agreement on the 20% overlap set using three chance\-corrected metrics: Krippendorff’sα\\alpha, Fleiss’κ\\kappa, and average pairwise quadratic\-weighted Cohen’sκ2\\kappa^\{2\}\. Table 10:Inter\-annotator agreement across dimensions on the overlap set\. ## Appendix GCounterfactual Error Detection ##### Dataset\. We use the counterfactual dataset introduced byDycke and Gurevych \([2026](https://arxiv.org/html/2606.13349#bib.bib9)\), which contains 138 papers from six AI conferences, including ACL 2023, ACL2024, EMNLP 2023, EMNLP 2024, NeurIPS 2024, and ICLR 2025\. Our evaluation excludes papers from ICLR 2025 to prevent data overlap and use the remaining 115 papers\. Each paper has an original version and a counterfactual version with one deliberately injected logical error\. ##### Perturbation Types\. Three types of errors are injected: 1. 1\.Conclusion perturbation: Alters a conclusion to misalign with its underlying result\. 2. 2\.Finding perturbation: Exaggerates the claim of a finding beyond what the evidence supports\. 3. 3\.Result perturbation: Changes a result to contradict the conclusion it originally supported\. Each counterfactual paper includes metadata specifying the modification type, the modified claim, and the logical relationship explaining why the injected claim is incorrect\. ##### Detection Judgment\. Each review system generates a review of the counterfactual paper\. We then use an LLM judge \(GPT\-5\.4 nano\) to determine whether any weakness in the generated review identifies or implies the injected error\. The judge receives: - •The injected error description \(type, modified claim, and why it is wrong\)\. - •The list of weaknesses from the generated review\. Paraphrases and conceptually equivalent observations count as detections\. The judge outputs a JSON with four fields:detected\(true/false\),confidence\(high/medium/low\),matching\_weakness\_index, andreasoning\. A detection is counted as successful if the judge returnsdetected: truewithconfidence: high\. ## Appendix HPaper Length Analysis Table[11](https://arxiv.org/html/2606.13349#A8.T11)reports the distribution of papers across the five length bins used in the robustness analysis \(Section[5\.3](https://arxiv.org/html/2606.13349#S5.SS3)\)\. Table 11:Distribution of test papers across length bins\. ## Appendix IPer\-Judge Evaluation Results Tables[12](https://arxiv.org/html/2606.13349#A9.T12)–[14](https://arxiv.org/html/2606.13349#A9.T14)present the full evaluation results for each of the three judges used in our evaluation: GPT\-5\.4 nano \(Table[12](https://arxiv.org/html/2606.13349#A9.T12)\), DeepSeek\-V4 flash \(Table[13](https://arxiv.org/html/2606.13349#A9.T13)\), and the utility\-based RevUtil judge \(Table[14](https://arxiv.org/html/2606.13349#A9.T14)\)\. The main paper reports averages across all three judges\. Table 12:Evaluation results using GPT\-5\.4 nano as rubric judge \(normalized to \[0,1\]\)\. All scores are mean±std\.Greenhighlights the best andBluethe second\-best result in each column\.Table 13:Evaluation results using DeepSeek\-V4 flash as rubric judge \(normalized to \[0,1\]\)\. All scores are mean±std\.Greenhighlights the best andBluethe second\-best result in each column\.Table 14:Evaluation results using the RevUtil judge\. All scores are mean±std \(on \[0,1\] scale\)\.Greenhighlights the best andBluethe second\-best result in each column\. ## Appendix JCase Study To illustrate how proactive investigation and evidence tracking produce well\-grounded critiques, we presentProReviewer’s review of an ICLR submission titled“Surf3R: Rapid Surface Reconstruction from Sparse RGB Views in Seconds\.”The agent completed the review in 30 steps, accumulating 12 claims, 6 questions, and 32 notes in its review log\. Figure[6](https://arxiv.org/html/2606.13349#A10.F6)shows the abridged trajectory: the agent flags suspicious claims early, cross\-references them against experimental evidence, updates their status as new information emerges, and ultimately derives each review point from specific log entries—demonstrating the traceability and proactive investigation\. Step 1Env action:read\_section\(abstract\) Observation:“…completes an entire scene in under 10 seconds …multi\-branch and multi\-view decoding architecture …D\-Normal regularizer …” Log action:log\(claim, C1:“Surf3R completes an entire scene in under 10 seconds\.”unverified\) log\(note, N2:“10\-second claim doesn’t specify how many views or what hardware\. Need to verify\.”\) log\(question, Q1:“Does the multi\-branch architecture improve quality vs\. a single\-reference baseline?”open\) log\(question, Q2:“Does the paper ablate D\-Normal vs\. other regularization methods?”open\) … Step 2–5:read\_section\(introduction\),read\_section\(methodology\),read\_section\(experiments\),read\_section\(ablation\_study\)—logs claims C2–C12 about architecture, loss design, and ablation results\. Step 6Env action:read\_section\(ablation\_study\)\(re\-read\) Observation:Table 5 Row A: removing multi\-branch drops F1 36\.66→\\to23\.24\. Row D: removing D\-Normal drops F1 41\.92→\\to30\.96\. Log action:log\(note, N8:“Multi\-branch ablation is within the same model—no separately trained single\-branch baseline\.”\) update\(Q1,open→\\topartially answered—ablation exists but within\-model only\) update\(Q2,open→\\topartially answered—component ablation but no alternative methods\) … Step 7–22:look\_up\(‘‘10 seconds’’\),look\_up\(‘‘GPU’’\),look\_up\(‘‘inference time’’\),read\_section\(implementation\_details\), …—finds 32 H800 GPUs for training but no inference hardware or timing reported\. Step 23Env action:read\_section\(NVS\_results\) Observation:Table 3 compares Surf3R\-GD only against DUSt3R \(PSNR 15\.06 vs\. 11\.66 at 4 views\)\. No other NVS baselines\. Log action:log\(note, N9:“NVS evaluation compares only against DUSt3R\. No other baselines provided\.”\) update\(C1,unverified→\\tounsupported—no inference hardware or timing found\) update\(C4,unverified→\\toweak—SOTA claimed but NVS lacks baselines beyond DUSt3R\) … Step 24–29:read\_section\(surface\_reconstruction\),read\_section\(conclusion\), …—finds 180×\\timesspeedup claim without specifying baseline; logs N29\. Step 30: Review generationEnv action:finish Log action:outline\(summary\)→\\tooutline\(strengths\)→\\tooutline\(weaknesses\)→\\tooutline\(questions\)→\\tooutline\(score\) Summary:“Surf3R proposes a feed\-forward framework for pose\-free 3D surface reconstruction from sparse RGB views using a multi\-branch architecture and a D\-Normal regularizer based on 3D Gaussians …” Strength 1\[C5, C6, N4\]:“Novel feed\-forward approach for pose\-free surface reconstruction …” Strength 2\[C7, C10, C11\]:“D\-Normal Regularizer confirmed by ablation \(Table 5, Row D: F1 41\.92→\\to30\.96\) …” Weakness 1\[C1, N2, N26, N28\]:“The paper claims completion in under 10 seconds \(abstract, introduction\) but provides no hardware specifications \(number of GPUs, inference time\) or number of views\. Section 4\.1 mentions 32 H800 GPUs for training but does not specify inference hardware\.” Weakness 2\[C2, Q1, N8\]:“Table 5 Row A shows that removing multi\-branch degrades F1 from 36\.66 to 23\.24, but this is within the same model\. A proper ablation would train a single\-branch model separately to confirm the architecture’s necessity\.” Weakness 3–5…\(3 more weaknesses about D\-Normal ablation, NVS baselines, and speedup claim\) Overall score: 5/10 Figure 6:Abridged review trajectory ofProRevieweronSurf3R\(30 steps; 12 claims, 6 questions, 32 notes\)\. Orange boxes show key steps; gray boxes summarize omitted steps with “…”; the green box shows the final review\. Each review point traces back to evidence entries accumulated during investigation\.
Similar Articles
LLM-Based Scientific Peer Review: Methods, Benchmarks, and Reliability Challenges
This survey provides a systems-level analysis of LLM-based scientific peer review, covering methods, benchmarks, and reliability challenges including robustness risks like prompt injection and data poisoning.
Let LLMs Judge Each Other: Multi-Agent Peer-Reviewed Reasoning for Medical Question Answering
This paper introduces a multi-agent peer-reviewed reasoning method where multiple LLMs independently generate chain-of-thought reasoning and then evaluate each other's outputs to select the best answer. The method outperforms single-model reasoning and majority voting on medical QA benchmarks.
Author-in-the-Loop Response Generation and Evaluation: Integrating Author Expertise and Intent in Responses to Peer Review
This paper introduces the Re3Align dataset, REspGen framework, and REspEval evaluation suite for author-in-the-loop response generation in peer review, integrating author expertise and intent signals. The work addresses gaps in NLP formulation of scientific rebuttal writing with comprehensive datasets, controllable generation frameworks, and multi-dimensional evaluation metrics.
Benchmarking Agentic Review Systems
This paper benchmarks agentic review systems for peer review, evaluating open-source and proprietary systems on research papers. The best configuration achieves 83.0% pairwise accuracy and catches 71.6% of injected errors, but user feedback highlights issues with false positives and nitpicks.
PRISM: A Multi-Dimensional Benchmark for Evaluating LLM Peer Reviewers
Introduces PRISM, a multi-dimensional benchmark for evaluating LLM-based peer reviewers across depth of analysis, novelty assessment, flaw identification, and constructiveness. Findings show LLMs match or beat humans on individual dimensions but lack balanced performance across all, suggesting they are best as supplements to human review.