Review Arcade: On the Human Alignment and Gameability of LLM Reviews
Summary
This paper investigates the alignment of LLM-generated reviews with human judgment using 1k real ACL 2025 submissions, finding limited agreement, instability across models/prompts, and a method to artificially inflate scores without meaningful changes. The authors advise against relying solely on LLM reviews and call for discussion on their use in handling increasing submission volumes.
View Cached Full Text
Cached at: 06/02/26, 03:35 PM
Paper page - Review Arcade: On the Human Alignment and Gameability of LLM Reviews
Source: https://huggingface.co/papers/2605.28897 As submission numbers continue to rise (NeurIPS 26 40k+; ARR May 26: 17k+), automated reviewing is becoming increasingly difficult to ignore. This year,NeurIPS,EMNLP, andAAAIare testing automated review pipelines, while new papers from Stanford and Mila claim to present safe and stable setups.
Therefore, the question of whether we can trust them becomes increasingly important. Do the generated reviews really align with human judgment, and are the reviews “safe”? Or can they be gamed to artificially inflate the scores without changing any meaningful content?
In our new paper, “Review Arcade: On the Human Alignment and Gameability of LLM Reviews,” we examine 1k real ACL 2025 submissions with real scores and reviews to test whether LLMs align with them.
We find in our experiments three key findings:
- Across five model families, we find onlylimited agreement with human evaluations, as well as differences in accepted and rejected submissions.
- Even when agreement is present, the results arenot stable across models, prompts, or even repetitions of the same evaluation, making reliability highly problematic.
- We find a way to “game” the models with an iterative process over 10 iterations toincrease LLM review scores (up to ~35% of the submissions), without doing meaningful changes (see Fig.).
Therefore, we strongly advise againstrelying on reviews generated by LLMs aloneand encourage a discussion about whether this should be the solution to the enormous number of submissions.
Full Details:https://arxiv.org/pdf/2605.28897
Similar Articles
Review Arcade: On the Human Alignment and Gameability of LLM Reviews
This paper empirically evaluates the alignment between LLM-generated and human reviews for scientific papers, finding limited and variable alignment. It also shows that authors can 'game' LLM reviews by iteratively revising papers to improve scores, with up to 35% of papers seeing statistically significant score increases.
The Geometry of LLM-as-Judge: Why Inter-LLM Consensus Is Not Human Alignment
This paper geometrically analyzes why LLMs acting as judges agree strongly with each other but weakly with humans, finding that inter-LLM consensus reflects a collapsed subspace rather than true human alignment on subjective rubrics. Post-hoc calibration on human data improves alignment, but even calibrated LLMs fall short of human reliability.
Augmenting Human Evaluation with LLM Judges: How Many Human Reviews Do You Need?
This paper proposes a two-stage sampling design where LLM evaluations are used to augment, rather than replace, human ratings, and provides guidance on determining sample sizes for human and LLM reviews using a doubly robust estimator from missing data literature.
Evaluating LLMs as Human Surrogates in Controlled Experiments
This paper evaluates whether off-the-shelf LLMs can reliably simulate human responses in controlled behavioral experiments by comparing LLM-generated data with human survey responses on accuracy perception. The findings show that while LLMs capture directional effects and aggregate belief-updating patterns, they do not consistently match human-scale effect magnitudes, clarifying when synthetic LLM data can serve as behavioral proxies.
PRISM: A Multi-Dimensional Benchmark for Evaluating LLM Peer Reviewers
Introduces PRISM, a multi-dimensional benchmark for evaluating LLM-based peer reviewers across depth of analysis, novelty assessment, flaw identification, and constructiveness. Findings show LLMs match or beat humans on individual dimensions but lack balanced performance across all, suggesting they are best as supplements to human review.
