Review Arcade: On the Human Alignment and Gameability of LLM Reviews

Hugging Face Daily Papers 05/27/26, 12:00 AM Papers

Summary

This paper investigates the alignment of LLM-generated reviews with human judgment using 1k real ACL 2025 submissions, finding limited agreement, instability across models/prompts, and a method to artificially inflate scores without meaningful changes. The authors advise against relying solely on LLM reviews and call for discussion on their use in handling increasing submission volumes.

LLM-generated reviews for scientific papers are gaining considerable traction and are even being officially piloted by major conferences. We have to assume that not only reviewers are using LLM-assistance, but also that authors use LLMs to revise their papers before submitting. In this work, we perform empirical experiments on papers from the 2025 ACL Rolling Review (ARR) to evaluate LLM reviews from both the author and the reviewer perspective. First, we identify a limited alignment of LLM reviews with human ones. In the best-case scenario, the alignment is reasonable. However, we also find that LLM-human alignment varies substantially across prompts and models. Finally, we investigate the scenario in which the author uses an iterative draft-revise workflow to improve the submission according to the LLM review. We find that this "gaming" of LLM reviews can be effective in specific scenarios, leading to a statistically significant increase of overall scores for up to 35\% of papers. We publish our code: https://github.com/uhh-hcds/reviewarcade.

Original Article

View Cached Full Text

Cached at: 06/02/26, 03:35 PM

Paper page - Review Arcade: On the Human Alignment and Gameability of LLM Reviews

Source: https://huggingface.co/papers/2605.28897 As submission numbers continue to rise (NeurIPS 26 40k+; ARR May 26: 17k+), automated reviewing is becoming increasingly difficult to ignore. This year,NeurIPS,EMNLP, andAAAIare testing automated review pipelines, while new papers from Stanford and Mila claim to present safe and stable setups.

Therefore, the question of whether we can trust them becomes increasingly important. Do the generated reviews really align with human judgment, and are the reviews “safe”? Or can they be gamed to artificially inflate the scores without changing any meaningful content?

In our new paper, “Review Arcade: On the Human Alignment and Gameability of LLM Reviews,” we examine 1k real ACL 2025 submissions with real scores and reviews to test whether LLMs align with them.

We find in our experiments three key findings:

Across five model families, we find onlylimited agreement with human evaluations, as well as differences in accepted and rejected submissions.
Even when agreement is present, the results arenot stable across models, prompts, or even repetitions of the same evaluation, making reliability highly problematic.
We find a way to “game” the models with an iterative process over 10 iterations toincrease LLM review scores (up to ~35% of the submissions), without doing meaningful changes (see Fig.).

Therefore, we strongly advise againstrelying on reviews generated by LLMs aloneand encourage a discussion about whether this should be the solution to the enormous number of submissions.

Full Details:https://arxiv.org/pdf/2605.28897

Review Arcade: On the Human Alignment and Gameability of LLM Reviews

Paper page - Review Arcade: On the Human Alignment and Gameability of LLM Reviews

Similar Articles

Review Arcade: On the Human Alignment and Gameability of LLM Reviews

The Geometry of LLM-as-Judge: Why Inter-LLM Consensus Is Not Human Alignment

Augmenting Human Evaluation with LLM Judges: How Many Human Reviews Do You Need?

Re-Centering Humans in LLM Personalization

Articulate Intuition or Genuine Analysis? Benchmarking Epistemic Reliability in LLM-as-a-Judge Peer Reviews

Submit Feedback

Similar Articles

Review Arcade: On the Human Alignment and Gameability of LLM Reviews

The Geometry of LLM-as-Judge: Why Inter-LLM Consensus Is Not Human Alignment

Augmenting Human Evaluation with LLM Judges: How Many Human Reviews Do You Need?

Re-Centering Humans in LLM Personalization

Articulate Intuition or Genuine Analysis? Benchmarking Epistemic Reliability in LLM-as-a-Judge Peer Reviews