Tag
This paper empirically evaluates the alignment between LLM-generated and human reviews for scientific papers, finding limited and variable alignment. It also shows that authors can 'game' LLM reviews by iteratively revising papers to improve scores, with up to 35% of papers seeing statistically significant score increases.