Reward Modeling for Scientific Writing Evaluation

arXiv cs.CL Papers

Summary

This paper proposes SciRM, cost-efficient open-source reward models tailored for evaluating scientific writing through a two-stage training framework that optimizes evaluation preferences and reasoning capabilities. The models generalize across diverse scientific writing tasks without requiring task-specific retraining, addressing limitations of existing LLM-based judges on domain-specific evaluation criteria.

arXiv:2601.11374v2 Announce Type: replace Abstract: Scientific writing is an expert-domain task that demands deep domain knowledge, task-specific requirements and reasoning capabilities that leverage the domain knowledge to satisfy the task specifications. While scientific text generation has been widely studied, its evaluation remains a challenging and open problem. It is critical to develop models that can be reliably deployed for evaluating diverse open-ended scientific writing tasks while adhering to their distinct requirements. However, existing LLM-based judges and reward models are primarily optimized for general-purpose benchmarks with fixed scoring rubrics and evaluation criteria. Consequently, they often fail to reason over sparse knowledge of scientific domains when interpreting task-dependent and multi-faceted criteria. Moreover, fine-tuning for each individual task is costly and impractical for low-resource settings. To bridge these gaps, we propose cost-efficient, open-source reward models tailored for scientific writing evaluation. We introduce a two-stage training framework that initially optimizes scientific evaluation preferences and then refines reasoning capabilities. Our multi-aspect evaluation design and joint training across diverse tasks enable fine-grained assessment and robustness to dynamic criteria and scoring rubrics. Experimental analysis shows that our training regime strongly improves LLM-based scientific writing evaluation. Our models generalize effectively across tasks and to previously unseen scientific writing evaluation settings, allowing a single trained evaluator to be reused without task-specific retraining.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 04/20/26, 08:31 AM

# Reward Modeling for Scientific Writing Evaluation Source: https://arxiv.org/html/2601.11374 Furkan Şahinuç1,2, Subhabrata Dutta1, Iryna Gurevych1,2 1Ubiquitous Knowledge Processing Lab (UKP Lab) Department of Computer Science and Hessian Center for AI (hessian.AI) Technical University of Darmstadt 2Konrad Zuse School of Excellence in Learning and Intelligent Systems (ELIZA) www.ukp.tu-darmstadt.de (https://arxiv.org/html/2601.11374v2/www.ukp.tu-darmstadt.de) ###### Abstract Scientific writing is an expert-domain task that demands deep domain knowledge, task-specific requirements and reasoning capabilities that leverage the domain knowledge to satisfy the task specifications. While scientific text generation has been widely studied, its evaluation remains a challenging and open problem. It is critical to develop models that can be reliably deployed for evaluating diverse open-ended scientific writing tasks while adhering to their distinct requirements. However, existing LLM-based judges and reward models are primarily optimized for general-purpose benchmarks with fixed scoring rubrics and evaluation criteria. Consequently, they often fail to reason over sparse knowledge of scientific domains when interpreting task-dependent and multi-faceted criteria. Moreover, fine-tuning for each individual task is costly and impractical for low-resource settings. To bridge these gaps, we propose cost-efficient, open-source reward models tailored for scientific writing evaluation. We introduce a two-stage training framework that initially optimizes scientific evaluation preferences and then refines reasoning capabilities. Our multi-aspect evaluation design and joint training across diverse tasks enable fine-grained assessment and robustness to dynamic criteria and scoring rubrics. Experimental analysis shows that our training regime strongly improves LLM-based scientific writing evaluation. Our models generalize effectively across tasks and to previously unseen scientific writing evaluation settings, allowing a single trained evaluator to be reused without task-specific retraining. We make our code111GitHub:UKPLab/acl2026-expert-rm (https://github.com/UKPLab/acl2026-expert-rm)and data222Data:TUdatalib (https://tudatalib.ulb.tu-darmstadt.de/handle/tudatalib/4980)publicly available.

Reward Modeling for Scientific Writing Evaluation Furkan Şahinuç1,2, Subhabrata Dutta1, Iryna Gurevych1,21Ubiquitous Knowledge Processing Lab (UKP Lab)Department of Computer Science and Hessian Center for AI (hessian.AI)Technical University of Darmstadt2Konrad Zuse School of Excellence in Learning and Intelligent Systems (ELIZA)www.ukp.tu-darmstadt.de (https://arxiv.org/html/2601.11374v2/www.ukp.tu-darmstadt.de)

Refer to captionFigure 1:Example demonstration of how we formalize scientific writing evaluation and the outputs of the review utility evaluation task. Vanilla LLM-based judges fail to properly reason over the task-specific evaluation criteria and provided examples. Contradictory statements are highlighted in different colors. In contrast, our SciRM model successfully incorporates the given criteria and examples into its reasoning process and correctly evaluates the scientific artifact.

## 1 Introduction

Refer to captionFigure 2:Overview of SciRM and SciRM-Ref training and testing pipeline. Diverse scientific artifacts are used to construct training data with multiple evaluation aspects and scoring rubrics (see Section 3.1 (https://arxiv.org/html/2601.11374#S3.SS1)for details). Models are trained via GRPO in two stages to optimize task specifications and reasoning capabilities, and are evaluated on both seen and unseen scientific writing evaluation tasks.

Due to the strong text generation capabilities of large language models (LLMs), their application to scientific text generation, such as related work generation, review generation, and revising papers, has gained increasing attention recently Li and Ouyang (2024 (https://arxiv.org/html/2601.11374#bib.bib46)); Liang et al. (2024 (https://arxiv.org/html/2601.11374#bib.bib47)); Afzal et al. (2026 (https://arxiv.org/html/2601.11374#bib.bib22)). However, without proper evaluation, it is difficult to assess the accuracy and reliability of the generated texts. Therefore, the limitations in evaluation can block the entire development pipeline. Since scientific writing tasks have diverse and task-specific requirements, developing appropriate evaluation frameworks is a challenging problem. Training evaluators for each individual task is costly and, in some cases, infeasible due to limited data availability. LLM-as-a-judge approaches Liu et al. (2023 (https://arxiv.org/html/2601.11374#bib.bib20)); Zheng et al. (2023 (https://arxiv.org/html/2601.11374#bib.bib21)) are the most widely adopted evaluation paradigms in scientific writing tasks. However, they often fail to reason over the given domain knowledge and task-specific preferences (see Figure 1 (https://arxiv.org/html/2601.11374#S0.F1)). This motivates the need for mechanisms that allow models to reason over and remain grounded in explicit evaluation guidelines (aka, constitution) at inference time. Inference-time adaptability is a major challenge for existing approaches like Constitutional AI Bai et al. (2022 (https://arxiv.org/html/2601.11374#bib.bib3)), which internalize fixed constitution during training and therefore cannot be readily applied to a diverse set of evaluation tasks. This rigidity is problematic while evaluating scientific text generations, as evaluation guidelines can be different, even contradictory, across different aspects, tasks, or domains. To improve the reasoning capabilities of LLM-based judges, training reward reasoning models has recently gained popularity Ankner et al. (2024 (https://arxiv.org/html/2601.11374#bib.bib34)); Guo et al. (2025b (https://arxiv.org/html/2601.11374#bib.bib33)); Chen et al. (2025 (https://arxiv.org/html/2601.11374#bib.bib37), 2026 (https://arxiv.org/html/2601.11374#bib.bib36)); Wang et al. (2024 (https://arxiv.org/html/2601.11374#bib.bib39)). However, existing reward models are primarily optimized for improving performance on community-standard benchmarks, such as mathematical reasoning, instruction-following, and human-preference modeling for coding, helpfulness, and safety Lambert et al. (2025 (https://arxiv.org/html/2601.11374#bib.bib12)); Malik et al. (2026 (https://arxiv.org/html/2601.11374#bib.bib13)); Frick et al. (2025 (https://arxiv.org/html/2601.11374#bib.bib14)). Therefore, they fall short of capturing the nuanced requirements of scientific writing evaluation. In addition, the vast majority of reward models encode task preferences in a pairwise manner, which prevents independent assessment of text quality based on explicit task-specific criteria. Another drawback of such models is that they are optimized for fixed scoring rubrics and criteria. Scientific tasks have unique characteristics; each requires domain-specific expertise and evaluation dynamics that differ from those of open-ended creative writing tasks Chakrabarty et al. (2025 (https://arxiv.org/html/2601.11374#bib.bib50)). Even for the same scientific artifact, it is possible to evaluate from multiple aspects requiring different criteria and rubrics. However, current reward models experience performance degradation when they are applied to other tasks with evaluation rubrics that differ from those seen during training Yang et al. (2024 (https://arxiv.org/html/2601.11374#bib.bib48)). On the other hand, Şahinuç et al. (2025 (https://arxiv.org/html/2601.11374#bib.bib2)) offers insights on multiple-aspect evaluation for expert-domain tasks. However, their approach is limited to a single task, and their most accurate pipeline relies on proprietary LLMs, which restricts scalability and hinders generalization across diverse scientific writing tasks. In this work, we adapt reward model training strategies to enhance LLMs' capabilities regarding *what to evaluate* and *how to evaluate* scientific writing tasks. Concretely, we design reward models that are conditioned on an explicit evaluation constitution—a structured description of criteria and label space—present during both training and inference. Furthermore, we introduce a two-stage optimization process in which models not only learn to follow the constitution but also reflexively reinterpret it to correct and stabilize their own reasoning. This process involves a joint optimization of in-context preference-following and reasoning abilities, which Lai et al. (2024 (https://arxiv.org/html/2601.11374#bib.bib49)) identifies as a missing piece in modern LMs trained using RL.

### Contributions and Findings:

We introduce cost-efficient reward models, *SciRM* and *SciRM-Ref*, specifically designed for scientific writing evaluation (C1). We employ two-stage reinforcement learning to optimize the models for (1) scientific writing evaluation preferences and (2) reasoning abilities to better comprehend the given evaluation criteria, enabling models to explicitly reason over and faithfully adhere to dynamically specified evaluation rules. (C2) Rather than producing a single aggregated score, our models evaluate scientific artifacts across multiple aspects. This approach enhances both reliability and interpretability of our evaluation (C3). We curate and process datasets from diverse sources and jointly train our models across multiple tasks to (1) improve robustness against varying scoring rubrics and (2) enhance our models' generalization capabilities (C4). We illustrate the overview of our pipeline in Figure 2 (https://arxiv.org/html/2601.11374#S1.F2).

We finally test our models on four different scientific writing tasks: related work sections, paper reviews, novelty summary assessments, and paper revisions based on instructions, each with distinct evaluation aspects and scoring rubrics. Experimental results show that our two-stage training scheme substantially boosts the LLMs' scientific writing evaluation performance (F1). In particular, our second training stage leads to improvements in tasks requiring strong reasoning capabilities (F2). Furthermore, our models outperform baseline models on tasks that are not included in the training, indicating our model's strong generalization and scaling capabilities (F3).

## 2 Related Work

### 2.1 Scientific Writing Evaluation

Utilizing LLMs as direct evaluators is one of the most intuitive approaches for scientific writing evaluation due to their flexibility in being prompted with various task settings without additional training overhead Liu et al. (2023 (https://arxiv.org/html/2601.11374#bib.bib20)); Zheng et al. (2023 (https://arxiv.org/html/2601.11374#bib.bib21)). However, prior work has shown that vanilla LLM-as-a-judge setups are prone to systematic biases and failures in domain-grounded reasoning Li et al. (2024 (https://arxiv.org/html/2601.11374#bib.bib18)); Szymanski et al. (2025 (https://arxiv.org/html/2601.11374#bib.bib16)); Gao et al. (2025 (https://arxiv.org/html/2601.11374#bib.bib17)). Moreover, the relative scarcity of scientific writing evaluation datasets also prevents LLMs from becoming more familiar with these tasks in the training phase. To address these limitations, Jourdan et al. (2025 (https://arxiv.org/html/2601.11374#bib.bib19)) focus on evaluating scientific text revision. They highlight that LLM-as-a-judge methods struggle to grasp task-specific evaluation aspects in the absence of a gold reference. In a complementary direction, Purkayastha et al. (2025 (https://arxiv.org/html/2601.11374#bib.bib15)) classify the peer-reviews according to common mistakes of reviewers referred to as *lazy-thinking* patterns. A key limitation of their classification scheme is that, although some review sentences are suitable for multiple error categories, the dataset imposes a one-to-one label assignment. Similarly, Sadallah et al. (2025 (https://arxiv.org/html/2601.11374#bib.bib1)) introduce a direct-evaluation benchmark that measures the utility of scientific reviews across the aspects of actionability, grounding, verifiability, and helpfulness. Addressing a different task, Şahinuç et al. (2025 (https://arxiv.org/html/2601.11374#bib.bib2)) propose a fine-grained evaluation framework for related work generation. Instead of producing a single overall score, they perform aspect-based evaluation. Although their evaluation achieves strong alignment with human experts, their implementation is limited to the related work generation task.

### 2.2 Evaluation-Tuned Models

Reward modeling has a significant impact on the success of post-training with reinforcement learning. Although verifiable rewards are computationally efficient and demonstrate strong performance in math and coding tasks Wei et al. (2026 (https://arxiv.org/html/2601.11374#bib.bib38)), many complex tasks are not suitable for directly verifiable reward signals, such as writing quality or helpfulness. These limitations and the recent success of reasoning-centric models on complex tasks Guo et al. (2025a (https://arxiv.org/html/2601.11374#bib.bib24)) have motivated the development of reward reasoning models Ankner et al. (2024 (https://arxiv.org/html/2601.11374#bib.bib34)); Guo et al. (2025b (https://arxiv.org/html/2601.11374#bib.bib33)); Chen et al. (2025 (https://arxiv.org/html/2601.11374#bib.bib37), 2026 (https://arxiv.org/html/2601.11374#bib.bib36)); Wang et al. (2024 (https://arxiv.org/html/2601.11374#bib.bib39)). The primary objective of such models is to generate more reliable and accurate rewards in non-verifiable tasks by leveraging intermediate thinking steps. Along with reasoning features, prior works also explore improving long-context comprehension Tan et al. (2025 (https://arxiv.org/html/2601.11374#bib.bib32)), integrating external documents Ma et al. (2025 (https://arxiv.org/html/2601.11374#bib.bib7)), and bridging pointwise and pairwise scoring paradigms in reward modeling settings Whitehouse et al. (2026 (https://arxiv.org/html/2601.11374#bib.bib6)); Jian et al. (2025 (https://arxiv.org/html/2601.11374#bib.bib35)). The main drawback of these models is that they are optimized for standard reward benchmarks, which do not involve any scientific writing tasks. Therefore, they struggle to adapt to the varying scientific writing criteria and scoring. Although there are attempts to automatically generate task-specific evaluation criteria alongside the evaluation itself Liu et al. (2025a (https://arxiv.org/html/2601.11374#bib.bib41), b (https://arxiv.org/html/2601.11374#bib.bib28)); Liang et al. (2025 (https://arxiv.org/html/2601.11374#bib.bib42)), the generated criteria are mostly surface-level and do not match the scientific writing evaluation requirements which are highly specific, diverse, and involve domain expertise. Besides reward models, evaluation-tuned LLM-as-a-judge models also exist. In general, these models are not specialized for a particular task but are designed as general-purpose judge models Liu et al. (2023 (https://arxiv.org/html/2601.11374#bib.bib20)); Kim et al. (2024 (https://arxiv.org/html/2601.11374#bib.bib29)); Alexandru et al. (2025 (https://arxiv.org/html/2601.11374#bib.bib30)); Flow AI (2024 (https://arxiv.org/html/2601.11374#bib.bib31)); Shiwen et al. (2024 (https://arxiv.org/html/2601.11374#bib.bib40)). This feature provides flexibility, allowing them to be employed in a diverse set of tasks. However, since they are not specialized for expert-domain tasks like scientific writing, they experience performance drops. In contrast, our work directly focuses on the evaluation of scientific writing and generalizes across different tasks with unique criteria and evaluation schemas.

## 3 Methodology

### 3.1 Dataset

To improve the generalization capabilities of our reward models, we curate training data from diverse scientific writing evaluation tasks.

Similar Articles

C2: Scalable Rubric-Augmented Reward Modeling from Binary Preferences

Hugging Face Daily Papers

C2 proposes a scalable rubric-augmented reward modeling framework that trains a cooperative rubric generator and critical verifier exclusively from binary preferences, eliminating the need for costly rubric annotations while achieving up to 6.5 point gains on RM-Bench.

Author-in-the-Loop Response Generation and Evaluation: Integrating Author Expertise and Intent in Responses to Peer Review

arXiv cs.CL

This paper introduces the Re3Align dataset, REspGen framework, and REspEval evaluation suite for author-in-the-loop response generation in peer review, integrating author expertise and intent signals. The work addresses gaps in NLP formulation of scientific rebuttal writing with comprehensive datasets, controllable generation frameworks, and multi-dimensional evaluation metrics.

Curing Miracle Steps in LLM Mathematical Reasoning with Rubric Rewards

arXiv cs.CL

This paper identifies and addresses the problem of 'Miracle Steps' in LLM mathematical reasoning—unjustified jumps to correct answers that indicate reward hacking—by proposing Rubric Reward Model (RRM), a process-oriented reward function that evaluates entire reasoning trajectories. RRM achieves significant improvements on AIME2024 (26.7% to 62.6% Verified Pass@1024) and reduces Miracle Steps by 71%.