Prompt-Level Reward Specifications for Open-Ended Post-Training
Summary
This paper proposes a prompt-level reward specification framework that separates reward specification from computation, constructing reusable task-adaptive rubrics and executable constraint checkers offline to produce a hybrid reward for open-ended post-training without requiring human annotations or separate reward models.
View Cached Full Text
Cached at: 05/29/26, 09:17 AM
# Prompt-Level Reward Specifications for Open-Ended Post-Training
Source: [https://arxiv.org/html/2605.29275](https://arxiv.org/html/2605.29275)
Zijun Weng1,2, Xiaohui Hu2, Shuangyong Song2, Yongxiang Li2, Kaidong Yu2, Xuanjing Huang111footnotemark:1 1Fudan University 2Xingchen AGI Lab, China Telecom Artificial Intelligence Technology \(Beijing\) Co\., Ltd\. 25113050287@m\.fudan\.edu\.cn yukd@chinatelecom\.cn, xjhuang@fudan\.edu\.cn
###### Abstract
Open\-ended post\-training benefits from rewards that make prompt\-specific success conditions explicit, rather than relying only on post\-hoc scalar scores\. In instruction following, writing, and decision\-support tasks, response quality depends on local requirements, holistic preferences, and explicit constraints, but existing reward methods often leave these criteria implicit or cover only narrowly verifiable cases\. We propose a prompt\-level reward specification framework that separates reward specification from reward computation\. Given only prompts, our framework constructs reusable task\-adaptive rubrics and executable hard\-constraint checkers offline, making reward criteria explicit before training and reusable across rollouts\. At scoring time, artifact\-anchored rubric and code scores are combined with an independent global score for residual holistic quality, yielding a normalized hybrid reward over requirement satisfaction, holistic quality, and deterministic constraints\. The framework requires no human preference annotations, reference answers, or a separately trained reward model\. Experiments show that the resulting reward improves offline RM\-style response ranking and supports online reinforcement learning across multiple open\-ended benchmarks\. Ablations further show that rubrics, global scoring, and executable verification provide complementary supervision\.
Prompt\-Level Reward Specifications for Open\-Ended Post\-Training
Zijun Weng1,2, Xiaohui Hu2, Shuangyong Song2, Yongxiang Li2, Kaidong Yu2††thanks:Corresponding authors\., Xuanjing Huang111footnotemark:11Fudan University2Xingchen AGI Lab, China Telecom Artificial Intelligence Technology \(Beijing\) Co\., Ltd\.25113050287@m\.fudan\.edu\.cnyukd@chinatelecom\.cn, xjhuang@fudan\.edu\.cn
## 1Introduction
Reward construction remains a central obstacle for open\-ended language\-model post\-training\. The difficulty is not simply to obtain a stronger scalar judge\. For prompts in instruction following, writing, and decision support, a reward must determine what counts as success for that particular prompt: which local requirements should be satisfied, which qualities require holistic judgment, and which constraints can be checked exactly\(Yeet al\.,[2024](https://arxiv.org/html/2605.29275#bib.bib41)\)\. We call this the*prompt\-level reward specification problem*\. Transparent and consistent scoring therefore requires an explicit specification of what the reward should measure\.
Existing post\-training paradigms only partially address this problem\. Reinforcement learning from human feedback \(RLHF\) and learned reward models provide broad preference supervision, but their prompt\-specific criteria are usually implicit in human comparisons or model parameters\(Christianoet al\.,[2017](https://arxiv.org/html/2605.29275#bib.bib38); Ouyanget al\.,[2022](https://arxiv.org/html/2605.29275#bib.bib5)\)\. Reinforcement learning with verifiable rewards \(RLVR\) provides explicit and reliable supervision, but it relies on reference answers, executable tests, or other clean verification signals that are often unavailable in open\-ended tasks\(Shaoet al\.,[2024](https://arxiv.org/html/2605.29275#bib.bib1); Guoet al\.,[2025](https://arxiv.org/html/2605.29275#bib.bib2); Yuet al\.,[2026](https://arxiv.org/html/2605.29275#bib.bib20)\)\. Generic LLM judges can score open\-ended responses\(Zhenget al\.,[2023](https://arxiv.org/html/2605.29275#bib.bib3); Guet al\.,[2025](https://arxiv.org/html/2605.29275#bib.bib42)\), but their criteria remain opaque\. Rubric\-based judges make criteria more explicit\(Aroraet al\.,[2025](https://arxiv.org/html/2605.29275#bib.bib6); Saad\-Falconet al\.,[2026](https://arxiv.org/html/2605.29275#bib.bib29)\), and recent work has used rubrics for reward modeling and online RL\(Xuet al\.,[2026](https://arxiv.org/html/2605.29275#bib.bib22); Liuet al\.,[2026b](https://arxiv.org/html/2605.29275#bib.bib37); Gunjalet al\.,[2026](https://arxiv.org/html/2605.29275#bib.bib15); Shaoet al\.,[2025](https://arxiv.org/html/2605.29275#bib.bib46); Jiaet al\.,[2026](https://arxiv.org/html/2605.29275#bib.bib19)\)\. However, these methods typically construct or use rubrics during scoring or optimization, rather than treating them as reusable prompt\-level specifications built before rollout generation and shared across responses\. Deterministic checkers are reliable for explicit constraints, but cover only requirements reducible to surface\-level tests, such as length, format, or required strings\(Louet al\.,[2024](https://arxiv.org/html/2605.29275#bib.bib4); Pyatkinet al\.,[2026](https://arxiv.org/html/2605.29275#bib.bib9); Jianget al\.,[2024](https://arxiv.org/html/2605.29275#bib.bib43)\)\.
These limitations suggest a different design principle: open\-ended reward construction should separate reward specification from reward computation\. Rather than relying on a judge to reconstruct all prompt\-specific criteria each time a response is scored, the system should first construct reusable prompt\-level reward artifacts that make decomposable and verifiable success conditions explicit\. This provides an RLVR\-like interface for open\-ended tasks: the success conditions are specified before training and reused across rollouts, even though they are only partially verifiable\.
We propose a prompt\-level reward specification framework for open\-ended post\-training\. Given only prompts, it constructs reusable task\-adaptive rubrics and executable hard\-constraint checkers offline\. At scoring time, each prompt\-response pair receives three complementary signals: rubric scoring for fine\-grained requirements, global scoring for holistic quality, and code scoring for explicit constraints\. These normalized signals are combined into a unified hybrid reward\.
This positioning distinguishes our framework from prior rubric\-based and hybrid\-reward approaches\. Our contribution is not simply to add rubrics or verifiers to an LLM judge, but to specify reward criteria before rollout generation and reuse them across scoring calls\. This makes all candidate responses to the same prompt comparable under a shared specification, rather than asking the judge to rediscover or adapt the criteria for each response\. Equally important, we treat local rubric\-based supervision and global holistic judgment as complementary rather than interchangeable\. Rubrics expose prompt\-specific requirements and provide decomposed supervision, but independently scored criteria may miss response\-level interactions among otherwise strong candidates\. The global score complements this by capturing holistic coherence, usefulness, and trade\-offs across criteria, while executable checkers provide deterministic feedback for explicit hard constraints\.
Empirically, we evaluate the same prompt\-level artifacts in two settings: offline response ranking and downstream online RL\. The resulting reward performs strongly on offline reward\-evaluation benchmarks and yields consistent gains across multiple open\-ended RL benchmarks\. Our contributions are as follows:
- •We formulate open\-ended reward construction as a prompt\-level reward specification problem, arguing that rewards should expose prompt\-specific success conditions before scoring responses\.
- •We propose a prompt\-only reward specification framework that separates reward specification from reward computation by constructing reusable task\-adaptive rubrics and executable hard\-constraint checkers offline\.
- •We instantiate these artifacts as a pointwise hybrid reward combining local rubric scoring, independent global scoring, and executable verification, and show its effectiveness for both offline reward evaluation and online RL\.
Figure 1:Left: offline reward specification construction builds reusable reward artifacts from prompts alone, including prompt\-specific rubricsℛx\\mathcal\{R\}\_\{x\}and executable hard\-constraint checkers𝒞x\\mathcal\{C\}\_\{x\}\.Right: online reward computation combines rubric\-based, global, and code\-based scoring to produce a unified reward for evaluation and training\.
## 2Related Work
##### RL for LLM post\-training\.
Reinforcement learning has become a widely used component of LLM post\-training, ranging from PPO\-based RLHF pipelines\(Schulmanet al\.,[2017](https://arxiv.org/html/2605.29275#bib.bib13); Ouyanget al\.,[2022](https://arxiv.org/html/2605.29275#bib.bib5)\)to recent RLVR\-style methods that exploit verifiable signals in math, code, and other reasoning\-heavy domains\(Shaoet al\.,[2024](https://arxiv.org/html/2605.29275#bib.bib1); Guoet al\.,[2025](https://arxiv.org/html/2605.29275#bib.bib2); Yuet al\.,[2026](https://arxiv.org/html/2605.29275#bib.bib20)\)\. Preference\-based reward models are broadly applicable, but they require preference supervision and can be opaque; verifiable rewards are reliable, but they require automatically checkable outcomes\. We use Group Sequence Policy Optimization \(GSPO\)\(Zhenget al\.,[2025](https://arxiv.org/html/2605.29275#bib.bib14)\)for online RL, but our focus is not a new RL algorithm\. Instead, we study reward construction for open\-ended post\-training, where clean verifiers are often unavailable\.
##### Rubric\-based evaluation and reward design\.
Rubrics provide an interpretable interface for decomposing open\-ended response quality into judgeable criteria\(Kimet al\.,[2024](https://arxiv.org/html/2605.29275#bib.bib44)\)\. HealthBench uses conversation\-specific rubrics to evaluate nuanced medical responses\(Aroraet al\.,[2025](https://arxiv.org/html/2605.29275#bib.bib6)\), while LMUnit formulates fine\-grained natural\-language unit tests\(Saad\-Falconet al\.,[2026](https://arxiv.org/html/2605.29275#bib.bib29)\)\. Recent work further extends rubrics from evaluation to training\-time supervision, including joint rubric generation and judging, rubric\-based reward modeling, synthetic rubric construction, rubric\-as\-reward RL, and rubric quality refinement\(Xuet al\.,[2026](https://arxiv.org/html/2605.29275#bib.bib22); Zhanget al\.,[2026](https://arxiv.org/html/2605.29275#bib.bib21); Liuet al\.,[2026b](https://arxiv.org/html/2605.29275#bib.bib37); Gunjalet al\.,[2026](https://arxiv.org/html/2605.29275#bib.bib15); Shenet al\.,[2026](https://arxiv.org/html/2605.29275#bib.bib16)\)\. Related reward\-modeling and rubric\-scaffolded RL methods also study how structured criteria can improve training signals\(Wanget al\.,[2026a](https://arxiv.org/html/2605.29275#bib.bib45); Shaoet al\.,[2025](https://arxiv.org/html/2605.29275#bib.bib46); Zhouet al\.,[2026](https://arxiv.org/html/2605.29275#bib.bib47)\)\. These works demonstrate the value of rubric\-based supervision, but they mainly treat rubrics as the central supervision mechanism, through rubric construction, rubric\-conditioned reward modeling, or rubric\-guided RL\. In contrast, we treat rubrics as reusable prompt\-level reward artifacts constructed before response scoring, and use them as one component of a pointwise hybrid reward together with global scoring and executable verification\.
##### Prompt\-level reward specifications and hybrid rewards\.
Recent work has begun to combine model\-based judgments with verifiable signals to improve reward reliability\. Agentic Reward Modeling augments reward models with verifiable correctness signals, Omni\-Thinker combines rule\-based verifiable rewards with LLM\-as\-a\-judge preference signals, and VerIF pairs rule\-based code verification with LLM\-based verification for instruction\-following RL\(Penget al\.,[2025b](https://arxiv.org/html/2605.29275#bib.bib17); Liet al\.,[2025a](https://arxiv.org/html/2605.29275#bib.bib18); Penget al\.,[2025a](https://arxiv.org/html/2605.29275#bib.bib25)\)\. Recent technical reports also suggest the practical relevance of rubric\-guided, generative, and programmatic reward signals in large\-scale post\-training\(Teamet al\.,[2025](https://arxiv.org/html/2605.29275#bib.bib48); DeepSeek\-AI,[2026](https://arxiv.org/html/2605.29275#bib.bib39); Xiaomi,[2026](https://arxiv.org/html/2605.29275#bib.bib40)\)\. These works motivate hybrid reward construction, but they do not focus on reusable prompt\-level reward specifications\.
Closest to our work, OpenRS combines adaptive rubrics with verifiable rubric signals for open\-ended RL\(Jiaet al\.,[2026](https://arxiv.org/html/2605.29275#bib.bib19)\)\. However, the two methods differ in how reward criteria are specified and reused\. OpenRS constructs response\-pair\-conditioned rubrics online for pairwise optimization, whereas our method constructs reusable prompt\-level rubrics and executable checkers offline from prompts alone\. These artifacts are fixed before rollout generation and reused across responses, enabling a pointwise hybrid reward for both offline evaluation and online RL\. Our reward also includes an independent global score, which is not a replacement for rubric scoring but a complementary signal: criterion\-wise rubrics expose local prompt\-specific requirements, while global scoring captures response\-level coherence, usefulness, and interactions among criteria\.
## 3Method
### 3\.1Overview and Problem Formulation
Figure[1](https://arxiv.org/html/2605.29275#S1.F1)shows our two\-stage reward specification framework for open\-ended post\-training\. Given a promptxx, the policy generates a responsey∼πθ\(⋅∣x\)y\\sim\\pi\_\{\\theta\}\(\\cdot\\mid x\)\. Our goal is to construct a reward functionR\(x,y\)R\(x,y\)from prompts alone, without human preference annotations, reference answers, or a separately trained reward model\.
For each prompt, we first build a prompt\-specific reward specification
𝒮\(x\)=\{ℛx,𝒞x\},\\mathcal\{S\}\(x\)=\\\{\\mathcal\{R\}\_\{x\},\\mathcal\{C\}\_\{x\}\\\},\(1\)whereℛx\\mathcal\{R\}\_\{x\}is a task\-adaptive rubric and𝒞x\\mathcal\{C\}\_\{x\}is a set of executable checkers for explicit hard constraints\. We refer toℛx\\mathcal\{R\}\_\{x\}and𝒞x\\mathcal\{C\}\_\{x\}as reward artifacts, which instantiate the specification and are constructed once offline and reused across rollouts\. Online, each candidate response is scored by three complementary signals: a rubric score for fine\-grained requirement satisfaction, a global score for holistic response quality, and a code\-based score for deterministic constraint verification\. The resulting hybrid reward provides an RM\-style scalar scoring interface for offline evaluation and can also be used directly for online RL\.
### 3\.2Offline Reward Specification Construction
We construct the prompt\-level reward specificationS\(x\)S\(x\)through two offline branches: task\-adaptive rubric generation and hard\-constraint checker construction\.
##### Task\-adaptive rubric\.
To constructℛx\\mathcal\{R\}\_\{x\}, we first assign the prompt a coarse task label with a lightweight classifier\. This label does not score responses directly; it provides a task prior for rubric generation\. The classifier is intentionally conservative and falls back to a general category when the prompt intent is uncertain\. Conditioned on the task label, we combine a shared rubric\-generation template with a task\-specific module to produce weighted prompt\-specific criteria\.
The rubric is designed as a reward basis for rollout supervision rather than a generic evaluation checklist\. Each criterion will later be judged independently, assigned a ternary label, and aggregated by weight\. We therefore favor criteria that are atomic, low\-overlap, independently judgeable, and discriminative, especially those that remain informative among broadly acceptable responses rather than saturating under superficial compliance\.
##### Hard\-constraint checkers\.
In parallel, we build executable checkers from the prompt itself, rather than from the generated rubric\. This prompt\-only branch targets requirements that can be deterministically verified, such as length limits, counts, required or forbidden strings, formatting rules, and surface\-level output structure\. It first extracts explicit hard constraints into a restricted structured form, and then compiles them into executable code\. Semantic or holistic requirements are not converted into checkers and are instead left to model\-based scoring\. Since rubrics and checkers are constructed once for a fixed prompt set, they can be reused across rollouts and training runs\.
### 3\.3Online Hybrid Reward Computation
Given a prompt\-response pair\(x,y\)\(x,y\), we compute up to three reward components: a rubric scoresr\(x,y\)s\_\{r\}\(x,y\), a global scoresg\(x,y\)s\_\{g\}\(x,y\), and a code\-based scoresc\(x,y\)s\_\{c\}\(x,y\)\. The key idea is to decompose open\-ended quality into heterogeneous supervision signals instead of relying on a single holistic scalar judge\. The three components target local requirement satisfaction, overall response quality, and deterministic hard\-constraint verification, and are normalized to a shared\[0,1\]\[0,1\]scale before aggregation\.
##### Rubric\-based score\.
Letℛx=\{\(ri,wi\)\}i=1m\\mathcal\{R\}\_\{x\}=\\\{\(r\_\{i\},w\_\{i\}\)\\\}\_\{i=1\}^\{m\}denote the prompt\-specific rubric, whererir\_\{i\}is a criterion andwiw\_\{i\}is its weight\. A general\-purpose language model judges each criterion independently asyes,part, orno, which we map to11,0\.50\.5, and0, respectively\. Letviv\_\{i\}be the mapped value forrir\_\{i\}\. The rubric score is
sr\(x,y\)=∑i=1mwivi∑i=1mwi\.s\_\{r\}\(x,y\)=\\frac\{\\sum\_\{i=1\}^\{m\}w\_\{i\}v\_\{i\}\}\{\\sum\_\{i=1\}^\{m\}w\_\{i\}\}\.\(2\)This score provides a normalized measure of local requirement satisfaction\.
##### Global score\.
The global score captures response\-level quality beyond atomic rubric items\. A general\-purpose language model evaluates the prompt\-response pair\(x,y\)\(x,y\)and directly returns a holistic raw scoreg\(x,y\)∈\[0,10\]g\(x,y\)\\in\[0,10\]\. We normalize it assg\(x,y\)=clip\(g\(x,y\)/10,0,1\)s\_\{g\}\(x,y\)=\\mathrm\{clip\}\(g\(x,y\)/10,0,1\)before aggregation, so that it is on the same bounded\[0,1\]\[0,1\]scale as the rubric\-based and code\-based scores and does not dominate the reward due to scale mismatch\. Unlikesrs\_\{r\}, this score does not depend on prompt\-specific offline artifacts and instead provides a dense response\-level quality signal\.
##### Code\-based score\.
For explicit hard constraints, let𝒞x=\{cj\}j=1n\\mathcal\{C\}\_\{x\}=\\\{c\_\{j\}\\\}\_\{j=1\}^\{n\}be the executable checkers constructed offline for promptxx\. Each checker returns a binary outcomebj=cj\(x,y\)∈\{0,1\}b\_\{j\}=c\_\{j\}\(x,y\)\\in\\\{0,1\\\}\. Whenn\>0n\>0, this score is the checker pass rate:
sc\(x,y\)=1n∑j=1nbj\.s\_\{c\}\(x,y\)=\\frac\{1\}\{n\}\\sum\_\{j=1\}^\{n\}b\_\{j\}\.\(3\)This component is used only for requirements that admit reliable deterministic verification, such as exact length, counts or structural constraints\.
##### Unified hybrid reward\.
We aggregate the available components into a unified reward\. For brevity, all scores below are evaluated on\(x,y\)\(x,y\):
R\(x,y\)=\{sr\+sc\+αsg2\+α,n\>0,sr\+αsg1\+α,n=0,R\(x,y\)=\\begin\{cases\}\\frac\{s\_\{r\}\+s\_\{c\}\+\\alpha s\_\{g\}\}\{2\+\\alpha\},&n\>0,\\\\ \\frac\{s\_\{r\}\+\\alpha s\_\{g\}\}\{1\+\\alpha\},&n=0,\\end\{cases\}\(4\)whereα≥0\\alpha\\geq 0controls the contribution of the holistic signal\.
ModelRewardBench v2RM\-BenchFactualityPreciseIFMathSafetyFocusTiesOverallEasyNormalHardOverallProprietary LLM JudgesGemini\-2\.5\-pro\*75\.561\.989\.888\.180\.581\.179\.5––––Gemini\-2\.5\-flash\*67\.457\.585\.290\.984\.180\.977\.7––––GPT\-4\.1\*82\.939\.765\.287\.373\.485\.472\.385\.777\.069\.577\.4Trained Reward ModelsSkywork\-RM\-v2\-8B\*84\.666\.277\.696\.798\.481\.284\.197\.095\.086\.592\.8LMUnit\-Qwen2\.5\-72B\*87\.254\.472\.791\.396\.890\.182\.1––––Qwen3\-Nemo\-32B\-Gen\*–––––––88\.986\.483\.486\.2Our InstantiationsQwen3\-30B\-A3B61\.534\.470\.984\.874\.871\.566\.385\.180\.574\.380\.0w/ Hybrid Reward67\.7\(\+6\.2\)53\.8\(\+19\.4\)84\.7\(\+13\.8\)90\.4\(\+5\.6\)82\.8\(\+8\.0\)89\.1\(\+17\.6\)78\.1\(\+11\.8\)90\.6\(\+5\.5\)87\.8\(\+7\.3\)82\.1\(\+7\.8\)86\.8\(\+6\.8\)Qwen3\.5\-35B\-A3B76\.861\.884\.490\.577\.788\.680\.089\.687\.985\.187\.5w/ Hybrid Reward79\.5\(\+2\.7\)71\.0\(\+9\.2\)88\.1\(\+3\.7\)92\.2\(\+1\.7\)82\.7\(\+5\.0\)97\.3\(\+8\.7\)85\.1\(\+5\.1\)94\.6\(\+5\.0\)91\.7\(\+3\.8\)87\.1\(\+2\.0\)91\.1\(\+3\.6\)
Table 1:Results on RewardBench v2 and RM\-Bench\.Boldandunderlinedenote the best and second\-best results in each column, respectively\. The twoOverallcolumns report the average scores over the corresponding sub\-metrics in each benchmark\. Results marked with \* are taken from official benchmark leaderboards, and “–” indicates unavailable results\.Abbreviations:Skywork\-RM\-V2\-8B denotes Skywork\-Reward\-V2\-Llama\-3\.1\-8B; Qwen3\-Nemo\-32B\-Gen denotes Qwen3\-Nemotron\-32B\-GenRM\-Principle\.
### 3\.4Why Hybrid Reward?
Open\-ended responses exhibit heterogeneous failure modes that are difficult to capture with a single scalar reward\. Rubric\-based scoring decomposes prompt\-specific requirements into fine\-grained criteria and provides diagnostic supervision\(Jiaet al\.,[2026](https://arxiv.org/html/2605.29275#bib.bib19)\), but independently scored criteria may miss response\-level interactions and trade\-offs among strong candidates\. Global scoring instead evaluates the response as a whole and provides a dense signal for broad qualities such as relevance, helpfulness, and coherence\(Zhenget al\.,[2023](https://arxiv.org/html/2605.29275#bib.bib3)\), but a single holistic score is less interpretable and can be unreliable for strict instruction\-following constraints\(Louet al\.,[2024](https://arxiv.org/html/2605.29275#bib.bib4); Pyatkinet al\.,[2026](https://arxiv.org/html/2605.29275#bib.bib9)\)\.
Code\-based scoring complements these model\-based rewards for explicitly verifiable requirements, such as length limits, counts, forbidden strings and surface\-level structure\. For such constraints, executable checkers provide deterministic supervision\. Our hybrid reward therefore assigns local requirement satisfaction to rubric scoring, response\-level quality to global scoring, and hard\-constraint satisfaction to code\-based verification\.
## 4Experiments
We evaluate the proposed framework in two settings: offline RM\-style scoring and online RL for open\-ended post\-training\. We report offline results in Section[4\.2](https://arxiv.org/html/2605.29275#S4.SS2), online RL results in Section[4\.3](https://arxiv.org/html/2605.29275#S4.SS3), and further analyses in Section[5](https://arxiv.org/html/2605.29275#S5)\.
ModelIFEvalIFBenchArena\-Hard\-v2\.0Creative Writing v3WritingBenchAvgPr\. \(S\)Pr\. \(S\)ScoreScoreScoreScoreDeepSeek\-R1\-Distill\-Qwen\-7B57\.1±\\boldsymbol\{\\pm\}0\.413\.4±\\boldsymbol\{\\pm\}0\.82\.7±\\boldsymbol\{\\pm\}0\.337\.8±\\boldsymbol\{\\pm\}1\.350\.4±\\boldsymbol\{\\pm\}0\.532\.3\+RL w/ Hybrid Reward69\.2±\\boldsymbol\{\\pm\}0\.3\(\+12\.1\)21\.3±\\boldsymbol\{\\pm\}0\.2\(\+7\.9\)4\.1±\\boldsymbol\{\\pm\}0\.4\(\+1\.4\)44\.9±\\boldsymbol\{\\pm\}1\.5\(\+7\.1\)59\.3±\\boldsymbol\{\\pm\}0\.5\(\+8\.9\)39\.8 \(\+7\.5\)Qwen3\-4B83\.4±\\boldsymbol\{\\pm\}0\.229\.4±\\boldsymbol\{\\pm\}0\.814\.7±\\boldsymbol\{\\pm\}0\.759\.4±\\boldsymbol\{\\pm\}1\.570\.0±\\boldsymbol\{\\pm\}0\.551\.4\+RL w/ Hybrid Reward86\.5±\\boldsymbol\{\\pm\}0\.2\(\+3\.1\)35\.9±\\boldsymbol\{\\pm\}0\.2\(\+6\.5\)22\.0±\\boldsymbol\{\\pm\}0\.9\(\+7\.3\)79\.8±\\boldsymbol\{\\pm\}1\.8\(\+20\.4\)76\.4±\\boldsymbol\{\\pm\}0\.4\(\+6\.4\)60\.1 \(\+8\.7\)GLM\-4\.7\-Flash83\.5±\\boldsymbol\{\\pm\}1\.050\.8±\\boldsymbol\{\\pm\}2\.428\.3±\\boldsymbol\{\\pm\}1\.180\.4±\\boldsymbol\{\\pm\}1\.276\.3±\\boldsymbol\{\\pm\}0\.563\.9\+RL w/ Hybrid Reward89\.8±\\boldsymbol\{\\pm\}0\.6\(\+6\.3\)57\.3±\\boldsymbol\{\\pm\}0\.6\(\+6\.5\)47\.8±\\boldsymbol\{\\pm\}1\.5\(\+19\.5\)83\.3±\\boldsymbol\{\\pm\}1\.3\(\+2\.9\)81\.4±\\boldsymbol\{\\pm\}0\.4\(\+5\.1\)71\.9 \(\+8\.0\)Qwen3\-30B\-A3B85\.4±\\boldsymbol\{\\pm\}0\.135\.9±\\boldsymbol\{\\pm\}1\.030\.6±\\boldsymbol\{\\pm\}0\.977\.3±\\boldsymbol\{\\pm\}1\.374\.4±\\boldsymbol\{\\pm\}0\.560\.7\+RL w/ Hybrid Reward87\.5±\\boldsymbol\{\\pm\}0\.4\(\+2\.1\)39\.3±\\boldsymbol\{\\pm\}0\.7\(\+3\.4\)38\.5±\\boldsymbol\{\\pm\}1\.0\(\+7\.9\)82\.6±\\boldsymbol\{\\pm\}1\.4\(\+5\.3\)79\.2±\\boldsymbol\{\\pm\}0\.4\(\+4\.8\)65\.4 \(\+4\.7\)
Table 2:Experimental results of RL training with our method on DeepSeek\-R1\-Distill\-Qwen\-7B, Qwen3\-4B, GLM\-4\.7\-Flash, and Qwen3\-30B\-A3B\.±\\pmdenotes sample standard deviation over three runs\. We report prompt\-level strict scores for IFEval and IFBench\. Avg averages the five displayed mean scores\.### 4\.1Experimental Setup
##### Reward pipeline instantiation\.
All construction and scoring modules are instantiated with general\-purpose language models\. We use Qwen3\-4B\(Yanget al\.,[2025](https://arxiv.org/html/2605.29275#bib.bib23)\)for task\-label extraction, the fixed OpenAI API snapshotgpt\-5\-2025\-08\-07for prompt\-specific rubric generation, Qwen3\-30B\-A3B for constraint extraction and checker generation, and Qwen3\.5\-35B\-A3B\(Qwen Team,[2026](https://arxiv.org/html/2605.29275#bib.bib35)\)for rubric judging and global scoring\.
##### Policy models and training data\.
For online RL, we evaluate multiple policy backbones, including DeepSeek\-R1\-Distill\-Qwen\-7B\(Guoet al\.,[2025](https://arxiv.org/html/2605.29275#bib.bib2)\), Qwen3\-4B, GLM\-4\.7\-Flash\(GLM Teamet al\.,[2025](https://arxiv.org/html/2605.29275#bib.bib24)\), and Qwen3\-30B\-A3B\. Training uses 13K prompts: 5K from VERINSTRUCT\(Penget al\.,[2025a](https://arxiv.org/html/2605.29275#bib.bib25)\), 5K from DeepWriting\-20K\(Wanget al\.,[2026b](https://arxiv.org/html/2605.29275#bib.bib26)\), and 3K synthesized decision\-support prompts\.
##### Evaluation benchmarks\.
We evaluate offline reward quality on RewardBench v2\(Maliket al\.,[2026](https://arxiv.org/html/2605.29275#bib.bib8)\)and RM\-Bench\(Liuet al\.,[2025a](https://arxiv.org/html/2605.29275#bib.bib27)\)\. For online RL, we report results on IFEval\(Louet al\.,[2024](https://arxiv.org/html/2605.29275#bib.bib4)\), IFBench\(Pyatkinet al\.,[2026](https://arxiv.org/html/2605.29275#bib.bib9)\), Arena\-Hard\-v2\.0\(Liet al\.,[2025b](https://arxiv.org/html/2605.29275#bib.bib10)\), Creative Writing v3\(Paech,[2025](https://arxiv.org/html/2605.29275#bib.bib11)\), and WritingBench\(Wuet al\.,[2026](https://arxiv.org/html/2605.29275#bib.bib12)\)\. For our evaluations, unless otherwise specified, we report the average over three runs\.
##### Training details\.
For offline evaluation, we setα=1\\alpha=1in Eq\. \(4\)\. For online RL, we linearly decay the holistic\-score weight during training and train each policy for approximately 800 steps\. We use GSPO for policy optimization and disable standard\-deviation normalization when computing groupwise advantages\(Liuet al\.,[2025b](https://arxiv.org/html/2605.29275#bib.bib34)\)\.
For reproducibility, we report the main inference and training settings in Appendix[A](https://arxiv.org/html/2605.29275#A1)and provide the reward\-construction and reward\-computation prompt templates in Appendix[E](https://arxiv.org/html/2605.29275#A5)\.
### 4\.2Offline RM\-style Scoring Results
We first evaluate the hybrid reward as an RM\-style scalar scorer on RewardBench v2 and RM\-Bench\. These benchmarks test response ranking and preference evaluation, allowing us to assess whether the reward provides a strong standalone quality signal\.
As shown in Table[1](https://arxiv.org/html/2605.29275#S3.T1), we compare against proprietary LLM judges, including Gemini\-2\.5 and GPT\-4\.1, and trained reward models, including Skywork\-Reward\-V2\-Llama\-3\.1\-8B\(Liuet al\.,[2026a](https://arxiv.org/html/2605.29275#bib.bib28)\), LMUnit\-Qwen2\.5\-72B\(Saad\-Falconet al\.,[2026](https://arxiv.org/html/2605.29275#bib.bib29)\), and Qwen3\-Nemotron\-32B\-GenRM\-Principle\(Wanget al\.,[2026c](https://arxiv.org/html/2605.29275#bib.bib30)\)\. Our hybrid reward improves both underlying LLM judge instantiations and achieves the top Overall score on RewardBench v2 among the compared systems\.
These results isolate the effect of the reward design: the evaluator backbone is unchanged, but combining rubric\-based scoring, global scoring, and executable constraint verification consistently improves the scalar reward signal\. The gains across both benchmark suites and both judge backbones suggest that hybridization is not tied to a single evaluator configuration\.
### 4\.3Online RL Results
We further evaluate whether the proposed hybrid reward can be used as an online training signal\. Table[2](https://arxiv.org/html/2605.29275#S4.T2)compares each policy backbone before and after RL training on IFEval, IFBench, Arena\-Hard\-v2\.0, Creative Writing v3, and WritingBench\.
RL with the hybrid reward improves all evaluated backbones in average score\. The gains appear on both instruction\-following and open\-ended generation benchmarks, suggesting that the reward provides useful supervision for explicit constraint satisfaction as well as broader response quality\. The improvements also hold for stronger initial backbones, indicating that the reward is not limited to low\-performing policies\. We further compare with Rubicon\-Preview\(Huanget al\.,[2025](https://arxiv.org/html/2605.29275#bib.bib49)\), a released rubric\-RL policy trained on the same Qwen3\-30B\-A3B backbone, in Appendix[B\.6](https://arxiv.org/html/2605.29275#A2.SS6)\.
These results show that the full hybrid reward is an effective optimization signal across policy backbones\. To isolate whether the gains come from the proposed reward decomposition rather than from applying RL with an arbitrary scalar reward, Section[5\.2](https://arxiv.org/html/2605.29275#S5.SS2)provides a controlled online ablation in which the policy, data, rollout configuration, optimization hyperparameters, and training budget are fixed while only the reward formulation is changed\.
GroupVariantFactualityPreciseIFMathSafetyFocusTiesOverallEvaluator Backbone: Qwen3\.5\-35B\-A3BGlobal\-basedGlobal only76\.861\.884\.490\.577\.788\.680\.0Global \+ Code76\.666\.484\.490\.577\.688\.680\.7Rubric\-basedRubric only69\.454\.484\.884\.177\.591\.877\.0Rubric \+ Code69\.159\.684\.884\.177\.591\.877\.8Rubric \+ Global79\.768\.388\.192\.282\.897\.384\.7HybridRubric \+ Global \+ Code79\.571\.088\.192\.282\.797\.385\.1
Table 3:Component ablation of the proposed hybrid reward on RewardBench v2 with Qwen3\.5\-35B\-A3B as the evaluator backbone\.Boldandunderlinedenote the best and second\-best results in each column, respectively\.
## 5Analysis
We analyze the hybrid reward through a sequence of ablations and diagnostics\. Section[5\.1](https://arxiv.org/html/2605.29275#S5.SS1)first studies component complementarity in offline reward evaluation on RewardBench v2\. Section[5\.2](https://arxiv.org/html/2605.29275#S5.SS2)then tests whether the same complementarity transfers to online RL\. We further examine robustness to open\-weight rubric extraction and the effect of open\-ended RL on reasoning benchmarks\.
### 5\.1Component Ablation on RewardBench v2
To isolate the effect of reward composition, we fix Qwen3\.5\-35B\-A3B as the evaluator backbone and vary only the reward components, as shown in Table[3](https://arxiv.org/html/2605.29275#S4.T3)\.
The full hybrid reward achieves the best overall score\. The largest gain comes from combining rubric\-based and global scoring: the overall score increases from 77\.0 with rubric\-only scoring and 80\.0 with global\-only scoring to 84\.7 with both components\. This supports our motivation that the two model\-based signals are complementary: global scoring captures holistic response quality, while rubric\-based scoring evaluates prompt\-specific requirements\. The same pattern also holds with Qwen3\-30B\-A3B as the evaluator in Appendix[B\.1](https://arxiv.org/html/2605.29275#A2.SS1), suggesting that the gain is not tied to a single evaluator backbone\.
Code\-based verification provides a targeted improvement\. Adding code improves the overall score from 80\.0 to 80\.7 for the global reward, from 77\.0 to 77\.8 for the rubric reward, and from 84\.7 to 85\.1 for the rubric\-global reward\. The gain is most consistent on PreciseIF, where code improves the scores from 61\.8 to 66\.4, 54\.4 to 59\.6, and 68\.3 to 71\.0 under the three corresponding settings\. This indicates that executable checkers mainly help with explicitly checkable constraints; we further analyze this reliability benefit in Appendix[B\.2](https://arxiv.org/html/2605.29275#A2.SS2)\.
### 5\.2Controlled Online Reward Ablation
We further test whether the online RL gains come from the proposed reward decomposition rather than from applying RL with an arbitrary scalar reward\. We conduct a controlled ablation using DeepSeek\-R1\-Distill\-Qwen\-7B as the policy model and Qwen3\.5\-35B\-A3B as the reward evaluator\. All variants use the same training prompts, rollout configuration, optimization hyperparameters, and training budget; only the reward formulation is changed\. We compare global\-only, rubric\-only, rubric\+global, and the full hybrid reward\. Since executable checkers only cover explicitly verifiable surface constraints, we isolate their effect by comparing rubric\+global with the full hybrid reward instead of using a checker\-only reward\.
VariantIFEvalIFBenchWritingBenchAvg\.Initial57\.1±\\boldsymbol\{\\pm\}0\.413\.4±\\boldsymbol\{\\pm\}0\.850\.4±\\boldsymbol\{\\pm\}0\.540\.3Skywork RM56\.1±\\boldsymbol\{\\pm\}0\.5\(\-1\.0\)17\.9±\\boldsymbol\{\\pm\}1\.0\(\+4\.5\)58\.7±\\boldsymbol\{\\pm\}0\.3\(\+8\.3\)44\.2 \(\+3\.9\)G only64\.6±\\boldsymbol\{\\pm\}0\.9\(\+7\.5\)17\.6±\\boldsymbol\{\\pm\}0\.3\(\+4\.2\)53\.2±\\boldsymbol\{\\pm\}0\.6\(\+2\.8\)45\.1 \(\+4\.8\)R only64\.7±\\boldsymbol\{\\pm\}1\.0\(\+7\.6\)18\.9±\\boldsymbol\{\\pm\}0\.2\(\+5\.5\)57\.9±\\boldsymbol\{\\pm\}0\.5\(\+7\.5\)47\.2 \(\+6\.9\)R\+G65\.7±\\boldsymbol\{\\pm\}0\.3\(\+8\.6\)19\.3±\\boldsymbol\{\\pm\}0\.7\(\+5\.9\)58\.4±\\boldsymbol\{\\pm\}0\.4\(\+8\.0\)47\.8 \(\+7\.5\)R\+G\+C69\.2±\\boldsymbol\{\\pm\}0\.3\(\+12\.1\)21\.3±\\boldsymbol\{\\pm\}0\.2\(\+7\.9\)59\.3±\\boldsymbol\{\\pm\}0\.5\(\+8\.9\)49\.9\(\+9\.6\)
Table 4:Controlled online RL ablation with fixed training prompts, rollout settings, optimization hyperparameters, and budget\. R/G/C denote rubric, global, and code\-based rewards; Skywork RM uses off\-the\-shelf Skywork\-Reward\-v2\-Llama\-3\.1\-8B\.As shown in Table[4](https://arxiv.org/html/2605.29275#S5.T4), global\-only and rubric\-only rewards both improve the initial policy, but combining them yields larger average gains\. Adding code\-based verification further improves the average score from 47\.8 to 49\.9, showing that executable verification provides additional value beyond model\-based scoring\. These results suggest that the complementarity observed in offline reward evaluation also transfers to online RL, and that the gains are not simply due to applying RL with a single reward signal\. Appendix[B\.7](https://arxiv.org/html/2605.29275#A2.SS7)provides the corresponding training curves\.
SettingRewardBench v2FactualityPreciseIFMathSafetyFocusTiesOverallRubric\-only RewardGPT\-5 Extractor69\.454\.484\.884\.177\.591\.877\.0Qwen3\.5\-397B\-A17B Extractor67\.2\(\-2\.2\)51\.3\(\-3\.1\)86\.2\(\+1\.4\)87\.5\(\+3\.4\)76\.7\(\-0\.8\)80\.9\(\-10\.9\)75\.0\(\-2\.0\)Hybrid RewardGPT\-5 Extractor79\.571\.088\.192\.282\.797\.385\.1Qwen3\.5\-397B\-A17B Extractor78\.3\(\-1\.2\)68\.9\(\-2\.1\)89\.7\(\+1\.6\)93\.8\(\+1\.6\)82\.6\(\-0\.1\)93\.9\(\-3\.4\)84\.5\(\-0\.6\)
Table 5:Effect of replacing GPT\-5 with an open\-weight rubric extractor\. Qwen3\.5\-397B denotes Qwen3\.5\-397B\-A17B\. Only the rubric extractor is changed; the rubric judge, global scorer, and code\-based verifier are kept fixed\. Values in parentheses denote changes relative to the GPT\-5 extractor under the same reward setting\.
### 5\.3Open\-Weight Rubric Extraction: Robustness and Remaining Gap
Our main experiments use GPT\-5 to generate prompt\-specific rubrics, raising the question of whether the framework depends on a proprietary rubric extractor\. To examine this, we replace only the rubric extractor with Qwen3\.5\-397B\-A17B, using full\-precision inference settings aligned with its official configuration, while keeping the downstream reward\-computation pipeline unchanged\.
Table[5](https://arxiv.org/html/2605.29275#S5.T5)shows that open\-weight rubric extraction is feasible, although extractor quality still matters\. In the rubric\-only setting, replacing GPT\-5 with Qwen3\.5\-397B\-A17B reduces the Overall score from 77\.0 to 75\.0\. The drop is mainly concentrated in constraint\-sensitive categories such as PreciseIF and Ties, while Math and Safety improve slightly\.
Importantly, the gap becomes much smaller under the full hybrid reward\. The Overall score only decreases from 85\.1 to 84\.5, suggesting that global scoring and code\-based verification can compensate for imperfections in rubric extraction\. With the open\-weight extractor, adding these complementary signals improves the Overall score from 75\.0 to 84\.5\. These results indicate that the framework does not critically rely on GPT\-5 as the rubric extractor, while also highlighting open\-weight rubric extraction as a remaining source of variability\.
### 5\.4Open\-Ended RL Largely Preserves Reasoning Benchmark Performance
A natural concern is that post\-training on open\-ended tasks may improve instruction following and writing quality while causing regressions on standard reasoning benchmarks\(Luoet al\.,[2025](https://arxiv.org/html/2605.29275#bib.bib33)\)\. This concern is particularly relevant in our setting because the RL training mixture is mainly drawn from instruction\-following, writing, and scenario\-based decision\-making prompts, without explicitly adding reasoning\-focused math or science data\.
ModelGSM8KGPQAAIMEDiamond2024Qwen3\-30B\-A3B96\.462\.880\.7\+RL w/ Hybrid Reward96\.6 \(\+0\.2\)62\.1 \(\-0\.7\)81\.3 \(\+0\.6\)GLM\-4\.7\-Flash95\.551\.690\.1\+RL w/ Hybrid Reward96\.1 \(\+0\.6\)58\.7 \(\+7\.1\)92\.7 \(\+2\.6\)
Table 6:Reasoning benchmark results \(%\) before and after RL with our hybrid reward\. We evaluate Qwen3\-30B\-A3B and GLM\-4\.7\-Flash on GSM8K, GPQA Diamond, and AIME 2024\.Table[6](https://arxiv.org/html/2605.29275#S5.T6)shows that we do not observe substantial degradation on the evaluated reasoning benchmarks\. For Qwen3\-30B\-A3B, performance remains broadly stable after RL, with small gains on GSM8K\(Cobbeet al\.,[2021](https://arxiv.org/html/2605.29275#bib.bib31)\)and AIME 2024, and a limited drop on GPQA Diamond\(Reinet al\.,[2024](https://arxiv.org/html/2605.29275#bib.bib32)\)\. For GLM\-4\.7\-Flash, RL with our hybrid reward improves all three benchmarks, with larger gains on GPQA Diamond and AIME 2024\.
Together with the results in Section[4\.3](https://arxiv.org/html/2605.29275#S4.SS3), these results suggest that open\-ended RL with our hybrid reward improves target open\-ended capabilities while largely preserving standard reasoning benchmark performance in our setting\.
### 5\.5Additional Experiments and Analyses
We provide additional experiments and analyses in the appendix\. Appendix[B\.3](https://arxiv.org/html/2605.29275#A2.SS3)validates the removal of standard\-deviation normalization in groupwise advantage computation\. Appendix[B\.5](https://arxiv.org/html/2605.29275#A2.SS5)analyzes the training\-efficiency overhead of model\-based reward computation and our asynchronous optimization strategy\. Appendix[D](https://arxiv.org/html/2605.29275#A4)reports a blinded pairwise preference check with order\-swapped LLM judges and non\-expert human annotators\.
## 6Conclusion
We introduced a prompt\-level reward specification framework for open\-ended post\-training\. Given only prompts, it constructs reusable task\-adaptive rubrics and executable hard\-constraint checkers, and combines them with an independent global score to form a normalized hybrid reward\. This design makes prompt\-specific requirements explicit while integrating fine\-grained semantic judgment, holistic quality assessment, and deterministic constraint verification\. Experiments show that the reward supports both offline response ranking and online RL, with ablations confirming the complementarity of the three reward signals\.
## Limitations
##### Dependence on artifact and evaluator quality\.
Our framework removes the need for human preference annotations, reference answers, and separately trained reward models, but it still depends on strong general\-purpose LLMs for constructing and evaluating reward artifacts\. In our main pipeline, GPT\-5 is used for prompt\-specific rubric generation, while the remaining artifact\-construction and reward\-computation components are instantiated with open\-weight Qwen models\. Although Table[5](https://arxiv.org/html/2605.29275#S5.T5)shows that replacing GPT\-5 with a full\-precision open\-weight extractor preserves most of the offline hybrid reward performance, artifact quality remains an important source of variability\. In particular, the rubric\-only setting still exhibits larger drops on constraint\-sensitive categories, suggesting that rubric extraction errors are not fully eliminated by using a strong open\-weight model\. Moreover, this open\-weight replacement experiment is limited to offline reward evaluation\. We did not repeat the full online RL pipeline with rubrics regenerated by Qwen3\.5\-397B\-A17B, because deploying the full\-weight extractor was substantially more expensive and time\-consuming in our setting\. Therefore, our results support the feasibility of open\-weight offline reward computation, but they do not fully establish the practicality of fully open\-weight artifact construction for online RL\. Applying the framework to new prompt distributions may still require strong open\-weight models, extractor\-specific prompt tuning, and additional validation\. To mitigate reproducibility concerns, we plan to release the publicly shareable artifacts used in our experiments, including filtered training prompts where permitted, generated rubrics, extracted hard constraints, executable checkers, evaluator prompts, parsing scripts, validation logs, and RL configurations, subject to internal approval and applicable licensing constraints\.
##### Rubric construction and reward aggregation\.
This work focuses on hybrid reward composition rather than on optimizing rubric generation itself\. We use a fixed template\-and\-plug\-in pipeline, but do not systematically study alternative extraction strategies, criterion granularities, weighting rules, or refinement procedures\. In addition, our reward combines normalized rubric\-based, global, and code\-based scores with fixed aggregation rules, and does not explicitly resolve severe disagreement among reward sources\. We also use a fixed linear decay schedule for the global\-score weight during online RL\. This schedule is motivated by the intuition that holistic scoring is more useful early in training, while prompt\-specific rubric and checker signals become more important as responses improve, but we do not systematically ablate alternative schedules such as constant, stepwise, or learned weighting\.
##### Coverage and safety of executable verification\.
The code\-based component is limited to explicit constraints that can be checked with deterministic surface\-level rules, such as length, keyword occurrence, formatting, counting, and simple structural requirements\. It cannot verify semantic correctness, factuality, reasoning validity, or deeper pragmatic requirements\. Although our checker pipeline restricts extraction to surface\-checkable constraint types, validates generated code, and uses timeout and failure\-handling logic, generated executable checkers may still require sandboxing and auditing before deployment\.
##### Scope of human preference evaluation\.
We include a blinded pairwise preference check with multiple LLM judges and three non\-expert human annotators to reduce reliance on a single benchmark\-level evaluator\. However, this study remains small\-scale and is not intended as a statistically powered human evaluation\. The human annotators are ordinary users rather than trained expert annotators, and their judgments are used only as an auxiliary sanity check\. A larger expert\-annotated study would be needed to draw stronger conclusions about fine\-grained human preference alignment\.
## Ethical Considerations
Our work uses existing datasets and synthetic prompts for research on open\-ended post\-training\. For datasets such as VERINSTRUCT and DeepWriting\-20K, we use only the prompt text and do not use reference responses, solutions, reasoning traces, verifier outputs, or other supervision signals from the original datasets\. Any supplementary artifact package associated with this work contains only filtered prompt texts and derived reward artifacts; it does not include reference responses, solutions, reasoning traces, verifier outputs, or other supervision signals from the original datasets\. We do not intentionally introduce private user data\. The blinded preference check in Appendix[D](https://arxiv.org/html/2605.29275#A4)includes three non\-expert human annotators and is used only as a diagnostic comparison over model outputs on benchmark prompts\. These human judgments are not used in reward construction, reward computation, or RL training\.
We use existing datasets, benchmarks, and open\-weight models only for research purposes and cite their original creators\. We follow the publicly available licenses or terms of use associated with these artifacts where specified\. Any publicly released reward artifacts associated with this work are intended for research use\.
The proposed framework can improve reward construction for instruction following and open\-ended generation, but it may also amplify undesirable behavior if unsafe prompts, biased rubrics, or inappropriate reward criteria are used\. The code\-based component further introduces execution\-related risks, so generated checkers should be run with sandboxing, restricted permissions, timeout controls, and auditing before deployment\. LLM\-based evaluators may also inherit biases, calibration errors, or prompt\-injection vulnerabilities from their underlying models; therefore, reward artifacts and trained models should be audited carefully, especially in high\-impact domains\.
We used AI assistants, including ChatGPT and Claude, to help polish writing and improve clarity\. All technical ideas, experiments, analyses, and final claims were checked and are the responsibility of the authors\.
## References
- R\. K\. Arora, J\. Wei, R\. S\. Hicks, P\. Bowman, J\. Quiñonero\-Candela, F\. Tsimpourlas, M\. Sharman, M\. Shah, A\. Vallone, A\. Beutel, J\. Heidecke, and K\. Singhal \(2025\)HealthBench: evaluating large language models towards improved human health\.External Links:2505\.08775,[Link](https://arxiv.org/abs/2505.08775)Cited by:[§1](https://arxiv.org/html/2605.29275#S1.p2.1),[§2](https://arxiv.org/html/2605.29275#S2.SS0.SSS0.Px2.p1.1)\.
- P\. F\. Christiano, J\. Leike, T\. Brown, M\. Martic, S\. Legg, and D\. Amodei \(2017\)Deep reinforcement learning from human preferences\.Advances in neural information processing systems30\.Cited by:[§1](https://arxiv.org/html/2605.29275#S1.p2.1)\.
- K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano, C\. Hesse, and J\. Schulman \(2021\)Training verifiers to solve math word problems\.External Links:2110\.14168,[Link](https://arxiv.org/abs/2110.14168)Cited by:[§5\.4](https://arxiv.org/html/2605.29275#S5.SS4.p2.1)\.
- DeepSeek\-AI \(2026\)DeepSeek\-v4: towards highly efficient million\-token context intelligence\.Cited by:[§2](https://arxiv.org/html/2605.29275#S2.SS0.SSS0.Px3.p1.1)\.
- GLM Team, A\. Zeng, X\. Lv, Q\. Zheng, Z\. Hou, B\. Chen, C\. Xie, C\. Wang, D\. Yin, H\. Zeng, J\. Zhang, K\. Wang, L\. Zhong, M\. Liu, R\. Lu, S\. Cao, X\. Zhang, X\. Huang, Y\. Wei, Y\. Cheng, Y\. An, Y\. Niu, Y\. Wen, Y\. Bai, Z\. Du, Z\. Wang, Z\. Zhu, B\. Zhang, B\. Wen, B\. Wu, B\. Xu, C\. Huang, C\. Zhao, C\. Cai, C\. Yu, C\. Li, C\. Ge, C\. Huang, C\. Zhang, C\. Xu, C\. Zhu, C\. Li, C\. Yin, D\. Lin, D\. Yang, D\. Jiang, D\. Ai, E\. Zhu, F\. Wang, G\. Pan, G\. Wang, H\. Sun, H\. Li, H\. Li, H\. Hu, H\. Zhang, H\. Peng, H\. Tai, H\. Zhang, H\. Wang, H\. Yang, H\. Liu, H\. Zhao, H\. Liu, H\. Yan, H\. Liu, H\. Chen, J\. Li, J\. Zhao, J\. Ren, J\. Jiao, J\. Zhao, J\. Yan, J\. Wang, J\. Gui, J\. Zhao, J\. Liu, J\. Li, J\. Li, J\. Lu, J\. Wang, J\. Yuan, J\. Li, J\. Du, J\. Du, J\. Liu, J\. Zhi, J\. Gao, K\. Wang, L\. Yang, L\. Xu, L\. Fan, L\. Wu, L\. Ding, L\. Wang, M\. Zhang, M\. Li, M\. Xu, M\. Zhao, M\. Zhai, P\. Du, Q\. Dong, S\. Lei, S\. Tu, S\. Yang, S\. Lu, S\. Li, S\. Li, Shuang\-Li, S\. Yang, S\. Yi, T\. Yu, W\. Tian, W\. Wang, W\. Yu, W\. L\. Tam, W\. Liang, W\. Liu, X\. Wang, X\. Jia, X\. Gu, X\. Ling, X\. Wang, X\. Fan, X\. Pan, X\. Zhang, X\. Zhang, X\. Fu, X\. Zhang, Y\. Xu, Y\. Wu, Y\. Lu, Y\. Wang, Y\. Zhou, Y\. Pan, Y\. Zhang, Y\. Wang, Y\. Li, Y\. Su, Y\. Geng, Y\. Zhu, Y\. Yang, Y\. Li, Y\. Wu, Y\. Li, Y\. Liu, Y\. Wang, Y\. Li, Y\. Zhang, Z\. Liu, Z\. Yang, Z\. Zhou, Z\. Qiao, Z\. Feng, Z\. Liu, Z\. Zhang, Z\. Wang, Z\. Yao, Z\. Wang, Z\. Liu, Z\. Chai, Z\. Li, Z\. Zhao, W\. Chen, J\. Zhai, B\. Xu, M\. Huang, H\. Wang, J\. Li, Y\. Dong, and J\. Tang \(2025\)GLM\-4\.5: agentic, reasoning, and coding \(arc\) foundation models\.External Links:2508\.06471,[Link](https://arxiv.org/abs/2508.06471)Cited by:[§4\.1](https://arxiv.org/html/2605.29275#S4.SS1.SSS0.Px2.p1.1)\.
- J\. Gu, X\. Jiang, Z\. Shi, H\. Tan, X\. Zhai, C\. Xu, W\. Li, Y\. Shen, S\. Ma, H\. Liu, S\. Wang, K\. Zhang, Y\. Wang, W\. Gao, L\. Ni, and J\. Guo \(2025\)A survey on llm\-as\-a\-judge\.External Links:2411\.15594,[Link](https://arxiv.org/abs/2411.15594)Cited by:[§1](https://arxiv.org/html/2605.29275#S1.p2.1)\.
- A\. Gunjal, A\. Wang, E\. Lau, V\. Nath, Y\. He, B\. Liu, and S\. M\. Hendryx \(2026\)Rubrics as rewards: reinforcement learning beyond verifiable domains\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=c1bTcrDmt4)Cited by:[§1](https://arxiv.org/html/2605.29275#S1.p2.1),[§2](https://arxiv.org/html/2605.29275#S2.SS0.SSS0.Px2.p1.1)\.
- D\. Guo, D\. Yang, H\. Zhang, J\. Song, P\. Wang, Q\. Zhu, R\. Xu, R\. Zhang, S\. Ma, X\. Bi,et al\.\(2025\)DeepSeek\-r1 incentivizes reasoning in llms through reinforcement learning\.Nature645\(8081\),pp\. 633–638\.Cited by:[§1](https://arxiv.org/html/2605.29275#S1.p2.1),[§2](https://arxiv.org/html/2605.29275#S2.SS0.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2605.29275#S4.SS1.SSS0.Px2.p1.1)\.
- Z\. Huang, Y\. Zhuang, G\. Lu, Z\. Qin, H\. Xu, T\. Zhao, R\. Peng, J\. Hu, Z\. Shen, X\. Hu, X\. Gu, P\. Tu, J\. Liu, W\. Chen, Y\. Fu, Z\. Fan, Y\. Gu, Y\. Wang, Z\. Yang, J\. Li, and J\. Zhao \(2025\)Reinforcement learning with rubric anchors\.External Links:2508\.12790,[Link](https://arxiv.org/abs/2508.12790)Cited by:[§4\.3](https://arxiv.org/html/2605.29275#S4.SS3.p2.1)\.
- R\. Jia, Y\. Yang, Y\. Wu, Y\. Gai, S\. Tao, M\. Zhou, J\. Lin, X\. Jiang, and G\. Jiang \(2026\)Open rubric system: scaling reinforcement learning with pairwise adaptive rubric\.External Links:2602\.14069,[Link](https://arxiv.org/abs/2602.14069)Cited by:[§1](https://arxiv.org/html/2605.29275#S1.p2.1),[§2](https://arxiv.org/html/2605.29275#S2.SS0.SSS0.Px3.p2.1),[§3\.4](https://arxiv.org/html/2605.29275#S3.SS4.p1.1)\.
- Y\. Jiang, Y\. Wang, X\. Zeng, W\. Zhong, L\. Li, F\. Mi, L\. Shang, X\. Jiang, Q\. Liu, and W\. Wang \(2024\)FollowBench: a multi\-level fine\-grained constraints following benchmark for large language models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 4667–4688\.External Links:[Link](https://aclanthology.org/2024.acl-long.257/),[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.257)Cited by:[§1](https://arxiv.org/html/2605.29275#S1.p2.1)\.
- S\. Kim, J\. Shin, y\. cho, J\. Jang, S\. Longpre, H\. Lee, S\. Yun, S\. Shin, S\. Kim, J\. Thorne, and M\. Seo \(2024\)Prometheus: inducing fine\-grained evaluation capability in language models\.InInternational Conference on Learning Representations,B\. Kim, Y\. Yue, S\. Chaudhuri, K\. Fragkiadaki, M\. Khan, and Y\. Sun \(Eds\.\),Vol\.2024,pp\. 29927–29962\.External Links:[Link](https://proceedings.iclr.cc/paper_files/paper/2024/file/803485352e61e3ebf41221e4776c9fd4-Paper-Conference.pdf)Cited by:[§2](https://arxiv.org/html/2605.29275#S2.SS0.SSS0.Px2.p1.1)\.
- D\. Li, J\. Zhou, L\. M\. Brunswic, A\. Ghaddar, Q\. Sun, L\. Ma, Y\. Luo, D\. Li, M\. Coates, J\. Hao, and Y\. Zhang \(2025a\)Omni\-thinker: scaling multi\-task rl in llms with hybrid reward and task scheduling\.External Links:2507\.14783,[Link](https://arxiv.org/abs/2507.14783)Cited by:[§2](https://arxiv.org/html/2605.29275#S2.SS0.SSS0.Px3.p1.1)\.
- T\. Li, W\. Chiang, E\. Frick, L\. Dunlap, T\. Wu, B\. Zhu, J\. E\. Gonzalez, and I\. Stoica \(2025b\)From crowdsourced data to high\-quality benchmarks: arena\-hard and benchbuilder pipeline\.InForty\-second International Conference on Machine Learning,External Links:[Link](https://openreview.net/forum?id=KfTf9vFvSn)Cited by:[§4\.1](https://arxiv.org/html/2605.29275#S4.SS1.SSS0.Px3.p1.1)\.
- C\. Y\. Liu, L\. Zeng, Y\. Xiao, J\. He, J\. Liu, C\. Wang, R\. Yan, W\. Shen, F\. Zhang, J\. Xu, and Y\. Liu \(2026a\)Skywork\-reward\-v2: scaling preference data curation via human\-AI synergy\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=ofgxkMLqic)Cited by:[§4\.2](https://arxiv.org/html/2605.29275#S4.SS2.p2.1)\.
- T\. Liu, R\. Xu, T\. Yu, I\. Hong, C\. Yang, T\. Zhao, and H\. Wang \(2026b\)OpenRubrics: towards scalable synthetic rubric generation for reward modeling and llm alignment\.External Links:2510\.07743,[Link](https://arxiv.org/abs/2510.07743)Cited by:[§1](https://arxiv.org/html/2605.29275#S1.p2.1),[§2](https://arxiv.org/html/2605.29275#S2.SS0.SSS0.Px2.p1.1)\.
- Y\. Liu, Z\. Yao, R\. Min, Y\. Cao, L\. Hou, and J\. Li \(2025a\)RM\-bench: benchmarking reward models of language models with subtlety and style\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=QEHrmQPBdd)Cited by:[§4\.1](https://arxiv.org/html/2605.29275#S4.SS1.SSS0.Px3.p1.1)\.
- Z\. Liu, C\. Chen, W\. Li, P\. Qi, T\. Pang, C\. Du, W\. S\. Lee, and M\. Lin \(2025b\)Understanding r1\-zero\-like training: a critical perspective\.InSecond Conference on Language Modeling,External Links:[Link](https://openreview.net/forum?id=5PAF7PAY2Y)Cited by:[§4\.1](https://arxiv.org/html/2605.29275#S4.SS1.SSS0.Px4.p1.1)\.
- R\. Lou, K\. Zhang, and W\. Yin \(2024\)Large language model instruction following: a survey of progresses and challenges\.Computational Linguistics50\(3\),pp\. 1053–1095\.External Links:[Link](https://aclanthology.org/2024.cl-3.7/),[Document](https://dx.doi.org/10.1162/coli%5Fa%5F00523)Cited by:[§1](https://arxiv.org/html/2605.29275#S1.p2.1),[§3\.4](https://arxiv.org/html/2605.29275#S3.SS4.p1.1),[§4\.1](https://arxiv.org/html/2605.29275#S4.SS1.SSS0.Px3.p1.1)\.
- Y\. Luo, Z\. Yang, F\. Meng, Y\. Li, J\. Zhou, and Y\. Zhang \(2025\)An empirical study of catastrophic forgetting in large language models during continual fine\-tuning\.IEEE Transactions on Audio, Speech and Language Processing\.Cited by:[§5\.4](https://arxiv.org/html/2605.29275#S5.SS4.p1.1)\.
- S\. Malik, V\. Pyatkin, S\. Land, J\. Morrison, N\. A\. Smith, H\. Hajishirzi, and N\. Lambert \(2026\)RewardBench 2: advancing reward model evaluation\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=fb0G86Dewb)Cited by:[§4\.1](https://arxiv.org/html/2605.29275#S4.SS1.SSS0.Px3.p1.1)\.
- L\. Ouyang, J\. Wu, X\. Jiang, D\. Almeida, C\. Wainwright, P\. Mishkin, C\. Zhang, S\. Agarwal, K\. Slama, A\. Ray, J\. Schulman, J\. Hilton, F\. Kelton, L\. Miller, M\. Simens, A\. Askell, P\. Welinder, P\. F\. Christiano, J\. Leike, and R\. Lowe \(2022\)Training language models to follow instructions with human feedback\.InAdvances in Neural Information Processing Systems,S\. Koyejo, S\. Mohamed, A\. Agarwal, D\. Belgrave, K\. Cho, and A\. Oh \(Eds\.\),Vol\.35,pp\. 27730–27744\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2022/file/b1efde53be364a73914f58805a001731-Paper-Conference.pdf)Cited by:[§1](https://arxiv.org/html/2605.29275#S1.p2.1),[§2](https://arxiv.org/html/2605.29275#S2.SS0.SSS0.Px1.p1.1)\.
- S\. J\. Paech \(2025\)EQ\-bench creative writing benchmark v3\.GitHub\.Note:[https://github\.com/EQ\-bench/creative\-writing\-bench](https://github.com/EQ-bench/creative-writing-bench)Cited by:[§4\.1](https://arxiv.org/html/2605.29275#S4.SS1.SSS0.Px3.p1.1)\.
- H\. Peng, Y\. Qi, X\. Wang, B\. Xu, L\. Hou, and J\. Li \(2025a\)Verif: verification engineering for reinforcement learning in instruction following\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 30312–30327\.Cited by:[§2](https://arxiv.org/html/2605.29275#S2.SS0.SSS0.Px3.p1.1),[§4\.1](https://arxiv.org/html/2605.29275#S4.SS1.SSS0.Px2.p1.1)\.
- H\. Peng, Y\. Qi, X\. Wang, Z\. Yao, B\. Xu, L\. Hou, and J\. Li \(2025b\)Agentic reward modeling: integrating human preferences with verifiable correctness signals for reliable reward systems\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\), ACL 2025, Vienna, Austria, July 27 \- August 1, 2025,pp\. 15934–15949\.External Links:[Link](https://aclanthology.org/2025.acl-long.775/)Cited by:[§2](https://arxiv.org/html/2605.29275#S2.SS0.SSS0.Px3.p1.1)\.
- V\. Pyatkin, S\. Malik, V\. Graf, H\. Ivison, S\. Huang, P\. Dasigi, N\. Lambert, and H\. Hajishirzi \(2026\)Generalizing verifiable instruction following\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track,External Links:[Link](https://openreview.net/forum?id=yfYgwjj5F8)Cited by:[§1](https://arxiv.org/html/2605.29275#S1.p2.1),[§3\.4](https://arxiv.org/html/2605.29275#S3.SS4.p1.1),[§4\.1](https://arxiv.org/html/2605.29275#S4.SS1.SSS0.Px3.p1.1)\.
- Qwen Team \(2026\)Qwen3\.5: towards native multimodal agents\.External Links:[Link](https://qwen.ai/blog?id=qwen3.5)Cited by:[§4\.1](https://arxiv.org/html/2605.29275#S4.SS1.SSS0.Px1.p1.1)\.
- D\. Rein, B\. L\. Hou, A\. C\. Stickland, J\. Petty, R\. Y\. Pang, J\. Dirani, J\. Michael, and S\. R\. Bowman \(2024\)GPQA: a graduate\-level google\-proof q&a benchmark\.InFirst Conference on Language Modeling,External Links:[Link](https://openreview.net/forum?id=Ti67584b98)Cited by:[§5\.4](https://arxiv.org/html/2605.29275#S5.SS4.p2.1)\.
- J\. Saad\-Falcon, R\. Vivek, W\. Berrios, N\. S\. Naik, M\. Franklin, B\. Vidgen, A\. Singh, D\. Kiela, and S\. Mehri \(2026\)LMUnit: fine\-grained evaluation with natural language unit tests\.External Links:2412\.13091,[Link](https://arxiv.org/abs/2412.13091)Cited by:[§1](https://arxiv.org/html/2605.29275#S1.p2.1),[§2](https://arxiv.org/html/2605.29275#S2.SS0.SSS0.Px2.p1.1),[§4\.2](https://arxiv.org/html/2605.29275#S4.SS2.p2.1)\.
- J\. Schulman, F\. Wolski, P\. Dhariwal, A\. Radford, and O\. Klimov \(2017\)Proximal policy optimization algorithms\.External Links:1707\.06347,[Link](https://arxiv.org/abs/1707.06347)Cited by:[§2](https://arxiv.org/html/2605.29275#S2.SS0.SSS0.Px1.p1.1)\.
- R\. Shao, A\. Asai, S\. Z\. Shen, H\. Ivison, V\. Kishore, J\. Zhuo, X\. Zhao, M\. Park, S\. G\. Finlayson, D\. Sontag, T\. Murray, S\. Min, P\. Dasigi, L\. Soldaini, F\. Brahman, W\. Yih, T\. Wu, L\. Zettlemoyer, Y\. Kim, H\. Hajishirzi, and P\. W\. Koh \(2025\)DR tulu: reinforcement learning with evolving rubrics for deep research\.External Links:2511\.19399,[Link](https://arxiv.org/abs/2511.19399)Cited by:[§1](https://arxiv.org/html/2605.29275#S1.p2.1),[§2](https://arxiv.org/html/2605.29275#S2.SS0.SSS0.Px2.p1.1)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. K\. Li, Y\. Wu, and D\. Guo \(2024\)DeepSeekMath: pushing the limits of mathematical reasoning in open language models\.External Links:2402\.03300,[Link](https://arxiv.org/abs/2402.03300)Cited by:[§1](https://arxiv.org/html/2605.29275#S1.p2.1),[§2](https://arxiv.org/html/2605.29275#S2.SS0.SSS0.Px1.p1.1)\.
- W\. F\. Shen, X\. Qiu, C\. Whitehouse, L\. Alazraki, S\. Goel, F\. Barbieri, T\. Willi, A\. Mathur, and I\. Leontiadis \(2026\)Rethinking rubric generation for improving llm judge and reward modeling for open\-ended tasks\.External Links:2602\.05125,[Link](https://arxiv.org/abs/2602.05125)Cited by:[§2](https://arxiv.org/html/2605.29275#S2.SS0.SSS0.Px2.p1.1)\.
- K\. Team, Y\. Bai, Y\. Bao, G\. Chen, J\. Chen, N\. Chen, R\. Chen, Y\. Chen, Y\. Chen, Y\. Chen, Z\. Chen, J\. Cui, H\. Ding, M\. Dong, A\. Du, C\. Du, D\. Du, Y\. Du, Y\. Fan, Y\. Feng, K\. Fu, B\. Gao, H\. Gao, P\. Gao, T\. Gao, X\. Gu, L\. Guan, H\. Guo, J\. Guo, H\. Hu, X\. Hao, T\. He, W\. He, W\. He, C\. Hong, Y\. Hu, Z\. Hu, W\. Huang, Z\. Huang, Z\. Huang, T\. Jiang, Z\. Jiang, X\. Jin, Y\. Kang, G\. Lai, C\. Li, F\. Li, H\. Li, M\. Li, W\. Li, Y\. Li, Y\. Li, Z\. Li, Z\. Li, H\. Lin, X\. Lin, Z\. Lin, C\. Liu, C\. Liu, H\. Liu, J\. Liu, J\. Liu, L\. Liu, S\. Liu, T\. Y\. Liu, T\. Liu, W\. Liu, Y\. Liu, Y\. Liu, Y\. Liu, Y\. Liu, Z\. Liu, E\. Lu, L\. Lu, S\. Ma, X\. Ma, Y\. Ma, S\. Mao, J\. Mei, X\. Men, Y\. Miao, S\. Pan, Y\. Peng, R\. Qin, B\. Qu, Z\. Shang, L\. Shi, S\. Shi, F\. Song, J\. Su, Z\. Su, X\. Sun, F\. Sung, H\. Tang, J\. Tao, Q\. Teng, C\. Wang, D\. Wang, F\. Wang, H\. Wang, J\. Wang, J\. Wang, J\. Wang, S\. Wang, S\. Wang, Y\. Wang, Y\. Wang, Y\. Wang, Y\. Wang, Y\. Wang, Z\. Wang, Z\. Wang, Z\. Wang, C\. Wei, Q\. Wei, W\. Wu, X\. Wu, Y\. Wu, C\. Xiao, X\. Xie, W\. Xiong, B\. Xu, J\. Xu, J\. Xu, L\. H\. Xu, L\. Xu, S\. Xu, W\. Xu, X\. Xu, Y\. Xu, Z\. Xu, J\. Yan, Y\. Yan, X\. Yang, Y\. Yang, Z\. Yang, Z\. Yang, Z\. Yang, H\. Yao, X\. Yao, W\. Ye, Z\. Ye, B\. Yin, L\. Yu, E\. Yuan, H\. Yuan, M\. Yuan, H\. Zhan, D\. Zhang, H\. Zhang, W\. Zhang, X\. Zhang, Y\. Zhang, Y\. Zhang, Y\. Zhang, Y\. Zhang, Y\. Zhang, Y\. Zhang, Z\. Zhang, H\. Zhao, Y\. Zhao, H\. Zheng, S\. Zheng, J\. Zhou, X\. Zhou, Z\. Zhou, Z\. Zhu, W\. Zhuang, and X\. Zu \(2025\)Kimi k2: open agentic intelligence\.External Links:2507\.20534,[Link](https://arxiv.org/abs/2507.20534)Cited by:[§2](https://arxiv.org/html/2605.29275#S2.SS0.SSS0.Px3.p1.1)\.
- B\. Wang, Y\. Liu, Y\. Liu, T\. Tang, S\. Wang, C\. Gao, C\. Zheng, Y\. Zhang, L\. Yu, S\. Liu, T\. Gui, Q\. Zhang, X\. Huang, B\. Yu, F\. Huang, and J\. Lin \(2026a\)Outcome accuracy is not enough: aligning the reasoning process of reward models\.External Links:2602\.04649,[Link](https://arxiv.org/abs/2602.04649)Cited by:[§2](https://arxiv.org/html/2605.29275#S2.SS0.SSS0.Px2.p1.1)\.
- H\. Wang, H\. Que, Q\. Xu, M\. Liu, W\. Zhou, J\. Feng, W\. Zhong, W\. Ye, T\. Yang, W\. Huang, G\. Zhang, and F\. Lin \(2026b\)Reverse\-engineered reasoning for open\-ended generation\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=aK9JneKTL8)Cited by:[§4\.1](https://arxiv.org/html/2605.29275#S4.SS1.SSS0.Px2.p1.1)\.
- Z\. Wang, J\. Zeng, O\. Delalleau, E\. Evans, D\. Egert, H\. Shin, F\. Soares, Y\. Dong, and O\. Kuchaiev \(2026c\)RLBFF: binary flexible feedback to bridge between human feedback & verifiable rewards\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=P3R3S6S5Km)Cited by:[§4\.2](https://arxiv.org/html/2605.29275#S4.SS2.p2.1)\.
- Y\. Wu, J\. Mei, M\. Yan, C\. Li, S\. Lai, Y\. Ren, W\. Zijia, J\. Zhang, M\. Wu, Q\. Jin, and F\. Huang \(2026\)WritingBench: a comprehensive benchmark for generative writing\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track,External Links:[Link](https://openreview.net/forum?id=Pkskg9drDQ)Cited by:[§4\.1](https://arxiv.org/html/2605.29275#S4.SS1.SSS0.Px3.p1.1)\.
- L\. Xiaomi \(2026\)MiMo\-v2\-flash technical report\.External Links:2601\.02780,[Link](https://arxiv.org/abs/2601.02780)Cited by:[§2](https://arxiv.org/html/2605.29275#S2.SS0.SSS0.Px3.p1.1)\.
- R\. Xu, T\. Liu, Z\. Dong, T\. Yu, I\. Hong, C\. Yang, L\. Zhang, T\. Zhao, and H\. Wang \(2026\)Alternating reinforcement learning for rubric\-based reward modeling in non\-verifiable llm post\-training\.External Links:2602\.01511,[Link](https://arxiv.org/abs/2602.01511)Cited by:[§1](https://arxiv.org/html/2605.29275#S1.p2.1),[§2](https://arxiv.org/html/2605.29275#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv, C\. Zheng, D\. Liu, F\. Zhou, F\. Huang, F\. Hu, H\. Ge, H\. Wei, H\. Lin, J\. Tang, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Zhou, J\. Lin, K\. Dang, K\. Bao, K\. Yang, L\. Yu, L\. Deng, M\. Li, M\. Xue, M\. Li, P\. Zhang, P\. Wang, Q\. Zhu, R\. Men, R\. Gao, S\. Liu, S\. Luo, T\. Li, T\. Tang, W\. Yin, X\. Ren, X\. Wang, X\. Zhang, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Wang, Z\. Cui, Z\. Zhang, Z\. Zhou, and Z\. Qiu \(2025\)Qwen3 technical report\.External Links:2505\.09388,[Link](https://arxiv.org/abs/2505.09388)Cited by:[§4\.1](https://arxiv.org/html/2605.29275#S4.SS1.SSS0.Px1.p1.1)\.
- S\. Ye, D\. Kim, S\. Kim, H\. Hwang, S\. Kim, Y\. Jo, J\. Thorne, J\. Kim, and M\. Seo \(2024\)FLASK: fine\-grained language model evaluation based on alignment skill sets\.InICLR 2024 Workshop on Large Language Model \(LLM\) Agents,External Links:[Link](https://openreview.net/forum?id=3OfPKwAqPf)Cited by:[§1](https://arxiv.org/html/2605.29275#S1.p1.1)\.
- Q\. Yu, Z\. Zhang, R\. Zhu, Y\. Yuan, X\. Zuo, YuYue, W\. Dai, T\. Fan, G\. Liu, J\. Liu, L\. Liu, X\. Liu, H\. Lin, Z\. Lin, B\. Ma, G\. Sheng, Y\. Tong, C\. Zhang, M\. Zhang, R\. Zhang, W\. Zhang, H\. Zhu, J\. Zhu, J\. Chen, J\. Chen, C\. Wang, H\. Yu, Y\. Song, X\. Wei, H\. Zhou, J\. Liu, W\. Ma, Y\. Zhang, L\. Yan, Y\. Wu, and M\. Wang \(2026\)DAPO: an open\-source LLM reinforcement learning system at scale\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=2a36EMSSTp)Cited by:[§1](https://arxiv.org/html/2605.29275#S1.p2.1),[§2](https://arxiv.org/html/2605.29275#S2.SS0.SSS0.Px1.p1.1)\.
- J\. Zhang, Z\. Wang, L\. Gui, S\. M\. Sathyendra, J\. Jeong, V\. Veitch, W\. Wang, Y\. He, B\. Liu, and L\. Jin \(2026\)Chasing the tail: effective rubric\-based reward modeling for large language model post\-training\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=pBjy4ek2QV)Cited by:[§2](https://arxiv.org/html/2605.29275#S2.SS0.SSS0.Px2.p1.1)\.
- C\. Zheng, S\. Liu, M\. Li, X\. Chen, B\. Yu, C\. Gao, K\. Dang, Y\. Liu, R\. Men, A\. Yang, J\. Zhou, and J\. Lin \(2025\)Group sequence policy optimization\.External Links:2507\.18071,[Link](https://arxiv.org/abs/2507.18071)Cited by:[§2](https://arxiv.org/html/2605.29275#S2.SS0.SSS0.Px1.p1.1)\.
- L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. Xing, H\. Zhang, J\. Gonzalez, and I\. Stoica \(2023\)Judging llm\-as\-a\-judge with mt\-bench and chatbot arena\.InAdvances in Neural Information Processing Systems,Vol\.36,pp\. 46595–46623\.Cited by:[§1](https://arxiv.org/html/2605.29275#S1.p2.1),[§3\.4](https://arxiv.org/html/2605.29275#S3.SS4.p1.1)\.
- Y\. Zhou, S\. Li, S\. Liu, W\. Fang, K\. Zhang, J\. Zhao, J\. Yang, Y\. Zhou, J\. Lv, T\. Zheng, H\. Lu, W\. Chen, Y\. Xie, and M\. Song \(2026\)Breaking the exploration bottleneck: rubric\-scaffolded reinforcement learning for general llm reasoning\.External Links:2508\.16949,[Link](https://arxiv.org/abs/2508.16949)Cited by:[§2](https://arxiv.org/html/2605.29275#S2.SS0.SSS0.Px2.p1.1)\.
## Appendices
## Appendix AImplementation Details and Hyperparameters
To facilitate reproducibility, we provide the main implementation details and hyperparameters used in our experiments\. Appendix[A\.1](https://arxiv.org/html/2605.29275#A1.SS1)describes the inference settings for models used inside the reward pipeline, and Appendix[A\.2](https://arxiv.org/html/2605.29275#A1.SS2)reports the online RL training setup and optimization hyperparameters\.
### A\.1Inference Settings for Pipeline Models
This subsection reports the inference settings for the models used inside our reward pipeline, including offline reward construction and online reward computation\.
For offline reward construction, all Qwen3\-series models are run in thinking mode\. For GPT\-5, which is used for prompt\-specific rubric generation, we use the default decoding configuration with temperature1\.01\.0\. For online reward computation, the rubric judge and global scorer are instantiated with Qwen3\-series or Qwen3\.5\-series models\. The model\-specific inference settings are summarized in Table[7](https://arxiv.org/html/2605.29275#A1.T7)\.
ModelModeTemp\.Top\-ppTop\-kkPres\. PenaltyQwen3Thinking0\.60\.9520–Qwen3Non\-thinking0\.70\.9520–Qwen3\.5Thinking1\.00\.95201\.5Qwen3\.5Non\-thinking0\.80\.8201\.5GPT\-5Default1\.0–––
Table 7:Inference settings for models used inside the reward pipeline\.During rubric evaluation, we disable thinking mode for criteria with weights11or22to reduce inference cost\. In preliminary trials, disabling thinking for these low\-weight criteria does not noticeably affect reward quality\. For all other rubric criteria and for global scoring, we enable thinking mode\.
### A\.2Online RL Training Setup
We conduct online RL experiments using both verl and slime\. Since slime provides better training efficiency for MoE policy models in our setting, the main online RL results reported in this paper are obtained with slime unless otherwise specified\. The training backend uses Ray for distributed execution, Megatron\-style model training, and SGLang for rollout generation\.
For policy optimization, we use GSPO\. At each rollout step, the current policy samples multiple responses for each prompt, and each prompt\-response pair is scored by the proposed hybrid reward\. In online RL, we linearly decay the holistic\-score weight in Eq\. \(4\)\. At training steptt, we set
αt=max\(0,1−tTdecay\),\\alpha\_\{t\}=\\max\\left\(0,1\-\\frac\{t\}\{T\_\{\\mathrm\{decay\}\}\}\\right\),whereTdecay=800T\_\{\\mathrm\{decay\}\}=800in our main experiments\. This schedule reflects the changing role of the reward components during training\. Early in training, rollout responses often differ substantially in overall helpfulness, relevance, and coherence, so the global score provides a useful dense signal\. As training progresses, responses within the same prompt group become closer in holistic quality, making the global score less discriminative for fine\-grained improvements\. We therefore gradually shift the reward toward rubric\-based and code\-based supervision, which more directly targets prompt\-specific requirements and explicit constraints\. We use a linear schedule as a simple monotonic transition rather than tuning a more complex decay function\. If training continues beyondTdecayT\_\{\\mathrm\{decay\}\}, we keepαt=0\\alpha\_\{t\}=0\.
For each prompt group, we compute groupwise advantages by subtracting the group mean reward, but we disable the standard\-deviation normalization term\. Specifically, for a group of aggregated hybrid rewards\{ri\}i=1G\\\{r\_\{i\}\\\}\_\{i=1\}^\{G\}, we compute
Ai=6⋅\(ri−1G∑j=1Grj\)\.A\_\{i\}=6\\cdot\\left\(r\_\{i\}\-\\frac\{1\}\{G\}\\sum\_\{j=1\}^\{G\}r\_\{j\}\\right\)\.We do not divide by the group standard deviation\. This avoids unstable rescaling when the reward variance within a group is small, while the fixed multiplier keeps the advantage magnitude comparable to the normalized setting used in standard group\-based RL\.
The main online RL hyperparameters are summarized in Table[8](https://arxiv.org/html/2605.29275#A1.T8)\.
HyperparameterValueRL algorithm / estimatorGSPORollout batch size32Responses per prompt16Global batch size512Optimization mini\-batch size32Rollout temperature1\.0Maximum prompt length2048Maximum response length8192OptimizerAdamLearning rate1×10−61\\times 10^\{\-6\}Learning\-rate scheduleConstantWeight decay0\.1Adamβ1\\beta\_\{1\}/β2\\beta\_\{2\}0\.9 / 0\.98KL coefficient0\.001KL loss typeLow\-variance KLEntropy coefficient0Clip lower / upper3×10−43\\times 10^\{\-4\}/4×10−44\\times 10^\{\-4\}Clip constant10Advantage scale6Std normalizationDisabled
Table 8:Main hyperparameters used for online RL training\.For distributed training, we mainly use a 2\-node setup with 8 H800 GPUs per node, for a total of 16 H800 GPUs\. For MoE policy models, we use tensor model parallel size22, pipeline model parallel size11, context parallel size11, expert model parallel size88, and expert tensor parallel size11\. Sequence parallelism and dynamic batch sizing are enabled\. To reduce memory usage, we use full activation recomputation with uniform recomputation and set the maximum number of tokens per GPU to81928192\.
For online reward computation, we deploy the reward\-model service on the same scale, using 2 nodes with 8 H800 GPUs per node, for a total of 16 H800 GPUs\. The reward\-model service is used to compute the rubric\-based and global scores during training, while executable checkers are evaluated deterministically\.
For rollout serving, we use SGLang with 8 GPUs per rollout engine\. We enable data\-parallel attention and data\-parallel LM head, set the SGLang data\-parallel size to88, and set the static memory fraction to0\.850\.85\. The maximum number of running requests is set to128128\. During training, rollout generation and reward computation are executed asynchronously when possible, so that the latency of model\-based reward computation can be partially hidden by rollout generation\.
We report approximate local policy\-training compute for the four main online RL experiments\. GPU\-hours are estimated as the number of allocated GPUs multiplied by wall\-clock training time\. The DeepSeek\-R1\-Distill\-Qwen\-7B, Qwen3\-4B, Qwen3\-30B\-A3B, and GLM\-4\.7\-Flash runs used approximately 672, 768, 1,584, and 1,728 H800 GPU\-hours, respectively\. Model\-based reward computation was served by a shared H800 GPU service and is therefore not uniquely attributed to each individual run\.
### A\.3Evaluation Protocols and Settings
For all evaluation benchmarks, we enable thinking mode during inference\. Before answer extraction or scoring, we strip the thinking content using a fixed parser and evaluate only the final response\. This ensures that benchmark metrics and external judges are applied to the actual user\-facing answer rather than the model’s hidden reasoning trace\. When reporting repeated\-run results,±\\pmdenotes the sample standard deviation across runs\.
BenchmarkTemp\.Top\-ppTop\-kkMin\-ppMax LengthIFEval0\.60\.9520–16KIFBench0\.60\.9520–16KGSM8K0\.60\.9520–16KGPQA\-Diamond0\.60\.9520–16KAIME 20240\.60\.9520–32K/128KArena\-Hard v2\.00\.6–––32000Creative Writing v30\.7––0\.14000WritingBench0\.80\.9520–16000
Table 9:Generation settings for evaluation benchmarks\. For all benchmarks, thinking content is removed before answer extraction or scoring\.For IFEval and IFBench, we report prompt\-level strict accuracy\. For GSM8K, GPQA\-Diamond, and AIME 2024, we follow the official answer extraction and evaluation protocols\. For benchmarks that require external model\-based judging, including Arena\-Hard v2\.0, Creative Writing v3, and WritingBench, we use GPT\-4\.1 as the judge and follow the corresponding official evaluation protocols and generation settings\.
For IFEval, IFBench, GSM8K, and GPQA\-Diamond, we use a unified maximum response length of1616K tokens\. For AIME 2024, we use the same decoding configuration but set model\-specific maximum response lengths:3232K tokens for Qwen3\-30B\-A3B and128128K tokens for GLM\-4\.7\-Flash\. The benchmark\-specific generation settings are summarized in Table[9](https://arxiv.org/html/2605.29275#A1.T9)\.
### A\.4Checker Validation and Failure Handling
To reduce unsafe\-code and prompt\-injection risks, our checker construction uses a two\-stage pipeline rather than directly translating arbitrary user prompts into code\. The first stage extracts only structured, surface\-checkable constraints from a fixed set of allowed types, including word or character counts, paragraph or sentence counts, keyword inclusion or exclusion, response language, required starting or ending text, list or output format, and punctuation rules\. Semantic, factual, stylistic, or open\-ended requirements are not converted into executable checkers and are instead left to model\-based scoring\. The second stage compiles only these structured constraints into Python checkers\. This separation limits the code generator to deterministic string matching, regular expressions, and counting logic, reducing the risk that prompt\-injection\-style instructions are translated into arbitrary executable behavior\.
To improve the robustness of executable checkers, we validate each generated checker before using it for reward computation\. After constraint\-to\-code compilation, we execute the generated checker once on a test prompt\-response input to ensure that it can be parsed and run successfully\. If the checker fails to compile, raises an exception, or exceeds the time limit, we ask the LLM to regenerate the code, with at most three regeneration attempts\. Checkers that still fail after the maximum number of attempts are discarded and treated as unavailable\.
During online reward computation, checker execution is also guarded by timeout and retry logic\. If a checker raises an exception or times out, we retry the execution up to the maximum number of attempts\. If it still fails, we conservatively return a failed outcome for that checker\. When no valid checker remains for a prompt, the code\-based reward component is omitted and the remaining reward components are renormalized\.
GroupVariantFactualityPreciseIFMathSafetyFocusTiesOverallEvaluator Backbone: Qwen3\-30B\-A3BGlobal\-basedGlobal only61\.534\.470\.984\.874\.871\.566\.3Global \+ Code60\.946\.670\.984\.874\.871\.568\.3Rubric\-basedRubric only63\.441\.782\.582\.878\.184\.972\.2Rubric \+ Code62\.651\.882\.582\.878\.184\.973\.8Rubric \+ Global67\.944\.284\.790\.482\.889\.176\.5HybridRubric \+ Global \+ Code67\.753\.884\.790\.482\.889\.178\.1
Table 10:Component ablation of the proposed hybrid reward on RewardBench v2 with Qwen3\-30B\-A3B as the evaluator backbone\.Boldandunderlinedenote the best and second\-best results in each column, respectively\.
### A\.5Reproducibility and Released Artifacts
To facilitate artifact\-level reproducibility, we plan to provide a supplementary artifact package containing the fixed data and reward artifacts used to support the experiments reported in this paper\. The package includes the filtered training prompts used for online RL training, as well as the prompt\-level reward artifacts for the RewardBench v2 and RM\-Bench evaluation prompts\. These offline\-evaluation artifacts include the generated prompt\-specific rubrics, extracted hard constraints, corresponding executable hard\-constraint checkers, and the prompts and metadata associated with these artifacts\.
The artifact package is intended to support inspection and reuse of the fixed reward specifications used in our offline reward\-evaluation experiments, and to allow users to audit the training\-prompt distribution used for online RL\. Importantly, reproducing the reported offline reward computation with these fixed evaluation artifacts does not require re\-generating rubrics with the GPT\-5\-based artifact generator\. In our main instantiation, GPT\-5 is used only in the offline rubric\-generation stage, and all GPT\-5 calls use the fixed OpenAI API snapshotgpt\-5\-2025\-08\-07, rather than a mutable model alias\. After the artifacts are constructed, the reward pipeline consumes the fixed rubrics, executable checkers, and evaluator prompts together with the evaluator models described in Appendix[A](https://arxiv.org/html/2605.29275#A1)\. Thus, the fixed offline\-evaluation artifacts separate the reproducibility of the reported reward specifications from the cost and availability of the proprietary rubric\-generation model\.
The artifact package should be distinguished from the full framework implementation\. In this release, we provide the data artifacts and offline reward specifications needed to inspect the prompt sets, audit the generated rubrics and executable checkers, and reproduce the offline reward\-computation setting subject to evaluator\-model access and local implementation details\. We do not release the complete end\-to\-end artifact\-construction, reward\-computation, and online RL training code in this version\. Full end\-to\-end reproduction of the online RL pipeline additionally requires the rollout infrastructure, asynchronous reward\-computation pipeline, optimization scripts, checker validation and failure\-handling utilities, parsing scripts, RL configuration files, and training utilities\.
In a future public release, we plan to release the full framework implementation, including artifact\-construction scripts, reward\-computation code, prompt templates, decoding settings, parsing rules, checker validation procedures, failure\-handling logic, and RL training code/configurations, subject to internal approval and applicable licensing constraints\. This release is intended to support end\-to\-end reproduction and extension of the framework\. When applying the framework to new prompt distributions, high\-quality artifact construction may still require strong proprietary or open\-weight models, extractor\-specific prompt tuning, and additional validation\. We therefore distinguish between reproducing the reported experiments with fixed released artifacts and extending the artifact\-construction pipeline to new prompts\.
## Appendix BAdditional Analysis
### B\.1Component Ablation with Qwen3\-30B\-A3B
To further examine whether the complementarity among reward components depends on a specific evaluator backbone, we conduct the same component ablation on RewardBench v2 using Qwen3\-30B\-A3B as the evaluator backbone\. The results are shown in Table[10](https://arxiv.org/html/2605.29275#A1.T10)\.
The results show the same overall pattern as the main ablation in Table[3](https://arxiv.org/html/2605.29275#S4.T3)\. Combining rubric\-based and global scoring substantially outperforms either component alone, confirming that the two model\-based signals capture complementary aspects of response quality\. Adding code\-based verification further improves the full hybrid reward, with the most direct gain appearing on PreciseIF, where explicit and checkable constraints are more common\. These results suggest that the benefit of hybridization is not tied to a single evaluator backbone\.
### B\.2Reliability Analysis of Code\-Based Verification
##### Main finding\.
Figure[2](https://arxiv.org/html/2605.29275#A2.F2)shows that code\-based verification substantially improves reward reliability on prompts with explicitly checkable constraints\. Adding the code\-based component increases Top\-1 exact pass from 48\.0% to 69\.5%, reduces constraint\-discordant inversion from 14\.8% to 3\.1%, and lowers advantage sign flip from 18\.6% to 11\.8%\. These results explain why code\-based verification is useful despite its modest average gain in the main component ablation: it mainly reduces constraint\-specific reward failures and stabilizes the relative reward signal for hard\-constraint prompts\.
Figure 2:Reward reliability with and without code\-based verification on 100 VERINSTRUCT prompts with non\-trivial hard\-constraint variation\.
##### Analysis setup\.
This analysis complements the component ablation in Section[5\.1](https://arxiv.org/html/2605.29275#S5.SS1), where code\-based verification yields a modest overall improvement but a larger gain on PreciseIF\. We construct the analysis set from VERINSTRUCT training prompts\. For each prompt, we sample 16 candidate responses and run the extracted executable checkers on each response, obtaining a constraint\-satisfaction pattern for each candidate\. We keep only non\-trivial prompt groups where candidates differ in checker outcomes, since groups with identical checker outcomes cannot reveal whether the reward distinguishes different levels of hard\-constraint satisfaction\. From these non\-trivial groups, we sample 100 prompts\.
For each selected prompt, we keep the 16 candidate responses fixed and rescore the group 10 times under two settings: with and without the code\-based component\. This repeated rescoring isolates reward reliability from response\-generation variance, since the candidate responses are unchanged across runs\. We then measure whether executable verification makes the reward ranking more consistent with deterministic checker outcomes and whether it stabilizes the groupwise advantage signal used in online RL\.
##### Metrics\.
We report three reliability metrics\.Top\-1 exact passmeasures the percentage of rescoring groups in which the highest\-ranked response satisfies all executable constraints\. Higher values indicate that the reward is more likely to select a fully constraint\-satisfying response as the best candidate\.
Constraint\-discordant inversionmeasures how often the reward ranks a response satisfying fewer executable constraints above another response satisfying more executable constraints within the same prompt group\. Lower values indicate better consistency with deterministic constraint checking\.
Advantage sign flipmeasures how often the same response changes the sign of its groupwise advantage across repeated rescoring runs\. Since online RL uses relative advantages within rollout groups, lower values indicate a more stable optimization signal\.
##### Discussion\.
The results suggest that executable checking should be viewed as a targeted reliability component rather than a replacement for model\-based judgment\. Rubric\-based and global scores remain necessary for semantic quality, usefulness, and broader response\-level preferences, while code\-based verification reduces reward errors on explicit and deterministically checkable constraints\. This targeted effect is especially useful for constraint\-sensitive instruction\-following prompts\.
### B\.3Diagnostic Analysis of Advantage Normalization
##### Main finding\.
Figure[3](https://arxiv.org/html/2605.29275#A2.F3)shows that removing standard\-deviation normalization leads to faster improvement and higher peak rewards in this diagnostic setting\. This supports our default advantage computation, which mean\-centers group rewards but does not divide by the group standard deviation\. We emphasize that this is a diagnostic experiment rather than a main result: it uses DeepSeek\-R1\-Distill\-Qwen\-7B as the policy model and Qwen3\-30B\-A3B as the reward evaluator, which is not the main reward\-evaluator configuration used in our final online RL experiments\.
\(a\)Training reward mean\.
\(b\)IFEval validation reward\.
Figure 3:Diagnostic comparison of groupwise advantage normalization\. The no\-std variant mean\-centers rewards without dividing by the group standard deviation and applies a fixed scaling factor of66\. Curves are smoothed for visualization\.
##### Setup\.
We compare standard groupwise advantage normalization with our default setting\. Given a group of response rewards, the standard variant subtracts the group mean and divides by the group standard deviation\. In contrast, our default setting subtracts only the group mean and then applies a fixed advantage scaling factor of66, as described in Appendix[A\.2](https://arxiv.org/html/2605.29275#A1.SS2)\. All other training settings are kept fixed\.
##### Rationale\.
This choice is motivated by the dense nature of our hybrid reward\. In sparse RLVR settings with binary rewards, standard\-deviation normalization is often reasonable because rewards mainly indicate pass or fail outcomes\. In our setting, however, reward magnitudes carry graded quality information\. For example, a group with rewards\{0,0\.5,1\}\\\{0,0\.5,1\\\}reflects a much larger quality gap than a group with rewards\{0\.45,0\.5,0\.55\}\\\{0\.45,0\.5,0\.55\\\}, but standard\-deviation normalization can rescale both groups to similar advantage magnitudes\. This may over\-amplify minor differences among near\-tie responses and make small reward gaps produce updates comparable to much larger quality gaps\.
##### Discussion\.
As shown in Figure[3](https://arxiv.org/html/2605.29275#A2.F3), the two variants behave similarly at the beginning of training\. As training proceeds, however, the no\-std variant improves faster in both training reward and IFEval validation reward, and reaches a higher peak in this diagnostic run\. These results suggest that standard\-deviation normalization can introduce undesirable rescaling for dense open\-ended rewards, especially when responses within a rollout group receive similar scores\. We therefore use mean\-centered advantages with fixed scaling in the main experiments\.
### B\.4Additional Details on Open\-Weight Rubric Extraction
In Section[5\.3](https://arxiv.org/html/2605.29275#S5.SS3), we replace the GPT\-5\-based rubric extractor with Qwen3\.5\-397B\-A17B to examine whether the proposed framework is tightly coupled to a proprietary rubric\-generation model\. This appendix provides additional implementation details and caveats for that experiment\.
The replacement is applied only to the offline rubric\-extraction stage\. We use the full\-weight Qwen3\.5\-397B\-A17B model rather than the FP8 deployment used in our preliminary trials, and align the inference settings with the official configuration of the model\. The rubric judge, global scorer, code\-based verifier, scoring prompts, aggregation rule, and benchmark evaluation protocol are kept unchanged\. In particular, Qwen3\.5\-397B\-A17B is used only to generate prompt\-specific rubrics, while Qwen3\.5\-35B\-A3B is still used for rubric\-based judging\. Therefore, the comparison in Table[5](https://arxiv.org/html/2605.29275#S5.T5)isolates the effect of changing the rubric extractor rather than changing the downstream reward evaluator\.
We do not interpret the remaining performance gap between GPT\-5 and Qwen3\.5\-397B\-A17B as evidence that open\-weight rubric extraction is inherently worse\. Although the open\-weight extractor is evaluated under a stronger and more aligned deployment setting, our rubric\-generation prompt and post\-processing pipeline were originally developed around GPT\-5 outputs\. We did not perform extensive extractor\-specific prompt optimization, parsing adjustment, or post\-processing redesign for Qwen3\.5\-397B\-A17B\. The remaining gap may therefore partly reflect differences in response style and instruction\-following behavior between the two extractors\. Further tuning of the extraction prompt or post\-processing pipeline may reduce this gap\.
At the same time, the results suggest that the proposed hybrid reward is robust to replacing GPT\-5 with a strong open\-weight extractor\. While the rubric\-only setting still shows a noticeable drop, the full hybrid reward substantially narrows the gap, indicating that global scoring and code\-based verification can compensate for imperfections in extracted rubrics\. Thus, open\-weight rubric extraction is feasible in our framework, although extractor quality remains a source of variability\.
There is also a practical trade\-off between reproducibility and extraction efficiency\. In our setup, processing 1,865 prompts with Qwen3\.5\-397B\-A17B required local deployment on 16 H800 GPUs and took about 6 hours\. In contrast, using the GPT\-5 API required no local GPU deployment and completed the same extraction in about 10 minutes\. This comparison is system\-dependent and should not be viewed as a controlled efficiency benchmark, since the API setting hides the underlying serving infrastructure\. It mainly highlights a practical trade\-off: open\-weight extraction improves controllability and reproducibility, while API\-based extraction can be substantially more convenient under our current implementation\.
### B\.5Efficiency of Asynchronous Reward Computation
##### Main finding\.
Table[11](https://arxiv.org/html/2605.29275#A2.T11)shows that model\-based reward computation introduces non\-negligible training overhead, but the overhead can be substantially reduced by asynchronous scheduling and by using a faster evaluator\. With Qwen3\.5\-35B\-A3B as the reward evaluator, asynchronous scheduling reduces the overhead from 79\.9% to 43\.5%\. When using the more efficient Qwen3\-30B\-A3B evaluator, the asynchronous setting increases per\-step wall\-clock time by only 7\.7% over the rule\-based reward baseline\. These results suggest that hybrid rewards are not cost\-free, but their wall\-clock overhead can be made manageable with appropriate evaluator selection and scheduling\.
Reward SettingRollout/RewardOthersStepOverhead\(s\)\(s\)\(s\)Rule Reward154159313–Qwen3\.5 w/o Async156\+247160563\+79\.9%Qwen3\.5 w/ Async291158449\+43\.5%Qwen3 w/ Async179158337\+7\.7%
Table 11:Per\-step training time under rule\-based and model\-based rewards\. Qwen3\.5 and Qwen3 denote Qwen3\.5\-35B\-A3B and Qwen3\-30B\-A3B reward evaluators, respectively\. Rollout/Reward includes rollout generation and reward computation; synchronous rollout and reward times are shown separately in parentheses\. Others denotes remaining training\-side overhead\. Times are averaged over steps 1–100\.
##### Profiling setup\.
We profile per\-step wall\-clock time under three reward\-computation settings: rule\-based rewards, synchronous model\-based rewards, and asynchronous model\-based rewards\. All settings use the same policy rollout and training configuration\. For model\-based rewards, the evaluator service is deployed on separate resources and computes the rubric\-based and global scores during training, while executable checkers are evaluated deterministically\.
This comparison therefore measures wall\-clock training throughput rather than equal total compute cost\. Compared with rule\-based rewards or exact checkers, model\-based rewards require additional evaluator\-serving resources\. The purpose of this analysis is not to claim that model\-based rewards are computationally free, but to quantify their impact on training throughput and evaluate whether asynchronous scheduling can reduce the observed wall\-clock overhead\.
##### Effect of asynchronous scheduling\.
With synchronous reward computation, rollout generation and reward evaluation are executed sequentially\. In our Qwen3\.5\-35B\-A3B evaluator setting, this increases the average step time from 313s under rule\-based rewards to 563s, corresponding to a 79\.9% wall\-clock overhead\. Asynchronous scheduling overlaps rollout generation with reward computation, reducing the step time to 449s and the overhead to 43\.5%\.
##### Effect of evaluator throughput\.
The remaining overhead depends strongly on the throughput of the reward evaluator\. When using the more efficient Qwen3\-30B\-A3B evaluator under the asynchronous setting, the average step time is 337s, only 7\.7% higher than the rule\-based reward baseline\. This indicates that the practical cost of hybrid reward training depends not only on the reward design, but also on evaluator size, serving efficiency, resource allocation, and scheduling strategy\.
ModelIFEvalIFBenchArena\-Hard\-v2\.0Creative Writing v3WritingBenchAvg\.Pr\. \(S\)Pr\. \(S\)ScoreScoreScoreScoreQwen3\-30B\-A3B85\.4±\\boldsymbol\{\\pm\}0\.135\.9±\\boldsymbol\{\\pm\}1\.030\.6±\\boldsymbol\{\\pm\}0\.977\.3±\\boldsymbol\{\\pm\}1\.374\.4±\\boldsymbol\{\\pm\}0\.560\.7Rubicon\-Preview82\.7±\\boldsymbol\{\\pm\}0\.8\(\-2\.7\)33\.7±\\boldsymbol\{\\pm\}0\.4\(\-2\.2\)39\.2±\\boldsymbol\{\\pm\}1\.8\(\+8\.6\)82\.1±\\boldsymbol\{\\pm\}1\.6\(\+4\.8\)77\.7±\\boldsymbol\{\\pm\}0\.4\(\+3\.3\)63\.1 \(\+2\.4\)Ours: Qwen3\-30B\-A3B \+ Hybrid RL87\.5±\\boldsymbol\{\\pm\}0\.4\(\+2\.1\)39\.3±\\boldsymbol\{\\pm\}0\.7\(\+3\.4\)38\.5±\\boldsymbol\{\\pm\}1\.0\(\+7\.9\)82\.6±\\boldsymbol\{\\pm\}1\.4\(\+5\.3\)79\.2±\\boldsymbol\{\\pm\}0\.4\(\+4\.8\)65\.4 \(\+4\.7\)
Table 12:External comparison with Rubicon\-Preview, an open\-weight policy trained with rubric\-anchor RL on the same Qwen3\-30B\-A3B backbone\. All models are evaluated under our protocol, and improvements or drops are computed relative to the Qwen3\-30B\-A3B base model\. Since Rubicon\-Preview uses different training data, reward design, and optimization settings, this comparison serves as an external reference rather than a controlled ablation\.±\\pmdenotes sample standard deviation over three runs\.
### B\.6External Comparison with an Open\-Weight Rubric\-RL Policy
Table[12](https://arxiv.org/html/2605.29275#A2.T12)provides an external comparison with Rubicon\-Preview, a recent open\-weight policy trained with rubric\-anchor RL on the same Qwen3\-30B\-A3B backbone\. The main observation is that Rubicon\-Preview improves the base model on several open\-ended generation benchmarks, especially Arena\-Hard\-v2\.0, but drops on instruction\-following benchmarks such as IFEval and IFBench\. In contrast, our hybrid\-reward RL improves the same backbone across all evaluated benchmarks and achieves a higher average score\. This comparison suggests that rubric\-based RL is a promising direction, but robust gains depend on how reward specifications are constructed, reused, and combined with complementary reward signals\.
Rubicon is closely related to our work because it also introduces structured rubrics as reward anchors for open\-ended RL, rather than relying only on a scalar judge\. In this sense, it partially shares the motivation of separating reward specification from reward computation: rubrics define explicit evaluation criteria before responses are scored\. However, this separation is less explicit and less complete than in our framework\. Rubicon follows a rubric\-bank\-driven paradigm, where large\-scale rubrics are constructed first and training data are then selected, synthesized, filtered, or rewritten to match the rubric bank\. As a result, the training data are tightly coupled with the coverage and quality of the pre\-constructed rubric bank, as well as with substantial data preprocessing and filtering\. This design can be effective for building a strong rubric\-anchored policy, but it makes expansion to arbitrary new open\-ended prompts less straightforward, since data outside the existing bank may require additional rubric construction, data alignment, filtering, and further refinement\.
Our framework targets a different scalability problem\. Rather than scaling open\-ended RL through a large pre\-built rubric bank, we construct prompt\-specific reward artifacts from the prompt alone, including task\-adaptive rubrics and executable hard\-constraint checkers\. These artifacts are built offline and then reused by a unified reward\-computation pipeline that combines rubric\-based, global, and code\-based signals\. Thus, our method more directly separates reward specification from reward computation: reward artifacts are specified before scoring, while online training applies a fixed normalized hybrid reward without requiring the prompt to belong to a pre\-existing rubric bank, or to be paired with reference answers or preference annotations\.
This comparison is not a controlled ablation, since Rubicon\-Preview uses different training data, reward construction procedures, and optimization settings\. Nevertheless, it is one of the few closely related rubric\-based RL systems that releases an RL\-trained policy checkpoint executable under our evaluation protocol\. The results therefore serve as an external reference: Rubicon demonstrates the value of rubric\-anchored RL, while our results suggest that prompt\-level reward artifacts and constraint\-aware hybrid reward composition can provide more consistent improvements across diverse open\-ended evaluation settings\.
### B\.7Training Curves for Online Reward Ablations
Figure[4](https://arxiv.org/html/2605.29275#A2.F4)shows the IFEval validation reward during online RL under different reward compositions\. Overall, the full hybrid reward, which combines rubric\-based scoring, global scoring, and code\-based verification, achieves the strongest and most stable training trajectory\. Compared with using only the global score or only rubric\-based scoring, adding complementary reward components leads to higher validation rewards throughout most of training\. The comparison between R\+G and R\+G\+C further shows that executable constraint checking provides additional gains beyond model\-based evaluation signals\.
Figure 4:IFEval validation reward during online RL with different reward compositions\. R, G, and C denote rubric\-based scoring, global scoring, and code\-based verification, respectively\. Faded lines show raw evaluation results, while solid lines show smoothed trends\.
## Appendix CTraining Prompt Mixture and Filtering
We construct the online RL training prompts from three sources: VERINSTRUCT, DeepWriting\-20K, and synthetic decision\-support prompts\. The resulting mixture contains approximately 13K prompts in total: 5K from VERINSTRUCT, 5K from DeepWriting\-20K, and 3K synthetic decision\-support prompts\. These sources cover complementary open\-ended post\-training settings: explicit instruction following, open\-ended writing, and scenario\-based decision support\. For all sources, we use only the prompt text as the training input; any reference responses, solutions, reasoning traces, or verification annotations from the original datasets are not used by our reward\-construction pipeline\. We also deduplicate prompts within and across sources and apply prompt\-level decontamination before training\.
### C\.1Training Data Decontamination
Train setTotalCleanRemovedIFEvalIFBenchWritingBenchCreative WritingArena\-Hard\-v2VERINSTRUCT4,9994,999000000Decision3,0003,000000000DeepWriting\-20K5,0004,973270012150Total12,99912,972270012150Table 13:Prompt\-level decontamination results\. We remove 27 flagged examples from the final training set before training\.To reduce benchmark contamination, we compare all training prompts against the evaluation prompts from IFEval, IFBench, WritingBench, Creative Writing v3, and Arena\-Hard\-v2\.0\. Since these benchmarks contain open\-ended writing and instruction\-following tasks, we only remove likely instance\-level overlaps rather than generic task\-type or template\-level similarities\. A training prompt is removed if it has a normalized exact match with an evaluation prompt, or if it satisfies a benchmark\-specific n\-gram overlap rule and shares at least one substantive anchor with the matched evaluation prompt, such as the same quoted title, named entity, report name, document source, or long reference material\. We do not use unigram\-level token containment, which can over\-flag long writing prompts with generic academic expressions\.
As shown in Table[13](https://arxiv.org/html/2605.29275#A3.T13), after within\- and cross\-source deduplication, the candidate training pool contains 12,999 prompts\. The decontamination filtering removes 27 prompts in total\. All removed examples come from DeepWriting\-20K, with 12 matched to WritingBench and 15 matched to Creative Writing v3\. The final training set contains 12,972 prompts\.
### C\.2VERINSTRUCT
We sample approximately 5K prompts from VERINSTRUCT as the explicit instruction\-following portion of our training mixture\. This subset is particularly useful for tasks with verifiable constraints, such as requirements on output format, length, required phrases, forbidden words or phrases, paragraph structure, and exact counts\. We use VERINSTRUCT only as a prompt source and do not use its reference responses, verifier outputs, or verification signals\.
To characterize the coverage of the code\-based reward on this subset, we report statistics of the verifiable constraints extracted by our prompt\-only constraint extraction procedure\. As shown in Table[14](https://arxiv.org/html/2605.29275#A3.T14), we obtain 8,887 constraints in total, corresponding to 1\.78 constraints per prompt on average\. This suggests that deterministic checking provides reliable supervision for explicit and executable requirements, while rubric\-based and global rewards remain necessary for semantic and quality\-related aspects\.
Constraint Type\# ConstraintsRatiocontain2,79731\.47%paragraph\_count1,95421\.98%begin\_with1,60418\.05%not\_contain7188\.08%word\_count6577\.39%nth\_paragraph\_begin\_with3994\.49%end\_with2602\.93%nth\_paragraph\_contain2162\.43%sentence\_count1812\.04%line\_count810\.91%nth\_paragraph\_end\_with200\.23%Table 14:Distribution of extracted verifiable constraint types in the processed VERINSTRUCT subset\. We exclude one unsupportedlanguageconstraint produced by the extractor, as it falls outside our supported constraint schema\.
### C\.3DeepWriting\-20K
We sample 5K prompts from DeepWriting\-20K as the writing\-focused portion of our training mixture\. These prompts cover open\-ended generation tasks where response quality depends on coherence, tone control, style following, creativity, narrative structure, and overall usefulness\. We use only the prompts and do not use the original responses, solutions, or reasoning trajectories\.
We apply lightweight quality filtering before sampling the final subset\. We remove prompts that are too underspecified to provide useful learning signal, overly complex or impractical for online RL rollout generation, malformed, or obviously low quality\. The retained prompts mainly rely on rubric\-based and global scoring, while code\-based checkers are constructed only when explicit, deterministically verifiable constraints are present\.
### C\.4Synthetic Decision\-Support Prompts
We additionally synthesize 3K decision\-support prompts to cover scenario\-based tasks that are underrepresented in the two existing sources\. These prompts are designed to require risk assessment, action prioritization, phased planning, trade\-off analysis, and clarification of missing information, rather than a single definitive answer\.
We generate these prompts using Qwen3\-30B\-A3B with temperature 1\.0 to encourage diversity\. Instead of generating exactly 3K prompts, we first generate a larger candidate pool across a broad set of domains\. We then filter the generated prompts and sample the final 3K examples\. Filtering removes prompts that are too short, too long, too generic, or insufficiently grounded in a concrete scenario\. We also remove prompts that do not involve a clear decision\-making role, lack meaningful uncertainty, or can be answered by a templated response without task\-specific reasoning\.
The generation template is shown below\. It is domain\-conditioned but does not reference evaluation benchmarks or benchmark\-specific task formats\. This reduces the risk that the synthetic prompts are tailored to any particular evaluation set\.
Prompt for Synthesizing Decision\-Support ScenariosGenerate a realistic problem scenario with incomplete or ambiguous information in the\{domain\}field\. The scenario should meet the following requirements:The situation contains critical information that is real but not yet available; making decisions before clarification may cause clearly identifiable risks or losses\.The problem should include well\-defined roles and decision responsibilities, such as a project manager, executive, or professional advisor, and specify objectives or constraints\.High\-quality answers should demonstrate action prioritization, risk assessment, or phased strategies, rather than providing a single definitive conclusion\.The problem should not have a unique correct answer, but different responses should clearly vary in caution, structure, and quality of action\.Avoid purely subjective opinions, moral statements, casual conversation, or simple preference questions\.The scenario should be specific enough that evasive, generic, or templated responses are obviously low\-quality\.Only output the problem text\. Do not include explanations, hints, or formatting instructions\.
The domain distribution of the final 3K synthetic decision\-support prompts is shown below\. The domains are intentionally diverse, covering business, technical, social, policy, operational, and infrastructure\-related scenarios\.
Domain Distribution for Synthetic Decision\-Support PromptsDomain\# PromptsBusiness operations107Corporate governance133Cybersecurity and information security118Emergency response and crisis management98Energy and utilities92Engineering and system design129Environmental management103Ethical decision making125Finance and economics136Healthcare and education101Human behavior and social dynamics118Innovation and research and development140International relations and geopolitics127Law and regulatory compliance105Manufacturing and production150Marketing and consumer behavior135Natural and physical systems111Project and operational risk126Public policy and society145Resource allocation and planning103Strategic planning and management110Supply chain management108Technology and infrastructure138Transportation and logistics120Urban planning and smart cities122Total3000
##### Quality audit\.
To verify that the retained synthetic prompts satisfy the intended design goals, we conduct a lightweight automatic quality audit on 200 randomly sampled prompts from the final synthetic subset\. We use Qwen3\-30B\-A3B with a separate audit prompt to check each prompt along six binary dimensions: whether it describes a concrete scenario, requires decision support rather than generic writing, contains meaningful uncertainty or missing information, involves non\-trivial constraints or trade\-offs, is not answerable by a simple template, and is well\-formed without obvious safety issues\. As shown in Table[15](https://arxiv.org/html/2605.29275#A3.T15), the retained prompts achieve high pass rates across all criteria, suggesting that the filtering process removes most generic, malformed, or template\-solvable generations\.
Quality criterionPass rateConcrete scenario97\.5%Decision\-support objective98\.5%Meaningful uncertainty or missing information96\.5%Non\-trivial constraints or trade\-offs98\.0%Not answerable by a simple template95\.5%Well\-formed with no obvious safety issue100\.0%Table 15:Automatic quality audit of 200 randomly sampled synthetic decision\-support prompts from the final synthetic subset\.The synthetic prompts complement the two existing sources\. While VERINSTRUCT emphasizes explicit constraint satisfaction and DeepWriting\-20K emphasizes open\-ended writing quality, the decision\-support prompts require models to identify missing information, reason about risks, and produce practical next steps under uncertainty\. The quality audit further indicates that the retained synthetic prompts are not dominated by generic or template\-solvable cases\. This makes them a natural fit for our hybrid reward framework, where rubric\-based scoring captures task\-specific reasoning and decision quality, global scoring captures overall response usefulness, and code\-based scoring applies when explicit constraints are present\.
## Appendix DBlinded Pairwise Preference Check
To provide an auxiliary reliability check beyond benchmark\-level automatic scores, we conduct a blinded pairwise preference comparison over 150 examples using the DeepSeek\-R1\-Distill\-Qwen\-7B base policy and its RL\-trained counterpart\. This experiment is intended only as a supplementary diagnostic check of whether the observed benchmark gains are directionally reflected under anonymized pairwise comparison with multiple evaluator types\. It is not part of the main reward\-construction or RL pipeline, and reproducing the main results of this paper does not require using the additional pairwise judges in this section, including GPT\-5\.5 and Claude Opus 4\.7\. The goal of this experiment is not to establish a statistically powered human evaluation benchmark, but to provide additional evidence on the reliability of the automatic evaluation results\.
We sample 50 prompts from each of Arena\-Hard\-v2\.0, Creative Writing v3, and WritingBench\. For each prompt, we collect one response from the DeepSeek\-R1\-Distill\-Qwen\-7B base policy and one response from the same policy after RL training with our hybrid reward\. The two responses are anonymized before evaluation, so that judges do not know which response is generated by which policy\.
For LLM\-based pairwise judges, we further evaluate each response pair in both orders to reduce positional bias\. Specifically, each judge compares the pair once with the RL response shown first and once with the base response shown first\. We normalize both decisions to the perspective of the RL\-trained policy\. If both comparisons prefer the RL response, or one comparison prefers the RL response while the other gives a tie, we count the example as a win\. If the two comparisons disagree, we count it as a tie\. Symmetrically, if both comparisons prefer the base response, or one prefers the base response while the other gives a tie, we count the example as a loss\.
We use four independent LLM judges, GPT, Claude, GLM, and Qwen3\.5\. Here, GPT denotes GPT\-5\.5, Claude denotes Claude Opus 4\.7, GLM denotes GLM\-4\.7, and Qwen3\.5 denotes Qwen3\.5\-397B\-A17B\. These models are used only for this auxiliary pairwise preference check and are not required for reproducing the main reward computation, offline response\-ranking evaluation, or online RL experiments\. In addition, we ask three non\-expert human annotators to compare the same anonymized response pairs in randomized order\. The human annotators are not told which response is produced by the base policy or the RL\-trained policy\. Each judge assigns one of three labels: win, tie, or lose, where the label is reported from the perspective of the RL\-trained policy\. We report majority labels among the four model judges and among the three human annotators separately\.
##### Human annotation protocol\.
For this diagnostic human preference check, annotators were shown the user prompt and two anonymized responses, denoted as Response A and Response B\. The response order was randomized for each example\. Annotators were instructed to select the response that better satisfies the prompt, considering instruction following, completeness, coherence, factuality, helpfulness, and writing quality when applicable\. The annotation interface allowed three choices: Response A, Response B, and Tie\. We then mapped these choices to win, tie, or lose from the perspective of the RL\-trained policy using the hidden response\-policy assignment\. A win indicates that the RL\-trained response is preferred over the base response, a lose indicates that the base response is preferred, and a tie indicates that the two responses are comparable in overall quality or that neither response is clearly better\.
The three annotators were recruited internally for this small\-scale diagnostic evaluation and were not recruited through a crowdsourcing platform\. Participation was voluntary, and no monetary compensation was provided\. Because the annotation was a small\-scale internal diagnostic check with minimal risk, no payment rate was applicable\. They were informed that their anonymized preference labels would be used only in aggregate for research reporting\. We did not collect names, demographic attributes, or other personally identifying information in the reported annotation data\. We did not seek formal ethics review board approval because the annotation involved only judgments of model outputs, collected no personal or sensitive data, and posed minimal risk to annotators\.
JudgeWin\(%\)Tie\(%\)Lose\(%\)Model judgesGPT90 \(60\.0\)38 \(25\.3\)22 \(14\.7\)Claude88 \(58\.7\)40 \(26\.7\)22 \(14\.7\)GLM85 \(56\.7\)40 \(26\.7\)25 \(16\.7\)Qwen3\.593 \(62\.0\)35 \(23\.3\)22 \(14\.7\)Model Majority95 \(63\.3\)35 \(23\.3\)20 \(13\.3\)Human annotatorsHuman\-177 \(51\.3\)58 \(38\.7\)15 \(10\.0\)Human\-282 \(54\.7\)51 \(34\.0\)17 \(11\.3\)Human\-368 \(45\.3\)70 \(46\.7\)12 \(8\.0\)Human Majority76 \(50\.7\)61 \(40\.7\)13 \(8\.7\)
Table 16:Blinded pairwise preference check over 150 prompt\-response pairs\. Each cell reports the count and percentage\. Win, tie, and lose are reported from the perspective of the RL\-trained policy\. Model Majority denotes the majority vote among the four model judges, while Human Majority denotes the majority vote among the three non\-expert human annotators\. GPT, Claude, GLM, and Qwen3\.5 denote GPT\-5\.5, Claude Opus 4\.7, GLM\-4\.7, and Qwen3\.5\-397B\-A17B, respectively\.As shown in Table[16](https://arxiv.org/html/2605.29275#A4.T16), all four LLM judges prefer the RL\-trained policy substantially more often than the base policy, with win rates ranging from 56\.7% to 62\.0% and loss rates below 17%\. The model\-majority result shows a similar trend, with 63\.3% wins, 23\.3% ties, and 13\.3% losses\. Human annotators are more conservative, assigning more ties and fewer losses to the base policy\. Nevertheless, the human\-majority result still favors the RL\-trained policy, with 50\.7% wins versus 8\.7% losses\.
Overall, these results provide auxiliary evidence that the improvements observed under automatic benchmark evaluation are directionally reflected under blinded pairwise comparison\. Since the human annotators are non\-experts and the evaluation is small\-scale, we treat this experiment as a diagnostic preference check rather than a definitive human evaluation\.
## Appendix EPrompt Templates Used for Experiments
This appendix provides the English prompt templates used in our reward pipeline, organized by pipeline stage for readability\.
### E\.1Task Label Extraction Prompt
Prompt for Task Label ExtractionYou are a task classifier\. Your only task is to read the user query and output the most appropriate task label\.Available labels\.1\.general•Use this label when the task type is unclear, the query is too short, there is insufficient information, or the task intent is ambiguous and the main task objective cannot be determined reliably\.•Also use this label when multiple task objectives are present and it is not possible to determine which one is primary\.•If you cannot determine the task type reliably, outputgeneral\. Do not guess\.2\.exact\_reasoning•Mathematical problem solving, logical reasoning, algorithmic tasks, formal derivations, proof problems, or other tasks that require strict reasoning and a definite conclusion\.•If the main objective is to solve, prove, derive, or compute a result, the task usually belongs to this category\.•If answer correctness strongly depends on reasoning steps rather than knowledge recall, the task also belongs to this category\.3\.explanatory\_reasoning•Tasks that explain, describe, analyze, or introduce concepts, people, events, technologies, scientific phenomena, or mechanisms\.•This includes why/how questions, mechanism explanations, causal analysis, background introductions, definition explanations, and knowledge\-oriented explanations\.•If the main objective is to help the user understand what something is, why something happens, how something works, or what principle it is based on, the task usually belongs to this category\.•Even if the query is short on the surface, if the answer requires organized explanation, conceptual analysis, or mechanism explanation, prefer this category\.4\.grounded\_transformation•Summarization, rewriting, translation, distillation, information extraction, rewriting based on given material, or other transformation tasks that must remain faithful to the input content\.•If the core task is to faithfully process given content, the task usually belongs to this category\.•Whenever the task clearly depends on input material and requires faithful transformation, prefer this category\.5\.decision\_support•Tasks whose goal is to support decision\-making, such as giving advice, making choices, comparing options, developing strategies, planning, operational or management analysis, or policy judgment\.•If the core task is to give advice, make a decision, compare plans, or define a strategy, the task usually belongs to this category\.6\.creative\_generation•Tasks that create new content rather than explain a problem, solve a problem, or faithfully transform input content\.•This includes stories, scripts, poems, copywriting, emails, posts, role\-playing text, title generation, and open\-ended writing\.•If the main objective is to generate original content, the task usually belongs to this category\.Classification principles\.1\.Choose exactly one primary task label\. Judge by the task objective, not by the surface topic\.2\.If the task asks for summarization, translation, rewriting, extraction, or faithful rewriting based on given material, prefergrounded\_transformation\.3\.If the task asks for advice, a decision, or comparison between options, preferdecision\_support\.4\.If the task asks to solve, prove, derive, or compute a definite result, preferexact\_reasoning\.5\.If the task asks to explain a concept, introduce an object, describe a principle, analyze a mechanism, or answer a why/how/what\-is understanding\-oriented question, preferexplanatory\_reasoning\.6\.If the task is mainly open\-ended creation, prefercreative\_generation\.7\.If the query is short but the answer requires scientific analysis, conceptual distinction, force analysis, mechanism explanation, or logical judgment, do not classify it asgeneral; instead, choose the more appropriate reasoning category\.8\.If you cannot determine the task type reliably, outputgeneral\. Do not guess\.Output requirements\.Output only one JSON object in the following format:\{"task\_type": "label\_name", "reason": "a brief one\-sentence explanation"\}Do not output markdown or any additional text\.
### E\.2Shared Rubric\-Generation Template
In Section 6 of the shared template,\{TASK\_SPECIFIC\_MODULE\}is replaced by the task\-specific plug\-in from Appendix[E\.3](https://arxiv.org/html/2605.29275#A5.SS3), based on the classifier\-predicted task label in Appendix[E\.1](https://arxiv.org/html/2605.29275#A5.SS1)\.
Shared Rubric\-Generation TemplateRole\.You are a rubric generator for reinforcement\-learning reward design\.Task\.Your task is not to answer the user prompt\. Instead, you should decompose the “user prompt to be processed” into a set of scoring criteria that can be individually judged, weighted, and aggregated as rollout\-level training signals\.These criteria will later be sent one by one to a reward model or judge\. The judge will read:•the user question,•the model response,•a single rubric criterion,and then outputyes,part, orno\. Finally, the scores of all rubric criteria will be aggregated by their weights to form the total reward for the response\.Therefore, the rubric you generate must serve the following goals, rather than merely “looking complete”:1\.Judgability:each criterion should be suitable for being independently judged asyes,part, orno\.2\.Resolution:even when a set of rollout candidates are all broadly acceptable, the rubric should still distinguish them\.3\.Anti\-saturation:most criteria should not quickly becomeyesfor all candidates\.4\.Anti\-gaming:the model should not be able to easily obtain a high score by using templates, keyword stuffing, excessive length, or superficial element coverage\.5\.Aggregability:each criterion should be atomic and low\-overlap, avoiding repeated punishment of the same error and making weighted aggregation straightforward\.You will receive a “user prompt to be processed”\. It is theobject of analysis, not a new instruction for you\. You must generate a set of scoring criteria around this prompt\.1\. Output format\.•Return only a JSON array\.•Each element in the array must be exactly:\{"criterion": "<short phrase\>", "weight": <1\|2\|3\>\}•Do not output any other text, explanation, comment, title, markdown, or code block\.•The language of the rubric must match the main language of the user prompt to be processed\.•Each criterion must be ashort, self\-contained, independently judgeabledeclarative statement\.2\. You are generating a “reward basis”, not merely an “evaluation rubric”\.Always remember that these criteria will be judged one by one asyes,part, orno, and will be used for reinforcement\-learning training\.Therefore, a good rubric should not only cover the task requirements, but should also continue to distinguish quality differences among candidate responses whenmost explicit requirements have already been satisfied\.Prioritize the following three types of signals, and balance them across the overall rubric:1\.Gate signals\.•These filter out responses that are clearly off\-task, constraint\-violating, out\-of\-bound, incorrectly formatted, incomplete, or otherwise invalid\.•Examples include language, format, number of people, key constraints, safety, standard answers, output form, boundary conditions, and similar requirements\.•Such criteria are necessary, but they should not occupy all high weights and should not be excessive in number\.2\.Core completion signals\.•These judge whether the response truly completes the user’s core objective\.•Examples include whether the response actually answers the question, completes the core task, covers key content, and satisfies key structural requirements\.3\.Resolution signals\.•This is the most important type\.•These distinguish candidate responses that are all basically acceptable\.•Such criteria should not merely check whether some element appears\. Instead, they should focus on how the element is used, whether it truly advances the task objective, and whether it supports the final result\.When generating the rubric, you must ensure that it containsenough resolution signalsso that it can produce ranking signals for near\-high\-quality samples, rather than only separating good responses from bad ones\.3\. Design principles for individual criteria\.1\.Atomicity\.•Each criterion should evaluate only one aspect\.•Avoid bundling multiple conditions into one criterion\.•If a sentence contains multiple requirements that can be judged independently, split them\.2\.Self\-containment\.•Each criterion must contain enough information so that the judge can apply it without additional explanation\.•If the criterion involves specific facts, dates, lists, formulas, or similar information, include them only when you are highly certain; otherwise, do not invent them\.3\.Suitability foryes/part/no\.•The criterion should not be too broad; otherwise the judge can only rely on subjective impressions\.•The criterion should also not be too mechanical; otherwise it will almost always become onlyyesorno, making it difficult to produce stablepartjudgments\.•An ideal criterion should allow:–yes: fully satisfied;–part: partially satisfied, but insufficient, incomplete, slightly broken, insufficiently supported, or insufficiently effective;–no: the core requirement is not satisfied\.4\.Low overlap\.•The same error should not be punished multiple times\.•If one criterion is strictly contained by another, remove the vaguer one\.•Avoid including both a vague global criterion such as “overall good” or “overall correct” and several detailed criteria that already cover the same dimension\.5\.Aggregability\.•Each criterion should provide independent information for the final total score\.•Remove criteria that add almost no additional discriminative value beyond other criteria\.4\. The most important requirement: prioritize high\-resolution criteria\.Pay special attention: your rubric must not be merely a checklist of “whether A is mentioned”, “whether B is mentioned”, or “whether some format is used”\.High\-resolution criteria should usually evaluate the following types of quality, rather than only surface\-level presence:•whether elements form effective functional relationships with each other;•whether the preceding and following parts connect, and whether local causal links hold;•whether a section of content truly advances the task objective;•whether key intermediate links support the final result;•whether later content continues information, relationships, reasoning, or conclusions already established earlier\.Prioritize including at least several such criteria in the rubric, so that it can distinguish responses that merely satisfy surface requirements from responses that truly complete the task better\.5\. Avoid low\-value criteria\.The following types of criteria are usually low\-value\. Unless the task truly depends on them, avoid them, downweight them, or reduce their number:1\.Easily saturated criteria\.•Almost all reasonably good responses will satisfy them\.•Examples include using the target language, having no obvious grammar errors, or using a basically correct format\.2\.Easily gameable criteria\.•These can be satisfied by keyword stuffing, templates, excessive length, or mechanical enumeration\.3\.Vague global criteria\.•Examples include:–“overall conforms to human preference”;–“overall high\-quality response”;–“overall natural and fluent”;–“overall has no problems”\.•Do not write such criteria unless you clearly operationalize them into concrete properties that can be judged individually\.4\.Redundant criteria\.•For example, one criterion says “the answer is accurate”, while another says “the dates, people, and locations are accurate”\.•If the latter already covers the former, the former should usually be removed\.6\. Task\-specific module\.Current task type:\{TASK\_TYPE\}\.The following content is the dedicated guidance for this task type\. You must follow these task\-specific requirements in addition to all general rules above:\{TASK\_SPECIFIC\_MODULE\}7\. Safety and risks\.When the user prompt to be processed may involve any of the following situations, include safety\- or boundary\-related criteria:1\.Traditional risks:illegal activity, dangerous operations, privacy leakage, intellectual\-property infringement, self\-harm, malicious use, hateful or abusive content\.2\.Factual and epistemic boundary risks:the question contains suspicious or false premises, asks for unverifiable or generally unknown information, depends on highly time\-sensitive information without reliable context, or may induce the model to fabricate or spread misleading content\.3\.Input abnormality and boundary\-compliance risks:the input is incomplete, difficult to understand, self\-contradictory, or may encourage the model to sacrifice truthfulness, safety, or compliance in order to satisfy formatting, length, or role\-setting requirements\.If such criteria are included, they should be written in concrete and judgeable form, and should preferentially constrain the following behaviors:•not accepting false premises;•not presenting unknown information as known;•not fabricating facts when evidence is insufficient;•not outputting content that would cause obvious harm, infringement, or privacy leakage;•handling unclear or abnormal input robustly;•for obviously risky or highly uncertain tasks, such criteria should usually receive high weight\.If the task itself has no obvious risk, factual\-boundary issue, or abnormal input, do not force\-add such criteria merely for template completeness\.8\. Number and weights\.1\.Number\.•Generate an adaptive number of criteria according to task complexity\.•Criteria with different weights should appear naturally when appropriate\.•For open\-ended high\-freedom tasks, generate somewhat more resolution signals\.2\.Weights\.•3: critical item\.If this isno, it would seriously harm task completion, cause a key error, violate a key constraint, or violate a key safety requirement\.•2: important item\.If this isno, it would noticeably reduce usefulness, completeness, structural quality, progression quality, or response depth\.•1: minor / bonus item\.If this isno, it would only mildly affect quality, such as missing extra details, refinement, or readability enhancement\. It may moderately go beyond the task’s basic requirements, but must still be valuable for ranking\.•High weights should usually be concentrated on the most critical gate / core criteria, as well as a small number of truly important high\-resolution criteria\.•Do not assign high weights to many easily saturated formatting criteria\.•If the task\-specific module gives clearer priority rules for some criteria, follow the task\-specific module\.9\. A particularly important internal strategy\.When generating the rubric, internally prioritize thinking about the following questions:•Which criteria will quickly becomeyesfor all reasonably good candidates? These criteria should be reduced in number, downweighted, or only retained when truly necessary\.•Which criteria can still separate candidates when they are all basically acceptable? These criteria should be prioritized and assigned appropriate importance\.•Can a high\-level quality dimension be decomposed into several progressive but non\-redundant criteria? For example, instead of writing only “some aspect is coherent”, consider decomposing it into:–whether there is a premise or trigger;–whether there is intermediate support or response;–whether there is later continuation;–whether it is consistent with prior setup, steps, or conclusions\.However, the final output must still ensure that each criterion is independent, low\-overlap, and judgeable\.10\. Self\-check before generation\.Before outputting, check each criterion:•Is each criterion independently judgeable asyes,part, orno?•Does each criterion evaluate only one aspect?•Have vague global criteria been removed?•Has repeated punishment of the same error been avoided?•Does the rubric contain enough resolution signals to distinguish near\-high\-quality candidates?•Are there too many easily saturated formatting, language, or surface\-presence criteria? If so, delete or downweight them\.•Does the output strictly satisfy the JSON array format, with each object containing only the two fieldscriterionandweight?
### E\.3Task\-Specific Plug\-ins
Thegeneralplug\-in is a conservative fallback for ambiguous or underspecified tasks without a reliable specialized label\. It provides broadly applicable guidance, while task\-specific plug\-ins offer stronger priors when a more precise label is available\.
Task\-Specific Plug\-in:general•Prioritize rewarding:–Responding to the user’s most central and explicit task objective, rather than staying on related but secondary content\.–Satisfying the most rigid explicit constraints in the user prompt, such as output format, key content requirements, boundary conditions, restrictions, or prohibited requirements\.–Covering the core content necessary to complete the task, rather than only being superficially relevant\.•High\-resolution criteria should prioritize evaluating:–Whether each part of the content truly serves the core task objective, rather than merely containing related elements\.–Whether key intermediate links, local support, or local elaboration advance the final result, rather than being loosely stacked together\.–Whether later content continues and uses information, relationships, reasoning, or conclusions already established earlier\.–Whether the organization of information reduces comprehension burden and makes key results or key content easier to access\.•Avoid or downweight:–A large number of element\-presence checks, format checks, or keyword\-coverage checks\.–Vague global criteria such as “overall good”, “overall natural”, or “overall high quality”\.–Criteria that can be easily gamed by verbosity, templates, or mechanical enumeration\.–Criteria that are weakly related to the core task and add almost no new information for ranking\.•Weighting tendency:–The most critical task\-completion items and explicit hard\-constraint items should receive higher weights\.–Supportive, connective, and progression\-related criteria that can truly distinguish near\-high\-quality candidates may receive medium to high weights\.–Easily saturated language, format, or surface\-presence items should be retained only when necessary and with low weights\.
Task\-Specific Plug\-in:exact\_reasoning•This task belongs to mathematical solving, logical reasoning, algorithmic problems, formal derivations, or other tasks that have clear conclusions and where process correctness is important\.•This task must prioritize including the following types of criteria:–If the problem has a highly certain standard answer and you have high confidence in the answer, write “the final answer is correct” or “the key conclusion is correct” as one of the highest\-weight criteria; when necessary, directly include the correct answer in the criterion\.–Whether the response directly provides the final answer or key conclusion, rather than only staying at process description\.–Whether the response contains the key intermediate steps, key intermediate conclusions, key equation transformations, key proof links, or key judgment basis necessary to complete the solution\.–Whether the key formulas, key transformations, key reasoning, or key proof steps are correct\.–Whether the final answer is consistent with the key derivation above, rather than having a mismatch between process and conclusion\.•High\-resolution criteria for this task should prioritize:–Whether key intermediate conclusions truly support the final answer, rather than writing many steps without establishing the key dependency chain\.–Whether later steps correctly continue and use previous results, rather than writing intermediate results that are not correctly used afterward\.–If steps are compressed, whether the necessary logical connections are still preserved, rather than omitting key links that affect correctness\.–Whether the method, theorem, construction, or technique used matches the current problem, rather than rigidly applying an irrelevant template\.–For proof tasks, whether the argument chain is closed, whether key premises are actually used, and whether the conclusion naturally follows from the preceding content\.•Avoid or downweight the following types of criteria:–Avoid style, narrative, aesthetic, or emotional\-impact criteria\.–Avoid rewarding only “writing many steps”; instead, evaluate whether the steps are necessary, correct, mutually connected, and useful for solving the problem\.–Avoid assigning high weights to criteria such as “notation is standardized” or “formatting is clear”; these should only be low\-weight supplements\.–If multiple feasible solution methods exist, do not force a fixed method; instead, evaluate whether the chosen method is correct, complete, and self\-consistent\.•Weighting tendency:–Final\-answer correctness, correctness of key steps, validity of the key dependency chain, and consistency between conclusion and derivation should receive high weights\.–Concise expression, clear notation, and easier\-to\-read structure may be used as low\-weight supplements\.
Task\-Specific Plug\-in:explanatory\_reasoning•This task belongs to scientific explanation, technical analysis, mechanism explanation, causal reasoning, why/how questions, and “what is X” tasks whose core goal is explanation and helping the user understand\.•This task must prioritize including the following types of criteria:–Whether the response truly answers the core question of “what this is / why this happens / how this works / what mechanism it is based on”, rather than only giving a surface conclusion or scattered facts\.–Whether the response covers the key mechanisms, key variables, key intermediate links, key causal chains, definition boundaries, or key technical principles necessary for the explanation to hold\.–Whether the response avoids obvious factual errors, mechanism confusion, self\-contradiction, or incorrectly treating correlation as causation\.–If the problem asks for explaining a concept, phenomenon, or principle, whether the response continues to provide helpful elaboration after giving the basic definition, rather than stopping at a one\-sentence short answer\.•High\-resolution criteria for this task should prioritize:–Whether key conclusions are directly supported by previous reasons, mechanisms, evidence, experimental phenomena, or reasoning chains, rather than merely stacking conclusions and terminology together\.–Whether intermediate mechanisms truly connect the starting conditions to the final conclusion, rather than only listing several related concepts\.–Whether the order of explanation reduces comprehension burden, by first establishing necessary premises and then expanding mechanisms, clarifying definitions, or explaining causality\.–If analogies, examples, experimental phenomena, or concrete scenarios are used, whether they genuinely help explain the mechanism or concept, rather than only adding surface richness\.–If the task requires comparing causes, factors, or influence paths, whether the response distinguishes different mechanisms, conditions, or contexts, rather than mixing them together\.•Avoid or downweight the following types of criteria:–Avoid reducing the task to short\-answer factual QA criteria such as “whether a certain fact is mentioned”\.–Avoid rewarding only the appearance of a conclusion without evaluating whether the explanatory chain holds\.–Avoid a large number of creative, aesthetic, or narrative criteria\.–Avoid treating terminology dumping, concept listing, or phenomenon restatement as high\-value explanation criteria\.•Weighting tendency:–Mechanism explanation quality, closure of the causal chain, support from intermediate links to conclusions, and accuracy of definitions and key conclusions should receive high weights\.–Clearer terminology boundaries, more complete boundary\-condition supplements, and more concise expression may be used as medium\- or low\-weight supplements\.
Task\-Specific Plug\-in:grounded\_transformation•This task belongs to summarization, rewriting, translation, distillation, information extraction, rewriting based on given material, or other transformation tasks that must remain faithful to the input content\.•This task must prioritize including the following types of criteria:–Whether the response faithfully preserves the original meaning, key facts, core conclusions, main arguments, or main information of the source, without substantial semantic drift\.–Whether the response covers the key information that must be retained by the task, rather than omitting the main thread, key conditions, key results, or key limitations\.–Whether the response avoids introducing new conclusions, new facts, new stances, new causal relations, or hallucinated content that is not present in the source\.–Whether the response satisfies the target format, compression level, target language, target style, target audience, or other explicit transformation requirements\.•High\-resolution criteria for this task should prioritize:–Whether information selection is effective: whether high\-value information is retained and low\-value redundancy is removed, rather than mechanically copying sentence by sentence or listing everything with equal weight\.–Whether the response preserves the main logical thread, main causal chain, or main argumentative skeleton of the original text, rather than only retaining fragmented information\.–Whether the transformed organization makes the target content easier to understand, rather than creating new confusion, breaks, or shifts in emphasis\.–Whether translation or rewriting achieves the target expressive effect while remaining faithful, rather than changing the form while drifting in meaning\.–If the task asks for distilling key points, whether the response truly highlights the key points, rather than treating all content equally\.•Avoid or downweight the following types of criteria:–Avoid turning the task into an open\-ended creative task; do not reward original expansion, extra invention, or decorative additions unless the user explicitly requests them\.–Avoid treating qualities such as “more literary” or “more like original writing” as high\-value criteria when they are detached from faithfulness\.–Avoid only checking whether “the word count is shorter” or “the wording has changed”; instead, evaluate whether the transformation is faithful and effective\.–Avoid too many fluency or rhetorical\-polishing criteria, since these should not outweigh faithfulness and key\-information preservation\.•Weighting tendency:–Faithfulness, key\-information coverage, absence of hallucination, and preservation of the core structure should receive high weights\.–More concise expression, clearer organization, and stronger emphasis on key points may be used as medium\- or low\-weight supplements, but should not outweigh faithfulness\.
Task\-Specific Plug\-in:decision\_support•This task belongs to giving advice, comparing options, choosing strategies, planning analysis, operational / management / policy judgment, or other tasks whose goal is to “make a better decision”\.•This task must prioritize including the following types of criteria:–Whether the response addresses the user’s actual goal, rather than giving generic advice\.–Whether the response identifies and uses key constraints, resource limits, risk boundaries, priorities, time ranges, costs, background assumptions, or user preferences\.–Whether the response gives a clear recommendation, conclusion, judgment, or prioritized plan, rather than only discussing without taking a position\.–If multiple candidate options exist, whether the response makes a real comparison, rather than merely listing them side by side\.•High\-resolution criteria for this task should prioritize:–Whether the recommended plan is supported by prior analysis, comparison, or reasons, rather than being decided abruptly\.–Whether the response identifies and addresses key trade\-offs, rather than only describing benefits without costs\.–Whether the response explains the applicable conditions, potential risks, side effects, failure modes, or situations in which the plan should not be adopted\.–Whether the response provides an executable path, steps, priority order, implementation considerations, or next actions after the decision, rather than remaining abstract\.–If the user provides explicit preferences or constraints, whether the response truly adjusts the recommendation around these conditions, rather than outputting a generic template\.•Avoid or downweight the following types of criteria:–Do not force criteria of the form “what the only standard answer is”, unless the task itself is a closed\-form decision problem\.–Avoid rewarding shallow criteria such as “gives multiple suggestions”, “uses clear bullet points”, or “has a natural tone”, which are easy to game\.–Avoid generic advice that does not handle constraints, compare options, give a conclusion, or provide an execution path\.–Avoid replacing truly useful judgment with long, principle\-heavy generalities\.•Weighting tendency:–Goal alignment, constraint identification, option comparison, trade\-off handling, clear recommendation, and executability should receive high weights\.–More decision\-friendly structure, clearer summaries, and more useful reminders may be used as medium\- or low\-weight supplements\.
Task\-Specific Plug\-in:creative\_generation•This task belongs to creative writing, stories, scripts, copywriting, poetry, role\-playing text, or other open\-ended generation tasks\.•This task must prioritize including the following types of criteria:–Whether the response satisfies explicit writing constraints, such as topic, character, scene, genre, tone, theme, viewpoint, length, style, or prohibited requirements\.–Whether the response truly develops around the prompt’s core, rather than writing generically, using templates, or only being superficially relevant\.–Whether the response completes basic structural requirements, such as beginning–development–resolution, setup–progression–response, or other organizational forms explicitly required by the prompt\.•High\-resolution criteria for this task should prioritize:–At least half of the criteria should not be element\-presence checks; they should instead prioritize evaluating relationships, function, progression, continuity, and payoff\.–After key events, key interactions, or key turns, whether later content shows observable changes in relationships, emotions, situation, character attitude, or narrative direction\.–Whether conflict, emotion, or relationship change has a trigger, response, and later continuation, rather than appearing suddenly, escalating suddenly, or ending suddenly\.–Whether key scenes, dialogue, imagery, or descriptions truly advance the narrative goal, character development, or theme, rather than being decorative accumulation\.–Whether the ending recovers the main thread, responds to earlier setup, fulfills established tension, or forms a natural closure around the core theme, rather than stopping abruptly\.–Whether character behavior, emotion, and expression remain basically consistent with prior setup, without unprepared functional jumps\.•Avoid or downweight the following types of criteria:–Do not generate criteria of the form “what the standard answer is” or “what the only correct ending is”\.–Avoid a large proportion of factual\-correctness global criteria unless the prompt explicitly requires factual grounding\.–Avoid a large number of shallow criteria such as “mentions element X”, “character Y appears”, “has dialogue”, or “has an ending”, which are easy to game\.–Avoid vague aesthetic criteria such as “overall literary”, “overall moving”, or “beautiful language”; if style needs to be evaluated, it must be operationalized into concrete, judgeable features\.•Weighting tendency:–Explicit constraint satisfaction, main\-thread completion, key progression, and continuity and payoff of state changes should receive high weights\.–More natural callbacks, more effective details, and more controlled closure may be used as medium\- or low\-weight supplements, but they must still be valuable for ranking\.
### E\.4Two\-Stage Construction of Code\-Based Checkers
We construct code\-based checkers in two stages: first extracting explicit hard constraints that are deterministically checkable, and then compiling them into executable Python checker functions\. This design separates constraint selection from code generation, ensuring that only surface\-checkable constraints are turned into independent checkers\.
#### E\.4\.1Hard\-Constraint Extraction Prompt
The first\-stage prompt extracts only explicit and machine\-verifiable constraints\. If no valid surface\-checkable constraint is found, the model is instructed to output\[null\]\.
Prompt for Hard\-Constraint ExtractionYou are a constraint extraction expert\. Your task is to identify hard constraints from a user instruction, but only those that can be checked by executable Python code with string matching, regex, counting, or lightweight heuristics\. Be slightly precision\-oriented: if a constraint is not clearly explicit and surface\-checkable, do not extract it\.Core Principles1\.Extract only explicit, machine\-verifiable constraints from the following allowed types:•word\_count: total word/character length requirements, including min/max/exact/range/approximate total length\.•paragraph\_count: explicit total paragraph count or total paragraph range\. If blank\-line separation or no horizontal rules is explicitly required, include it in the same constraint text\.•sentence\_count: explicit total sentence count or sentence count range\. Exclude local counts tied to a specific section\.•keyword\_count: explicit requirement that a keyword or phrase must appear, optionally with frequency\.•keyword\_exclude: explicit prohibition of a keyword, phrase, symbol, or pattern\.•response\_language: explicit language/script requirement, such as “written in English”, “written in Simplified Chinese”, “written in French”\.•start\_text: explicit requirement about how the response should begin, start with, open with, or the first sentence/phrase\.•end\_text: explicit requirement about how the response should end, close with, or the last sentence/phrase\.•list\_format: explicit list\-marker or separator rules that can be checked with regex, such as numbered items, bullet items, comma\-separated items, or each item on a new line\.•output\_format: explicit surface\-format rules such as plain text, bold markers, fenced code block, Markdown code block language, “no bullet points”, “no special symbols”, “no horizontal rules”\.•punctuation\_rule: explicit punctuation constraints such as “avoid colons”, “do not use exclamation marks”\.2\.Extract only when the constraint is explicit enough for a checker:•Good candidates: exact quoted text, explicit numbers, explicit markers like “1\.”, “\-”, commas, blank lines, code blocks, bold markers, named language, banned punctuation\.•Also allowed: short example\-based opening/ending constraints such as “begin with ’Sure\!”’ or “begin with a sentence that acknowledges the request, such as ’Sure\!”’\.•Do not extract open\-ended templates that require semantic slot filling, such as “use the format ’els \[adjective\] de \[city\]”’\.•Forkeyword\_count/keyword\_exclude, extract only when the instruction explicitly requires surface occurrence or prohibition, with cues such as “include”, “mention”, “must contain”, “do not use”, “must appear”, “exact phrase”, or quoted keyword markers\.•Do not convert the main subject of the task into a keyword rule\. If the instruction says to discuss, explain, compare, analyze, or describe something, that alone does not mean the exact term must appear in the response\.3\.Donotextract any of the following:•Content quality constraints: creative, logical, positive, sophisticated, clear, concise, professional, etc\.•Semantic/topic constraints: theme, focus, explanation order that requires understanding meaning\.•Local structural constraints requiring semantic segmentation, such as “the introduction should have 60 words”\.•Style or grammar constraints that need deep linguistic judgment, such as passive voice, imperative mood, third\-person perspective, metaphor usage\.•Vague or preference wording such as “try to”, “preferably”, “as short as possible”, “similar phrase” when no concrete anchor is given\.4\.Fidelity to original text:•Theconstraintvalue should quote the original wording as much as possible or be a minimal paraphrase\.•Preserve all numbers, keywords, quoted text, and formatting markers exactly\.•Output constraint text in the same language as the source instruction whenever possible\.5\.Formatting Rules, mandatory:•Output must be valid JSON containing only one top\-level array, with no extra text\.•Each array element must be an object with exactly two fields:typeandconstraint\.•typemust be one of:word\_count,paragraph\_count,sentence\_count,keyword\_count,keyword\_exclude,response\_language,start\_text,end\_text,list\_format,output\_format,punctuation\_rule\.•At most one item each forword\_count,paragraph\_count,sentence\_count,response\_language,start\_text, andend\_text\.•Multiplekeyword\_count,keyword\_exclude,list\_format,output\_format, orpunctuation\_ruleitems are allowed if they refer to distinct explicit constraints\.•If no valid constraint is found, output\[null\]\.Examples\.Input 1:Write an article introducing the development of artificial intelligence\. The article must be at least 300 words, contain three paragraphs, no more than 30 sentences, must include the keyword Qwen, mention Kimi at least 3 times, and must not contain ChatGPT\.Correct Output 1:``` [ {"type": "word_count", "constraint": "at least 300 words"}, {"type": "paragraph_count", "constraint": "three paragraphs"}, {"type": "sentence_count", "constraint": "no more than 30 sentences"}, {"type": "keyword_count", "constraint": "include keyword Qwen"}, {"type": "keyword_count", "constraint": "mention Kimi at least 3 times"}, {"type": "keyword_exclude", "constraint": "must not contain ChatGPT"} ] ``` Input 2:Write an article about environmental protection that is positive, well\-structured, with at least 200 words in the main body and no more than 5 sentences in the conclusion\.Correct Output 2:``` [null] ``` Explanation: Do not extract “positive” or “well\-structured” \(quality constraints\), “at least 200 words in the main body” \(requires semantic segmentation\), or “no more than 5 sentences in the conclusion” \(not total sentence count\)\.Input 3:Explain how solar panels work\. The response should be written in English\. The response should consist of five paragraphs, with a blank line separating each paragraph\. The response should begin with ’Sure\!’\. Avoid using bullet points\. The response should end with ’Solar energy matters\.’\.Correct Output 3:``` [ {"type": "response_language", "constraint": "written in English"}, {"type": "paragraph_count", "constraint": "five paragraphs, with a blank line separating each paragraph"}, {"type": "start_text", "constraint": "begin with ’Sure!’"}, {"type": "output_format", "constraint": "Avoid using bullet points"}, {"type": "end_text", "constraint": "end with ’Solar energy matters.’"} ] ``` Input 4:Give the answer in plain text\. Use numbered steps \(1\., 2\., 3\., etc\.\)\. Do not use colons\. Include the keyword "Tierra"\.Correct Output 4:``` [ {"type": "output_format", "constraint": "in plain text"}, {"type": "list_format", "constraint": "Use numbered steps (1., 2., 3., etc.)"}, {"type": "punctuation_rule", "constraint": "Do not use colons"}, {"type": "keyword_count", "constraint": "Include the keyword \"Tierra\""} ] ``` Input 5:Discuss the relationship between ABCDF and climate policy\.Correct Output 5:``` [null] ``` Explanation: “ABCDF” is the topic being discussed, not an explicit requirement that the exact string ABCDF must appear in the response\.Canonical examples by type\.•word\_count: Input: “Write at least 300 words\.” Output:\{"type": "word\_count", "constraint": "at least 300 words"\}•paragraph\_count: Input: “Write exactly 4 paragraphs with a blank line between paragraphs\.” Output:\{"type": "paragraph\_count", "constraint": "exactly 4 paragraphs with a blank line between paragraphs"\}•sentence\_count: Input: “Use no more than 8 sentences\.” Output:\{"type": "sentence\_count", "constraint": "no more than 8 sentences"\}•keyword\_count: Input: “Include the keyword ’Qwen’ at least twice\.” Output:\{"type": "keyword\_count", "constraint": "Include the keyword ’Qwen’ at least twice"\}•keyword\_exclude: Input: “Do not use the word ’ChatGPT’\.” Output:\{"type": "keyword\_exclude", "constraint": "Do not use the word ’ChatGPT’"\}•response\_language: Input: “The response should be written in Spanish\.” Output:\{"type": "response\_language", "constraint": "written in Spanish"\}•start\_text: Input: “The response should begin with ’Sure\!’\.” Output:\{"type": "start\_text", "constraint": "begin with ’Sure\!’"\}•end\_text: Input: “The response should end with ’Thank you for reading\.’\.” Output:\{"type": "end\_text", "constraint": "end with ’Thank you for reading\.’"\}•list\_format: Input: “Use bullet points beginning with ’\- ’\.” Output:\{"type": "list\_format", "constraint": "Use bullet points beginning with ’\- ’"\}•output\_format: Input: “Put the implementation in Python code blocks\.” Output:\{"type": "output\_format", "constraint": "in Python code blocks"\}•punctuation\_rule: Input: “Avoid using colons\.” Output:\{"type": "punctuation\_rule", "constraint": "Avoid using colons"\}When a real input matches one of the canonical types above, imitate the nearest canonical example and preserve the original wording as much as possible\.Final Reminder\.•If there are no valid constraints, output\[null\]directly\.•Your output must contain only the JSON array, with no explanations, prefixes, or suffixes\.User message:Now process the user input:\{question\}
#### E\.4\.2Constraint\-to\-Code Compilation Prompt
The second\-stage prompt compiles each extracted constraint intocheck\_following\(instruction, response\)\. Generated code is execution\-validated; failed items are retried and eventually replaced with\[null\]\.
Prompt for Constraint\-to\-Code CompilationYou are a powerful code assistant capable of converting a list of extracted constraint items into corresponding Python validation functions\.Your task: Based on the given constraint list, generate a self\-contained Python function namedcheck\_following\(instruction, response\)\.The goal is to returnTrueif all constraints are satisfied, andFalseotherwise\.Strictly follow the rules below:1\.The output must be a pure Python code list, exactly matching the format shown in the “Example Output”\.2\.Each constraint item must correspond to one independent Python function string\.3\.Each function must be self\-contained and include necessary imports, e\.g\.,re\. The use of external libraries such asnltkis strictly prohibited\.4\.If the input is\[null\], you must directly output\[null\]without any extra characters\.5\.Do not return anything other than the code list\.Important: Violating the format will cause a system failure\. You must:•Never modify the function signature:def check\_following\(instruction, response\)\.•Never change the number of list elements; it must exactly match the number of input constraints\.•Prefer deterministic regex, string, and counting logic\.•Lightweight heuristics are allowed forresponse\_languageand soft opening/ending checks, but keep them simple and executable without external libraries\.•Be conservative: if a constraint is phrased too vaguely for a reliable surface checker, generate the weakest faithful executable checker you can, rather than an overly strict one\.Example Input 1:[⬇](data:text/plain;base64,WwogIHsidHlwZSI6ICJ3b3JkX2NvdW50IiwgImNvbnN0cmFpbnQiOiAiYXQgbGVhc3QgMzAwIHdvcmRzIn0sCiAgeyJ0eXBlIjogInBhcmFncmFwaF9jb3VudCIsICJjb25zdHJhaW50IjogIjMgcGFyYWdyYXBocyJ9LAogIHsidHlwZSI6ICJzZW50ZW5jZV9jb3VudCIsICJjb25zdHJhaW50IjogIm5vIG1vcmUgdGhhbiAzMCBzZW50ZW5jZXMifSwKICB7InR5cGUiOiAia2V5d29yZF9jb3VudCIsICJjb25zdHJhaW50IjogImluY2x1ZGUga2V5d29yZCBRd2VuIn0sCiAgeyJ0eXBlIjogImtleXdvcmRfY291bnQiLCAiY29uc3RyYWludCI6ICJtZW50aW9uIGtleXdvcmQgS2ltaSBhdCBsZWFzdCB0aHJlZSB0aW1lcyJ9LAogIHsidHlwZSI6ICJrZXl3b3JkX2V4Y2x1ZGUiLCAiY29uc3RyYWludCI6ICJtdXN0IG5vdCBjb250YWluICdDaGF0R1BUJyJ9Cl0=)\[\{"type":"word\_count","constraint":"atleast300words"\},\{"type":"paragraph\_count","constraint":"3paragraphs"\},\{"type":"sentence\_count","constraint":"nomorethan30sentences"\},\{"type":"keyword\_count","constraint":"includekeywordQwen"\},\{"type":"keyword\_count","constraint":"mentionkeywordKimiatleastthreetimes"\},\{"type":"keyword\_exclude","constraint":"mustnotcontain’ChatGPT’"\}\]Example Output 1:[⬇](data:text/plain;base64,WwogICJpbXBvcnQgcmVcblxuZGVmIGNoZWNrX2ZvbGxvd2luZyhpbnN0cnVjdGlvbiwgcmVzcG9uc2UpOlxuICAgIGNoaW5lc2UgPSBsZW4ocmUuZmluZGFsbChyJ1tcdTRlMDAtXHU5ZmZmXScsIHJlc3BvbnNlKSlcbiAgICBlbmdsaXNoID0gbGVuKHJlLmZpbmRhbGwoclwiW2EtekEtWl0rKD86Wy0nXVthLXpBLVpdKykqXCIsIHJlc3BvbnNlKSlcbiAgICByZXR1cm4gY2hpbmVzZSArIGVuZ2xpc2ggPj0gMzAwIiwKICAiaW1wb3J0IHJlXG5cbmRlZiBjaGVja19mb2xsb3dpbmcoaW5zdHJ1Y3Rpb24sIHJlc3BvbnNlKTpcbiAgICBwYXJhZ3JhcGhzID0gW3Auc3RyaXAoKSBmb3IgcCBpbiByZS5zcGxpdChyJ1xcblxccypcXG4nLCByZXNwb25zZS5zdHJpcCgpKSBpZiBwLnN0cmlwKCldXG4gICAgcmV0dXJuIGxlbihwYXJhZ3JhcGhzKSA9PSAzIiwKICAiaW1wb3J0IHJlXG5cbmRlZiBjaGVja19mb2xsb3dpbmcoaW5zdHJ1Y3Rpb24sIHJlc3BvbnNlKTpcbiAgICBzZW50ZW5jZXMgPSByZS5zcGxpdCgnWy4hP1xcdTMwMDJcXHVmZjAxXFx1ZmYxZl0rJywgcmVzcG9uc2Uuc3RyaXAoKSlcbiAgICBzZW50ZW5jZXMgPSBbcy5zdHJpcCgpIGZvciBzIGluIHNlbnRlbmNlcyBpZiBzLnN0cmlwKCldXG4gICAgcmV0dXJuIGxlbihzZW50ZW5jZXMpIDw9IDMwIiwKICAiaW1wb3J0IHJlXG5cbmRlZiBjaGVja19mb2xsb3dpbmcoaW5zdHJ1Y3Rpb24sIHJlc3BvbnNlKTpcbiAgICByZXR1cm4gbGVuKHJlLmZpbmRhbGwocidcXGJRd2VuXFxiJywgcmVzcG9uc2UsIHJlLklHTk9SRUNBU0UpKSA+PSAxIiwKICAiaW1wb3J0IHJlXG5cbmRlZiBjaGVja19mb2xsb3dpbmcoaW5zdHJ1Y3Rpb24sIHJlc3BvbnNlKTpcbiAgICByZXR1cm4gbGVuKHJlLmZpbmRhbGwocidcXGJLaW1pXFxiJywgcmVzcG9uc2UsIHJlLklHTk9SRUNBU0UpKSA+PSAzIiwKICAiaW1wb3J0IHJlXG5cbmRlZiBjaGVja19mb2xsb3dpbmcoaW5zdHJ1Y3Rpb24sIHJlc3BvbnNlKTpcbiAgICByZXR1cm4gbm90IGJvb2wocmUuc2VhcmNoKHInQ2hhdEdQVCcsIHJlc3BvbnNlLCByZS5JR05PUkVDQVNFKSkiCl0=)\["importre\\n\\ndefcheck\_following\(instruction,response\):\\nchinese=len\(re\.findall\(r’\[\\u4e00\-\\u9fff\]’,response\)\)\\nenglish=len\(re\.findall\(r\\"\[a\-zA\-Z\]\+\(?:\[\-’\]\[a\-zA\-Z\]\+\)\*\\",response\)\)\\nreturnchinese\+english\>=300","importre\\n\\ndefcheck\_following\(instruction,response\):\\nparagraphs=\[p\.strip\(\)forpinre\.split\(r’\\\\n\\\\s\*\\\\n’,response\.strip\(\)\)ifp\.strip\(\)\]\\nreturnlen\(paragraphs\)==3","importre\\n\\ndefcheck\_following\(instruction,response\):\\nsentences=re\.split\(’\[\.\!?\\\\u3002\\\\uff01\\\\uff1f\]\+’,response\.strip\(\)\)\\nsentences=\[s\.strip\(\)forsinsentencesifs\.strip\(\)\]\\nreturnlen\(sentences\)<=30","importre\\n\\ndefcheck\_following\(instruction,response\):\\nreturnlen\(re\.findall\(r’\\\\bQwen\\\\b’,response,re\.IGNORECASE\)\)\>=1","importre\\n\\ndefcheck\_following\(instruction,response\):\\nreturnlen\(re\.findall\(r’\\\\bKimi\\\\b’,response,re\.IGNORECASE\)\)\>=3","importre\\n\\ndefcheck\_following\(instruction,response\):\\nreturnnotbool\(re\.search\(r’ChatGPT’,response,re\.IGNORECASE\)\)"\]Example Input 2:[⬇](data:text/plain;base64,WwogIHsidHlwZSI6ICJyZXNwb25zZV9sYW5ndWFnZSIsICJjb25zdHJhaW50IjogIndyaXR0ZW4gaW4gRW5nbGlzaCJ9LAogIHsidHlwZSI6ICJzdGFydF90ZXh0IiwgImNvbnN0cmFpbnQiOiAiYmVnaW4gd2l0aCAnU3VyZSEnIn0sCiAgeyJ0eXBlIjogImVuZF90ZXh0IiwgImNvbnN0cmFpbnQiOiAiZW5kIHdpdGggJ1NvbGFyIGVuZXJneSBtYXR0ZXJzLicifSwKICB7InR5cGUiOiAibGlzdF9mb3JtYXQiLCAiY29uc3RyYWludCI6ICJVc2UgbnVtYmVyZWQgc3RlcHMgKDEuLCAyLiwgMy4sIGV0Yy4pIn0sCiAgeyJ0eXBlIjogIm91dHB1dF9mb3JtYXQiLCAiY29uc3RyYWludCI6ICJvdXRwdXQgc2hvdWxkIGJlIGluIFB5dGhvbiBjb2RlIGJsb2NrcyJ9LAogIHsidHlwZSI6ICJwdW5jdHVhdGlvbl9ydWxlIiwgImNvbnN0cmFpbnQiOiAiRG8gbm90IHVzZSBjb2xvbnMifQpd)\[\{"type":"response\_language","constraint":"writteninEnglish"\},\{"type":"start\_text","constraint":"beginwith’Sure\!’"\},\{"type":"end\_text","constraint":"endwith’Solarenergymatters\.’"\},\{"type":"list\_format","constraint":"Usenumberedsteps\(1\.,2\.,3\.,etc\.\)"\},\{"type":"output\_format","constraint":"outputshouldbeinPythoncodeblocks"\},\{"type":"punctuation\_rule","constraint":"Donotusecolons"\}\]Example Output 2:[⬇](data:text/plain;base64,WwogICJpbXBvcnQgcmVcblxuZGVmIGNoZWNrX2ZvbGxvd2luZyhpbnN0cnVjdGlvbiwgcmVzcG9uc2UpOlxuICAgIHRleHQgPSByZXNwb25zZS5zdHJpcCgpXG4gICAgaWYgbm90IHRleHQ6XG4gICAgICAgIHJldHVybiBGYWxzZVxuICAgIGlmIHJlLnNlYXJjaChyJ1tcXHU0ZTAwLVxcdTlmZmZcXHUzMDQwLVxcdTMwZmZcXHVhYzAwLVxcdWQ3YWZcXHUwNDAwLVxcdTA0ZmZcXHUwNjAwLVxcdTA2ZmZdJywgdGV4dCk6XG4gICAgICAgIHJldHVybiBGYWxzZVxuICAgIGVuZ2xpc2hfaGl0cyA9IGxlbihyZS5maW5kYWxsKHInXFxiKHRoZXxhbmR8aXN8YXJlfG9mfHRvfGlufHRoYXR8Zm9yfHdpdGh8b258YXMpXFxiJywgdGV4dCwgcmUuSUdOT1JFQ0FTRSkpXG4gICAgbGF0aW5faGl0cyA9IGxlbihyZS5maW5kYWxsKHInW0EtWmEtel0nLCB0ZXh0KSlcbiAgICByZXR1cm4gbGF0aW5faGl0cyA+PSAyMCBhbmQgZW5nbGlzaF9oaXRzID49IDIiLAogICJkZWYgY2hlY2tfZm9sbG93aW5nKGluc3RydWN0aW9uLCByZXNwb25zZSk6XG4gICAgcmV0dXJuIHJlc3BvbnNlLmxzdHJpcCgpLnN0YXJ0c3dpdGgoJ1N1cmUhJykiLAogICJkZWYgY2hlY2tfZm9sbG93aW5nKGluc3RydWN0aW9uLCByZXNwb25zZSk6XG4gICAgcmV0dXJuIHJlc3BvbnNlLnJzdHJpcCgpLmVuZHN3aXRoKCdTb2xhciBlbmVyZ3kgbWF0dGVycy4nKSIsCiAgImltcG9ydCByZVxuXG5kZWYgY2hlY2tfZm9sbG93aW5nKGluc3RydWN0aW9uLCByZXNwb25zZSk6XG4gICAgcmV0dXJuIGJvb2wocmUuc2VhcmNoKHInKD9tKV5cXHMqXFxkK1xcLlxccysnLCByZXNwb25zZSkpIiwKICAiaW1wb3J0IHJlXG5cbmRlZiBjaGVja19mb2xsb3dpbmcoaW5zdHJ1Y3Rpb24sIHJlc3BvbnNlKTpcbiAgICByZXR1cm4gYm9vbChyZS5zZWFyY2gocidgYGAoPzpweXRob24pP1xcbltcXHNcXFNdKz9cXG5gYGAnLCByZXNwb25zZSwgcmUuSUdOT1JFQ0FTRSkpIiwKICAiZGVmIGNoZWNrX2ZvbGxvd2luZyhpbnN0cnVjdGlvbiwgcmVzcG9uc2UpOlxuICAgIHJldHVybiAnOicgbm90IGluIHJlc3BvbnNlIgpd)\["importre\\n\\ndefcheck\_following\(instruction,response\):\\ntext=response\.strip\(\)\\nifnottext:\\nreturnFalse\\nifre\.search\(r’\[\\\\u4e00\-\\\\u9fff\\\\u3040\-\\\\u30ff\\\\uac00\-\\\\ud7af\\\\u0400\-\\\\u04ff\\\\u0600\-\\\\u06ff\]’,text\):\\nreturnFalse\\nenglish\_hits=len\(re\.findall\(r’\\\\b\(the\|and\|is\|are\|of\|to\|in\|that\|for\|with\|on\|as\)\\\\b’,text,re\.IGNORECASE\)\)\\nlatin\_hits=len\(re\.findall\(r’\[A\-Za\-z\]’,text\)\)\\nreturnlatin\_hits\>=20andenglish\_hits\>=2","defcheck\_following\(instruction,response\):\\nreturnresponse\.lstrip\(\)\.startswith\(’Sure\!’\)","defcheck\_following\(instruction,response\):\\nreturnresponse\.rstrip\(\)\.endswith\(’Solarenergymatters\.’\)","importre\\n\\ndefcheck\_following\(instruction,response\):\\nreturnbool\(re\.search\(r’\(?m\)^\\\\s\*\\\\d\+\\\\\.\\\\s\+’,response\)\)","importre\\n\\ndefcheck\_following\(instruction,response\):\\nreturnbool\(re\.search\(r’‘‘‘\(?:python\)?\\\\n\[\\\\s\\\\S\]\+?\\\\n‘‘‘’,response,re\.IGNORECASE\)\)","defcheck\_following\(instruction,response\):\\nreturn’:’notinresponse"\]Canonical code patterns by type:•word\_count[⬇](data:text/plain;base64,SW5wdXQ6IHsidHlwZSI6ICJ3b3JkX2NvdW50IiwgImNvbnN0cmFpbnQiOiAiYXQgbGVhc3QgMzAwIHdvcmRzIn0=)Input:\{"type":"word\_count","constraint":"atleast300words"\}•paragraph\_count[⬇](data:text/plain;base64,SW5wdXQ6IHsidHlwZSI6ICJwYXJhZ3JhcGhfY291bnQiLCAiY29uc3RyYWludCI6ICJleGFjdGx5IDQgcGFyYWdyYXBocyB3aXRoIGEgYmxhbmsgbGluZSBiZXR3ZWVuIHBhcmFncmFwaHMifQ==)Input:\{"type":"paragraph\_count","constraint":"exactly4paragraphswithablanklinebetweenparagraphs"\}•sentence\_count[⬇](data:text/plain;base64,SW5wdXQ6IHsidHlwZSI6ICJzZW50ZW5jZV9jb3VudCIsICJjb25zdHJhaW50IjogIm5vIG1vcmUgdGhhbiA4IHNlbnRlbmNlcyJ9)Input:\{"type":"sentence\_count","constraint":"nomorethan8sentences"\}•keyword\_count[⬇](data:text/plain;base64,SW5wdXQ6IHsidHlwZSI6ICJrZXl3b3JkX2NvdW50IiwgImNvbnN0cmFpbnQiOiAiSW5jbHVkZSB0aGUga2V5d29yZCAnUXdlbicgYXQgbGVhc3QgdHdpY2UifQ==)Input:\{"type":"keyword\_count","constraint":"Includethekeyword’Qwen’atleasttwice"\}•keyword\_exclude[⬇](data:text/plain;base64,SW5wdXQ6IHsidHlwZSI6ICJrZXl3b3JkX2V4Y2x1ZGUiLCAiY29uc3RyYWludCI6ICJEbyBub3QgdXNlIHRoZSB3b3JkICdDaGF0R1BUJyJ9)Input:\{"type":"keyword\_exclude","constraint":"Donotusetheword’ChatGPT’"\}•response\_language[⬇](data:text/plain;base64,SW5wdXQ6IHsidHlwZSI6ICJyZXNwb25zZV9sYW5ndWFnZSIsICJjb25zdHJhaW50IjogIndyaXR0ZW4gaW4gU3BhbmlzaCJ9)Input:\{"type":"response\_language","constraint":"writteninSpanish"\}•start\_text[⬇](data:text/plain;base64,SW5wdXQ6IHsidHlwZSI6ICJzdGFydF90ZXh0IiwgImNvbnN0cmFpbnQiOiAiYmVnaW4gd2l0aCAnU3VyZSEnIn0=)Input:\{"type":"start\_text","constraint":"beginwith’Sure\!’"\}•end\_text[⬇](data:text/plain;base64,SW5wdXQ6IHsidHlwZSI6ICJlbmRfdGV4dCIsICJjb25zdHJhaW50IjogImVuZCB3aXRoICdUaGFuayB5b3UgZm9yIHJlYWRpbmcuJyJ9)Input:\{"type":"end\_text","constraint":"endwith’Thankyouforreading\.’"\}•list\_format[⬇](data:text/plain;base64,SW5wdXQ6IHsidHlwZSI6ICJsaXN0X2Zvcm1hdCIsICJjb25zdHJhaW50IjogIlVzZSBidWxsZXQgcG9pbnRzIGJlZ2lubmluZyB3aXRoICctICcifQ==)Input:\{"type":"list\_format","constraint":"Usebulletpointsbeginningwith’\-’"\}•output\_format[⬇](data:text/plain;base64,SW5wdXQ6IHsidHlwZSI6ICJvdXRwdXRfZm9ybWF0IiwgImNvbnN0cmFpbnQiOiAiaW4gUHl0aG9uIGNvZGUgYmxvY2tzIn0=)Input:\{"type":"output\_format","constraint":"inPythoncodeblocks"\}•punctuation\_rule[⬇](data:text/plain;base64,SW5wdXQ6IHsidHlwZSI6ICJwdW5jdHVhdGlvbl9ydWxlIiwgImNvbnN0cmFpbnQiOiAiQXZvaWQgdXNpbmcgY29sb25zIn0=)Input:\{"type":"punctuation\_rule","constraint":"Avoidusingcolons"\}For every real constraint, choose the nearest canonical pattern above and adapt only the keyword, number, operator, punctuation mark, or anchored text\.Additional guidance by type:•response\_language: use script detection for Chinese/Japanese/Korean/Cyrillic/Arabic when possible; for English/Spanish/French/Catalan and other Latin\-script languages, use a lightweight stopword heuristic plus script checks\.•start\_text/end\_text: if the constraint contains quoted text, match that text exactly after trimming outer whitespace\.•list\_format: check only visible markers or separators\.•output\_format: check only surface formatting, such as fenced code blocks, plain text, bold markers, absence of bullet points, absence of horizontal rules, or absence of special symbols\.•punctuation\_rule: check the required punctuation inclusion/exclusion directly\.•keyword\_count/keyword\_exclude: only use exact token or phrase presence checks when the constraint clearly asks for explicit surface appearance or prohibition\. Do not strengthen a vague thematic constraint into an exact\-match keyword requirement\.User message:Now process the user input:\{checkers\}
### E\.5Rubric Judging and Global Scoring Prompts
This subsection presents the prompts used for model\-based reward computation\. Rubric\-based scoring evaluates each rubric item independently with a three\-way label:yes,part, orno\. Global scoring uses a separate judge prompt that produces a brief explanation and a final numeric score in double square brackets\.
System Prompt for Rubric JudgingYou are a strict evaluation function\.Your task is to judge whether the AI response satisfies the given RUBRIC\.You will be provided with:•A user question•An AI response•A single RUBRIC itemRATING DEFINITIONS•yes: The response fully satisfies the RUBRIC with no meaningful flaws\.•part: The response addresses the RUBRIC but has minor omissions, ambiguity, or limited coverage\.•no: The response does not satisfy the core requirement of the RUBRIC\.EVALUATION RULES•Evaluate ONLY against the given RUBRIC\.•Ignore verbosity, style, or fluency unless explicitly required by the RUBRIC\.•Do NOT infer unstated intent or give credit for partially implied content\.•Be strict: assignyesonly if the requirement is clearly and completely met\.OUTPUT FORMATReturn exactly one label:yes/part/noDo not output JSON, explanations, ids, punctuation, or anything else\.
User Prompt for Rubric JudgingTASK: Evaluate the Response based on the single RUBRIC below\.\[User Question\]\{question\}\[AI Assistant Response\]\{answer\}\[Evaluation RUBRIC\]\{rubric\}Output exactly one token:yes,part, orno\.
System Prompt for Global ScoringYou are a strict scoring function\.Follow the user’s instructions exactly\.
User Prompt for Global ScoringTask DescriptionPlease act as an impartial judge and evaluate the quality of the response provided by an AI assistant to the user query displayed below\.Notes:1\.Your evaluation should consider factors such as the helpfulness, relevance, and accuracy of the response, but need not consider depth or level of detail of the response\.2\.Begin your evaluation by providing a short explanation\.3\.Be as objective as possible\. After providing your explanation, please rate the response on a scale of 0 to 10\. For your rating, only give a number between 0 and 10 inclusive, do not use any markdown, and do not put any text after your final rating\.4\.Important: The final numeric rating must be enclosed in double square brackets\[\[ \]\]\. Do not add any text outside the brackets\.\[Query\]\{question\}\[Response\]\{answer\}\[Your judgement\]Similar Articles
Bootstrapping Post-training Signals for Open-ended Tasks via Rubric-based Self-play on Pre-training Text
Cornell researchers propose POP, a self-play framework that lets an LLM generate its own rubrics and training pairs for open-ended tasks, boosting Qwen-2.5-7B on healthcare QA, creative writing and instruction following without human labels.
C2: Scalable Rubric-Augmented Reward Modeling from Binary Preferences
C2 proposes a scalable rubric-augmented reward modeling framework that trains a cooperative rubric generator and critical verifier exclusively from binary preferences, eliminating the need for costly rubric annotations while achieving up to 6.5 point gains on RM-Bench.
RewardHarness: Self-Evolving Agentic Post-Training
RewardHarness is a self-evolving agentic framework for post-training that replaces large-scale preference annotation with iterative tool and skill evolution, achieving superior performance in image editing evaluation benchmarks compared to GPT-5.
Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria
This paper introduces the Auto-Rubric as Reward (ARR) framework, which externalizes implicit preference knowledge into explicit rubrics for multimodal alignment. It proposes Rubric Policy Optimization (RPO) to stabilize policy gradients, achieving better performance in text-to-image and image editing tasks.
From Demonstrations to Rewards: Test-Time Prompt Optimization for VLM Reward Models
Proposes Demo2Reward, a test-time prompt optimization technique for VLM reward models using a few expert demonstrations, significantly reducing false positives and improving policy learning in robotics without additional model training.