Bootstrapping Post-training Signals for Open-ended Tasks via Rubric-based Self-play on Pre-training Text
Summary
Cornell researchers propose POP, a self-play framework that lets an LLM generate its own rubrics and training pairs for open-ended tasks, boosting Qwen-2.5-7B on healthcare QA, creative writing and instruction following without human labels.
View Cached Full Text
Cached at: 04/23/26, 10:03 AM
# Bootstrapping Post-training Signals for Open-ended Tasks via Rubric-based Self-play on Pre-training Text
Source: [https://arxiv.org/html/2604.20051](https://arxiv.org/html/2604.20051)
Chengyu HuangSheng\-Yen ChouZhengxin ZhangClaire Cardie Department of Computer Science, Cornell University \{ch2263, sc3379, zz865, ctc9\}@cornell\.edu
###### Abstract
Self\-play has recently emerged as a promising paradigm to train Large Language Models \(LLMs\)\. In self\-play, the target LLM creates the task input \(e\.g\., ask a question\), which it then addresses itself by producing a task output \(e\.g\., give an answer\)\. A reward model evaluates the output, and the rewards are then used to train the LLM, typically via Reinforcement Learning \(RL\)\. Self\-play incurs minimal supervision costs, and this is especially helpful for post\-training LLMs, which require high\-quality input\-output pairs that traditionally have to be written by humans or expensive proprietary models\. However, existing work explores self\-play only for verifiable tasks such as math and coding\. Instead, we seek to extend it to more realistic open\-ended tasks\. In particular, we propose POP, a self\-play framework that uses the same LLM to synthesize evaluation rubrics, along with input\-output pairs, for each example\. The rubric is then used to evaluate outputs and train the model\. We further ground the framework on a content\-rich pretraining corpus to \(1\) ensure a generation\-verification gap and reduce reward hacking, and \(2\) prevent mode collapse\. On Qwen\-2\.5\-7B, POP increases performance of both pretrained and instruction\-tuned models, across different tasks ranging from long\-form Healthcare QA to creative writing and instruction following\. GitHub:[https://github\.com/HCY123902/POP](https://github.com/HCY123902/POP)\.
## 1Introduction
Recent Large Language Models \(LLMs\) have become increasingly capable of handling complex tasks, from math problem solving\[[22](https://arxiv.org/html/2604.20051#bib.bib3)\]to agentic workflows\[[25](https://arxiv.org/html/2604.20051#bib.bib4)\]\. However, continual improvement of LLMs still requires high\-quality training data from humans or stronger LLMs, which is often expensive or unavailable\. In fact, the cost and scarcity of high\-quality training data is becoming a bottleneck\[[40](https://arxiv.org/html/2604.20051#bib.bib5)\]\.
To this end, self\-play has emerged as a promising direction to improve LLMs with minimal external supervision\. In essence, self\-play is a Reinforcement Learning \(RL\) process\. In each iteration, \(i\) the LLM synthesizes the task input \(e\.g\., asks a question\); \(ii\) the same LLM then responds to itself with a task output \(e\.g\., gives an answer\); \(iii\) the output \(and optionally the input\) is then scored by a reward model, which can take various forms \(a rule\-based verifier, neural network, etc\.\) and is task\-dependent; \(iv\) the rewards are then used to train the LLM, to make it produce better task output \(i\.e, answer better\) and optionally better task input as well \(i\.e\., ask better questions\)\. In contrast to standard RL and alternative self\-improvement methods\[[14](https://arxiv.org/html/2604.20051#bib.bib13),[43](https://arxiv.org/html/2604.20051#bib.bib14)\], self\-play avoids the need for supervision not only of the task output, but also of the task input\.
However, self\-play has only been explored on verifiable domains such as math\[[12](https://arxiv.org/html/2604.20051#bib.bib9),[20](https://arxiv.org/html/2604.20051#bib.bib10),[44](https://arxiv.org/html/2604.20051#bib.bib12),[3](https://arxiv.org/html/2604.20051#bib.bib8)\]and coding\[[45](https://arxiv.org/html/2604.20051#bib.bib6),[49](https://arxiv.org/html/2604.20051#bib.bib7)\]thus far, due to their ease of evaluation\. Extending these methods to more realistic open\-ended tasks is important yet challenging\. Previous work\[[9](https://arxiv.org/html/2604.20051#bib.bib24),[40](https://arxiv.org/html/2604.20051#bib.bib5),[2](https://arxiv.org/html/2604.20051#bib.bib30),[48](https://arxiv.org/html/2604.20051#bib.bib28),[10](https://arxiv.org/html/2604.20051#bib.bib29),[30](https://arxiv.org/html/2604.20051#bib.bib27),[15](https://arxiv.org/html/2604.20051#bib.bib26)\]shows that rubric\-based rewards provide reliable signals on open\-ended tasks\. Inspired by this, we propose a rubric\-based self\-play framework thatPost\-trains LLMs onOpen domains withPretraining Text \(POP\)\. In particular, after the model generates answers to a query, we use thesamemodel to generate an evaluation rubric specific to that query, and then grade each answer based on the rubric\.
The use of the same model to synthesize the queries, answers, and rewards implies several risks: \(1\) On the query side, a lack of query diversity and thus mode collapse; \(2\) On the answer side, a lack of high quality answers and hence no positive training signals; \(3\) On the reward side, the lack of generation\-verification gap\[[35](https://arxiv.org/html/2604.20051#bib.bib21)\]and consequently reward hacking\. We address \(1\) by grounding the entire framework on diverse text sampled from a content\-rich pretraining corpus\. We ask the model to synthesize questions that build upon the text, and, in tasks where ground truth exists, questions whose answers are derivable from the text\. We use two methods tomitigate\(3\)\. First, we give privileged access to the pretraining text to the model when it evaluates the answers\. Second, we select only the highest and lowest scored answers for training\. We address \(2\) by using Direct Preference Optimization\[[26](https://arxiv.org/html/2604.20051#bib.bib31)\]that learns contrastive signals from the response pairs\[[13](https://arxiv.org/html/2604.20051#bib.bib32)\]\.
Figure 1:Pipeline of POP\. We sample a set of examples from the base model and create response pairs from them, which are then used to train the model via DPO\. During the sampling stage, for each example, \(i\) we get a piece of text from a pretraining corpus and, grounded on that, the LLM synthesizes a question; \(ii\) the LLM produces a set of candidate answers to the question; \(iii\) the LLM generates an evaluation rubric, conditioned on the pretraining text; \(iv\) the LLM scores each candidate answer according to the rubric\. After sampling, we filter out invalid examples and take the answer with the highest score and the lowest score as the positive and negative responses, respectively\. The resulting dataset is used for DPO\.Put together, POP uses the same model for three roles: The Proposer, which synthesizes questions, grounded on pretraining text; The Solver, which answers the questions; The Verifier, which generates the rubric and grades each answer, again grounded on pretraining text\. We evaluate our POP on three open\-ended tasks: long\-form health care QA, creative writing, and general instruction following\. For each task, we only change the pretraining corpus and the prompts for the proposer and solver, while keeping other parts of the framework fixed\.
Our main contributions are as follows\. \(i\) We propose POP, a general self\-play framework for post\-training LLMs on open\-ended tasks; \(ii\) We ground POP on pretraining text to enable generation\-verification gap without introducing strong supervision; \(iii\) Experiment results show that POP increases performance of both pretrained and instruction\-tuned models on different tasks ranging from long\-form Healthcare QA \(up to 4% on HealthBench500\[[1](https://arxiv.org/html/2604.20051#bib.bib33)\]\) to creative writing \(up to 5% on Creative Writing V3\[[24](https://arxiv.org/html/2604.20051#bib.bib34)\]\) and instruction following \(up to 9% on IFEval\[[47](https://arxiv.org/html/2604.20051#bib.bib35)\]and 4% on ArenaHard\[[46](https://arxiv.org/html/2604.20051#bib.bib17)\]\)\.
## 2Related Work
##### Self\-Play\.
Prior work applies self\-play almost exclusively to verifiable reasoning tasks:\[[45](https://arxiv.org/html/2604.20051#bib.bib6)\]and\[[48](https://arxiv.org/html/2604.20051#bib.bib28)\]apply self\-play to coding tasks;\[[3](https://arxiv.org/html/2604.20051#bib.bib8),[12](https://arxiv.org/html/2604.20051#bib.bib9),[20](https://arxiv.org/html/2604.20051#bib.bib10),[44](https://arxiv.org/html/2604.20051#bib.bib12)\]focus on math reasoning;\[[4](https://arxiv.org/html/2604.20051#bib.bib11)\]works with commonsense reasoning\.\[[7](https://arxiv.org/html/2604.20051#bib.bib56)\]and\[[29](https://arxiv.org/html/2604.20051#bib.bib57)\]synthesize question\-answer pairs on various verifiable agentic tasks, with access to tools that interact with environments\. However, these verifiable tasks are only a small part of what is needed to train highly capable LLMs\[[5](https://arxiv.org/html/2604.20051#bib.bib16)\]\. Real\-world use cases often require long\-form open\-ended output, such as document\-assisted writing \(e\.g\., summarization\), creative writing \(e\.g\., storytelling\), information extraction, open\-ended QA, etc\.
##### Rubric as Rewards\.
There is a growing body of literature on extending existing Reinforcement Learning \(RL\) methods to post\-train LLMs in non\-verifiable domains\. One major branch is to use rubrics to score the open\-ended text from the model, and use that as the reward for RL\.
\[[9](https://arxiv.org/html/2604.20051#bib.bib24),[41](https://arxiv.org/html/2604.20051#bib.bib25)\]are among the first in this direction\. They take questions from existing datasets and generate question\-specific rubrics using strong LLMs\. Their rubric criteria adopt a binary rating scale\.\[[15](https://arxiv.org/html/2604.20051#bib.bib26)\]curate a set of rubrics in advance, from human experts or strong LLMs, and use these rubrics to synthesize questions, which are then fed into their rubric\-based RL pipeline\.\[[30](https://arxiv.org/html/2604.20051#bib.bib27)\]ask a strong LLM to compare responses from the current model against those from a reference model to curate new rubric criteria during training, and then use another strong LLM to grade the responses\.\[[48](https://arxiv.org/html/2604.20051#bib.bib28)\]use rubrics to guide the solver to generate high\-quality responses, and gradually limits access to rubrics during training for curriculum learning\.\[[10](https://arxiv.org/html/2604.20051#bib.bib29)\]hire humans to generate the rubrics and score responses, and these are used to train their LLM\-based rubric generators and verifiers\.\[[2](https://arxiv.org/html/2604.20051#bib.bib30)\]ask humans or strong LLMs to synthesize rubrics from reference answers and use them for RL training\. Similar to\[[48](https://arxiv.org/html/2604.20051#bib.bib28)\], they selectively refine the sampled responses with rubrics to enhance RL training efficiency\.
However, these existing methods require strong teacher models to create the questions and rubrics, which are expensive and would not function without a stronger teacher\.
## 3Methodology
The full pipeline of POP is shown in Figure[1](https://arxiv.org/html/2604.20051#S1.F1), and the pesudo code is in Algorithm[1](https://arxiv.org/html/2604.20051#alg1)\.
### 3\.1Sampling
First, we find a pretraining corpusDDwhose content is relevant to our target tasktt\(e\.g\., a corpus of medical articles or clinical records for health QA; fantasy books for creative writing; General text for instruction following\)\. With that, we synthesizeIIexamples from a base LLMπref\\pi\_\{ref\}\. Each example is synthesized according to the following four steps\.
##### Question Synthesis\.
First, we sample a text documentddfromDD\. To ensure thatddis informative, we force it to be at least 50 words long\. For efficiency reasons, we also truncateddto only the first 1024 words\. Then, we ask the LLM to synthesize a new questionxxand a reference answeryrefy\_\{ref\}conditioned ondd:\(x,yref\)∼πref\(⋅\|Ptqus\(d\)\)\(x,y\_\{ref\}\)\\sim\\pi\_\{ref\}\(\\cdot\|P\_\{t\}^\{qus\}\(d\)\), wherePtqusP\_\{t\}^\{qus\}is a task\-specific question synthesis prompt \(See Apendix[59](https://arxiv.org/html/2604.20051#A12.F59)\)\.
For tasks where ground truth answers exist \(e\.g\., long\-form health/medical QA\), in the promptPtqusP\_\{t\}^\{qus\}, we ask that the answer to questionxxneeds to be derivable fromdd\. For tasks where ground truth does not exist \(e\.g\., creative writing; instruction\-following\), we relax the condition and ask that the answer needs to build upondd\. Moreover, we ask that the model provide a reference answeryrefy\_\{ref\}toxx, which will be used later for rubric generation\.
##### Question Answering\.
We sample a set of candidate answers toxxfromπref\\pi\_\{ref\}:∀j∼\{1,⋯,J\},yj∼πref\(⋅\|Ptans\(x\)\)\\forall j\\sim\\\{1,\\cdots,J\\\},y\_\{j\}\\sim\\pi\_\{ref\}\(\\cdot\|P\_\{t\}^\{ans\}\(x\)\), wherePtansP\_\{t\}^\{ans\}is the question answering prompt\.
##### Rubric Generation\.
We generate a question\-specific rubricrr, conditioned on the pretraining documentdd, questionxx, reference answeryrefy\_\{ref\}, and candidate answers\{yj\}j=1J\\\{y\_\{j\}\\\}\_\{j=1\}^\{J\}:r∼πref\(⋅\|Prub\(d,x,yref,\{yj\}j=1J\)\)r\\sim\\pi\_\{ref\}\(\\cdot\|P^\{rub\}\(d,x,y\_\{ref\},\\\{y\_\{j\}\\\}\_\{j=1\}^\{J\}\)\), wherePrubP^\{rub\}is the rubric generation prompt\.
rrconsists ofKiK\_\{i\}criteriar1,⋯,rKir\_\{1\},\\cdots,r\_\{K\_\{i\}\}, whereKiK\_\{i\}is uppered bound by a constantKK\. Each criterionrkr\_\{k\}has a name, a description of the characteristics of good and bad responses, an optional gold label extracted from the documentdd, and a weightwkw\_\{k\}that indicates how important the criterion is\.
Importantly, we enforce several principles in the rubric generation prompt \(See full prompt in Appendix[59](https://arxiv.org/html/2604.20051#A12.F59)\)\. First, we require the rubric to begrounded on the pretraining document\. For criteria that have ground truth or gold standard answers, we ask the model to extract the gold label from the documentdd\. This providesexplicit groundingon the pretraining corpus\. For other criteria where there is no ground truth, we ask that the model utilizeddwhen generating their descriptions, which serves asimplicit groundingon the pretraining corpus\.
In addition, we require the rubric to bediscriminative\. We ask that the model only generate criteria that meaningfully distinguish high\-quality answers from low\-quality ones\. Different from\[[32](https://arxiv.org/html/2604.20051#bib.bib15)\], we also include add\-privileged reference answeryrefy\_\{ref\}in the rubric prompt\. When all the candidate answers are highly similar,yrefy\_\{ref\}serves as an extra reference point to avoid generating meaningless criteria that do not correlate with global quality\. Such cases will then be filtered out after sampling\.
##### Answer Grading\.
We grade each candidate responseyjy\_\{j\}using the rubricrrto get a evaluation reporteje\_\{j\}:ej∼πref\(⋅\|Pgrade\(x,r,yj\)e\_\{j\}\\sim\\pi\_\{ref\}\(\\cdot\|P^\{grade\}\(x,r,y\_\{j\}\), wherePgradeP^\{grade\}is the grading prompt\.eje\_\{j\}consists of individual evaluations\{ejk\}k=1K\\\{e\_\{j\}^\{k\}\\\}\_\{k=1\}^\{K\}ofyjy\_\{j\}for each criterionrkr\_\{k\}\. For eachejke\_\{j\}^\{k\}, we ask the model to first give its thoughts on how well the model does according toeke\_\{k\}’s description and gold label\. Then, we ask it to produce a ratingsjks\_\{j\}^\{k\}of either 0 \(Bad/Do not match gold\), 1 \(Medium/Partially match gold\), or 2 \(Good/Fully match gold\), depending on the extent to whichyjy\_\{j\}satisfiesrkr\_\{k\}\. If no valid score can be extracted for a criterion, we set its score to 0\. Finally, we aggregate the score as follows\.
sj=∑k=1Kwk⋅sjk∑k=1Kwk\\displaystyle s\_\{j\}=\\frac\{\\sum\_\{k=1\}^\{K\}w\_\{k\}\\cdot s\_\{j\}^\{k\}\}\{\\sum\_\{k=1\}^\{K\}w\_\{k\}\}\(1\)
### 3\.2Filtering, Pairing, and Training
##### Preliminaries\.
Direct Preference Optimization\[[26](https://arxiv.org/html/2604.20051#bib.bib31)\]is an offline version of traditional policy gradient methods such as PPO\[[31](https://arxiv.org/html/2604.20051#bib.bib38)\]\. It avoids the need for an extra reward model, and trains the main policy to learn the reward landscape directly\. Given a dataset\{\(xi,ywi,yli\)\}i=1N\\\{\(x^\{i\},y\_\{w\}^\{i\},y\_\{l\}^\{i\}\)\\\}\_\{i=1\}^\{N\}, where each example has a promptxx, a preferred responseywy\_\{w\}, and a dispreferred responseyly\_\{l\}, it initializes the main policyπθ\\pi\_\{\\theta\}from the reference policyπref\\pi\_\{ref\}and trains it using the following objective\.
LDPO\(πθ,πref\)=−E\(x,yw,yl\)∼S\[logσ\(βlogπθ\(yw\|x\)πref\(yw\|x\)−βlogπθ\(yl\|x\)πref\(yl\|x\)\)\]\\displaystyle L\_\{DPO\}\(\\pi\_\{\\theta\},\\pi\_\{ref\}\)=\-E\_\{\(x,y\_\{w\},y\_\{l\}\)\\sim S\}\\left\[log\\sigma\\left\(\\beta log\\frac\{\\pi\_\{\\theta\}\(y\_\{w\}\|x\)\}\{\\pi\_\{ref\}\(y\_\{w\}\|x\)\}\-\\beta log\\frac\{\\pi\_\{\\theta\}\(y\_\{l\}\|x\)\}\{\\pi\_\{ref\}\(y\_\{l\}\|x\)\}\\right\)\\right\]\(2\)
##### Filtering\.
After synthesizingIIexamples, for each example, we filter out candidate answers that are malformed or have a malformed evaluation report\. Then, we filter out examples where the question or rubric cannot be extracted, does not follow the correct format, or has zero valid candidate answers\.
##### Pairing\.
For each example that passes filtering, we pick the candidate answer with the highest score as the positive responseyw←argmaxyj\(sj\)y\_\{w\}\\leftarrow argmax\_\{y\_\{j\}\}\(s\_\{j\}\)and the answer with the lowest score as the negative responseyl←argminyj\(sj\)y\_\{l\}\\leftarrow argmin\_\{y\_\{j\}\}\(s\_\{j\}\)\. Following\[[13](https://arxiv.org/html/2604.20051#bib.bib32)\], we wantywy\_\{w\}andyly\_\{l\}to be maximally different in terms of quality but minimally different in terms of other irrelevant features such as length\. Therefore, we filter out response pairs whereywy\_\{w\}andyly\_\{l\}differ by more than 100 words in length\. We also filter out pairs whereywy\_\{w\}andyly\_\{l\}have the same score\. This gives the DPO training set\{\(xi,ywi,yli\)\}i=1N\\\{\(x^\{i\},y\_\{w\}^\{i\},y\_\{l\}^\{i\}\)\\\}\_\{i=1\}^\{N\}\.
##### Training\.
We initialize the main policy fromπref\\pi\_\{ref\}and then train it using the DPO objective in equation \(2\) on the synthesized dataset\. We denote the trained model asπtrained\\pi\_\{trained\}\.
## 4Experiment Setup
##### Tasks and Evaluations\.
We experiment with three tasks: Long\-form Healthcare QA; Creative Writing; Instruction Following\. ForLong\-form Healthcare QA, we use HC4\[[23](https://arxiv.org/html/2604.20051#bib.bib40)\]as our pre\-training corpus, which consists of more than 9\.7 million medical articles\. We evaluate on HealthBench\[[1](https://arxiv.org/html/2604.20051#bib.bib33)\], an LLM\-as\-a\-judge benchmark with example\-specific rubrics created by human experts\. To reduce evaluation cost, we randomly sample 500 examples from it and denote it as HealthBench500\. ForCreative Writing, we use 2 million book abstracts scraped from the Internet\[[34](https://arxiv.org/html/2604.20051#bib.bib41)\]as the pretraining corpus\. We evaluate on Creative Writing V3\[[24](https://arxiv.org/html/2604.20051#bib.bib34)\]\. ForInstruction Following, we use OpenWebText\[[8](https://arxiv.org/html/2604.20051#bib.bib20)\]as the pretraining corpus, which consists of 8 million documents in the general domains\. For evaluation, we use IFEval\[[47](https://arxiv.org/html/2604.20051#bib.bib35)\], a verifiable benchmark that measures strict adherence to writing constraints, and ArenaHard\[[46](https://arxiv.org/html/2604.20051#bib.bib17)\], an LLM\-as\-a\-judge benchmark measuring general instruction following abilities\. We use GPT\-4\.1\-mini\[[38](https://arxiv.org/html/2604.20051#bib.bib43)\]as the judge model across different LLM\-as\-a\-judge benchmarks\.
##### OOD Evaluations\.
To ensure that our trained model’s performance does not degrade on other tasks, especially on verifiable tasks, we evaluate the models on nine out\-of\-distribution benchmarks, on scientific reasoning \(GPQA\-Diamond\[[27](https://arxiv.org/html/2604.20051#bib.bib44)\]\); math reasoning \(GSM8K\[[6](https://arxiv.org/html/2604.20051#bib.bib45)\]; Math500\[[11](https://arxiv.org/html/2604.20051#bib.bib46)\]; AIME2024; AIME2025\); and factoid QA \(NaturalQuestions\[[18](https://arxiv.org/html/2604.20051#bib.bib47)\]; TriviaQA\[[17](https://arxiv.org/html/2604.20051#bib.bib48)\]; TruthfulQA\[[19](https://arxiv.org/html/2604.20051#bib.bib49)\]; MMLU\-Pro\[[42](https://arxiv.org/html/2604.20051#bib.bib50)\]\); MedQA\[[16](https://arxiv.org/html/2604.20051#bib.bib55)\]\. We use 0\-shot prompting for the benchmarks\.
Table 1:Statistics of the synthesized datasets, after filtering\. Base: Qwen\-2\.5\-7B; Inst: Qwen\-2\.5\-7B\-Instruct; \#Ques: Number of questions; \#Ans: Number of valid candidate answers per question; \#Crtieria: Number of rubric criteria per question;ssmean: Average answer score per question;ssstd: Answer score standard deviation per question;syws\_\{y\_\{w\}\}: Average score of preferred response;syls\_\{y\_\{l\}\}: Average score of dispreferred response\.Models\.We experiment with both pretrained model \(Qwen\-2\.5\-7B\[[39](https://arxiv.org/html/2604.20051#bib.bib42)\]\) and instruction\-finetuned model \(Qwen\-2\.5\-7B\-Instruct\[[39](https://arxiv.org/html/2604.20051#bib.bib42)\]\) as our reference model\.
Sampling\.We sampleI=4096I=4096\(except for Creative Writing, where we setI=8192I=8192\) questions, each withJ=32J=32initial candidate answers, and ask the model to create at most55rubrics \(i\.e\.,K=5K=5\)\. The statistics of the resulting synthesized datasets are shown in Table[1](https://arxiv.org/html/2604.20051#S4.T1)\.
Training\.We train the model for 1 epoch, using learning rate of1e−61e\-6with a cosine schedule and warmup ratio of 0\.1,β=0\.01\\beta=0\.01, and batch size of 16\. We use AdamW optimizer\[[21](https://arxiv.org/html/2604.20051#bib.bib39)\]\.
Baselines\.We compareπtrained\\pi\_\{trained\}against \(1\) the reference modelπref\\pi\_\{ref\}and \(2\) a model that is continuously pretrained on the same pretraining text used to ground our pipeline \(Train onDD\)\. Prior methods on rubric\-based RL\[[9](https://arxiv.org/html/2604.20051#bib.bib24),[40](https://arxiv.org/html/2604.20051#bib.bib5),[2](https://arxiv.org/html/2604.20051#bib.bib30),[48](https://arxiv.org/html/2604.20051#bib.bib28),[10](https://arxiv.org/html/2604.20051#bib.bib29),[30](https://arxiv.org/html/2604.20051#bib.bib27),[15](https://arxiv.org/html/2604.20051#bib.bib26)\]use strong supervisions to get the input questions, rubrics, and scores, so we do not compare against them\.
## 5Results and Analysis
We first discuss the main results in §[5\.1](https://arxiv.org/html/2604.20051#S5.SS1), then evaluate the trained models on OOD benchmarks in §[5\.2](https://arxiv.org/html/2604.20051#S5.SS2)\. We also analyze the synthesized training datasets in §[5\.3](https://arxiv.org/html/2604.20051#S5.SS3)and conduct various ablation studies in §[5\.4](https://arxiv.org/html/2604.20051#S5.SS4)\. Finally, we analyze the correctness of answer ranking by our rubric evaluation §[5\.5](https://arxiv.org/html/2604.20051#S5.SS5)\.
### 5\.1Main Results
We show the main results in Table[2](https://arxiv.org/html/2604.20051#S5.T2)\. Across different domains, we observe a consistent performance increase from the reference model\. In addition, continuously pretraining on the grounding documentsDDsignificantly underperforms both the reference and our model\. This shows the necessity of transforming pre\-training data into post\-training format, which is achieved by POP\.
Table 2:Main results\. Rubric: Rubric score; Elo: Elo score; IFEval: We report the average of strict and loose categories for both prompt\-level and instruction\-level metrics\.##### Healthcare QA\.
On the pretrained model \(Qwen\-2\.5\-7B\), we observe a 4% increase on HealthBench500\. The performance gains on the instruction\-tuned model \(Qwen\-2\.5\-7B\-Inst\) are marginal\. We suspect this is due to the misalignments between our synthesized training data and the evaluation data from HealthBench\. In Figure[2](https://arxiv.org/html/2604.20051#S5.F2), we show the results breakdown on HealthBench\. For Qwen\-2\.5\-7B, performance increases notably on all axes except context awareness\. For Qwen\-2\.5\-7B\-Inst, performance slightly increases on accuracy, context awareness, and instruction following, but slightly drops on completeness and communication quality\.


Figure 2:Per\-axis results on HealthBench500\. Left: Qwen\-2\.5\-7B; Right: Qwen\-2\.5\-7B\-Inst\.
##### Creative Writing\.
On Creative Writing V3, POP gets a 1\.6% increase for Qwen\-2\.5\-7B and a 5% increase for Qwen\-2\.5\-7B\-Inst\. We show the detailed per\-criteria breakdown in Appendix[C](https://arxiv.org/html/2604.20051#A3)\.


Figure 3:Per\-axis results on IFEval\. Left: Qwen\-2\.5\-7B; Right: Qwen\-2\.5\-7B\-Inst\.
##### Instruction Following\.
We first discuss results on IFEval\. For Qwen\-2\.5\-7B, POP gives a 9% increase on prompt\-level metrics and a 7\.5% increase on instruction\-level metrics\. For Qwen\-2\.5\-7B\-Inst, increases are milder\. We observe a 3\.5% increase on the prompt\-level and a 3% increase on the instruction\-level\. We further show the detailed breakdown in Figure[3](https://arxiv.org/html/2604.20051#S5.F3), which indicates a consistent performance increase under strict and loose evaluations\. On ArenaHard, POP improves the win rate against GPT\-4 by 3\.5% on Qwen\-2\.5\-7B and 4% on Qwen\-2\.5\-7B\-Inst\.
### 5\.2Results on OOD Benchmarks
We evaluate on various out\-of\-distribution benchmarks\. We present the results for models trained on Healthcare QA datasets in Table[3](https://arxiv.org/html/2604.20051#S5.T3)\. We show OOD results for other models in Appendix[D](https://arxiv.org/html/2604.20051#A4)\.
Table 3:OOD Results for models trained on the models trained on the healthcare datasets\. NQ: NaturalQuestions; TvQA: TriviaQA; TfQA: TruthfulQA; MMLU\-P: MMLU\-Pro; GPQA\-D: GPQA\-Diamond; GSM: GSM8K; Math: MATH500\. We use 0\-shot evaluation\.We observe a slight performance increase across knowledge and reasoning benchmarks for Qwen\-2\.5\-7B\. Notably, results on MMLU\-Pro increase from 30\.10% to 44\.75%\. Average results change from 34\.71% to 37\.34%\. On Qwen\-2\.5\-7B\-Inst, performance slightly drops on NaturalQuestions and GPQA\-Diamond, but remains largely the same on other benchmarks\. Average results change from 43\.17% to 42\.88%\. We also highlight that results on MedQA increase by around 1% for both models, suggesting that training on our long\-form datasets slightly helps performance on short\-form tasks in the same domain\.
In contrast, naively training on the pretraining corpusDDcauses a performance drop on NaturalQuestions, TriviaQA, and GPQA\-Diamond for Qwne\-2\.5\-7B\. Average results remain at 34\.68% due to an increase on MMLU\-Pro\. On Qwen\-2\.5\-7B\-Inst, naive pretraining causes degradation across benchmarks, with average results dropping from 43\.17% to 36\.01%\.
### 5\.3Statistics of the Synthesized Dataset
Figure 4:Question topics\.We analyze the types of questions, rubrics, and answers synthesized by POP\. We use theHealthcare QAQwen\-2\.5\-7B\\text\{Healthcare QA\}\_\{\\text\{Qwen\-2\.5\-7B\}\}dataset as an example\. Detailed methodology and results for other settings are in Appendix[E](https://arxiv.org/html/2604.20051#A5)\.
#### 5\.3\.1Questions
In Figure[4](https://arxiv.org/html/2604.20051#S5.F4), we show the common topics for the synthesized questions\. Healthcare Management \(21%\) and Public Health \(19%\) are the two most common topics, which also have the strongest overlap with questions from HealthBench\. Since a considerable portion of our pretraining corpus\[[23](https://arxiv.org/html/2604.20051#bib.bib40)\]is medical articles from PubMed, the rest of the synthesized questions tend to focus on professional medical knowledge and research\.
#### 5\.3\.2Rubrics
Figure 5:Meta rubric criteriaWe group the rubric criteria into meta\-criteria and show the composition in Figure[5](https://arxiv.org/html/2604.20051#S5.F5)\. Relevance \(25%\), Accuracy \(22%\), and Clarity \(18%\) make up two\-thirds of the rubric criteria\. Other major categories include Depth \(11%\), Biological Understanding \(8%\), and Originality \(6%\)\. This suggests that our rubric values both objective and subjective criteria\.
#### 5\.3\.3Answers and Scores


Figure 6:Answer score distribution forhealthcareQwen\-2\.5\-7B\\text\{healthcare\}\_\{\\text\{Qwen\-2\.5\-7B\}\}dataset\. Left: scores given by the model itself; Right: scores given by a stronger model; We group the candidate answers by questions and compute the average and standard deviation of answer scores for each question\. We then partition the questions into 10 bins\. x\-axis is the average answer score for the questions in that bin; error bar indicates standard deviations of answer scores for the questions in that bin, averaged by number of questions; bubble size indicates the percentage of questions that fall into that bin\.In Figure[6](https://arxiv.org/html/2604.20051#S5.F6), we show the distribution of answer scores, grouped by questions\. According to the rubric and scores generated by the model itself, more than 50% of questions have their average scores above 1\.0\. However, since these scores are not perfect, we additionally ask a stronger teacher model \(GPT\-4o\-mini\) to generate its own scores on the same questions, answers, and rubrics\. The teacher distribution moves more toward the lower end, with more than 50% of questions having an average score below 1\.0\. As we will show in §[5\.5](https://arxiv.org/html/2604.20051#S5.SS5), the scores generated by weaker models are not accurate without the pairing step\.
Nonetheless, we observe a large standard deviation of over 0\.3 across question groups under both models\. This suggests that POP is able to synthesize questions of appropriate difficulties, so that the model generates answers of varying qualities, and can give meaningful contrastive signals\.
### 5\.4Ablation Study
Figure 7:Ablation Results on HealthBench500\.We remove or replace various components of POP\. This includes \(1\) revoking the access to pretraining textddwhen generating the rubric \(Eval w/oDD\); \(2\) revoking our entire pipeline’s access toddaltogether \(w/oDD\); \(2\) As prior work suggests\[[33](https://arxiv.org/html/2604.20051#bib.bib51),[36](https://arxiv.org/html/2604.20051#bib.bib52)\], replacing our pointwise rubric grader with a pairwise judge that compares two answers according to our rubric \(w/ pairwise judge\); \(3\) Removing access to rubrics for our judge, but still grounding it ondd\(w/o rubric\)\. See Appendix[F\.1](https://arxiv.org/html/2604.20051#A6.SS1)for details\.
We conduct the studies onHealthcareQwen\-2\.5\-7B\\text\{Healthcare\}\_\{\\text\{Qwen\-2\.5\-7B\}\}dataset, and show the results in Figure[7](https://arxiv.org/html/2604.20051#S5.F7)\. Removing grounding of our rubric evaluation on the pretraining corpus reduces the overall performance to 28%\. We suspect this to be a result of a lack of generation verification gap and an increase in reward hacking, which we will analyze in §[5\.5](https://arxiv.org/html/2604.20051#S5.SS5)\. Removing our pipeline’s access toDDentirely \(w/oDD\) further hurts performance, since in this case, questions are synthesized without grounding, and likely degrade in terms of quality and diversity\. Grading answers without rubrics slightly hurts the performance\. Surprisingly, using a pairwise judge is not better than ours, which we will analyze in Appendix[F\.2](https://arxiv.org/html/2604.20051#A6.SS2)\.
### 5\.5Answer Ranking Analysis
Without external input, the use of the same model to verify its own answers implies a lack of generation\-verification gap\[[35](https://arxiv.org/html/2604.20051#bib.bib21)\]\. In our case, the rubric generated by our model may not be able to distinguish high\-quality answers from low\-quality ones\. As a result, the ranking of answers according to the rubric will be inconsistent with the "true quality" ranking, which encourages undesired behavior, a phenomenon known asreward hacking\. In our case, since we only choose the highest\-rankedywy\_\{w\}and lowest\-rankedyly\_\{l\}answers for DPO training, reward hacking happens whenywy\_\{w\}is worse thanyly\_\{l\}in terms of "true qualities"\.
To measure "true qualities", we use GPT\-4o\-mini as a proxy\. In particular, we use our prompts and ask it to generate a new set of rubrics and scores for the same questions and answers fromπref\\pi\_\{ref\}\. We define the rankings from this stronger model as thegold rankings\. We want to know \(1\) the correlation between rankings from POP and gold rankings, and \(2\) whether grounding our rubric on pretraining textDDhelps\. For \(2\), we revoke access toDDand regenerate the rubrics and scores\. We denote the new answer rankings as Eval w/oDDand compare them against our original rankings Eval w/DD, using the gold rankings as a reference\.
Table 4:Ranking accuracy\.##### Global correlation is imperfect\.
Figure[8](https://arxiv.org/html/2604.20051#S5.F8)shows the correlation between our rankings and gold rankings\. The setting without grounding \(Eval w/oDD\) achieves a moderate positive correlation of 0\.3301\. Grounding the rubric evaluation on pretraining text \(Eval w/DD\) gives a slightly stronger correlation of 0\.3424, but still far from perfect\.


Figure 8:Correlation between our rankings and gold rankings on thehealthcareQwen\-2\.5\-7B\\text\{healthcare\}\_\{\\text\{Qwen\-2\.5\-7B\}\}dataset\. x\-axis: Percentile ranking of answers from our model, with \(Right\) or without \(Left\) access toDD\. y\-axis: Given that the answer is ranked at the top x% according to our model, the distribution of gold ranking percentiles\. A smaller percentile means a higher rank\. Spearman’s r: Spearman’s ranking correlation coefficient\.
##### yw−yly\_\{w\}\-y\_\{l\}correlation is strong, leading to correct training signals\.
Since our DPO algorithm only uses the highest and lowest ranked answersywy\_\{w\}andyly\_\{l\}for training, a more relevant metric for POP is the pairwise ranking accuracy\. That is, the percentage of cases whereywy\_\{w\}is truly better thanyly\_\{l\}according to gold rankings\. We show this in Table[4](https://arxiv.org/html/2604.20051#S5.T4)\.
Table 5:Rankings and scores ofywy\_\{w\}andyly\_\{l\}from Eval w/oDDand Eval w/DD, according to eitherπref\\pi\_\{ref\}or stronger model\.Both w/oDDand w/DDachieves accuracies higher than 80%, and w/DDachieves an accuracy of 85%\. This suggests \(1\) even though global correlation is imperfect, taking the extremes according to our own rankings still gives mostly correct training signals; \(2\) Grounding rubric evaluation on the pretraining corpus gives stronger correlation with gold rankings and helps reduce reward hacking\.
In Table[5](https://arxiv.org/html/2604.20051#S5.T5), we further list the detailed statistics ofywy\_\{w\}andyly\_\{l\}under both the w/oDDand w/DDsettings\. For both settings, we compute the average rankings and scores ofywy\_\{w\}andyly\_\{l\}under our model and the stronger model\. In both settings,ywy\_\{w\}is ranked and scored significantly higher thanyly\_\{l\}even with the gold model\. In addition, the rank difference betweenywy\_\{w\}andyly\_\{l\}is larger in the w/DDsetting, showing the benefits of grounding on pretraining text\. We show further analysis in Appendix[H](https://arxiv.org/html/2604.20051#A8)\.
## 6Conclusion
We introduce POP, a rubric\-based self\-play framework on open\-ended tasks\. POP synthesizes question\-answer\-rubric pairs and grades each answer using the rubrics\. To mitigate reward hacking, we give privileged access to pretraining text during rubric generation and only take the answers with the highest and lowest scores for DPO training\. Experiment results suggest that POP bootstraps effective training signals across different tasks\. The design of POP is meant to be generalizable, and we hope it highlights a direction toward cheap and effective post\-training for LLMs\.
## References
- \[1\]R\. K\. Arora, J\. Wei, R\. S\. Hicks, P\. Bowman, J\. Quiñonero\-Candela, F\. Tsimpourlas, M\. Sharman, M\. Shah, A\. Vallone, A\. Beutel, J\. Heidecke, and K\. Singhal\(2025\)HealthBench: evaluating large language models towards improved human health\.arXiv preprint arXiv:2505\.08775\.Cited by:[Appendix G](https://arxiv.org/html/2604.20051#A7.p2.6),[§1](https://arxiv.org/html/2604.20051#S1.p6.1),[§4](https://arxiv.org/html/2604.20051#S4.SS0.SSS0.Px1.p1.1)\.
- \[2\]\(2025\)Reward and guidance through rubrics: promoting exploration to improve multi\-domain reasoning\.arXiv preprint arXiv:2508\.16949\.Cited by:[§1](https://arxiv.org/html/2604.20051#S1.p3.1),[§2](https://arxiv.org/html/2604.20051#S2.SS0.SSS0.Px2.p2.1),[§4](https://arxiv.org/html/2604.20051#S4.SS0.SSS0.Px2.p5.3)\.
- \[3\]J\. Chen, B\. Zhang, R\. Ma, P\. Wang, X\. Liang, Z\. Tu, X\. Li, and K\. K\. Wong\(2025\)SPC: evolving self\-play critic via adversarial games for llm reasoning\.arXiv preprint arXiv:2504\.19162\.Cited by:[§1](https://arxiv.org/html/2604.20051#S1.p3.1),[§2](https://arxiv.org/html/2604.20051#S2.SS0.SSS0.Px1.p1.1)\.
- \[4\]P\. Cheng, T\. Hu, H\. Xu, Z\. Zhang, Z\. Yuan, Y\. Dai, L\. Han, N\. Du, and X\. Li\(2024\)Self\-playing adversarial language game enhances llm reasoning\.arXiv preprint arXiv:2404\.10642\.Cited by:[§2](https://arxiv.org/html/2604.20051#S2.SS0.SSS0.Px1.p1.1)\.
- \[5\]H\. W\. Chung, L\. Hou, S\. Longpre, B\. Zoph, Y\. Tay, W\. Fedus, Y\. Li, X\. Wang, M\. Dehghani, S\. Brahma, A\. Webson, S\. S\. Gu, Z\. Dai, M\. Suzgun, X\. Chen, A\. Chowdhery, A\. Castro\-Ros, M\. Pellat, K\. Robinson, D\. Valter, S\. Narang, G\. Mishra, A\. Yu, V\. Zhao, Y\. Huang, A\. Dai, H\. Yu, S\. Petrov, E\. H\. Chi, J\. Dean, J\. Devlin, A\. Roberts, D\. Zhou, Q\. V\. Le, and J\. Wei\(2022\)Scaling instruction\-finetuned language models\.arXiv preprint arXiv:2210\.11416\.Cited by:[§2](https://arxiv.org/html/2604.20051#S2.SS0.SSS0.Px1.p1.1)\.
- \[6\]K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano, C\. Hesse, and J\. Schulman\(2021\)Training verifiers to solve math word problems\.arXiv preprint arXiv:2110\.14168\.Cited by:[§4](https://arxiv.org/html/2604.20051#S4.SS0.SSS0.Px2.p1.1)\.
- \[7\]DeepSeek\-AI\(2025\)DeepSeek\-v3\.2: pushing the frontier of open large language models\.arXiv preprint:arXiv:2512\.02556\.Cited by:[§2](https://arxiv.org/html/2604.20051#S2.SS0.SSS0.Px1.p1.1)\.
- \[8\]A\. Gokaslan, V\. Cohen, E\. Pavlick, and S\. Tellex\(2019\)OpenWebText corpus\.Note:[http://Skylion007\.github\.io/OpenWebTextCorpus](http://skylion007.github.io/OpenWebTextCorpus)Cited by:[§4](https://arxiv.org/html/2604.20051#S4.SS0.SSS0.Px1.p1.1)\.
- \[9\]A\. Gunjal, A\. Wang, E\. Lau, V\. Nath, Y\. He, B\. Liu, and S\. Hendryx\(2025\)Rubrics as rewards: reinforcement learning beyond verifiable domains\.arXiv preprint arXiv:2507\.17746\.Cited by:[§1](https://arxiv.org/html/2604.20051#S1.p3.1),[§2](https://arxiv.org/html/2604.20051#S2.SS0.SSS0.Px2.p2.1),[§4](https://arxiv.org/html/2604.20051#S4.SS0.SSS0.Px2.p5.3)\.
- \[10\]Y\. He, W\. Li, H\. Zhang, S\. Li, K\. Mandyam, S\. Khosla, Y\. Xiong, N\. Wang, X\. Peng, B\. Li, S\. Bi, S\. G\. Patil, Q\. Qi, S\. Feng, J\. Katz\-Samuels, R\. Y\. Pang, S\. Gonugondla, H\. Lang, Y\. Yu, Y\. Qian, M\. Fazel\-Zarandi, L\. Yu, A\. Benhalloum, H\. Awadalla, and M\. Faruqui\(2025\)AdvancedIF: rubric\-based benchmarking and reinforcement learning for advancing llm instruction following\.arXiv preprint arXiv:2511\.10507\.Cited by:[§1](https://arxiv.org/html/2604.20051#S1.p3.1),[§2](https://arxiv.org/html/2604.20051#S2.SS0.SSS0.Px2.p2.1),[§4](https://arxiv.org/html/2604.20051#S4.SS0.SSS0.Px2.p5.3)\.
- \[11\]D\. Hendrycks, C\. Burns, S\. Kadavath, A\. Arora, S\. Basart, E\. Tang, D\. Song, and J\. Steinhardt\(2021\)Measuring mathematical problem solving with the math dataset\.InAdvances in Neural Information Processing Systems,Cited by:[§4](https://arxiv.org/html/2604.20051#S4.SS0.SSS0.Px2.p1.1)\.
- \[12\]C\. Huang, W\. Yu, X\. Wang, H\. Zhang, Z\. Li, R\. Li, J\. Huang, H\. Mi, and D\. Yu\(2025\)R\-zero: self\-evolving reasoning llm from zero data\.arXiv preprint arXiv:2508\.05004\.Cited by:[§1](https://arxiv.org/html/2604.20051#S1.p3.1),[§2](https://arxiv.org/html/2604.20051#S2.SS0.SSS0.Px1.p1.1)\.
- \[13\]C\. Huang and T\. Goyal\(2025\)DCRM: a heuristic to measure response pair quality in preference optimization\.InFindings of the Association for Computational Linguistics: EMNLP,Cited by:[§1](https://arxiv.org/html/2604.20051#S1.p4.1),[§3\.2](https://arxiv.org/html/2604.20051#S3.SS2.SSS0.Px3.p1.9)\.
- \[14\]J\. Huang, S\. Gu, L\. Hou, Y\. Wu, X\. Wang, H\. Yu, and J\. Han\(2023\)Large language models can self\-improve\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,Cited by:[§1](https://arxiv.org/html/2604.20051#S1.p2.1)\.
- \[15\]Z\. Huang, Y\. Zhuang, G\. Lu, Z\. Qin, H\. Xu, T\. Zhao, R\. Peng, J\. Hu, Z\. Shen, X\. Hu, X\. Gu, P\. Tu, J\. Liu, W\. Chen, Y\. Fu, Z\. Fan, Y\. Gu, Y\. Wang, Z\. Yang, J\. Li, and J\. Zhao\(2025\)Reinforcement learning with rubric anchors\.arXiv preprint arXiv:2508\.12790\.Cited by:[§1](https://arxiv.org/html/2604.20051#S1.p3.1),[§2](https://arxiv.org/html/2604.20051#S2.SS0.SSS0.Px2.p2.1),[§4](https://arxiv.org/html/2604.20051#S4.SS0.SSS0.Px2.p5.3)\.
- \[16\]D\. Jin, E\. Pan, N\. Oufattole, W\. Weng, H\. Fang, and P\. Szolovits\(2009\)What disease does this patient have? a large\-scale open domain question answering dataset from medical exams\.arXiv preprint arXiv:2009\.13081\.Cited by:[§4](https://arxiv.org/html/2604.20051#S4.SS0.SSS0.Px2.p1.1)\.
- \[17\]M\. Joshi, E\. Choi, D\. Weld, and L\. Zettlemoyer\(2017\)triviaqa: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension\.arXiv e\-prints,pp\. arXiv:1705\.03551\.External Links:1705\.03551Cited by:[§4](https://arxiv.org/html/2604.20051#S4.SS0.SSS0.Px2.p1.1)\.
- \[18\]T\. Kwiatkowski, J\. Palomaki, O\. Redfield, M\. Collins, A\. Parikh, C\. Alberti, D\. Epstein, I\. Polosukhin, J\. Devlin, K\. Lee, K\. Toutanova, L\. Jones, M\. Kelcey, M\. Chang, A\. M\. Dai, J\. Uszkoreit, Q\. Le, and S\. Petrov\(2019\)Natural questions: a benchmark for question answering research\.Transactions of the Association for Computational Linguistics, Volume 7\.Cited by:[§4](https://arxiv.org/html/2604.20051#S4.SS0.SSS0.Px2.p1.1)\.
- \[19\]S\. Lin, J\. Hilton, and O\. Evans\(2022\)TruthfulQA: measuring how models mimic human falsehoods\.InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),Cited by:[§4](https://arxiv.org/html/2604.20051#S4.SS0.SSS0.Px2.p1.1)\.
- \[20\]B\. Liu, C\. Jin, S\. Kim, W\. Yuan, W\. Zhao, I\. Kulikov, X\. Li, S\. Sukhbaatar, J\. Lanchantin, and J\. Weston\(2025\)SPICE: self\-play in corpus environments improves reasoning\.arXiv preprint arXiv:2510\.24684\.Cited by:[§1](https://arxiv.org/html/2604.20051#S1.p3.1),[§2](https://arxiv.org/html/2604.20051#S2.SS0.SSS0.Px1.p1.1)\.
- \[21\]I\. Loshchilov and F\. Hutter\(2019\)Decoupled weight decay regularization\.InProceedings of International Conference on Learning Representations,Cited by:[§4](https://arxiv.org/html/2604.20051#S4.SS0.SSS0.Px2.p4.2)\.
- \[22\]T\. Luong and E\. Lockhart\(2025\)Advanced version of gemini with deep think officially achieves gold\-medal standard at the international mathematical olympiad\.Cited by:[§1](https://arxiv.org/html/2604.20051#S1.p1.1)\.
- \[23\]S\. Maslenkova, C\. Christophe, M\. A\. Pimentel, T\. Raha, M\. U\. Salman, A\. A\. Mahrooqi, A\. Gupta, S\. Khan, R\. Rajan, and P\. Kanithi\(2025\)Building trust in clinical llms: bias analysis and dataset transparency\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,Cited by:[§4](https://arxiv.org/html/2604.20051#S4.SS0.SSS0.Px1.p1.1),[§5\.3\.1](https://arxiv.org/html/2604.20051#S5.SS3.SSS1.p1.1)\.
- \[24\]S\. J\. Paech\(2025\)EQ\-bench creative writing benchmark v3\.GitHub\.Note:[https://github\.com/EQ\-bench/creative\-writing\-bench](https://github.com/EQ-bench/creative-writing-bench)Cited by:[§1](https://arxiv.org/html/2604.20051#S1.p6.1),[§4](https://arxiv.org/html/2604.20051#S4.SS0.SSS0.Px1.p1.1)\.
- \[25\]A\. Plaat, M\. van Duijn, N\. van Stein, M\. Preuss, P\. van der Putten, and K\. J\. Batenburg\(2025\)Agentic large language models, a survey\.arXiv preprint arXiv:2503\.23037\.Cited by:[§1](https://arxiv.org/html/2604.20051#S1.p1.1)\.
- \[26\]R\. Rafailov, A\. Sharma, E\. Mitchell, S\. Ermon, C\. D\. Manning, and C\. Finn\(2023\)Direct preference optimization: your language model is secretly a reward model\.InarXiv preprint arXiv:2305\.18290,Cited by:[§1](https://arxiv.org/html/2604.20051#S1.p4.1),[§3\.2](https://arxiv.org/html/2604.20051#S3.SS2.SSS0.Px1.p1.6)\.
- \[27\]D\. Rein, B\. L\. Hou, A\. C\. Stickland, J\. Petty, R\. Y\. Pang, J\. Dirani, J\. Michael, and S\. R\. Bowman\(2024\)GPQA: a graduate\-level google\-proof q&a benchmark\.InFirst Conference on Language Modeling,External Links:[Link](https://openreview.net/forum?id=Ti67584b98)Cited by:[§4](https://arxiv.org/html/2604.20051#S4.SS0.SSS0.Px2.p1.1)\.
- \[28\]Y\. Ren and D\. J\. Sutherland\(2025\)Learning dynamics of llm finetuning\.InProceedings of International Conference on Learning Representations,Cited by:[Appendix G](https://arxiv.org/html/2604.20051#A7.p3.1)\.
- \[29\]D\. A\. Research\(2026\)KARL: knowledge agentsvial reinforcement learning\.arXiv preprint:arXiv:2603\.05218\.Cited by:[§2](https://arxiv.org/html/2604.20051#S2.SS0.SSS0.Px1.p1.1)\.
- \[30\]M\. Rezaei, R\. Vacareanu, Z\. Wang, C\. Wang, B\. Liu, Y\. He, and A\. F\. Akyürek\(2025\)Online rubrics elicitation from pairwise comparisons\.arXiv preprint arXiv:2510\.07284\.Cited by:[§1](https://arxiv.org/html/2604.20051#S1.p3.1),[§2](https://arxiv.org/html/2604.20051#S2.SS0.SSS0.Px2.p2.1),[§4](https://arxiv.org/html/2604.20051#S4.SS0.SSS0.Px2.p5.3)\.
- \[31\]J\. Schulman, F\. Wolski, P\. Dhariwal, A\. Radford, and O\. Klimov\(2017\)Proximal policy optimization algorithms\.arXiv preprint arXiv:1707\.06347\.Cited by:[§3\.2](https://arxiv.org/html/2604.20051#S3.SS2.SSS0.Px1.p1.6)\.
- \[32\]R\. Shao, A\. Asai, S\. Z\. Shen, H\. Ivison, V\. Kishore, J\. Zhuo, X\. Zhao, M\. Park, S\. G\. Finlayson, D\. Sontag, T\. Murray, S\. Min, P\. Dasigi, L\. Soldaini, F\. Brahman, W\. Yih, T\. Wu, L\. Zettlemoyer, Y\. Kim, H\. Hajishirzi, and P\. W\. Koh\(2025\)DR tulu: reinforcement learning with evolving rubrics for deep research\.arXiv preprint arXiv:2511\.19399\.Cited by:[§3\.1](https://arxiv.org/html/2604.20051#S3.SS1.SSS0.Px3.p4.3)\.
- \[33\]H\. Singh, X\. Li, K\. Sareen, M\. Maheswaran, S\. Tan, X\. Wu, J\. Wang, A\. Ariyak, Q\. Wu, S\. Khaki, R\. Tiwari, L\. Lian, Y\. Lu, B\. Li, A\. Suhr, B\. Athiwaratkun, and K\. Keutzer\(2026\)V1: unifying generation and self\-verification for parallel reasoners\.arXiv preprint arXiv:2603\.04304\.Cited by:[Appendix F](https://arxiv.org/html/2604.20051#A6.p1.1),[§5\.4](https://arxiv.org/html/2604.20051#S5.SS4.p1.5)\.
- \[34\]Skelebor\(2022\)Book titles and abstracts\.Note:[https://huggingface\.co/datasets/Skelebor/book\_titles\_and\_descriptions](https://huggingface.co/datasets/Skelebor/book_titles_and_descriptions)Cited by:[§4](https://arxiv.org/html/2604.20051#S4.SS0.SSS0.Px1.p1.1)\.
- \[35\]Y\. Song, H\. Zhang, C\. Eisenach, S\. Kakade, D\. Foster, and U\. Ghai\(2025\)Mind the gap: examining the self\-improvement capabilities of large language models\.InProceedings of International Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2604.20051#S1.p4.1),[§5\.5](https://arxiv.org/html/2604.20051#S5.SS5.p1.4)\.
- \[36\]G\. Swamy, C\. Dann, R\. Kidambi, Z\. S\. Wu, and A\. Agarwal\(2024\)A minimaximalist approach to reinforcement learning from human feedback\.InProceedings of the 41st International Conference on Machine Learning,Cited by:[§5\.4](https://arxiv.org/html/2604.20051#S5.SS4.p1.5)\.
- \[37\]Y\. Tang, D\. Z\. Guo, Z\. Zheng, D\. Calandriello, Y\. Cao, E\. Tarassov, R\. Munos, B\. Á\. Pires, M\. Valko, and W\. D\. Yong Cheng\(2024\)Understanding the performance gap between online and offline alignment algorithms\.arXiv preprint arXiv:2405\.08448\.Cited by:[Appendix G](https://arxiv.org/html/2604.20051#A7.p3.1)\.
- \[38\]O\. Team\(2023\)GPT\-4 technical report\.arXiv preprint arXiv:2303\.08774\.Cited by:[§4](https://arxiv.org/html/2604.20051#S4.SS0.SSS0.Px1.p1.1)\.
- \[39\]Q\. Team\(2024\)Qwen2\.5 technical report\.arXiv preprint arXiv:2412\.15115\.Cited by:[§4](https://arxiv.org/html/2604.20051#S4.SS0.SSS0.Px2.p2.1)\.
- \[40\]P\. Villalobos, A\. Ho, J\. Sevilla, T\. Besiroglu, L\. Heim, and M\. Hobbhahn\(2024\)Will we run out of data? limits of llm scaling based on human\-generated data\.InProceedings of the 41st International Conference on Machine Learning,Cited by:[§1](https://arxiv.org/html/2604.20051#S1.p1.1),[§1](https://arxiv.org/html/2604.20051#S1.p3.1),[§4](https://arxiv.org/html/2604.20051#S4.SS0.SSS0.Px2.p5.3)\.
- \[41\]V\. Viswanathan, Y\. Sun, S\. Ma, X\. Kong, M\. Cao, G\. Neubig, and T\. Wu\(2025\)Checklists are better than reward models for aligning language models\.InAdvances in Neural Information Processing Systems,Cited by:[§2](https://arxiv.org/html/2604.20051#S2.SS0.SSS0.Px2.p2.1)\.
- \[42\]Y\. Wang, X\. Ma, G\. Zhang, Y\. Ni, A\. Chandra, S\. Guo, W\. Ren, A\. Arulraj, X\. He, Z\. Jiang, T\. Li, M\. Ku, K\. Wang, A\. Zhuang, R\. Fan, X\. Yue, and W\. Chen\(2024\)MMLU\-pro: a more robust and challenging multi\-task language understanding benchmark\.InAdvances in Neural Information Processing Systems, Datasets and Benchmarks Track,Cited by:[§4](https://arxiv.org/html/2604.20051#S4.SS0.SSS0.Px2.p1.1)\.
- \[43\]W\. Yuan, R\. Y\. Pang, K\. Cho, X\. Li, S\. Sukhbaatar, J\. Xu, and J\. Weston\(2024\)Self\-rewarding language models\.InProceedings of the 41st International Conference on Machine Learning,Cited by:[§1](https://arxiv.org/html/2604.20051#S1.p2.1)\.
- \[44\]Z\. Zhang, C\. Huang, A\. O\. Li, and C\. Cardie\(2025\)Better llm reasoning via dual\-play\.arXiv preprint arXiv:2511\.11881\.Cited by:[§1](https://arxiv.org/html/2604.20051#S1.p3.1),[§2](https://arxiv.org/html/2604.20051#S2.SS0.SSS0.Px1.p1.1)\.
- \[45\]A\. Zhao, Y\. Wu, Y\. Yue, T\. Wu, Q\. Xu, Y\. Yue, M\. Lin, S\. Wang, Q\. Wu, Z\. Zheng, and G\. Huang\(2025\)Absolute zero: reinforced self\-play reasoning with zero data\.arXiv preprint arXiv:2505\.03335\.Cited by:[§1](https://arxiv.org/html/2604.20051#S1.p3.1),[§2](https://arxiv.org/html/2604.20051#S2.SS0.SSS0.Px1.p1.1)\.
- \[46\]L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. P\. Xing, H\. Zhang, J\. E\. Gonzalez, and I\. Stoica\(2023\)Judging llm\-as\-a\-judge with mt\-bench and chatbot arena\.InAdvances in Neural Information Processing Systems, Datasets and Benchmarks Track,Cited by:[§1](https://arxiv.org/html/2604.20051#S1.p6.1),[§4](https://arxiv.org/html/2604.20051#S4.SS0.SSS0.Px1.p1.1)\.
- \[47\]J\. Zhou, T\. Lu, S\. Mishra, S\. Brahma, S\. Basu, Y\. Luan, D\. Zhou, and L\. Hou\(2023\)Instruction\-following evaluation for large language models\.arXiv preprint arXiv:2311\.07911\.Cited by:[§1](https://arxiv.org/html/2604.20051#S1.p6.1),[§4](https://arxiv.org/html/2604.20051#S4.SS0.SSS0.Px1.p1.1)\.
- \[48\]Y\. Zhou, S\. Li, S\. Liu, W\. Fang, K\. Zhang, J\. Zhao, J\. Yang, Y\. Zhou, J\. Lv, T\. Zheng, H\. Lu, W\. Chen, Y\. Xie, and M\. Song\(2025\)Breaking the exploration bottleneck: rubric\-scaffolded reinforcement learning for general llm reasoning\.arXiv preprint arXiv:2508\.16949\.Cited by:[Appendix G](https://arxiv.org/html/2604.20051#A7.p2.6),[§1](https://arxiv.org/html/2604.20051#S1.p3.1),[§2](https://arxiv.org/html/2604.20051#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2604.20051#S2.SS0.SSS0.Px2.p2.1),[§4](https://arxiv.org/html/2604.20051#S4.SS0.SSS0.Px2.p5.3)\.
- \[49\]Y\. Zhou, S\. Levine, J\. Weston, X\. Li, and S\. Sukhbaatar\(2025\)Self\-challenging language model agents\.arXiv preprint arXiv:2506\.01716\.Cited by:[§1](https://arxiv.org/html/2604.20051#S1.p3.1)\.
## Appendix AHyperparameter
We show the list of hyperparameters in Table[6](https://arxiv.org/html/2604.20051#A1.T6)\.
StageParameterValueSamplingMax Context Length32768\|d\|\|d\|∈\[50,1024\]\\in\[50,1024\]Proposer Max New Tokens6144Proposer Temperature1\.0Proposer Top P1\.0Solver Max New Tokens6144Solver Temperature1\.0Solver Top P1\.0Rubric Gen Max New Tokens8192Rubric Gen Temperature0\.0Rubric Gen Top P1\.0Ans Grading Max New Tokens4096Ans Grading Gen Temperature0\.0Ans Grading Gen Top P1\.0\# Ques \(II\)4096\# Ques \(II\) Creative Writing8192\# Ques perdd1\# Ans per Ques \(JJ\)32Max \# Rub Criteria per Ques \(KK\)5Continuous Pretraining \(Train onDD\)SchedulerCosineWarmup Ratio0\.1OptimizerAdamWLearning Rate2e\-5Number of Epochs1Batch Size64DPO \(POP\)SchedulerCosineWarmup Ratio0\.1OptimizerAdamWLearning Rate1e\-6Beta0\.01Number of Epochs1Batch Size16Table 6:HyperparametersFor rubric generation, to prevent context length overflow, we select ten candidate answers to put into the rubric generation prompt\. In particular, we sort the answers by length and then pick one from every\#Ans10\\frac\{\\\#Ans\}\{10\}so that the rubric generator sees answers of varying length\.
## Appendix BPOP Algorithm
We show the pseudo code of POP in Algorithm[1](https://arxiv.org/html/2604.20051#alg1)\.
Algorithm 1POP Pipeline1:Reference LLM
πref\\pi\_\{ref\}; Pretraining Corpus
DD; \#Questions
II; \#Answers /Qus
JJ; \#Max Rubric Citeria/Qus
KK; Task
tt; Question Synthesis Prompt
PtqusP\_\{t\}^\{qus\}; Question Answering Prompt
PtansP\_\{t\}^\{ans\}; Rubric Generation Prompt
PrubP^\{rub\}; Answer Grading Prompt
PgradeP^\{grade\}
2:// Sampling
3:Initialize
RawDataset←\{\}\\text\{RawDataset\}\\leftarrow\\\{\\\}
4:for
i=1i=1to
IIdo
5:// Question Synthesis
6:Sample
d∼Dd\\sim D
7:Sample
\(x,yref\)∼πref\(⋅\|Ptqus\(d\)\)\(x,y\_\{ref\}\)\\sim\\pi\_\{ref\}\(\\cdot\|P\_\{t\}^\{qus\}\(d\)\)
8:// Question Answering
9:Sample
\{yj\}j=1J∼πref\(⋅\|Ptans\(x\)\)\\\{y\_\{j\}\\\}\_\{j=1\}^\{J\}\\sim\\pi\_\{ref\}\(\\cdot\|P\_\{t\}^\{ans\}\(x\)\)
10:// Rubric Generation
11:Draw sample answers
Yrub⊂\(\{yj\}j=1J\)Y\_\{rub\}\\subset\(\\\{y\_\{j\}\\\}\_\{j=1\}^\{J\}\)⊳\\trianglerightPrevent context overflow
12:Sample
r∼πref\(⋅\|Prub\(d,x,yref,Yrub\)\)r\\sim\\pi\_\{ref\}\(\\cdot\|P^\{rub\}\(d,x,y\_\{ref\},Y\_\{rub\}\)\)with
\|r\|≤K\|r\|\\leq K
13:// Answer Grading
14:for
j=1j=1to
JJdo
15:Sample
ej∼πref\(⋅\|Pgrade\(x,r,yj\)\)e\_\{j\}\\sim\\pi\_\{ref\}\(\\cdot\|P^\{grade\}\(x,r,y\_\{j\}\)\)
16:
\{wk\}k=1Ki←\\\{w\_\{k\}\\\}\_\{k=1\}^\{K\_\{i\}\}\\leftarrowGetWeights\(
rr\)
17:
\{sjk\}k=1Ki←\\\{s\_\{j\}^\{k\}\\\}\_\{k=1\}^\{K\_\{i\}\}\\leftarrowGetScores\(
ee\)
18:
sj←∑k=1Kwk⋅sjk∑k=1Kwks\_\{j\}\\leftarrow\\frac\{\\sum\_\{k=1\}^\{K\}w\_\{k\}\\cdot s\_\{j\}^\{k\}\}\{\\sum\_\{k=1\}^\{K\}w\_\{k\}\}
19:endfor
20:
RawDataset←RawDataset∪\{\(d,x,yref,\{yj\}j=1J,r,\{\(ej,sj\)\}j=1J\)\}\\text\{RawDataset\}\\leftarrow\\text\{RawDataset\}\\cup\\\{\(d,x,y\_\{ref\},\\\{y\_\{j\}\\\}\_\{j=1\}^\{J\},r,\\\{\(e\_\{j\},s\_\{j\}\)\\\}\_\{j=1\}^\{J\}\)\\\}
21:endfor
22:// Filtering
23:
FilteredDataset←\{\}\\text\{FilteredDataset\}\\leftarrow\\\{\\\}
24:forexample in RawDatasetdo
25:isvalid
←True\\leftarrow True
26:forcomponentin exampledo
27:ifnot ValidCheck\(component\)then⊳\\trianglerightNot extractable, etc\.
28:isvalid
←False\\leftarrow False
29:break
30:endif
31:endfor
32:ifisvalidthen
33:
FilteredDataset←FilteredDataset∪\{example\}\\text\{FilteredDataset\}\\leftarrow\\text\{FilteredDataset\}\\cup\\\{\\text\{example\}\\\}
34:endif
35:endfor
36:// Pairing
37:
DPODataset←\{\}\\text\{DPODataset\}\\leftarrow\\\{\\\}
38:forexample in RawDatasetdo
39:
x←x\\leftarrowGetQuestion\(example\)
40:
\{\(yj,sj\)\}j=1J′←\\\{\(y\_\{j\},s\_\{j\}\)\\\}\_\{j=1\}^\{J^\{\\prime\}\}\\leftarrowGetAnswersAndScores\(example\)
41:
yw←argmaxyj\(sj\)y\_\{w\}\\leftarrow argmax\_\{y\_\{j\}\}\(s\_\{j\}\)
42:
yl←argminyj\(sj\)y\_\{l\}\\leftarrow argmin\_\{y\_\{j\}\}\(s\_\{j\}\)
43:if
yw\>yly\_\{w\}\>y\_\{l\}and
\|\|yw\|−\|yl\|\|≤100\|\|y\_\{w\}\|\-\|y\_\{l\}\|\|\\leq 100then
44:
DPODataset←DPODataset∪\{\(x,yw,yl\)\}\\text\{DPODataset\}\\leftarrow\\text\{DPODataset\}\\cup\\\{\(x,y\_\{w\},y\_\{l\}\)\\\}
45:endif
46:endfor
47:// Training
48:
πtrained←DPO\(πref,DPODataset\)\\pi\_\{trained\}\\leftarrow DPO\(\\pi\_\{ref\},\\text\{DPODataset\}\)
49:return
πtrained\\pi\_\{trained\}
## Appendix CResults on Creative Writing V3
We show the per\-criterion results for Qwen\-2\.5\-7B in Figure[9](https://arxiv.org/html/2604.20051#A3.F9)and results for Qwen\-2\.5\-7B\-Inst in Figure[10](https://arxiv.org/html/2604.20051#A3.F10)\. POP achieves consistent improvement over the reference models for both cases\. Exceptions include "Incongruent Ending Positivity", "Amateurish", "Unsurprising or Uncreative", and "Tell\-Don’t Show" for both models, and "Unearned Transformations", "Overwrought", "Purple Prose", "Weak Dialogue", and "Meandering" for Qwen\-2\.5\-7B\-Inst\. Naively pretraining onDDsignificantly degrades performance on most criteria\.
Figure 9:Qwen\-2\.5\-7B Results on Creative Writing V3\.Figure 10:Qwen\-2\.5\-7B\-Inst Results on Creative Writing V3\.
## Appendix DResults on OOD Benchmarks
We show the OOD results of models trained on the Creative Writing datasets in Table[7](https://arxiv.org/html/2604.20051#A4.T7)and the OOD results of models trained on the Instruction Following datasets Table[8](https://arxiv.org/html/2604.20051#A4.T8)\. POP is able to maintain the performance across different benchmarks compared with the reference model, while naive pretraining onDDcauses degeneration\.
Table 7:OOD Results for models trained on the Creative Writing datasets\. NQ: NaturalQuestions; TvQA: TriviaQA; TfQA: TruthfulQA; MMLU\-P: MMLU\-Pro; GPQA\-D: GPQA\-Diamond; GSM: GSM8K; Math: MATH500\. We use 0\-shot evaluation\.Table 8:OOD Results for models trained on the Instruction Following datasets\.
## Appendix EStatistics of the Synthesized Datasets
### E\.1Questions and Rubrics Classification
##### Questions\.
For each task, we sample 100 examples from both Qwen\-2\.5\-7B and Qwen\-2\.5\-7B\-Instruct\. The questions for the resulting 200 examples are fed into GPT\-4\.1\-mini to identify the common topics\. We then ask GPT\-4\.1\-mini again to categorize each question to exactly 1 of the topics\.
##### Rubrics\.
We take the rubric criteria from the 200 examples used in the previous question topic analysis and feed them into GPT\-4\.1\-mini to identify the meta\-criteria\. We then ask GPT\-4\.1\-mini again to categorize each rubric criterion to exactly 1 of the meta\-criteria\.
### E\.2Healthcare QA
We show the statistics for our synthesized datasetsHealthcareQAQwen\-2\.5\-7B\\text\{HealthcareQA\}\_\{\\text\{Qwen\-2\.5\-7B\}\}in Figure[11](https://arxiv.org/html/2604.20051#A5.F11)andHealthcareQAQwen\-2\.5\-7B\-Inst\\text\{HealthcareQA\}\_\{\\text\{Qwen\-2\.5\-7B\-Inst\}\}in Figure[12](https://arxiv.org/html/2604.20051#A5.F12)\.
\(a\)Question topics
\(b\)Meta rubric criteria
\(c\)Answer score distribution, grouped by questions\.
\(d\)Individual answer score distribution\.
Figure 11:Statistics forHealthcare QAQwen\-2\.5\-7B\\text\{Healthcare QA\}\_\{\\text\{Qwen\-2\.5\-7B\}\}dataset\.\(a\)Question topics
\(b\)Meta rubric criteria
\(c\)Answer score distribution, grouped by questions\.
\(d\)Individual answer score distribution\.
Figure 12:Statistics forHealthcare QAQwen\-2\.5\-7B\-Inst\\text\{Healthcare QA\}\_\{\\text\{Qwen\-2\.5\-7B\-Inst\}\}dataset
### E\.3Creative Writing
We show the statistics for our synthesized datasetsCreativeWritingQwen\-2\.5\-7B\\text\{CreativeWriting\}\_\{\\text\{Qwen\-2\.5\-7B\}\}in Figure[13](https://arxiv.org/html/2604.20051#A5.F13)andCreativeWritingQwen\-2\.5\-7B\-Inst\\text\{CreativeWriting\}\_\{\\text\{Qwen\-2\.5\-7B\-Inst\}\}in Figure[14](https://arxiv.org/html/2604.20051#A5.F14)\.
\(a\)Question topics
\(b\)Meta rubric criteria
\(c\)Answer score distribution, grouped by questions\.
\(d\)Individual answer score distribution\.
Figure 13:Statistics forCreative WritingQwen\-2\.5\-7B\\text\{Creative Writing\}\_\{\\text\{Qwen\-2\.5\-7B\}\}dataset\.\(a\)Question topics
\(b\)Meta rubric criteria
\(c\)Answer score distribution, grouped by questions\.
\(d\)Individual answer score distribution\.
Figure 14:Statistics forCreative WritingQwen\-2\.5\-7B\-Inst\\text\{Creative Writing\}\_\{\\text\{Qwen\-2\.5\-7B\-Inst\}\}dataset
### E\.4Instruction Following
We show the statistics for our synthesized datasetsInstructionFollowingQwen\-2\.5\-7B\\text\{InstructionFollowing\}\_\{\\text\{Qwen\-2\.5\-7B\}\}in Figure[15](https://arxiv.org/html/2604.20051#A5.F15)andInstructionFollowingQwen\-2\.5\-7B\-Inst\\text\{InstructionFollowing\}\_\{\\text\{Qwen\-2\.5\-7B\-Inst\}\}in Figure[16](https://arxiv.org/html/2604.20051#A5.F16)\. We highlight that we use the most general prompt \(Figure[49](https://arxiv.org/html/2604.20051#A12.F49)\) and pretraining corpus \(OpenWebText\) for this task, so the topic coverage is much broader than Healthcare QA and Creative Writing, and we are not explicitly asking models to generate questions similar to those in our evaluation benchmarks \(IFEval, ArenHard\)\.
\(a\)Question topics
\(b\)Meta rubric criteria
\(c\)Answer score distribution, grouped by questions\.
\(d\)Individual answer score distribution\.
Figure 15:Statistics forInstruction FollowingQwen\-2\.5\-7B\\text\{Instruction Following\}\_\{\\text\{Qwen\-2\.5\-7B\}\}dataset\.\(a\)Question topics
\(b\)Meta rubric criteria
\(c\)Answer score distribution, grouped by questions\.
\(d\)Individual answer score distribution\.
Figure 16:Statistics forInstruction FollowingQwen\-2\.5\-7B\-Inst\\text\{Instruction Following\}\_\{\\text\{Qwen\-2\.5\-7B\-Inst\}\}dataset
## Appendix FAblation Study
We detail the methodology for each ablation setting in Appendix[F\.1](https://arxiv.org/html/2604.20051#A6.SS1)\. Since prior work\[[33](https://arxiv.org/html/2604.20051#bib.bib51)\]claims strong performance of pairwise judge over pointwise judge, we also investigate the reason why it does not outperform our judge, using same ranking analysis as in §[5\.5](https://arxiv.org/html/2604.20051#S5.SS5)for the resulting dataset produced by each setting\.
### F\.1Methodology
##### Eval w/oDD\.
We use the same prompt for rubric generation \(See Appendix[59](https://arxiv.org/html/2604.20051#A12.F59)\), while settingdd\(i\.e\., the "knowledge"\) to "None"\. This ensures that the prompt is fixed and only the pretraining text is removed\.
##### w/oDD\.
Similarly, we setdd\(i\.e\., the "knowledge"\) to "None" for the question synthesis prompt and rubric generation prompt\.
Figure 17:Judge without Rubric System Prompt\.Figure 18:Judge without Rubric User Prompt\.
##### w/o rubric\.
We ask the model to directly give a single rating to each candidate answer, without generating the intermediate rubric\. To distinguish different answers, we use a finer\-grained rating scale of 0 to 10\. Importantly, to ensure a fair comparison, we still give the answer grading prompt access to the pretraining textdd\. See prompts in Figure[17](https://arxiv.org/html/2604.20051#A6.F17)and Figure[18](https://arxiv.org/html/2604.20051#A6.F18)\.
##### w/ pairwise judge\.
Figure 19:Pairwise Judge System Prompt\.Figure 20:Pairwise Judge User Prompt\.Table 9:Ranking accuracy\.Naive pairwise judging requiresO\(N2\)O\(N^\{2\}\)calls to the judge model, which is prohibitively expensive, since we often have more than 20 answers per question\. Therefore, we use an anchor\-based approach to reduce the number of pairwise comparisons toO\(N\)O\(N\)\. In particular, we select an anchor answer and compare every answer against that anchor\. For each comparison, we ask the model to give both the current answer and the anchor a score from 0 to 10 \(See prompts in Figure[19](https://arxiv.org/html/2604.20051#A6.F19)and Figure[20](https://arxiv.org/html/2604.20051#A6.F20)\)\. To avoid position bias, for each candidate answer, we prompt the judge twice with the orders between the anchor and candidate switched\. When comparing the anchor and each answerii, we denote the score of answeriiassis^\{i\}and the score of the anchor assanchoris\_\{anchor\}^\{i\}\. We then compute the relative score of answeriiassi−sanchoris^\{i\}\-s\_\{anchor\}^\{i\}\. This is then used to rank the answers and selectywy\_\{w\}andyly\_\{l\}\. To avoid length bias, we sort the answers according to length and pick the medium\-length answer as the anchor\.
### F\.2Answer Ranking Analysis for Pairwise Judge


Figure 21:Correlation between our rankings and gold rankings on thehealthcareQwen\-2\.5\-7B\\text\{healthcare\}\_\{\\text\{Qwen\-2\.5\-7B\}\}dataset\. x\-axis: Ranking of answers from our model, with pointwise \(Right\) or pairwise \(Left\) judges\. y\-axis: Given that the answer is ranked at the top x% according to our model, the distribution of gold ranking percentiles\. A smaller rank percentile means a higher rank\. Spearman’s r: Spearman’s ranking correlation coefficient\.We show the correlation between rankings of answers from our model and gold rankings from the stronger model in Figure[21](https://arxiv.org/html/2604.20051#A6.F21)and the pairwise ranking accuracy in Table[9](https://arxiv.org/html/2604.20051#A6.T9)\.
Pairwise judge gives a much higher ranking correlation of 0\.44, but its pairwise ranking accuracy remains similar to ours, which partly explains why its performance on HealthBench500 is still similar to ours\.
Table 10:Rankings and scores ofywy\_\{w\}andyly\_\{l\}from Eval w/oDDand Eval w/DD, according to eitherπref\\pi\_\{ref\}or stronger model\. Scores from Pairwise Judge are on a different scale and not normalized\.We further show the detailed ranking and score statistics in Table[10](https://arxiv.org/html/2604.20051#A6.T10)\.
## Appendix GUsing Stronger Model in Our Pipeline
Figure 22:Results with a strong teacher model\.We want to investigate the effect of introducing strong external supervision in our pipeline\. In particular, we replace our model with a stronger teacher model GPT\-4o\-mini, for each component in our pipeline\. This includes using the teacher model to \(1\) synthesize questions \(w/ teacher qus\); \(2\) generate answers \(w/ teacher ans\); \(3\) generate rubrics and grade answers \(w/ teacher eva\)\. Each of these settings produces a new training dataset\. We conduct the experiments on Healthcare QA with Qwen\-2\.5\-7B, and show the results on HealthBench500 in Figure[22](https://arxiv.org/html/2604.20051#A7.F22)\.
Surprisingly, none of the variants significantly surpasses our unsupervised version\. w/ teacher qus and w/ teacher eva only give similar performance to ours\. For w/ teacher aus, we suspect that the relatively small training set size that we are using hinders the manifestation of the benefits of using better questions from stronger models\. For w/ teacher eva, as we discuss in §[5\.5](https://arxiv.org/html/2604.20051#S5.SS5), our choice ofywy\_\{w\}andyly\_\{l\}already gives mostly correct training signals to DPO, and ourywy\_\{w\}already ranks significantly higher thanyly\_\{l\}, so choosing better \(ywy\_\{w\},yly\_\{l\}\) yields marginal benefits\. However, we suspect that for RL algorithms that train the model on the full set of answers instead of just the extremes, such as PPO or GRPO, rubric evaluation with a stronger teacher should give much better results\. This is because while our rankings at the extremes are mostly accurate, the full rankings are not\. More work needs to be done to align our observations with prior work\[[1](https://arxiv.org/html/2604.20051#bib.bib33),[48](https://arxiv.org/html/2604.20051#bib.bib28)\]\.
Even more surprisingly, training on the teacher’s answers degrades performance \(w/ teacher ans\)\. However, we argue that this is likely due to the sensitivity of DPO to off\-policy responses\[[37](https://arxiv.org/html/2604.20051#bib.bib54),[28](https://arxiv.org/html/2604.20051#bib.bib53)\]\.
## Appendix HAdditional Answer Ranking Analysis
Table 11:Ranking correlation under the same rubric\.We additionally experiment with a setting where we ask the stronger model to grade the same answers using rubrics synthesized by our model\.
##### Correlation is higher if conditioned on the same rubrics\.
As shown in Table[11](https://arxiv.org/html/2604.20051#A8.T11), Spearman’s ranking correlation increases from 0\.34 to 0\.43, and theyw−yly\_\{w\}\-y\_\{l\}ranking accuracy increases from 85% to 89%\. This is expected, since using the same rubric means both models share common ground on the evaluation criteria\.
## Appendix ICompute
Each run of POP , including sampling, filtering, pairing, training, and evaluation, uses a single node with 32 CPU cores, 192 GB of memory, and 1 Nvidia A100 80GB GPU\.
## Appendix JLimitations and Future Directions
We do not manage to scale up our synthesized datasets due to cost and compute constraints\. Since the design of POP is general and not specific to any task, we believe synthesizing datasets larger in magnitude on a general pretraining corpus \(e\.g\., OpenWebText\) is a promising direction to enable cheap, effective, and general post\-training\.
Our pipeline also requires a strong enough reference model\. Otherwise, the model may not be able to follow our instructions to synthesize questions, answers, rubrics, or conduct evaluations\. Investigating the minimum level of model competency for POP to be functional is a promising future direction\.
Another direction is to raise the level of automation by one more level\. That is, we can askπref\\pi\_\{ref\}to automatically select tasks suitable for the pretraining text and generate the corresponding Question Synthesis prompt\.
## Appendix KExamples
We show a complete example for each task in this section\.
### K\.1Healthcare QA
See Figure[23](https://arxiv.org/html/2604.20051#A11.F23)through[30](https://arxiv.org/html/2604.20051#A11.F30)for the sampled pretraining text, synthesized question, reference answer, rubric, and the answer and grading forywy\_\{w\}andyly\_\{l\}\.
Figure 23:Pretraining Text \(Healthcare QA\)\.Figure 24:Question \(Healthcare QA\)\.Figure 25:Reference Answer \(Healthcare QA\)\.Figure 26:Rubric \(Healthcare QA\)\.Figure 27:Answerywy\_\{w\}\(Healthcare QA\)\.Figure 28:Answerywy\_\{w\}Grading \(Healthcare QA\)\.Figure 29:Answeryly\_\{l\}\(Healthcare QA\)\.Figure 30:Answeryly\_\{l\}Grading \(Healthcare QA\)\.
### K\.2Creative Writing
See Figure[31](https://arxiv.org/html/2604.20051#A11.F31)through[38](https://arxiv.org/html/2604.20051#A11.F38)for the sampled pretraining text, synthesized question, reference answer, rubric, and the answer and grading forywy\_\{w\}andyly\_\{l\}\.
Figure 31:Pretraining Text \(Creative Writing\)\.Figure 32:Question \(Creative Writing\)\.Figure 33:Reference Answer \(Creative Writing\)\.Figure 34:Rubric \(Creative Writing\)\.Figure 35:Answerywy\_\{w\}\(Creative Writing\)\.Figure 36:Answerywy\_\{w\}Grading \(Creative Writing\)\.Figure 37:Answeryly\_\{l\}\(Creative Writing\)\.Figure 38:Answerywy\_\{w\}Grading \(Creative Writing\)\.
### K\.3Instruction Following
See Figure[39](https://arxiv.org/html/2604.20051#A11.F39)through[46](https://arxiv.org/html/2604.20051#A11.F46)for the sampled pretraining text, reference answer, synthesized question, rubric, and the answer and grading forywy\_\{w\}andyly\_\{l\}\.
Figure 39:Pretraining Text \(Instruction Following\)\.Figure 40:Question \(Instruction Following\)\.Figure 41:Reference Answer \(Instruction Following\)\.Figure 42:Rubric \(Instruction Following\)\.Figure 43:Answerywy\_\{w\}\(Instruction Following\)\.Figure 44:Answerywy\_\{w\}Grading \(Instruction Following\)\.Figure 45:Answeryly\_\{l\}\(Instruction Following\)\.Figure 46:Answeryly\_\{l\}Grading \(Instruction Following\)\.
## Appendix LPrompts
Figure 47:Question Synthesis Prompt \(Healthcare QA\)\.Figure 48:Question Synthesis Prompt \(Creative Writing\)\.Figure 49:Question Synthesis Prompt \(Instruction Following\)\.Figure 50:Question Synthesis User Prompt\.Figure 51:Question Synthesis User Prompt \(Creative Writing\)\.Figure 52:Question Answering System Prompt \(Healthcare QA\)\.Figure 53:Question Answering System Prompt \(Creative Writing\)\.Figure 54:Question Answering System Prompt \(Instruction Following\)\.Figure 55:Question Answering User Prompt\.Figure 56:Rubric Generation System Prompt\.Figure 57:Rubric Generation User Prompt\.Figure 58:Answer Grading System Prompt\.Figure 59:Answer Grading User Prompt\.Similar Articles
C2: Scalable Rubric-Augmented Reward Modeling from Binary Preferences
C2 proposes a scalable rubric-augmented reward modeling framework that trains a cooperative rubric generator and critical verifier exclusively from binary preferences, eliminating the need for costly rubric annotations while achieving up to 6.5 point gains on RM-Bench.
Training LLM Agents for Spontaneous, Reward-Free Self-Evolution via World Knowledge Exploration
This paper proposes a method to train LLM agents with intrinsic meta-evolution capabilities, enabling spontaneous self-improvement without external rewards at inference time. Applied to Qwen3-30B and Seed-OSS-36B, the approach yields a 20% performance boost on web navigation benchmarks, with a 14B model outperforming Gemini-2.5-Flash.
SPARK: Self-Play with Asymmetric Reward from Knowledge Graphs
This paper introduces SPARK, a self-play reinforcement learning framework that leverages knowledge graphs derived from scientific literature to improve relational reasoning in vision-language models.
Boosting Visual Instruction Tuning with Self-Supervised Guidance
This paper proposes augmenting visual instruction tuning in multimodal language models with self-supervised tasks expressed as natural language instructions, improving vision-centric reasoning without additional architecture or annotations. By reformulating classical self-supervised pretext tasks as image-instruction-response triplets, the method achieves consistent performance improvements across multiple benchmarks by injecting only 3-10% visually grounded instructions into the training data.
TEMPO: Scaling Test-time Training for Large Reasoning Models
TEMPO introduces a test-time training framework that alternates policy refinement with critic recalibration to prevent diversity collapse and sustain performance gains in large reasoning models, boosting AIME 2024 scores for Qwen3-14B from 42.3% to 65.8%.