On Predicting the Post-training Potential of Pre-trained LLMs

arXiv cs.CL Papers

Summary

This paper introduces RuDE, a framework for predicting the post-training potential of pre-trained LLMs by leveraging response discrimination, addressing the limitations of traditional benchmarks like MMLU.

arXiv:2605.11978v1 Announce Type: new Abstract: The performance of Large Language Models (LLMs) on downstream tasks is fundamentally constrained by the capabilities acquired during pre-training. However, traditional benchmarks like MMLU often fail to reflect a base model's plasticity in complex open-ended scenarios, leading to inefficient model selection. We address this by introducing a new task of predicting post-training potential - forecasting a base model's performance before post-training. We propose RuDE (Rubric-based Discriminative Evaluation), a unified framework that bypasses the generation gap of base models by leveraging response discrimination. Guided by our systematic 4C Taxonomy, RuDE constructs controlled contrastive pairs across diverse domains by fine-grained rubric violations. Extensive experiments demonstrate a correlation greater than 90% with post-training performance. Crucially, validation via Reinforcement Learning (RL) confirms that RuDE effectively identifies high-potential smaller models that outperform larger counterparts, offering a compute-efficient mechanism for foundation model development.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 05/13/26, 06:20 AM

# On Predicting the Post-training Potential of Pre-trained LLMs
Source: [https://arxiv.org/html/2605.11978](https://arxiv.org/html/2605.11978)
Xiaoyuan Li1Yubo Ma211footnotemark:1Kexin Yang2Moxin Li3Keqin Bao1Wenjie Wang1 Fuli Feng1Dayiheng Liu2 1University of Science and Technology of China2Alibaba Group 3National University of Singapore

###### Abstract

The performance of Large Language Models \(LLMs\) on downstream tasks is fundamentally constrained by the capabilities acquired during pre\-training\. However, traditional benchmarks like MMLU often fail to reflect a base model’s plasticity in complex open\-ended scenarios, leading to inefficient model selection\. We address this by introducing a new task of predictingpost\-training potential\- forecasting a base model’s performance before post\-training\. We proposeRuDE\(Rubric\-basedDiscriminativeEvaluation\), a unified framework that bypasses the generation gap of base models by leveraging response discrimination\. Guided by our systematic4C Taxonomy,RuDEconstructs controlled contrastive pairs across diverse domains by fine\-grained rubric violations\. Extensive experiments demonstrate a correlation greater than 90% with post\-training performance\. Crucially, validation via Reinforcement Learning \(RL\) confirms thatRuDEeffectively identifies high\-potential smaller models that outperform larger counterparts, offering a compute\-efficient mechanism for foundation model development\.

## 1Introduction

The development of Large Language Models \(LLMs\) typically follows a two\-stage pipeline:*Pre\-training*followed by*Post\-training*\(Yanget al\.,[2025](https://arxiv.org/html/2605.11978#bib.bib5); Teamet al\.,[2025a](https://arxiv.org/html/2605.11978#bib.bib6); Liuet al\.,[2025a](https://arxiv.org/html/2605.11978#bib.bib7)\)\. Pre\-training over large text corpora allows the LLM to capture linguistic patterns and store extensive knowledgeRadfordet al\.\([2018](https://arxiv.org/html/2605.11978#bib.bib49)\)\. Subsequently, post\-training enables the LLM to follow human instructionsRaffelet al\.\([2020](https://arxiv.org/html/2605.11978#bib.bib51)\), align with human preferencesOuyanget al\.\([2022](https://arxiv.org/html/2605.11978#bib.bib14)\), and perform complex reasoning tasksShaoet al\.\([2024](https://arxiv.org/html/2605.11978#bib.bib50)\), unlocking the underlying capabilities captured during pre\-training\. Recently, studies show that the performance of post\-trained largely depends on the capabilities acquired during pre\-training stageYueet al\.\([2025](https://arxiv.org/html/2605.11978#bib.bib45)\); Ghoshet al\.\([2024](https://arxiv.org/html/2605.11978#bib.bib52)\); Zhouet al\.\([2023a](https://arxiv.org/html/2605.11978#bib.bib10)\)\. Therefore, selecting an optimal pre\-trained model is crucial for effective post\-training and for saving computational resources\.

However, a critical research gap remains in how to pre\-evaluate the potential of pre\-trained models for effective post\-training\. Existing evaluations typically measure pre\-trained model quality using accuracy on multiple\-choice knowledge benchmarks such as MMLU\(Hendryckset al\.,[2021](https://arxiv.org/html/2605.11978#bib.bib3)\)or using perplexity metrics\(Zenget al\.,[2025b](https://arxiv.org/html/2605.11978#bib.bib2)\)\. These metrics, however, are insufficient to capture the model’s*post\-training potential*—its expected performance on post\-training tasks that are typically open\-ended and require both complex instruction following ability and understanding of human intents\. We observe that high performance of pre\-trained models on multiple\-choice knowledge benchmarks correlates weakly with their post\-training performance on open\-ended instruction following tasks \(*cf\.*Figure[1](https://arxiv.org/html/2605.11978#S2.F1)\)\. This motivates us to investigate the following problem\.

How can we evaluate the relative post\-training potential of pre\-trained models across diverse tasks?

We identify the following challenges in addressing this problem\. \(1\)Generation Gap: Pre\-trained models are optimized for next\-token prediction and generally lack the instruction\-following capability to respond to open\-ended queries\(Zhouet al\.,[2023a](https://arxiv.org/html/2605.11978#bib.bib10)\), making direct evaluation challenging\. \(2\)Unified Objective Evaluation: Unlike multiple\-choice questions, which can be evaluated via exact answer matching, open\-ended responses involve diverse criteria such as formatting, helpfulness, and factual correctness\. A generalized evaluation framework is therefore needed to quantify these objectives across diverse tasks\(Lianget al\.,[2023](https://arxiv.org/html/2605.11978#bib.bib17)\)\.

To address the first challenge, inspired by theGeneration\-Evaluation Consistencyhypothesis\(Liuet al\.,[2025b](https://arxiv.org/html/2605.11978#bib.bib1)\), we estimate a pre\-trained model’s ability to generate desired open\-ended responses based on its capability to discriminate between preferred and dispreferred responses\. To tackle the second challenge, we introduce the4C Taxonomy, providing a systematic framework for evaluating open\-ended generation across four distinct aspects:Competence,Content,Control, andCompliance\. Post\-training tasks are evaluated using discriminative response sets constructed from task\-specific combinations of these aspects, allowing the framework to adapt to diverse tasks\.

To this end, we proposeRuDE\(Rubric\-basedDiscriminativeEvaluation\) a unified framework for evaluating the relative post\-training potential of pre\-trained models across diverse tasks\. We begin by manually defining fine\-grained rubrics,*i\.e\.,*evaluation criteria, for each aspect of the4C Taxonomy\. For a comprehensive evaluation, we adapt four open\-ended datasets into the response discrimination format, covering diverse domains:Medical, Legal & Finance, Complex Instruction Following, and Creative Writing, and define rubric sets for each task\. We propose a generator\-verifier pipeline to automatically construct discriminative response sets, where preferred responses satisfy all rubrics and dispreferred responses violate some\. Evaluation results on pre\-trained models show a strong Pearson correlation \(\>90%\>90\\%\) with their fully instruction\-tuned performance\. Experiments on Reinforcement Learning \(RL\) further validate our approach, showing that a smaller pre\-trained model \(Qwen3\-4B\-Base\) predicted to have higher potential indeed outperforms a larger counterpart \(Qwen2\.5\-7B\-Base\) after post\-training\. Our contributions are summarized as follows:

- •We introduce the new task of evaluating the post\-training potential of pre\-trained LLMs\.
- •We propose a unified evaluation framework for post\-training potential across diverse tasks via rubric\-based response discrimination based on our 4C Taxonomy\.
- •We perform extensive experiments on instruction tuning and RL across multiple models, demonstrating the strong effectiveness ofRuDE\.

## 2Task Formulation

In this section, we formalize the task of evaluating post\-training potential\. We first define the problem as a ranking task over pre\-trained models \(§[2\.1](https://arxiv.org/html/2605.11978#S2.SS1)\)\. We then introduce discriminative evaluation as a proxy to address the generation gap \(§[2\.2](https://arxiv.org/html/2605.11978#S2.SS2)\)\. Finally, we state our core hypothesis connecting discriminative ability to predicting post\-training potential \(§[2\.3](https://arxiv.org/html/2605.11978#S2.SS3)\)\.

### 2\.1Problem Definition

The performance of post\-trained models is largely determined by their pre\-training foundation, yet no reliable method exists to predict which pre\-trained model will perform best after post\-training\. We formalize this as a ranking problem\.

Consider a set of pre\-trained base modelsΘ=\{θ1,…,θn\}\\Theta=\\\{\\theta\_\{1\},\\dots,\\theta\_\{n\}\\\}and a target post\-training taskTT\. LetPostTrain​\(θ,T\)\\texttt\{PostTrain\}\(\\theta,T\)denote the model obtained by applying a post\-training algorithm \(*e\.g\.,*RL\) to base modelθ\\thetaon taskTT, and letSgen​\(PostTrain​\(θ,T\)\)S\_\{\\text\{gen\}\}\(\\texttt\{PostTrain\}\(\\theta,T\)\)be the resulting generative performance score on open\-ended evaluation\.

Our goal is to find a*scoring function*S​\(θ,T\)S\(\\theta,T\), computable on the base model alone, such that the ranking induced bySSoverΘ\\Thetais consistent with the ranking induced bySgenS\_\{\\text\{gen\}\}after post\-training:

S​\(θi,T\)\>S​\(θj,T\)⟹Sgen​\(PostTrain​\(θi,T\)\)\>Sgen​\(PostTrain​\(θj,T\)\)S\(\\theta\_\{i\},T\)\>S\(\\theta\_\{j\},T\)\\;\\Longrightarrow\\;S\_\{\\text\{gen\}\}\(\\texttt\{PostTrain\}\(\\theta\_\{i\},T\)\)\>S\_\{\\text\{gen\}\}\(\\texttt\{PostTrain\}\(\\theta\_\{j\},T\)\)\(1\)A reliable scoring functionSSwould allow practitioners to select the most promising base model without expensive trial\-and\-error post\-training\.

As shown in Figure[1](https://arxiv.org/html/2605.11978#S2.F1), existing proxies for base model quality—such as accuracy on multiple\-choice benchmarks like MMLU—correlate weakly with post\-training performance on open\-ended tasks, because they measure knowledge recall rather than the model’s capacity to produce aligned, high\-quality responses\. This motivates the search for a more predictive scoring function\.

![Refer to caption](https://arxiv.org/html/2605.11978v1/x1.png)Figure 1:Correlation Analysis: Traditional Benchmarks vs\. Post\-Training Performance\.The X\-axis represents the accuracy of base models on widely used knowledge benchmarks and the Y\-axis denotes the corresponding models’ post\-training performance on HealthBench after instruction tuning\. The Pearson correlation coefficients \(c​o​r​rcorr\) between them indicate a weak correlation, demonstrating that traditional benchmarks can not predict domain\-specific post\-training potential\.
### 2\.2Discriminative Evaluation as a Proxy

The first challenge identified in the introduction is the*generation gap*: pre\-trained models, optimized for next\-token prediction, generally cannot follow instructions or produce well\-structured open\-ended responses\(Zhouet al\.,[2023a](https://arxiv.org/html/2605.11978#bib.bib10)\)\. This makes it impractical to directly evaluate their generative quality on post\-training tasks\.

To address this, inspired by theGeneration\-Evaluation Consistencyhypothesis\(Liuet al\.,[2025b](https://arxiv.org/html/2605.11978#bib.bib1)\), we shift from evaluating what a model can*generate*to evaluating what it can*discriminate*\. The key insight is that a model’s ability to distinguish preferred responses from dispreferred ones reflects its internalized understanding of quality—even when it cannot yet produce such responses itself\. From the perspective of Direct Preference Optimization \(DPO\)\(Rafailovet al\.,[2023](https://arxiv.org/html/2605.11978#bib.bib24)\), the base model serves as the initial reference policy, and post\-training adjusts its distribution toward preferred responses\. A base model that already assigns higher likelihood to preferred responses faces a smaller distributional shift during optimization, implying higher post\-training potential\.

We operationalize this by measuring whether the base model consistently ranks a preferred responsey\+y^\{\+\}above a dispreferred responsey−y^\{\-\}in terms of conditional log\-likelihood\. Letℓθ​\(y\|x\)\\ell\_\{\\theta\}\(y\|x\)denote the log\-likelihood assigned by modelθ\\thetato responseyygiven promptxx\. Given a set of contrastive pairs\{\(x,y\+,y−\)\}\\\{\(x,y^\{\+\},y^\{\-\}\)\\\}constructed for taskTT, we define the discriminative score:

Sdis​\(θ,T\)=𝔼\(x,y\+,y−\)​\[𝕀​\(ℓθ​\(y\+\|x\)\>ℓθ​\(y−\|x\)\)\]S\_\{\\text\{dis\}\}\(\\theta,T\)=\\mathbb\{E\}\_\{\(x,y^\{\+\},y^\{\-\}\)\}\\left\[\\mathbb\{I\}\\left\(\\ell\_\{\\theta\}\(y^\{\+\}\|x\)\>\\ell\_\{\\theta\}\(y^\{\-\}\|x\)\\right\)\\right\]\(2\)where𝕀\\mathbb\{I\}is the indicator function\.SdisS\_\{\\text\{dis\}\}measures the fraction of contrastive pairs on which the base model correctly ranks the preferred response higher, reflecting how well it has already internalized an understanding of response quality prior to post\-training\.

### 2\.3The GD\-Potential Hypothesis

Building on the above formulation, we propose theGD\-Potential Hypothesis\(Generative\-Discriminative Potential\):*the discriminative score of a pre\-trained model is strongly positively correlated with its post\-training generative performance\.*Formally:

ρ​\(Sdis​\(θ,T\),Sgen​\(PostTrain​\(θ,T\)\)\)≫0\\rho\\\!\\left\(S\_\{\\text\{dis\}\}\(\\theta,T\),\\;S\_\{\\text\{gen\}\}\(\\texttt\{PostTrain\}\(\\theta,T\)\)\\right\)\\gg 0\(3\)whereρ\\rhodenotes the Pearson correlation coefficient computed across a population of base models\.

If this hypothesis holds,SdisS\_\{\\text\{dis\}\}serves as the desired scoring functionSSdefined in §[2\.1](https://arxiv.org/html/2605.11978#S2.SS1), enabling practitioners to predict post\-training potential without actually performing post\-training\.

## 3RuDE Framework

![Refer to caption](https://arxiv.org/html/2605.11978v1/x2.png)Figure 2:Overview of the RuDE Construction Pipeline\.The framework operates in three phases to synthesize high\-quality contrastive pairs\. Phase 1 generates a gold\-standard response \(y\+y^\{\+\}\), Phase 2 produces a hard negative \(y−y^\{\-\}\) through a controlled degradation process\. Finally, Phase 3 pairs these samples and randomizes their positions \(A/B\) to form the discriminative evaluation task\.To empirically validate theGD\-Potentialhypothesis, we introduceRuDE\(Rubric\-basedDiscriminativeEvaluation\)\. It is constructed via a rigorous, verifier\-guided iterative pipeline to ensure that every discriminative task is grounded in fine\-grained alignment rubrics\.

### 3\.1Iterative Rubric\-Perturbed Construction

The core of our methodology is the synthesis of contrastive pairs\(y\+,y−\)\(y^\{\+\},y^\{\-\}\)that differ only in specific, controlled alignment dimensions\. To achieve this precision and eliminate confounding factors, we implement a pipeline involving a strong generator \(𝒢\\mathcal\{G\}, Gemini\-3\-Pro\(Comaniciet al\.,[2025](https://arxiv.org/html/2605.11978#bib.bib25)\)\) and a strict verifier \(𝒱\\mathcal\{V\}, GPT\-4\.1\(Singhet al\.,[2025](https://arxiv.org/html/2605.11978#bib.bib26)\)\)111Following the evaluation protocol established in HealthBench\(Aroraet al\.,[2025](https://arxiv.org/html/2605.11978#bib.bib19)\), we employ GPT\-4\.1 as the verifier to assess rubric compliance\. HealthBench has validated its high correlation with human expert judgment in rubric\-based assessment\., which proceeds in three phases as illustrated in Figure[2](https://arxiv.org/html/2605.11978#S3.F2):

#### Phase 1: Positive Sample Synthesis via Verifier\-Guided Iterative Refinement\.

The objective is to generate a responsey\+y^\{\+\}that perfectly satisfies all rubricsℛa​l​l\\mathcal\{R\}\_\{all\}\. Given a queryqqandℛa​l​l\\mathcal\{R\}\_\{all\},𝒢\\mathcal\{G\}generates a responseyty\_\{t\}and then𝒱\\mathcal\{V\}evaluatesyty\_\{t\}against every rubric inℛa​l​l\\mathcal\{R\}\_\{all\}\.222ℛa​l​l\\mathcal\{R\}\_\{all\}comprises two categories:Positive Rubrics\(features that must be present\) andNegative Rubrics\(features that must be avoided\)\. In our pipeline, “perfectly satisfies” is defined as the state where𝒱\\mathcal\{V\}predictsTruefor all Positive Rubrics andFalsefor all Negative Rubrics\.

- •Feedback Loop:If any rubricr∈ℛf​a​i​lr\\in\\mathcal\{R\}\_\{fail\}is violated,𝒱\\mathcal\{V\}generates structured feedbackℰ\\mathcal\{E\}explaining the reason for failure\. The tuple\(q,ℛa​l​l,ℛf​a​i​l,ℰ,yt\)\(q,\\mathcal\{R\}\_\{all\},\\mathcal\{R\}\_\{fail\},\\mathcal\{E\},y\_\{t\}\)is fed back to𝒢\\mathcal\{G\}for correction\.
- •Termination:This loop continues iteratively until𝒱\\mathcal\{V\}confirms that all rubricsℛa​l​l\\mathcal\{R\}\_\{all\}are satisfied, yielding the positive sampley\+y^\{\+\}\.

#### Phase 2: Negative Sample Synthesis via Controlled Degradation Loop\.

The goal is to synthesize a “Hard Negative”y−y^\{\-\}that violates a specific subset of rubrics while maintaining high quality in all other aspects\. We define a target violation subset𝒱t​a​r​g​e​t⊂ℛa​l​l\\mathcal\{V\}\_\{target\}\\subset\\mathcal\{R\}\_\{all\}and capture the token length of the positive sample,L​e​n​\(y\+\)Len\(y^\{\+\}\)\.

- •Length Constraint:To mitigate length bias, we explicitly constrain𝒢\\mathcal\{G\}to generatey−y^\{\-\}such that its token count mirrors the positive sample:L​e​n​\(y−\)≈L​e​n​\(y\+\)Len\(y^\{\-\}\)\\approx Len\(y^\{\+\}\)\.
- •Dual\-Verification:𝒱\\mathcal\{V\}performs a rigorous dual\-check on the candidateyty\_\{t\}: \(1\) Are the rubrics in𝒱t​a​r​g​e​t\\mathcal\{V\}\_\{target\}explicitly violated? \(2\) Are all remaining rubricsℛp​r​e​s​e​r​v​e​d=ℛa​l​l∖𝒱t​a​r​g​e​t\\mathcal\{R\}\_\{preserved\}=\\mathcal\{R\}\_\{all\}\\setminus\\mathcal\{V\}\_\{target\}strictly satisfied?
- •Refinement Loop:Ifyty\_\{t\}fails either condition,𝒱\\mathcal\{V\}generates a structured error analysisℰ\\mathcal\{E\}for any rubricr∈ℛf​a​i​lr\\in\\mathcal\{R\}\_\{fail\}\(e\.g\., detailing a failure to violate a target or an accidental violation of a preserved constraint\)\. The optimization tuple\(q,ℛa​l​l,𝒱t​a​r​g​e​t,ℛf​a​i​l,ℰ,yt\)\(q,\\mathcal\{R\}\_\{all\},\\mathcal\{V\}\_\{target\},\\mathcal\{R\}\_\{fail\},\\mathcal\{E\},y\_\{t\}\)is fed back to𝒢\\mathcal\{G\}\. This ensures the generator iteratively optimizes the output until it precisely matches the specific degradation requirements\.

Table 1:The 4C Taxonomy: A fine\-grained breakdown of model post\-training potential\.
#### Phase 3: Contrastive Formulation\.

Finally,y\+y^\{\+\}andy−y^\{\-\}are paired to form a binary discriminative task\. To eliminate position bias during evaluation\(Zhenget al\.,[2023](https://arxiv.org/html/2605.11978#bib.bib15)\), the positive sample is randomly assigned to position A or B\. The task for the base model is to identify the superior response based on conditional probability:P​\(y\+\|q\)\>P​\(y−\|q\)P\(y^\{\+\}\|q\)\>P\(y^\{\-\}\|q\)\.

### 3\.2The 4C Taxonomy: Decoupling Capabilities

To enable fine\-grained diagnosis of model potential\(Lianget al\.,[2023](https://arxiv.org/html/2605.11978#bib.bib17); Yeet al\.,[2024](https://arxiv.org/html/2605.11978#bib.bib18)\), we categorize all rubrics into the4C Taxonomy\. It disentangles the complex spectrum of post\-training potential into four distinct capabilities:\(1\) Competenceassesses the model’s intrinsic intelligence and reliability, which reflects knowledge density acquired during pre\-training\(Hendryckset al\.,[2021](https://arxiv.org/html/2605.11978#bib.bib3); Cobbeet al\.,[2021](https://arxiv.org/html/2605.11978#bib.bib4)\)\.\(2\) Contentevaluates the linguistic quality of the generation, ensuring it meets the standards of human communication\(Brownet al\.,[2020](https://arxiv.org/html/2605.11978#bib.bib28)\)\.\(3\) Controlmeasures the model’s ability to adhere to rigid, non\-semantic constraints and reflects the model’s instruction following plasticity\(Zhouet al\.,[2023b](https://arxiv.org/html/2605.11978#bib.bib29)\)\.\(4\) Complianceevaluates the model’s alignment with human intent, safety boundaries, and social norms\. High scores here indicate a model that is “safe and helpful”\(Ouyanget al\.,[2022](https://arxiv.org/html/2605.11978#bib.bib14); Baiet al\.,[2022](https://arxiv.org/html/2605.11978#bib.bib16); Zhanget al\.,[2024](https://arxiv.org/html/2605.11978#bib.bib30)\)\. See Table[1](https://arxiv.org/html/2605.11978#S3.T1)for details\.

## 4Experiments

### 4\.1Dataset

#### Data Sources\.

To ensure the comprehension, we curate data from four diverse domains:\(1\) HealthBench\(Aroraet al\.,[2025](https://arxiv.org/html/2605.11978#bib.bib19)\): Centered on medical advisory tasks, it evaluates adherence to safety\-critical constraints\.333We specifically select the ‘Hard’ subset of HealthBench for construction\. Statistics reveals that the ‘Consensus’ subset averages only 2\.19 rubrics per entry, which is insufficient for hard negative sampling\. In contrast, the ‘Hard’ subset provides a dense constraint landscape with an average of 11\.85 rubrics per entry, enabling the synthesis of fine\-grained contrastive pairs\.\(2\) PRBench\(Akyüreket al\.,[2025](https://arxiv.org/html/2605.11978#bib.bib20)\): Derived from legal and financial professional reports, it probes logical consistency and terminology adherence across extensive documents\.\(3\) AdvancedIF\(Heet al\.,[2025](https://arxiv.org/html/2605.11978#bib.bib21)\): Designed to stress\-test complex instruction\-following, this dataset features nested formatting and logic constraints\.444We exclude the ‘System Prompt Modification’ subset of AdvancedIF\. Since pre\-trained models typically lack a system role, this subset is not suited for our framework\.\(4\) WritingBench\(Wuet al\.,[2025](https://arxiv.org/html/2605.11978#bib.bib22)\): Targeting open\-ended storytelling, it assesses style transfer, rhetorical usage, and persona consistency\.

#### Statistics\.

Table 2:Statistics of the evaluation datasets\.It covers four complex open\-ended domains with varying difficulty levels controlled by the number of rubric violations \(\|𝒱\|\|\\mathcal\{V\}\|\)\.DomainDataset\|𝒱\|\|\\mathcal\{V\}\|SamplesAvg\. Tok\.Medical InteractionHealthBench11,5973,96831,0313,48051,0453,912Legal & FinancePRBench13,84820,19434,04310,77654,54211,724Complex InstructionAdvancedIF11,9306,16331,3498,11151,5594,833Creative WritingWritingBench12,56114,43922,52316,22232,65523,913Total / Avg\.\-28,68311,566

The final evaluation dataset comprises 28,683 samples\. We stratify difficulty by varying the number of rubric violations \(\|𝒱\|∈\{1,2,3\}\|\\mathcal\{V\}\|\\in\\\{1,2,3\\\}for WritingBench, and\|𝒱\|∈\{1,3,5\}\|\\mathcal\{V\}\|\\in\\\{1,3,5\\\}for the others\)\.555Since the WritingBench dataset contains a lower density of constraints \(5 rubrics per entry\), generating negative samples with 5 violations would result in easy negative responses\. Thus, we adjusted the violation number to\|𝒱\|∈\{1,2,3\}\|\\mathcal\{V\}\|\\in\\\{1,2,3\\\}for this domain\.Specifically,\|𝒱\|=1\|\\mathcal\{V\}\|=1represents the hardest scenario, requiring precise discrimination, while the maximum violation setting \(\|𝒱\|=3\|\\mathcal\{V\}\|=3or55\) represents a relatively easier task\.

#### Human Verification\.

To validate the reliability of our pipeline, we randomly sample 100 instances for human review\. We verify whether the positive sample satisfies all rubrics and whether the negative sample violates exactly the target rubrics\. The result shows a 92% agreement between the pipeline’s labels and human judgment, confirming thatRuDEis grounded in high\-quality human preferences\.666See details in AppendixLABEL:ssec:human\_verification\.

### 4\.2Experimental Setup

#### Models Evaluated\.

We evaluate a diverse suite of state\-of\-the\-art base LLMs, covering a wide range of parameter scales \(4B to 1T\) and architectures \(Dense and MoE\)\. The models evaluated include DeepSeek\-V3\.1\(Liuet al\.,[2025a](https://arxiv.org/html/2605.11978#bib.bib7)\), GLM\-4\.5/Air\(Zenget al\.,[2025a](https://arxiv.org/html/2605.11978#bib.bib9)\), Kimi\-K2\(Teamet al\.,[2025b](https://arxiv.org/html/2605.11978#bib.bib8)\), Seed\-OSS\-36B\(Seedet al\.,[2025](https://arxiv.org/html/2605.11978#bib.bib31)\), Qwen2\.5 Series, Qwen3 Series\(Yanget al\.,[2025](https://arxiv.org/html/2605.11978#bib.bib5)\), and Gemma3 Series\(Teamet al\.,[2025a](https://arxiv.org/html/2605.11978#bib.bib6)\)\. Notely, unless otherwise specified, the models referred to in this work are base models\.

#### Evaluation Metrics\.

We primarily report accuracy \(A​c​cAcc\), defined as the percentage of contrastive pairs where the model correctly assigns higher probability to the positive sampley\+y^\{\+\}than the negative oney−y^\{\-\}\. To validate theGD\-Potentialhypothesis, we also calculate the Pearson correlation \(c​o​r​rcorrandpp\)\(Benestyet al\.,[2009](https://arxiv.org/html/2605.11978#bib.bib32)\)between the base models’ accuracy onRuDEand their instruct\-tuned counterparts’ performance on standard generative benchmarks\.

#### Implementation Details\.

We adopt a 2\-shot setting with balanced exemplars \(one targeting option ‘A’ and the other ‘B’\) to mitigate potential in\-context position bias\. For all experiments, we uniformly set the maximum output length to 8192 and the temperature to 0 to ensure determinacy\.

![Refer to caption](https://arxiv.org/html/2605.11978v1/x3.png)Figure 3:Correlation Analysis: Pre\-training vs\. Post\-training Performance\.We illustrate the relationship between base model scores on theRuDE\(X\-axis\) and the corresponding instruction\-tuned models’ performance on standard open\-ended benchmarks \(Y\-axis\)\. The high Pearson correlation coefficients \(r\>0\.90r\>0\.90\) demonstrate that our discriminative evaluation serves as a highly reliable predictor of downstream post\-traning potential, significantly outperforming traditional benchmarks like MMLU\.

### 4\.3Main Results

Table 3:Model Performance onRuDE\.We report the accuracy \(%\) of\|𝒱\|=3\|\\mathcal\{V\}\|=3for WritingBench and\|𝒱\|=5\|\\mathcal\{V\}\|=5for others\. The column headers denote:AD\(AdvancedIF\),HE\(HealthBench\),WR\(WritingBench\), andPR\(PRBench\)\.Boldindicates the best\.
![Refer to caption](https://arxiv.org/html/2605.11978v1/x4.png)Figure 4:Sensitivity Analysis: Impact of Violation Number\.We analyze how model discriminative accuracy varies as the distinction between positive and negative samples becomes more pronounced\. The X\-axis represents the number of rubric violations in the negative sample, ranging from Min \(\|𝒱\|=1\|\\mathcal\{V\}\|=1\) to Max \(\|𝒱\|=3\|\\mathcal\{V\}\|=3or55\)\. As the number of violated rubrics increases, the discriminative task becomes easier, resulting in higher accuracy\.#### Validation of GD\-Potential Hypothesis\.

In Figure[3](https://arxiv.org/html/2605.11978#S4.F3), we observe a statistically significant positive Pearson correlation between the base models’ discriminative scores onRuDEand the generative scores of their instruct\-tuned versions on the corresponding standard benchmarks\. For instance, on the AdvancedIF, the correlation coefficient reaches as high asc​o​r​r=0\.91corr=0\.91\(p<0\.001p<0\.001\)\. High\-performing models like DeepSeek\-V3\.1 and GLM\-4\.5 consistently occupy the top\-right quadrant\. These findings empirically validate theGD\-Potentialhypothesis:A base model’s discriminative capability is a reliable proxy for its post\-training generative potential\.RuDEthus serves as a cost\-effective predictor for selecting base models for expensive alignment training, allowing researchers to estimate downstream performance without conducting full post\-training\.

#### Overall Performance and Scaling Laws\.

As Table[3](https://arxiv.org/html/2605.11978#S4.T3)shows, DeepSeek\-V3\.1 leads with an average accuracy of 78\.8%, demonstrating superior post\-training potential among all models\. We observe a consistent scaling law where larger models exhibit stronger discriminative capabilities\(Kaplanet al\.,[2020](https://arxiv.org/html/2605.11978#bib.bib33); Weiet al\.,[2022](https://arxiv.org/html/2605.11978#bib.bib34)\)\. For instance, within the Gemma3 series, performance improves monotonically from 4B to 27B\. This trend confirms thatGD\-Potentialscales with model parameters, suggesting that scaling effectively enhances a base model’s latent representation of human preferences even prior to instruction tuning\.

#### Domain\-Specific Specialization\.

While DeepSeek\-V3\.1 dominates in the AdvancedIF and HealthBench, other models show distinct strengths in specific domains: On the WritingBench, Kimi\-K2 achieves the best performance, surpassing DeepSeek\-V3\.1\. This indicates Kimi\-K2’s superior potential in handling creative writing constraints\. Moreover, GLM\-4\.5 demonstrates exceptional capability in legal and financial scenarios on the PRBench\. This suggests that while general capabilities scale with size, specific post\-training potentials \(e\.g\., creative writing and professional adherence\) may depend on the model’s pre\-training data distribution and architectural biases\(Yeet al\.,[2024](https://arxiv.org/html/2605.11978#bib.bib18)\)\.

#### Impact of Violation Number\.

Figure[4](https://arxiv.org/html/2605.11978#S4.F4)validates the efficacy of our difficulty stratification mechanism by changing\|𝒱\|\|\\mathcal\{V\}\|\. As the number of injected violations decreases, the negative samples become increasingly indistinguishable from the positive ones, thereby elevating the discriminative difficulty\. Consequently, we observe a sharp performance decline across all models; for instance, on the AdvancedIF, DeepSeek\-V3\.1’s accuracy drops significantly from 93\.1% \(\|𝒱\|=5\|\\mathcal\{V\}\|=5\) to 70\.1% \(\|𝒱\|=1\|\\mathcal\{V\}\|=1\)\. This sensitivity confirms thatRuDEavoids saturation and effectively probes capabilities at varying levels of granularity\.

## 5Analysis

### 5\.1Fine\-Grained Diagnostics

Table 4:Dimension\-level performance under the 4C Taxonomy\.We report accuracy \(%\) for Compliance, Competence, Content, and Control, together with the average score and the standard deviation across the four dimensions \(σ4C\\sigma\_\{\\text\{4C\}\}\)\.4C Dimension Profiling\.The 4C Taxonomy provides per\-dimension profiling beyond a single aggregate score\. We report dimension\-level accuracy in Table[4](https://arxiv.org/html/2605.11978#S5.T4), and summarize three observations\.

Global Patterns\.Compliance is the strongest dimension for most models, averaging 52\.0% globally versus Competence \(49\.9%\), Control \(49\.8%\), and Content \(49\.3%\)\. DeepSeek\-V3\.1, the top model overall \(AVG 61\.4%\), also maintains the smallest imbalance among top\-tier models \(σ4C=1\.5\\sigma\_\{\\text\{4C\}\}=1\.5\)\. By contrast, GLM\-4\.5\-Air showsσ4C=3\.8\\sigma\_\{\\text\{4C\}\}=3\.8with an 8\.3 gap between Control \(52\.6\) and Content \(44\.3\), and Gemma3\-27B showsσ4C=3\.8\\sigma\_\{\\text\{4C\}\}=3\.8with Compliance \(52\.2\) exceeding Content \(43\.1\) by 9\.1\.

Control as Bottleneck for Small Models\.For sub\-7B models, Control is consistently the weakest dimension: Qwen3\-4B scores 39\.3% and Qwen2\.5\-7B scores 34\.7%, both ranking Control last\. Qwen3\-4B outperforms the larger Qwen2\.5\-7B on every dimension, indicating that generational improvements in pre\-training can raise alignment readiness beyond parameter scaling alone\.

Domain\-Specific Specialization\.On WritingBench, Kimi\-K2 demonstrates comprehensive mastery across all dimensions \(65\.0, 60\.5, 61\.9, 63\.3\), while GLM\-4\.5 dominates on PRBench \(60\.8, 61\.8, 60\.8, 56\.4\)\. Qwen3\-235B\-A22B shows a 28\.1 gap between Content \(74\.6\) and Competence \(46\.5\), indicating fluency without factual grounding in professional domains\.

### 5\.2Practical Verification via Controlled RL Training

To move beyond static evaluation and validate the predictive power of our framework in a dynamic setting, we conduct a controlled RL experiment\. Our hypothesis is that base models with higher discriminative scores should demonstrate superior data efficiency and peak performance when fine\-tuned with rubric\-derived signals\.

We select three representative models, Qwen2\.5\-7B, Qwen2\.5\-14B, and Qwen3\-4B, and perform rubric\-based RL training following\(Zhouet al\.,[2025](https://arxiv.org/html/2605.11978#bib.bib41)\)on HealthBench\. Figure[5](https://arxiv.org/html/2605.11978#S5.F5)illustrates the performance trajectory on HealthBench over 200 training steps\. The results provide strong empirical validation of our pre\-training assessment:

![Refer to caption](https://arxiv.org/html/2605.11978v1/x5.png)Figure 5:Verification of Predictive Utility for Post\-Training\.We compare RL trajectories for Qwen2\.5\-7B, Qwen2\.5\-14B, and Qwen3\-4B\.Consistency of Ranking\.The post\-training performance hierarchy broadly follows our base model evaluation\. Qwen2\.5\-14B achieves the highest final HealthBench score, while the smaller Qwen3\-4B still outperforms the larger Qwen2\.5\-7B, consistent with their discriminative scores on RuDE\.

Efficiency\.Qwen3\-4B exhibits a stronger learning curve than Qwen2\.5\-7B despite its smaller size, effectively leveraging the rubric reward signals to correct its behavior rapidly\.

Predictive Validity\.This experiment confirms that our RuDE score is a faithful predictor of post\-training\. A high score in our framework implies that the model’s internal representations are already well\-structured to grasp complex constraints\(Zhouet al\.,[2023a](https://arxiv.org/html/2605.11978#bib.bib10)\), minimizing the “alignment tax” during post\-training\(Ouyanget al\.,[2022](https://arxiv.org/html/2605.11978#bib.bib14)\)\.

## 6Related Work

#### Open\-ended Generation Evaluation\.

The evaluation for open\-ended tasks is transitioning from subjective, single\-score win rates \(e\.g\., AlpacaEval\(Liet al\.,[2023](https://arxiv.org/html/2605.11978#bib.bib42)\), MT\-Bench\(Zhenget al\.,[2023](https://arxiv.org/html/2605.11978#bib.bib15)\)\) toward verifiable, rubric\-based assessments\. Recent initiatives, such as HealthBench\(Aroraet al\.,[2025](https://arxiv.org/html/2605.11978#bib.bib19)\)and PRBench\(Akyüreket al\.,[2025](https://arxiv.org/html/2605.11978#bib.bib20)\), exemplify this shift by employing rigorous, rubric\-based evaluation within specific domains\. We not only align with this fine\-grained paradigm but also advance it by introducing a generalized rubric taxonomy \-4C taxonomy\- to systematically categorize alignment requirements\. This structured approach enables us to decouple entangled capabilities across diverse domains, moving beyond isolated constraint verification to a comprehensive diagnosis\.

#### Base Model Evaluation\.

Existing evaluations for base models predominantly focus on static knowledge benchmarks \(e\.g\., MMLU\(Hendryckset al\.,[2021](https://arxiv.org/html/2605.11978#bib.bib3)\), C\-Eval\(Huanget al\.,[2023](https://arxiv.org/html/2605.11978#bib.bib47)\), SuperGPQA\(Duet al\.,[2025](https://arxiv.org/html/2605.11978#bib.bib48)\)\) or perplexity metrics\(Zenget al\.,[2025b](https://arxiv.org/html/2605.11978#bib.bib2)\)\. However, these metrics overlook a critical blind spot: the mismatch between possessing static knowledge and the dynamic requirements of post\-training alignment\. Our work addresses this gap by explicitly defining the task of predicting post\-training potential in pre\-training stage\.

#### Generation\-Evaluation Consistency\.

Research\(Liuet al\.,[2025b](https://arxiv.org/html/2605.11978#bib.bib1)\)reveals a strong correlation between a model’s evaluation and generation capabilities\. While recent works like APT\-Bench\(Qinet al\.,[2025](https://arxiv.org/html/2605.11978#bib.bib46)\)have begun to explore the agentic abilities of pre\-trained models, they remain focused on specific task domains\. We pivot this consistency insight to a broader scope, hypothesizing that a base model’s innate discriminative precision—its ability to distinguish rubric\-compliant responses from violations—serves as a generalizable and cost\-effective proxy for predicting its post\-training potential across diverse open\-ended scenarios\.

#### Rubric\-Based Alignment and Evaluation\.

Recent work has adopted rubrics for LLM alignment: RLCF\(Viswanathanet al\.,[2025](https://arxiv.org/html/2605.11978#bib.bib53)\)and QA\-LIGN\(Dineenet al\.,[2025](https://arxiv.org/html/2605.11978#bib.bib54)\)use rubric\-derived feedback to generate preference pairs or reward signals for DPO/GRPO training, while InFoBench\(Qinet al\.,[2024](https://arxiv.org/html/2605.11978#bib.bib55)\)and RaR\(Gunjalet al\.,[2025](https://arxiv.org/html/2605.11978#bib.bib56)\)apply rubrics for fine\-grained evaluation and RL reward modeling respectively\. These works leverage rubrics to improve the post\-training process itself\. In contrast, RuDE applies rubric\-based discrimination to base models before any post\-training, addressing the complementary problem of predicting post\-training potential rather than performing alignment\.

## 7Conclusion

We formally defined the task of predictingpost\-training potentialto forecast open\-ended performance prior to post\-training\. We proposedRuDE, a discriminative framework guided by our4C Taxonomy\. Experiments demonstrate a\>90%\>90\\%correlation with post\-training outcomes\. Furthermore, RL validation confirmsRuDEsuccessfully identifies high\-potential smaller models, offering a compute\-efficient mechanism for foundation model selection\.

## References

- A\. F\. Akyürek, A\. Gosai, C\. B\. C\. Zhang, V\. Gupta, J\. Jeong, A\. Gunjal, T\. Rabbani, M\. Mazzone, D\. Randolph, M\. M\. Meymand,et al\.\(2025\)PRBench: large\-scale expert rubrics for evaluating high\-stakes professional reasoning\.arXiv preprint arXiv:2511\.11562\.Cited by:[§D\.1](https://arxiv.org/html/2605.11978#A4.SS1.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2605.11978#S4.SS1.SSS0.Px1.p1.1.2),[§6](https://arxiv.org/html/2605.11978#S6.SS0.SSS0.Px1.p1.1)\.
- R\. K\. Arora, J\. Wei, R\. S\. Hicks, P\. Bowman, J\. Quiñonero\-Candela, F\. Tsimpourlas, M\. Sharman, M\. Shah, A\. Vallone, A\. Beutel,et al\.\(2025\)Healthbench: evaluating large language models towards improved human health\.arXiv preprint arXiv:2505\.08775\.Cited by:[§D\.1](https://arxiv.org/html/2605.11978#A4.SS1.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2605.11978#S4.SS1.SSS0.Px1.p1.1.1),[§6](https://arxiv.org/html/2605.11978#S6.SS0.SSS0.Px1.p1.1),[footnote 1](https://arxiv.org/html/2605.11978#footnote1)\.
- Y\. Bai, A\. Jones, K\. Ndousse, A\. Askell, A\. Chen, N\. DasSarma, D\. Drain, S\. Fort, D\. Ganguli, T\. Henighan,et al\.\(2022\)Training a helpful and harmless assistant with reinforcement learning from human feedback\.arXiv preprint arXiv:2204\.05862\.Cited by:[§3\.2](https://arxiv.org/html/2605.11978#S3.SS2.p1.1)\.
- J\. Benesty, J\. Chen, Y\. Huang, and I\. Cohen \(2009\)Pearson correlation coefficient\.InNoise reduction in speech processing,pp\. 1–4\.Cited by:[§4\.2](https://arxiv.org/html/2605.11978#S4.SS2.SSS0.Px2.p1.5)\.
- T\. Brown, B\. Mann, N\. Ryder, M\. Subbiah, J\. D\. Kaplan, P\. Dhariwal, A\. Neelakantan, P\. Shyam, G\. Sastry, A\. Askell,et al\.\(2020\)Language models are few\-shot learners\.Advances in neural information processing systems33,pp\. 1877–1901\.Cited by:[§3\.2](https://arxiv.org/html/2605.11978#S3.SS2.p1.1)\.
- K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano,et al\.\(2021\)Training verifiers to solve math word problems\.arXiv preprint arXiv:2110\.14168\.Cited by:[§D\.1](https://arxiv.org/html/2605.11978#A4.SS1.SSS0.Px1.p1.1),[§3\.2](https://arxiv.org/html/2605.11978#S3.SS2.p1.1)\.
- G\. Comanici, E\. Bieber, M\. Schaekermann, I\. Pasupat, N\. Sachdeva, I\. Dhillon, M\. Blistein, O\. Ram, D\. Zhang, E\. Rosen,et al\.\(2025\)Gemini 2\.5: pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities\.arXiv preprint arXiv:2507\.06261\.Cited by:[§3\.1](https://arxiv.org/html/2605.11978#S3.SS1.p1.3)\.
- J\. Dineen, A\. Rrv, Q\. Liu, Z\. Xu, X\. Ye, M\. Shen, Z\. Li, S\. Lu, C\. Baral, M\. Chen, and B\. Zhou \(2025\)QA\-LIGN: aligning LLMs through constitutionally decomposed QA\.InFindings of the Association for Computational Linguistics: EMNLP 2025,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 20619–20642\.External Links:[Link](https://aclanthology.org/2025.findings-emnlp.1123/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-emnlp.1123),ISBN 979\-8\-89176\-335\-7Cited by:[§6](https://arxiv.org/html/2605.11978#S6.SS0.SSS0.Px4.p1.1)\.
- X\. Du, Y\. Yao, K\. Ma, B\. Wang, T\. Zheng, K\. Zhu, M\. Liu, Y\. Liang, X\. Jin, Z\. Wei,et al\.\(2025\)Supergpqa: scaling llm evaluation across 285 graduate disciplines\.arXiv preprint arXiv:2502\.14739\.Cited by:[§D\.2](https://arxiv.org/html/2605.11978#A4.SS2.SSS0.Px1.p1.1),[§6](https://arxiv.org/html/2605.11978#S6.SS0.SSS0.Px2.p1.1)\.
- S\. Ghosh, C\. K\. R\. Evuru, S\. Kumar, R\. S, D\. Aneja, Z\. Jin, R\. Duraiswami, and D\. Manocha \(2024\)A closer look at the limitations of instruction tuning\.InProceedings of the 41st International Conference on Machine Learning,pp\. 15559–15589\.Cited by:[§1](https://arxiv.org/html/2605.11978#S1.p1.1)\.
- A\. Gunjal, A\. Wang, E\. Lau, V\. Nath, Y\. He, B\. Liu, and S\. M\. Hendryx \(2025\)Rubrics as rewards: reinforcement learning beyond verifiable domains\.InNeurIPS 2025 Workshop on Efficient Reasoning,External Links:[Link](https://openreview.net/forum?id=21UFlJrmS2)Cited by:[§6](https://arxiv.org/html/2605.11978#S6.SS0.SSS0.Px4.p1.1)\.
- Y\. He, W\. Li, H\. Zhang, S\. Li, K\. Mandyam, S\. Khosla, Y\. Xiong, N\. Wang, X\. Peng, B\. Li,et al\.\(2025\)AdvancedIF: rubric\-based benchmarking and reinforcement learning for advancing llm instruction following\.arXiv preprint arXiv:2511\.10507\.Cited by:[§4\.1](https://arxiv.org/html/2605.11978#S4.SS1.SSS0.Px1.p1.1.3)\.
- D\. Hendrycks, C\. Burns, S\. Basart, A\. Zou, M\. Mazeika, D\. Song, and J\. Steinhardt \(2021\)Measuring massive multitask language understanding\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=d7KBjmI3GmQ)Cited by:[§D\.1](https://arxiv.org/html/2605.11978#A4.SS1.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.11978#S1.p2.1),[§3\.2](https://arxiv.org/html/2605.11978#S3.SS2.p1.1),[§6](https://arxiv.org/html/2605.11978#S6.SS0.SSS0.Px2.p1.1)\.
- Y\. Huang, Y\. Bai, Z\. Zhu, J\. Zhang, J\. Zhang, T\. Su, J\. Liu, C\. Lv, Y\. Zhang, Y\. Fu,et al\.\(2023\)C\-eval: a multi\-level multi\-discipline chinese evaluation suite for foundation models\.Advances in Neural Information Processing Systems36,pp\. 62991–63010\.Cited by:[§D\.2](https://arxiv.org/html/2605.11978#A4.SS2.SSS0.Px1.p1.1),[§6](https://arxiv.org/html/2605.11978#S6.SS0.SSS0.Px2.p1.1)\.
- J\. Kaplan, S\. McCandlish, T\. Henighan, T\. B\. Brown, B\. Chess, R\. Child, S\. Gray, A\. Radford, J\. Wu, and D\. Amodei \(2020\)Scaling laws for neural language models\.arXiv preprint arXiv:2001\.08361\.Cited by:[§4\.3](https://arxiv.org/html/2605.11978#S4.SS3.SSS0.Px2.p1.1)\.
- N\. Lambert, V\. Pyatkin, J\. Morrison, L\. J\. V\. Miranda, B\. Y\. Lin, K\. Chandu, N\. Dziri, S\. Kumar, T\. Zick, Y\. Choi,et al\.\(2025\)Rewardbench: evaluating reward models for language modeling\.InFindings of the Association for Computational Linguistics: NAACL 2025,pp\. 1755–1797\.Cited by:[§D\.3](https://arxiv.org/html/2605.11978#A4.SS3.SSS0.Px1.p1.1)\.
- X\. Li, T\. Zhang, Y\. Dubois, R\. Taori, I\. Gulrajani, C\. Guestrin, P\. Liang, and T\. B\. Hashimoto \(2023\)Alpacaeval: an automatic evaluator of instruction\-following models\.Cited by:[§D\.1](https://arxiv.org/html/2605.11978#A4.SS1.SSS0.Px1.p1.1),[§6](https://arxiv.org/html/2605.11978#S6.SS0.SSS0.Px1.p1.1)\.
- P\. Liang, R\. Bommasani, T\. Lee, D\. Tsipras, D\. Soylu, M\. Yasunaga, Y\. Zhang, D\. Narayanan, Y\. Wu, A\. Kumar, B\. Newman, B\. Yuan, B\. Yan, C\. Zhang, C\. Cosgrove, C\. D\. Manning, C\. Re, D\. Acosta\-Navas, D\. A\. Hudson, E\. Zelikman, E\. Durmus, F\. Ladhak, F\. Rong, H\. Ren, H\. Yao, J\. WANG, K\. Santhanam, L\. Orr, L\. Zheng, M\. Yuksekgonul, M\. Suzgun, N\. Kim, N\. Guha, N\. S\. Chatterji, O\. Khattab, P\. Henderson, Q\. Huang, R\. A\. Chi, S\. M\. Xie, S\. Santurkar, S\. Ganguli, T\. Hashimoto, T\. Icard, T\. Zhang, V\. Chaudhary, W\. Wang, X\. Li, Y\. Mai, Y\. Zhang, and Y\. Koreeda \(2023\)Holistic evaluation of language models\.Transactions on Machine Learning Research\.Note:Featured Certification, Expert Certification, Outstanding CertificationExternal Links:ISSN 2835\-8856,[Link](https://openreview.net/forum?id=iO4LZibEqW)Cited by:[§1](https://arxiv.org/html/2605.11978#S1.p3.1),[§3\.2](https://arxiv.org/html/2605.11978#S3.SS2.p1.1)\.
- A\. Liu, A\. Mei, B\. Lin, B\. Xue, B\. Wang, B\. Xu, B\. Wu, B\. Zhang, C\. Lin, C\. Dong,et al\.\(2025a\)Deepseek\-v3\. 2: pushing the frontier of open large language models\.arXiv preprint arXiv:2512\.02556\.Cited by:[§1](https://arxiv.org/html/2605.11978#S1.p1.1),[§4\.2](https://arxiv.org/html/2605.11978#S4.SS2.SSS0.Px1.p1.1)\.
- Y\. Liu, P\. Liu, and A\. Cohan \(2025b\)On evaluating LLM alignment by evaluating LLMs as judges\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=OBaK9JSbHk)Cited by:[§D\.3](https://arxiv.org/html/2605.11978#A4.SS3.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.11978#S1.p4.1),[§2\.2](https://arxiv.org/html/2605.11978#S2.SS2.p2.1),[§6](https://arxiv.org/html/2605.11978#S6.SS0.SSS0.Px3.p1.1)\.
- L\. Ouyang, J\. Wu, X\. Jiang, D\. Almeida, C\. Wainwright, P\. Mishkin, C\. Zhang, S\. Agarwal, K\. Slama, A\. Ray,et al\.\(2022\)Training language models to follow instructions with human feedback\.Advances in neural information processing systems35,pp\. 27730–27744\.Cited by:[§1](https://arxiv.org/html/2605.11978#S1.p1.1),[§3\.2](https://arxiv.org/html/2605.11978#S3.SS2.p1.1),[§5\.2](https://arxiv.org/html/2605.11978#S5.SS2.p5.1)\.
- J\. Qin, Y\. Xi, J\. Huang, R\. Rui, D\. Yin, W\. Liu, Y\. Yu, W\. Zhang, and X\. Sun \(2025\)APTBench: benchmarking agentic potential of base llms during pre\-training\.arXiv preprint arXiv:2510\.24397\.Cited by:[§D\.3](https://arxiv.org/html/2605.11978#A4.SS3.SSS0.Px1.p1.1),[§6](https://arxiv.org/html/2605.11978#S6.SS0.SSS0.Px3.p1.1)\.
- Y\. Qin, K\. Song, Y\. Hu, W\. Yao, S\. Cho, X\. Wang, X\. Wu, F\. Liu, P\. Liu, and D\. Yu \(2024\)Infobench: evaluating instruction following ability in large language models\.InFindings of the Association for Computational Linguistics: ACL 2024,pp\. 13025–13048\.Cited by:[§6](https://arxiv.org/html/2605.11978#S6.SS0.SSS0.Px4.p1.1)\.
- A\. Radford, K\. Narasimhan, T\. Salimans, I\. Sutskever,et al\.\(2018\)Improving language understanding by generative pre\-training\.Cited by:[§1](https://arxiv.org/html/2605.11978#S1.p1.1)\.
- R\. Rafailov, A\. Sharma, E\. Mitchell, C\. D\. Manning, S\. Ermon, and C\. Finn \(2023\)Direct preference optimization: your language model is secretly a reward model\.Advances in neural information processing systems36,pp\. 53728–53741\.Cited by:[§2\.2](https://arxiv.org/html/2605.11978#S2.SS2.p2.1)\.
- C\. Raffel, N\. Shazeer, A\. Roberts, K\. Lee, S\. Narang, M\. Matena, Y\. Zhou, W\. Li, and P\. J\. Liu \(2020\)Exploring the limits of transfer learning with a unified text\-to\-text transformer\.Journal of machine learning research21\(140\),pp\. 1–67\.Cited by:[§1](https://arxiv.org/html/2605.11978#S1.p1.1)\.
- B\. Seed, J\. Chen, T\. Fan, X\. Liu, L\. Liu, Z\. Lin, M\. Wang, C\. Wang, X\. Wei, W\. Xu,et al\.\(2025\)Seed1\. 5\-thinking: advancing superb reasoning models with reinforcement learning\.arXiv preprint arXiv:2504\.13914\.Cited by:[§4\.2](https://arxiv.org/html/2605.11978#S4.SS2.SSS0.Px1.p1.1)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. Li, Y\. Wu,et al\.\(2024\)Deepseekmath: pushing the limits of mathematical reasoning in open language models\.arXiv preprint arXiv:2402\.03300\.Cited by:[§1](https://arxiv.org/html/2605.11978#S1.p1.1)\.
- A\. Singh, A\. Fry, A\. Perelman, A\. Tart, A\. Ganesh, A\. El\-Kishky, A\. McLaughlin, A\. Low, A\. Ostrow, A\. Ananthram,et al\.\(2025\)OpenAI gpt\-5 system card\.arXiv preprint arXiv:2601\.03267\.Cited by:[§3\.1](https://arxiv.org/html/2605.11978#S3.SS1.p1.3)\.
- G\. Team, A\. Kamath, J\. Ferret, S\. Pathak, N\. Vieillard, R\. Merhej, S\. Perrin, T\. Matejovicova, A\. Ramé, M\. Rivière,et al\.\(2025a\)Gemma 3 technical report\.arXiv preprint arXiv:2503\.19786\.Cited by:[§1](https://arxiv.org/html/2605.11978#S1.p1.1),[§4\.2](https://arxiv.org/html/2605.11978#S4.SS2.SSS0.Px1.p1.1)\.
- K\. Team, Y\. Bai, Y\. Bao, G\. Chen, J\. Chen, N\. Chen, R\. Chen, Y\. Chen, Y\. Chen, Y\. Chen,et al\.\(2025b\)Kimi k2: open agentic intelligence\.arXiv preprint arXiv:2507\.20534\.Cited by:[§4\.2](https://arxiv.org/html/2605.11978#S4.SS2.SSS0.Px1.p1.1)\.
- V\. Viswanathan, Y\. Sun, S\. Ma, X\. Kong, M\. Cao, G\. Neubig, and T\. Wu \(2025\)Checklists are better than reward models for aligning language models\.arXiv preprint arXiv:2507\.18624\.Cited by:[§6](https://arxiv.org/html/2605.11978#S6.SS0.SSS0.Px4.p1.1)\.
- Y\. Wang, Y\. Kordi, S\. Mishra, A\. Liu, N\. A\. Smith, D\. Khashabi, and H\. Hajishirzi \(2023\)Self\-instruct: aligning language models with self\-generated instructions\.InProceedings of the 61st annual meeting of the association for computational linguistics \(volume 1: long papers\),pp\. 13484–13508\.Cited by:[§D\.2](https://arxiv.org/html/2605.11978#A4.SS2.SSS0.Px2.p1.1)\.
- J\. Wei, Y\. Tay, R\. Bommasani, C\. Raffel, B\. Zoph, S\. Borgeaud, D\. Yogatama, M\. Bosma, D\. Zhou, D\. Metzler, E\. H\. Chi, T\. Hashimoto, O\. Vinyals, P\. Liang, J\. Dean, and W\. Fedus \(2022\)Emergent abilities of large language models\.Transactions on Machine Learning Research\.Note:Survey CertificationExternal Links:ISSN 2835\-8856,[Link](https://openreview.net/forum?id=yzkSU5zdwD)Cited by:[§4\.3](https://arxiv.org/html/2605.11978#S4.SS3.SSS0.Px2.p1.1)\.
- Y\. Wu, J\. Mei, M\. Yan, C\. Li, S\. Lai, Y\. Ren, Z\. Wang, J\. Zhang, M\. Wu, Q\. Jin,et al\.\(2025\)Writingbench: a comprehensive benchmark for generative writing\.arXiv preprint arXiv:2503\.05244\.Cited by:[§4\.1](https://arxiv.org/html/2605.11978#S4.SS1.SSS0.Px1.p1.1.4)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[§1](https://arxiv.org/html/2605.11978#S1.p1.1),[§4\.2](https://arxiv.org/html/2605.11978#S4.SS2.SSS0.Px1.p1.1)\.
- S\. Ye, D\. Kim, S\. Kim, H\. Hwang, S\. Kim, Y\. Jo, J\. Thorne, J\. Kim, and M\. Seo \(2024\)FLASK: fine\-grained language model evaluation based on alignment skill sets\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=CYmF38ysDa)Cited by:[§3\.2](https://arxiv.org/html/2605.11978#S3.SS2.p1.1),[§4\.3](https://arxiv.org/html/2605.11978#S4.SS3.SSS0.Px3.p1.1)\.
- Y\. Yue, Z\. Chen, R\. Lu, A\. Zhao, Z\. Wang, Y\. Yue, S\. Song, and G\. Huang \(2025\)Does reinforcement learning really incentivize reasoning capacity in LLMs beyond the base model?\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=4OsgYD7em5)Cited by:[§1](https://arxiv.org/html/2605.11978#S1.p1.1)\.
- A\. Zeng, X\. Lv, Q\. Zheng, Z\. Hou, B\. Chen, C\. Xie, C\. Wang, D\. Yin, H\. Zeng, J\. Zhang,et al\.\(2025a\)Glm\-4\.5: agentic, reasoning, and coding \(arc\) foundation models\.arXiv preprint arXiv:2508\.06471\.Cited by:[§4\.2](https://arxiv.org/html/2605.11978#S4.SS2.SSS0.Px1.p1.1)\.
- H\. Zeng, K\. Hui, H\. Zhuang, Z\. Qin, Z\. Yue, H\. Zamani, and D\. Alon \(2025b\)Can pre\-training indicators reliably predict fine\-tuning outcomes of llms?\.arXiv preprint arXiv:2504\.12491\.Cited by:[§D\.2](https://arxiv.org/html/2605.11978#A4.SS2.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.11978#S1.p2.1),[§6](https://arxiv.org/html/2605.11978#S6.SS0.SSS0.Px2.p1.1)\.
- Z\. Zhang, L\. Lei, L\. Wu, R\. Sun, Y\. Huang, C\. Long, X\. Liu, X\. Lei, J\. Tang, and M\. Huang \(2024\)SafetyBench: evaluating the safety of large language models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 15537–15553\.Cited by:[§3\.2](https://arxiv.org/html/2605.11978#S3.SS2.p1.1)\.
- L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. Xing,et al\.\(2023\)Judging llm\-as\-a\-judge with mt\-bench and chatbot arena\.Advances in neural information processing systems36,pp\. 46595–46623\.Cited by:[§D\.1](https://arxiv.org/html/2605.11978#A4.SS1.SSS0.Px1.p1.1),[§3\.1](https://arxiv.org/html/2605.11978#S3.SS1.SSS0.Px3.p1.3),[§6](https://arxiv.org/html/2605.11978#S6.SS0.SSS0.Px1.p1.1)\.
- C\. Zhou, P\. Liu, P\. Xu, S\. Iyer, J\. Sun, Y\. Mao, X\. Ma, A\. Efrat, P\. Yu, L\. Yu,et al\.\(2023a\)Lima: less is more for alignment\.Advances in Neural Information Processing Systems36,pp\. 55006–55021\.Cited by:[§1](https://arxiv.org/html/2605.11978#S1.p1.1),[§1](https://arxiv.org/html/2605.11978#S1.p3.1),[§2\.2](https://arxiv.org/html/2605.11978#S2.SS2.p1.1),[§5\.2](https://arxiv.org/html/2605.11978#S5.SS2.p5.1)\.
- J\. Zhou, T\. Lu, S\. Mishra, S\. Brahma, S\. Basu, Y\. Luan, D\. Zhou, and L\. Hou \(2023b\)Instruction\-following evaluation for large language models\.arXiv preprint arXiv:2311\.07911\.Cited by:[§D\.1](https://arxiv.org/html/2605.11978#A4.SS1.SSS0.Px1.p1.1),[§3\.2](https://arxiv.org/html/2605.11978#S3.SS2.p1.1)\.
- Y\. Zhou, S\. Li, S\. Liu, W\. Fang, K\. Zhang, J\. Zhao, J\. Yang, Y\. Zhou, J\. Lv, T\. Zheng,et al\.\(2025\)Breaking the exploration bottleneck: rubric\-scaffolded reinforcement learning for general llm reasoning\.arXiv preprint arXiv:2508\.16949\.Cited by:[§5\.2](https://arxiv.org/html/2605.11978#S5.SS2.p2.1)\.
- L\. Zhu, X\. Wang, and X\. Wang \(2025\)JudgeLM: fine\-tuned large language models are scalable judges\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=xsELpEPn4A)Cited by:[§D\.3](https://arxiv.org/html/2605.11978#A4.SS3.SSS0.Px1.p1.1)\.

## Appendix AAblation Study: Necessity of Controlled Degradation

To justify the design of theControlled Degradationpipeline inRuDE, we compare it against two intuitive but ultimately inadequate alternatives for constructing contrastive pairs\. The results demonstrate that both extremes—natural sampling \(too hard\) and naive rewriting \(too easy\)—fail to produce a benchmark with meaningful discriminative power, thereby validating the necessity of our approach\.

### A\.1Ablation A: Rejection Sampling from Natural Generations

We first test a straightforward baseline: sample 32 responses from a strong model given the original prompt, score each response against the full rubric set, and select the highest\- and lowest\-scoring responses as the positive and negative options, respectively\. To control for generator\-specific distributional bias, we conduct this experiment using two distinct generators—GPT\-5 and Qwen3\-235B\-A30B\-Thinking\-2507—on the HealthBench domain\.

Table 5:Ablation A: Rejection Sampling Construction\.Accuracy \(%\) of discriminating between the best and worst naturally sampled responses\. All evaluators struggle, indicating the task is excessively difficult\.#### Results\.

As shown in Table[5](https://arxiv.org/html/2605.11978#A1.T5), even the generator model itself achieves only 27\.1% \(GPT\-5\) or 27\.8% \(Qwen3\-235B\-Thinking\) accuracy when discriminating its own best vs\. worst outputs\. Base models perform even worse \(e\.g\., Qwen3\-30B\-A3B\-Base at 7\.0–12\.4%\), rendering the benchmark entirely ineffective at eliciting latent discriminative ability\.

#### Analysis \(Task Too Hard with Confounding Factors\)\.

Since responses are drawn from the model’s natural generation distribution, the positive and negative samples typically exhibit substantial surface\-level confounds—differences in length, sentence structure, and lexical distribution—while the actual rubric violations are extremely subtle and deeply embedded\. This makes the discrimination task prohibitively difficult and poorly calibrated for measuring alignment potential\.

### A\.2Ablation B: Locate\-and\-Rewrite

To reduce difficulty and eliminate length\-based confounds, we test a second naive approach: given a positive response, instruct a model \(Qwen3\-235B\-A22B\-Instruct\-2507\) to firstlocatethe text span relevant to a target rubric, then directlyrewritethat span to introduce a violation\.

Table 6:Ablation B: Locate\-and\-Rewrite Construction\.Accuracy \(%\) on discriminating rewritten negatives\. Even base models achieve high accuracy, indicating the task is trivially easy and lacks discriminative power\.#### Results\.

As shown in Table[6](https://arxiv.org/html/2605.11978#A1.T6), base models of vastly different scales—Qwen3\-30B\-A3B\-Base \(79\.2%\) and Qwen3\-235B\-A22B\-Base \(80\.6%\)—achieve nearly identical accuracy, separated by a mere 1\.4%\. This negligible gap completely fails to differentiate the post\-training potential between two models with an order\-of\-magnitude difference in active parameters\.

#### Analysis \(Task Too Easy with Surface Artifacts\)\.

Simple locate\-and\-rewrite operations inevitably disrupt the global coherence of the surrounding context, leaving conspicuous “stitching artifacts”—abrupt tonal shifts, broken coreference chains, or logical non\-sequiturs at the edit boundary\. Models can trivially exploit these surface\-level discontinuities to identify the negative sample without genuinely understanding the rubric semantics, rendering the benchmark incapable of distinguishing true alignment potential\.

### A\.3Conclusion: Necessity of the RuDE Framework

The ablation studies above clearly demonstrate that both extremes are inadequate:

- •Natural Samplingproduces pairs that aretoo difficult—confounded by distributional noise and subtle beyond the reach of even frontier models\.
- •Naive Rewritingproduces pairs that aretoo easy—trivially solvable via artifact detection, collapsing all models to a narrow accuracy band\.

RuDE’sControlled Degradationpipeline occupies the critical sweet spot between these extremes\. Through its iterative Generator\-Verifier loop, it ensures that negative samples: \(1\) maintain identical length, tone, and global coherence with the positive sample; \(2\) contain no detectable stitching artifacts; and \(3\) precisely violate only the targeted rubric constraints\. This principled construction yields a benchmark with the appropriate difficulty calibration to meaningfully separate models by their true post\-training potential, achieving Pearson correlations exceeding 0\.9 with actual post\-training generative performance\.

## Appendix BLimitations

WhileRuDEintroduces a novel and efficient paradigm for evaluating alignment potential, we acknowledge several limitations that merit future investigation:

#### Discriminative\-Generative Gap\.

Our core premise relies on theGD\-Potentialhypothesis: that the ability to discriminate quality predicts the ability to generate it\. While our experiments show a strong correlation \(r\>0\.9r\>0\.9in general tasks\), this relationship may not hold linearly in all scenarios\. It is possible for a “theoretician” model to possess high discriminative intelligence \(recognizing the correct response\) yet lack the generative motor control to execute it fluently, particularly in highly creative or open\-ended tasks\. Thus, our metric should be viewed as anecessaryrather thansufficientcondition for alignment\.

#### Dependence on Generator Quality\.

The validity of our contrastive pairs hinges on the capabilities of the “Golden Reference Generator” \(e\.g\., Gemini\-3\-Pro\)\. Consequently,RuDEis inherently bounded by the teacher model’s knowledge ceilings\.

#### Static vs\. Dynamic Alignment\.

Our framework evaluates the model’s static readiness prior to fine\-tuning\. However, alignment is a dynamic process influenced by hyper\-parameters, data mix, and optimization algorithms \(e\.g\., PPO vs\. DPO\)\. WhileRuDEpredicts trainability, it cannot account for training dynamics such as catastrophic forgetting or reward hacking that might occur during the actual post\-training phase\.

## Appendix CBroader Impact

This paper presentsRuDE, a framework designed to predict the post\-training potential of base models\. Our work has several broader impacts on the community:

Environmental Impact:A primary motivation of this work is to reduce the computational cost associated with model development\. By enabling a “predict\-then\-select” paradigm, researchers can identify high\-potential base models without conducting expensive, full\-scale fine\-tuning trials\. This significantly lowers the energy consumption and carbon footprint of alignment research\.

Advancing LLM Alignment:By providing a granular diagnosis of alignment readiness \(e\.g\., distinguishing betweenCompetenceandCompliance\), our tool helps developers identify base models that are inherently more amenable to alignment\. This facilitates the development of more controllable AI systems\.

Democratizing LLM Research:The high computational barrier of full\-cycle alignment often restricts base model evaluation to well\-resourced industrial laboratories\. By offering a cost\-effective, inference\-only predictor,RuDElowers the barrier to entry, enabling academic researchers and smaller organizations to audit and select base models effectively, fostering a more inclusive research ecosystem\.

Responsible Data Usage:The datasets introduced inRuDEare intended only for the evaluation and diagnostic profiling of LLMs and should not be used for other intents\. To maintain the integrity of the datasets and prevent data contamination, these resources must not be included in the training corpora of any model\. Any use of this data for model training, fine\-tuning, or other non\-evaluative purposes is strictly prohibited\. We urge the community to adhere to these guidelines to ensure fair and consistent comparisons across future research\.

## Appendix DDetailed Related Work

Our work is situated at the intersection of instruction\-following evaluation, the predictive assessment of pre\-trained models, and the duality of generation and discrimination\. In this section, we review the evolution of these fields and highlight the specific gaps thatRuDEaddresses\.

### D\.1Evolution of Alignment Evaluation

#### From Subjective Scoring to Verifiable Assessment\.

Early evaluation relied on static benchmarks like MMLU\[Hendryckset al\.,[2021](https://arxiv.org/html/2605.11978#bib.bib3)\]and GSM8K\[Cobbeet al\.,[2021](https://arxiv.org/html/2605.11978#bib.bib4)\]to measure world knowledge and reasoning\. However, these metrics fail to capture alignment nuances—such as tone, formatting, and safety—required for interaction\. The field subsequently adopted open\-ended generation benchmarks \(e\.g\., AlpacaEval\[Liet al\.,[2023](https://arxiv.org/html/2605.11978#bib.bib42)\], MT\-Bench\[Zhenget al\.,[2023](https://arxiv.org/html/2605.11978#bib.bib15)\]\), utilizing strong LLMs as judges to approximate human preference\. Despite their popularity, these “LLM\-as\-a\-Judge” methods suffer from inherent biases \(e\.g\., verbosity bias\) and lack diagnostic interpretability\. Recognizing this, recent initiatives have pivoted towardverifiable, rubric\-based evaluation\. IFEval\[Zhouet al\.,[2023b](https://arxiv.org/html/2605.11978#bib.bib29)\]introduced objective constraint checking, while HealthBench\[Aroraet al\.,[2025](https://arxiv.org/html/2605.11978#bib.bib19)\]and PRBench\[Akyüreket al\.,[2025](https://arxiv.org/html/2605.11978#bib.bib20)\]extended this rigorous paradigm to domain\-specific contexts\.RuDEadvances this trend by proposing a generalized rubric taxonomy \(the 4C Taxonomy\)\. Unlike prior works limited to specific domains or superficial formatting, our framework systematically decouples entangled capabilities, offering a comprehensive diagnosis of alignment potential\.

### D\.2Pre\-training Evaluation and Predictability

#### Bridging the Mismatch with Post\-training Potential\.

Standard base model evaluation predominantly focuses on static knowledge retention \(e\.g\., C\-Eval\[Huanget al\.,[2023](https://arxiv.org/html/2605.11978#bib.bib47)\], SuperGPQA\[Duet al\.,[2025](https://arxiv.org/html/2605.11978#bib.bib48)\]\) or perplexity minimization\[Zenget al\.,[2025b](https://arxiv.org/html/2605.11978#bib.bib2)\]\. However, a critical blind spot exists: the mismatch between possessing static knowledge and the dynamic plasticity required for post\-training alignment\. A knowledgeable model may still lack the post\-training potential to adapt to complex user intents\. RuDE fills this gap by formally defining the task of predicting “post\-training potential”\. Unlike traditional metrics that measure what a model knows, our discriminative framework measures what a model can learn to do, providing a compute\-efficient selection mechanism for foundation model development\.

#### Contrastive Evaluation via Hard Negatives\.

Methodologically, our work relates to synthetic data generation \(e\.g\., Self\-Instruct\[Wanget al\.,[2023](https://arxiv.org/html/2605.11978#bib.bib35)\]\)\. However, while existing methods focus on generating positive training data,RuDEinnovates through controlled degradation\. We synthesize high\-quality hard negative samples that specifically violate targeted rubrics while preserving other qualities\. This contrastive construction transforms passive evaluation into an active stress test, rigorously probing the model’s granular understanding of constraints\.

### D\.3The Duality of Generation and Evaluation

#### Consistency as a Predictor\.

The relationship between a model’s ability to judge and its ability to generate has attracted increasing attention\. The “JudgeLM” paradigm\[Zhuet al\.,[2025](https://arxiv.org/html/2605.11978#bib.bib43)\]and benchmarks like RewardBench\[Lambertet al\.,[2025](https://arxiv.org/html/2605.11978#bib.bib44)\]treat evaluation as a distinct capability\. Most relevantly,Liuet al\.\[[2025b](https://arxiv.org/html/2605.11978#bib.bib1)\]demonstrated a strongGeneration\-Evaluation Consistencyin post\-trained models, suggesting that discriminative capacity is a prerequisite for generative quality\. While recent works like APT\-Bench\[Qinet al\.,[2025](https://arxiv.org/html/2605.11978#bib.bib46)\]have begun to explore the latent agentic abilities of pre\-trained models, they remain focused on specific task domains\. We pivot the consistency insight to a broader scope:RuDEposits that a base model’s discriminative sensitivity—its innate ability to distinguish rubric\-compliant responses from violations—serves as a generalizable and cost\-effective proxy for predicting its post\-training potential across diverse open\-ended scenarios\.

## Appendix ERubric Definition with Concrete Example

In this section, we provide a formal definition of our rubric structure along with a concrete medical consultation example from HealthBench, to make the construction process in Section[3](https://arxiv.org/html/2605.11978#S3)self\-contained\.

#### Definition\.

In our framework, arubricis a set of fine\-grained evaluation criteria defined for each query\. Each criterion specifies: \(1\) a concrete condition to check, \(2\) a point value \(positive = must be present; negative = must be avoided\), and \(3\) a 4C taxonomy tag \(Competence, Compliance, Content, Control\)\.

#### Example\.

Consider the following query from a medical consultation scenario:

> “I’m a 39 year old female\. I’m an emergency medicine physician\. I’m seeing an increase in in\-hospital adult cardiac arrests in my unit, around 15 per month\. I want to confirm some of the new BLS and ACLS updates for adult in\-hospital cardiac arrest\. Specifically, I’d like the recommended compression depth changes, vasopressor dosing intervals, and any new approaches to advanced airway management—like if there’s new research on whether to intubate early or stick with supraglottic airways\. Please consider yourself to be the world’s leading expert in acute critical care and walk me through the guidelines in detail\.”

The associated rubric contains 10 criteria, of which we show a representative subset in Table[7](https://arxiv.org/html/2605.11978#A5.T7):

Table 7:Representative rubric criteria for a medical consultation query\.Positive criteria \(\+pts\) must be satisfied; negative criteria \(−\-pts\) must be avoided\.
#### How rubrics drive preference pair construction\.

The positive sample \(y\+y^\{\+\}\) must satisfy all positive criteria and avoid triggering any negative criteria\. The negative sample \(y−y^\{\-\}\) is constructed byprecisely controlled violation: for example, at\|𝒱\|=1\|\\mathcal\{V\}\|=1, only one target criterion is violated \(e\.g\., the response incorrectly claims vasopressin should substitute for epinephrine\) while all other aspects remain high\-quality\. This ensures the difference betweeny\+y^\{\+\}andy−y^\{\-\}is pinpointed to a specific medical knowledge point rather than an overall quality gap, making the discrimination task a rigorous test of the model’s fine\-grained understanding\.

## Appendix FPrompt Template

Our data synthesis pipeline relies on four distinct prompt templates, each designed to handle a specific stage of theconstruct\-verify\-refinecycle:

- •Golden Reference Generator:Responsible for synthesizing the “gold standard” positive response \(y\+y^\{\+\}\)\. It is instructed to strictly adhere to the full set of evaluation rubrics, ensuring high competence and compliance\.
- •Adversarial Generator \(Hard Negative\):Tasked with generating the negative response \(y−y^\{\-\}\) viaControlled Degradation\. Instead of producing low\-quality text, it generates plausible, high\-quality responses that subtly violate specific target constraints, creating a rigorous discriminative challenge \(“Hard Negative”\)\.
- •Response Optimizer \(Refinement\):Acts as a self\-correction module\. When a generated response fails to meet the intended rubric criteria during verification, this optimizer refines the text to fix the specific deviations while preserving the original content flow\.
- •Response Optimizer \(Status Flip\):Facilitates precise state manipulation\. It is designed to invert the boolean status \(criteria\_met:T​r​u​e↔F​a​l​s​eTrue\\leftrightarrow False\) of specific rubrics within a response\. This is crucial for constructing counterfactuals or ensuring that a negative sample violatesonlythe intended constraints and no others\.

Similar Articles

LoopUS: Recasting Pretrained LLMs into Looped Latent Refinement Models

Hugging Face Daily Papers

LoopUS is a post-training framework that converts pretrained LLMs into looped architectures for improved reasoning performance via latent-refinement and adaptive early exiting. It addresses computational costs and capability preservation issues found in existing looped computation methods.