Beyond Rubrics: Exploration-Guided Evaluation Skills for Reward Modeling

arXiv cs.CL 06/08/26, 04:00 AM Papers

Summary

Eval-Skill is an exploration-guided method that synthesizes reusable evaluation skills for reward modeling, achieving significant gains on RewardBench 2 over existing backbones.

arXiv:2606.07040v1 Announce Type: new Abstract: Open-ended reward modeling requires judges that can follow subtle, domain-specific preferences when verifiable answers are unavailable. Existing rubric-based methods often address this by generating criteria online for each query, but the extra generation step can add inference overhead and produce rigid or misaligned guidance. We introduce Eval-Skill, an exploration-guided method that synthesizes reusable evaluation skills for reward modeling and reframes reward guidance as context evolution rather than parameter training or per-query rubric generation. Using only 100 cases per domain for skill evolution, Eval-Skill synthesizes reusable domain-level evaluation skills through two progressive stages, workflow generation followed by principle generation, with exploration and selection interleaved across both stages. Once generated, a skill is directly injected into the judge context. Across multiple RM benchmarks, Eval-Skill consistently improves diverse judge backbones; on RewardBench 2, it yields significant gains over vanilla judging for each main backbone (+13.44% for Qwen3-8B, and 18.51% for DeepSeek-V4-Flash). Further analyses of evolution-time scaling, generalizability, and transferability show that compact evaluation skills offer an efficient new paradigm for LLM-based evaluation. Code is available at https://github.com/xing-stellus-yue/Eval-Skill.

Original Article

View Cached Full Text

Cached at: 06/08/26, 09:21 AM

# Beyond Rubrics: Exploration-Guided Evaluation Skills for Reward Modeling
Source: [https://arxiv.org/abs/2606.07040](https://arxiv.org/abs/2606.07040)
[View PDF](https://arxiv.org/pdf/2606.07040)

> Abstract:Open\-ended reward modeling requires judges that can follow subtle, domain\-specific preferences when verifiable answers are unavailable\. Existing rubric\-based methods often address this by generating criteria online for each query, but the extra generation step can add inference overhead and produce rigid or misaligned guidance\. We introduce Eval\-Skill, an exploration\-guided method that synthesizes reusable evaluation skills for reward modeling and reframes reward guidance as context evolution rather than parameter training or per\-query rubric generation\. Using only 100 cases per domain for skill evolution, Eval\-Skill synthesizes reusable domain\-level evaluation skills through two progressive stages, workflow generation followed by principle generation, with exploration and selection interleaved across both stages\. Once generated, a skill is directly injected into the judge context\. Across multiple RM benchmarks, Eval\-Skill consistently improves diverse judge backbones; on RewardBench 2, it yields significant gains over vanilla judging for each main backbone \(\+13\.44% for Qwen3\-8B, and 18\.51% for DeepSeek\-V4\-Flash\)\. Further analyses of evolution\-time scaling, generalizability, and transferability show that compact evaluation skills offer an efficient new paradigm for LLM\-based evaluation\. Code is available at[this https URL](https://github.com/xing-stellus-yue/Eval-Skill)\.

## Submission history

From: Xing Yue \[[view email](https://arxiv.org/show-email/999bbde3/2606.07040)\] **\[v1\]**Fri, 5 Jun 2026 08:34:06 UTC \(8,904 KB\)

Beyond Rubrics: Exploration-Guided Evaluation Skills for Reward Modeling

Similar Articles

Skill-RM: Unifying Heterogeneous Evaluation Criteria via Agent Skill

RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards

C2: Scalable Rubric-Augmented Reward Modeling from Binary Preferences

Agent Skill Evaluation and Evolution: Frameworks and Benchmarks

SkillEvolBench: Benchmarking the Evolution from Episodic Experience to Procedural Skills

Submit Feedback

Similar Articles

Skill-RM: Unifying Heterogeneous Evaluation Criteria via Agent Skill

RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards

C2: Scalable Rubric-Augmented Reward Modeling from Binary Preferences

Agent Skill Evaluation and Evolution: Frameworks and Benchmarks

SkillEvolBench: Benchmarking the Evolution from Episodic Experience to Procedural Skills