Beyond Rubrics: Exploration-Guided Evaluation Skills for Reward Modeling

arXiv cs.CL Papers

Summary

Eval-Skill is an exploration-guided method that synthesizes reusable evaluation skills for reward modeling, achieving significant gains on RewardBench 2 over existing backbones.

arXiv:2606.07040v1 Announce Type: new Abstract: Open-ended reward modeling requires judges that can follow subtle, domain-specific preferences when verifiable answers are unavailable. Existing rubric-based methods often address this by generating criteria online for each query, but the extra generation step can add inference overhead and produce rigid or misaligned guidance. We introduce Eval-Skill, an exploration-guided method that synthesizes reusable evaluation skills for reward modeling and reframes reward guidance as context evolution rather than parameter training or per-query rubric generation. Using only 100 cases per domain for skill evolution, Eval-Skill synthesizes reusable domain-level evaluation skills through two progressive stages, workflow generation followed by principle generation, with exploration and selection interleaved across both stages. Once generated, a skill is directly injected into the judge context. Across multiple RM benchmarks, Eval-Skill consistently improves diverse judge backbones; on RewardBench 2, it yields significant gains over vanilla judging for each main backbone (+13.44% for Qwen3-8B, and 18.51% for DeepSeek-V4-Flash). Further analyses of evolution-time scaling, generalizability, and transferability show that compact evaluation skills offer an efficient new paradigm for LLM-based evaluation. Code is available at https://github.com/xing-stellus-yue/Eval-Skill.
Original Article
View Cached Full Text

Cached at: 06/08/26, 09:21 AM

# Beyond Rubrics: Exploration-Guided Evaluation Skills for Reward Modeling
Source: [https://arxiv.org/abs/2606.07040](https://arxiv.org/abs/2606.07040)
[View PDF](https://arxiv.org/pdf/2606.07040)

> Abstract:Open\-ended reward modeling requires judges that can follow subtle, domain\-specific preferences when verifiable answers are unavailable\. Existing rubric\-based methods often address this by generating criteria online for each query, but the extra generation step can add inference overhead and produce rigid or misaligned guidance\. We introduce Eval\-Skill, an exploration\-guided method that synthesizes reusable evaluation skills for reward modeling and reframes reward guidance as context evolution rather than parameter training or per\-query rubric generation\. Using only 100 cases per domain for skill evolution, Eval\-Skill synthesizes reusable domain\-level evaluation skills through two progressive stages, workflow generation followed by principle generation, with exploration and selection interleaved across both stages\. Once generated, a skill is directly injected into the judge context\. Across multiple RM benchmarks, Eval\-Skill consistently improves diverse judge backbones; on RewardBench 2, it yields significant gains over vanilla judging for each main backbone \(\+13\.44% for Qwen3\-8B, and 18\.51% for DeepSeek\-V4\-Flash\)\. Further analyses of evolution\-time scaling, generalizability, and transferability show that compact evaluation skills offer an efficient new paradigm for LLM\-based evaluation\. Code is available at[this https URL](https://github.com/xing-stellus-yue/Eval-Skill)\.

## Submission history

From: Xing Yue \[[view email](https://arxiv.org/show-email/999bbde3/2606.07040)\] **\[v1\]**Fri, 5 Jun 2026 08:34:06 UTC \(8,904 KB\)

Similar Articles

Skill-RM: Unifying Heterogeneous Evaluation Criteria via Agent Skill

Hugging Face Daily Papers

Skill-RM proposes a unified reward modeling framework that treats reward computation as a structured agentic task, enabling dynamic evidence aggregation and consistent evaluation across diverse applications, outperforming traditional judge baselines.

RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards

Hugging Face Daily Papers

This paper introduces RubricEM, a reinforcement learning framework that uses rubric-guided policy decomposition and reflection-based meta-policy evolution to train deep research agents for long-form tasks. The resulting RubricEM-8B model demonstrates strong performance on long-form research benchmarks by leveraging stage-aware planning and denser semantic feedback.

C2: Scalable Rubric-Augmented Reward Modeling from Binary Preferences

Hugging Face Daily Papers

C2 proposes a scalable rubric-augmented reward modeling framework that trains a cooperative rubric generator and critical verifier exclusively from binary preferences, eliminating the need for costly rubric annotations while achieving up to 6.5 point gains on RM-Bench.

SkillEvolBench: Benchmarking the Evolution from Episodic Experience to Procedural Skills

Hugging Face Daily Papers

SkillEvolBench is a diagnostic benchmark for evaluating whether large language model agents can distill episodic experience into reusable procedural skills. It includes 180 tasks across six environments and finds that current agents often struggle to form robust reusable skills, with raw trajectory reuse often outperforming distilled skills.