SCOPE: Self-Play via Co-Evolving Policies for Open-Ended Tasks
Summary
SCOPE is a self-play framework for open-ended tasks that co-evolves a Challenger and Solver policy, achieving up to +10.4 points on benchmarks without external supervision.
View Cached Full Text
Cached at: 06/01/26, 07:18 AM
Paper page - SCOPE: Self-Play via Co-Evolving Policies for Open-Ended Tasks
Source: https://huggingface.co/papers/2605.31433
Abstract
SCOPE is a self-play framework that trains language models on open-ended tasks through policy co-evolution, achieving superior performance on both targeted and held-out benchmarks without external supervision.
Self-playcan train language models without external supervision. However, existing methods require rule-checkable answers, leavingopen-ended tasksdependent on curated prompts or frontier-model judges. We introduce SCOPE, a data-freeself-playframework foropen-ended tasksthat co-evolves two policies: a Challenger that generates document-grounded tasks, and a Solver that answers them throughmulti-turn retrieval. A frozen copy of the initial model serves as theself-judge, which writes task-specific rubrics from the source document and grades Solver responses against them. Across three 7-8Binstruction-tuned models(Qwen2.5, Qwen3, OLMo-3), SCOPE improves open-ended performance by up to +10.4 points on eight benchmarks and matches or exceedsGRPO_datatrained on ~9K curated prompts. Although trained only onopen-ended tasks, SCOPE also improves held-out short-form QA by up to +13.8 points on seven held-out benchmarks, surpassingGRPO_dataon all three models. Ablations show that co-evolving the Challenger is necessary to keep tasks near the Solver’s frontier, that gains arise from improvements in both retrieval and synthesis with the relative contribution varying by task, and thatrubric generationquality is the bottleneck for self-judging.
View arXiv pageView PDFProject pageAdd to collection
Get this paper in your agent:
hf papers read 2605\.31433
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.31433 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.31433 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.31433 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
OpenSkill: Open-World Self-Evolution for LLM Agents
OpenSkill is a framework for LLM agents to self-evolve skills and verification signals from open-world resources without target-task supervision, achieving high performance across benchmarks.
SocraticPO: Policy Optimization via Interactive Guidance
SocraticPO augments RL rollouts with Socratic-style natural language guidance and reward decay to improve scientific reasoning in LLMs, outperforming strong baselines on SciKnowEval benchmarks.
SCOPE: Structured Decomposition and Conditional Skill Orchestration for Complex Image Generation
SCOPE is a specification-guided framework for text-to-image generation that tracks semantic commitments to better fulfill complex visual intents. It introduces the Gen-Arena benchmark and demonstrates strong performance on complex generation tasks.
CoSPlay: Cooperative Self-Play at Test-Time with Self-Generated Code and Unit Test
CoSPlay is a training-free framework that jointly improves code generation and unit test quality through cooperative self-play, achieving competitive performance without ground-truth unit tests.
PACEvolve++: Improving Test-time Learning for Evolutionary Search Agents
The paper introduces PACEvolve++, a reinforcement learning framework that improves test-time policy adaptation for evolutionary search agents by decoupling hypothesis generation from execution.