SCOPE: Self-Play via Co-Evolving Policies for Open-Ended Tasks

Hugging Face Daily Papers Papers

Summary

SCOPE is a self-play framework for open-ended tasks that co-evolves a Challenger and Solver policy, achieving up to +10.4 points on benchmarks without external supervision.

Self-play can train language models without external supervision. However, existing methods require rule-checkable answers, leaving open-ended tasks dependent on curated prompts or frontier-model judges. We introduce SCOPE, a data-free self-play framework for open-ended tasks that co-evolves two policies: a Challenger that generates document-grounded tasks, and a Solver that answers them through multi-turn retrieval. A frozen copy of the initial model serves as the self-judge, which writes task-specific rubrics from the source document and grades Solver responses against them. Across three 7-8B instruction-tuned models (Qwen2.5, Qwen3, OLMo-3), SCOPE improves open-ended performance by up to +10.4 points on eight benchmarks and matches or exceeds GRPO_data trained on ~9K curated prompts. Although trained only on open-ended tasks, SCOPE also improves held-out short-form QA by up to +13.8 points on seven held-out benchmarks, surpassing GRPO_data on all three models. Ablations show that co-evolving the Challenger is necessary to keep tasks near the Solver's frontier, that gains arise from improvements in both retrieval and synthesis with the relative contribution varying by task, and that rubric generation quality is the bottleneck for self-judging.
Original Article
View Cached Full Text

Cached at: 06/01/26, 07:18 AM

Paper page - SCOPE: Self-Play via Co-Evolving Policies for Open-Ended Tasks

Source: https://huggingface.co/papers/2605.31433

Abstract

SCOPE is a self-play framework that trains language models on open-ended tasks through policy co-evolution, achieving superior performance on both targeted and held-out benchmarks without external supervision.

Self-playcan train language models without external supervision. However, existing methods require rule-checkable answers, leavingopen-ended tasksdependent on curated prompts or frontier-model judges. We introduce SCOPE, a data-freeself-playframework foropen-ended tasksthat co-evolves two policies: a Challenger that generates document-grounded tasks, and a Solver that answers them throughmulti-turn retrieval. A frozen copy of the initial model serves as theself-judge, which writes task-specific rubrics from the source document and grades Solver responses against them. Across three 7-8Binstruction-tuned models(Qwen2.5, Qwen3, OLMo-3), SCOPE improves open-ended performance by up to +10.4 points on eight benchmarks and matches or exceedsGRPO_datatrained on ~9K curated prompts. Although trained only onopen-ended tasks, SCOPE also improves held-out short-form QA by up to +13.8 points on seven held-out benchmarks, surpassingGRPO_dataon all three models. Ablations show that co-evolving the Challenger is necessary to keep tasks near the Solver’s frontier, that gains arise from improvements in both retrieval and synthesis with the relative contribution varying by task, and thatrubric generationquality is the bottleneck for self-judging.

View arXiv pageView PDFProject pageAdd to collection

Get this paper in your agent:

hf papers read 2605\.31433

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.31433 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.31433 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.31433 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

OpenSkill: Open-World Self-Evolution for LLM Agents

Hugging Face Daily Papers

OpenSkill is a framework for LLM agents to self-evolve skills and verification signals from open-world resources without target-task supervision, achieving high performance across benchmarks.