G-Zero: Self-Play for Open-Ended Generation from Zero Data
Summary
This paper introduces G-Zero, a verifier-free framework that enables autonomous large language model self-improvement through co-evolutionary training using intrinsic rewards and hint-based guidance. It aims to overcome the limitations of proxy LLM judges in open-ended tasks by deriving supervision from internal distributional dynamics.
View Cached Full Text
Cached at: 05/12/26, 07:32 AM
Paper page - G-Zero: Self-Play for Open-Ended Generation from Zero Data
Source: https://huggingface.co/papers/2605.09959
Abstract
A novel verifier-free framework enables autonomous large language model self-improvement through co-evolutionary training with intrinsic rewards and hint-based guidance.
Self-evolving LLMsexcel in verifiable domains but struggle in open-ended tasks, where reliance onproxy LLM judgesintroduces capability bottlenecks andreward hacking. To overcome this, we introduceG-Zero, a verifier-free, co-evolutionary framework for autonomous self-improvement. Our core innovation isHint-δ, anintrinsic rewardthat quantifies the predictive shift between aGenerator model’s unassisted response and its response conditioned on a self-generated hint. Using this signal, aProposer modelis trained viaGRPOto continuously target the Generator’s blind spots by synthesizing challenging queries and informative hints. The Generator is concurrently optimized viaDPOto internalize these hint-guided improvements. Theoretically, we prove abest-iterate suboptimality guaranteefor an idealized standard-DPOversion ofG-Zero, provided that the Proposer induces sufficientexploration coverageand thedata filterationkeepspseudo-label score noiselow. By deriving supervision entirely from internal distributional dynamics,G-Zerobypasses the capability ceilings of external judges, providing a scalable, robust pathway for continuous LLM self-evolution across unverifiable domains.
View arXiv pageView PDFGitHub4Add to collection
Get this paper in your agent:
hf papers read 2605\.09959
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.09959 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.09959 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.09959 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
MindZero: Learning Online Mental Reasoning With Zero Annotations
MindZero introduces a self-supervised reinforcement learning framework that trains multimodal large language models for efficient and robust online mental reasoning without requiring mental state annotations, outperforming model-based methods in accuracy and efficiency.
Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision
Self-Distillation Zero (SD-Zero) is a novel training method that converts sparse binary rewards into dense token-level supervision through dual-role training where a model acts as both generator and reviser, achieving 10%+ improvements on math and code reasoning benchmarks with higher sample efficiency than RL approaches.
Self-play helped AI achieve superhuman performance in Go, so why hasn’t it done the same for LLMs? Researchers have found a solution.
Researchers introduce Self-Guided Self-Play (SGS), a self-play algorithm for LLMs that prevents reward hacking by using a Guide role to score synthetic problems. Applied to theorem proving in Lean4, SGS surpasses RL baselines and allows a 7B model to outperform a 671B model.
GRLO: Towards Generalizable Reinforcement Learning in Open-Ended Environments from Zero
GRLO introduces a novel reinforcement learning post-training method that achieves strong generalization across multiple domains (math, code, etc.) from only 5K prompts and 22.7 GPU hours, significantly outperforming in-domain RLVR baselines in efficiency and data requirements.
Self-Evolving Deep Research via Joint Generation and Evaluation
Researchers from HKUST, ByteDance, and UCL propose SCORE, a co-evolutionary training framework that jointly trains an LLM as both a deep research report generator and an evaluator, using a meta-harness to dynamically adjust evaluation difficulty and prevent reward saturation. Experiments show consistent improvement in open-ended research report quality.