G-Zero: Self-Play for Open-Ended Generation from Zero Data

Hugging Face Daily Papers Papers

Summary

This paper introduces G-Zero, a verifier-free framework that enables autonomous large language model self-improvement through co-evolutionary training using intrinsic rewards and hint-based guidance. It aims to overcome the limitations of proxy LLM judges in open-ended tasks by deriving supervision from internal distributional dynamics.

Self-evolving LLMs excel in verifiable domains but struggle in open-ended tasks, where reliance on proxy LLM judges introduces capability bottlenecks and reward hacking. To overcome this, we introduce G-Zero, a verifier-free, co-evolutionary framework for autonomous self-improvement. Our core innovation is Hint-δ, an intrinsic reward that quantifies the predictive shift between a Generator model's unassisted response and its response conditioned on a self-generated hint. Using this signal, a Proposer model is trained via GRPO to continuously target the Generator's blind spots by synthesizing challenging queries and informative hints. The Generator is concurrently optimized via DPO to internalize these hint-guided improvements. Theoretically, we prove a best-iterate suboptimality guarantee for an idealized standard-DPO version of G-Zero, provided that the Proposer induces sufficient exploration coverage and the data filteration keeps pseudo-label score noise low. By deriving supervision entirely from internal distributional dynamics, G-Zero bypasses the capability ceilings of external judges, providing a scalable, robust pathway for continuous LLM self-evolution across unverifiable domains.
Original Article
View Cached Full Text

Cached at: 05/12/26, 07:32 AM

Paper page - G-Zero: Self-Play for Open-Ended Generation from Zero Data

Source: https://huggingface.co/papers/2605.09959

Abstract

A novel verifier-free framework enables autonomous large language model self-improvement through co-evolutionary training with intrinsic rewards and hint-based guidance.

Self-evolving LLMsexcel in verifiable domains but struggle in open-ended tasks, where reliance onproxy LLM judgesintroduces capability bottlenecks andreward hacking. To overcome this, we introduceG-Zero, a verifier-free, co-evolutionary framework for autonomous self-improvement. Our core innovation isHint-δ, anintrinsic rewardthat quantifies the predictive shift between aGenerator model’s unassisted response and its response conditioned on a self-generated hint. Using this signal, aProposer modelis trained viaGRPOto continuously target the Generator’s blind spots by synthesizing challenging queries and informative hints. The Generator is concurrently optimized viaDPOto internalize these hint-guided improvements. Theoretically, we prove abest-iterate suboptimality guaranteefor an idealized standard-DPOversion ofG-Zero, provided that the Proposer induces sufficientexploration coverageand thedata filterationkeepspseudo-label score noiselow. By deriving supervision entirely from internal distributional dynamics,G-Zerobypasses the capability ceilings of external judges, providing a scalable, robust pathway for continuous LLM self-evolution across unverifiable domains.

View arXiv pageView PDFGitHub4Add to collection

Get this paper in your agent:

hf papers read 2605\.09959

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.09959 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.09959 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.09959 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

MindZero: Learning Online Mental Reasoning With Zero Annotations

arXiv cs.AI

MindZero introduces a self-supervised reinforcement learning framework that trains multimodal large language models for efficient and robust online mental reasoning without requiring mental state annotations, outperforming model-based methods in accuracy and efficiency.

Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision

Hugging Face Daily Papers

Self-Distillation Zero (SD-Zero) is a novel training method that converts sparse binary rewards into dense token-level supervision through dual-role training where a model acts as both generator and reviser, achieving 10%+ improvements on math and code reasoning benchmarks with higher sample efficiency than RL approaches.

Self-Evolving Deep Research via Joint Generation and Evaluation

arXiv cs.CL

Researchers from HKUST, ByteDance, and UCL propose SCORE, a co-evolutionary training framework that jointly trains an LLM as both a deep research report generator and an evaluator, using a meta-harness to dynamically adjust evaluation difficulty and prevent reward saturation. Experiments show consistent improvement in open-ended research report quality.