G-Zero: Self-Play for Open-Ended Generation from Zero Data

Hugging Face Daily Papers 05/11/26, 12:00 AM Papers

Summary

This paper introduces G-Zero, a verifier-free framework that enables autonomous large language model self-improvement through co-evolutionary training using intrinsic rewards and hint-based guidance. It aims to overcome the limitations of proxy LLM judges in open-ended tasks by deriving supervision from internal distributional dynamics.

Self-evolving LLMs excel in verifiable domains but struggle in open-ended tasks, where reliance on proxy LLM judges introduces capability bottlenecks and reward hacking. To overcome this, we introduce G-Zero, a verifier-free, co-evolutionary framework for autonomous self-improvement. Our core innovation is Hint-δ, an intrinsic reward that quantifies the predictive shift between a Generator model's unassisted response and its response conditioned on a self-generated hint. Using this signal, a Proposer model is trained via GRPO to continuously target the Generator's blind spots by synthesizing challenging queries and informative hints. The Generator is concurrently optimized via DPO to internalize these hint-guided improvements. Theoretically, we prove a best-iterate suboptimality guarantee for an idealized standard-DPO version of G-Zero, provided that the Proposer induces sufficient exploration coverage and the data filteration keeps pseudo-label score noise low. By deriving supervision entirely from internal distributional dynamics, G-Zero bypasses the capability ceilings of external judges, providing a scalable, robust pathway for continuous LLM self-evolution across unverifiable domains.

Original Article

View Cached Full Text

Cached at: 05/12/26, 07:32 AM

Paper page - G-Zero: Self-Play for Open-Ended Generation from Zero Data

Source: https://huggingface.co/papers/2605.09959

Abstract

A novel verifier-free framework enables autonomous large language model self-improvement through co-evolutionary training with intrinsic rewards and hint-based guidance.

Self-evolving LLMsexcel in verifiable domains but struggle in open-ended tasks, where reliance onproxy LLM judgesintroduces capability bottlenecks andreward hacking. To overcome this, we introduceG-Zero, a verifier-free, co-evolutionary framework for autonomous self-improvement. Our core innovation isHint-δ, anintrinsic rewardthat quantifies the predictive shift between aGenerator model’s unassisted response and its response conditioned on a self-generated hint. Using this signal, aProposer modelis trained viaGRPOto continuously target the Generator’s blind spots by synthesizing challenging queries and informative hints. The Generator is concurrently optimized viaDPOto internalize these hint-guided improvements. Theoretically, we prove abest-iterate suboptimality guaranteefor an idealized standard-DPOversion ofG-Zero, provided that the Proposer induces sufficientexploration coverageand thedata filterationkeepspseudo-label score noiselow. By deriving supervision entirely from internal distributional dynamics,G-Zerobypasses the capability ceilings of external judges, providing a scalable, robust pathway for continuous LLM self-evolution across unverifiable domains.

View arXiv page View PDF GitHub4 Add to collection

Get this paper in your agent:

hf papers read 2605\.09959

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.09959 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.09959 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.09959 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

G-Zero: Self-Play for Open-Ended Generation from Zero Data

Paper page - G-Zero: Self-Play for Open-Ended Generation from Zero Data

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

MindZero: Learning Online Mental Reasoning With Zero Annotations

Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision

Self-play helped AI achieve superhuman performance in Go, so why hasn’t it done the same for LLMs? Researchers have found a solution.

GRLO: Towards Generalizable Reinforcement Learning in Open-Ended Environments from Zero

From Search to Synthesis: Training LLMs as Zero-Shot Workflow Generators

Submit Feedback

Similar Articles

MindZero: Learning Online Mental Reasoning With Zero Annotations

Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision

Self-play helped AI achieve superhuman performance in Go, so why hasn’t it done the same for LLMs? Researchers have found a solution.

GRLO: Towards Generalizable Reinforcement Learning in Open-Ended Environments from Zero

From Search to Synthesis: Training LLMs as Zero-Shot Workflow Generators