Trust Functions: Near-Lossless Weak-to-Strong Generalization by Learning When to Trust the Weak Teacher

Hugging Face Daily Papers Papers

Summary

Trust functions enable near-lossless weak-to-strong generalization by identifying reliable weak labels for training, achieving performance comparable to ground-truth supervision across multiple domains.

Weak-to-strong generalization studies how to improve a strong student using supervision from a weaker teacher when reliable labels are scarce. We view this primarily as a data selection problem, where the key challenge is to identify which weak labels are reliable enough to serve as a training signal. To address this, we introduce trust functions that assign each weak label a scalar trust score and use these scores to filter weak supervision. Across several domains, including world knowledge, quantitative reasoning, and strategy games, trust filtering yields students that match and sometimes surpass ground-truth supervision, achieving near-lossless weak-to-strong generalization. Moreover, trust functions enable an iterative weak-to-strong chain that compounds gains by training a student and reusing it as the next teacher, amplifying the gains. There are several mechanisms to which advantage of trust functions can be attributed.
Original Article
View Cached Full Text

Cached at: 06/10/26, 12:08 AM

Paper page - Trust Functions: Near-Lossless Weak-to-Strong Generalization by Learning When to Trust the Weak Teacher

Source: https://huggingface.co/papers/2606.01000

Abstract

Trust functions enable effective weak-to-strong generalization by identifying reliable weak labels for training, achieving performance comparable to ground-truth supervision across multiple domains.

Weak-to-strong generalizationstudies how to improve a strong student using supervision from a weaker teacher whenreliable labelsare scarce. We view this primarily as adata selectionproblem, where the key challenge is to identify which weak labels are reliable enough to serve as atraining signal. To address this, we introducetrust functionsthat assign each weak label a scalar trust score and use these scores to filterweak supervision. Across several domains, including world knowledge, quantitative reasoning, and strategy games, trust filtering yields students that match and sometimes surpass ground-truth supervision, achieving near-losslessweak-to-strong generalization. Moreover,trust functionsenable aniterative weak-to-strong chainthat compounds gains by training a student and reusing it as the next teacher, amplifying the gains. There are several mechanisms to which advantage oftrust functionscan be attributed.

View arXiv pageView PDFProject pageGitHub0Add to collection

Get this paper in your agent:

hf papers read 2606\.01000

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.01000 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.01000 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.01000 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

Trust-Region Behavior Blending for On-Policy Distillation

Hugging Face Daily Papers

Trust-Region behavior Blending (TRB) improves on-policy distillation by replacing poor early student rollouts with teacher-like behavior within a KL trust region during warmup, achieving stronger results on math-reasoning tasks.

When Can LLMs Learn to Reason with Weak Supervision?

Hugging Face Daily Papers

This paper systematically studies when LLMs can generalize in reasoning tasks under weak supervision (scarce data, noisy rewards, self-supervised proxy rewards), finding that reward saturation dynamics and reasoning faithfulness are key predictors, and that SFT on explicit reasoning traces is necessary for successful generalization under weak supervision.

Weak-to-strong generalization

OpenAI Blog

OpenAI's Superalignment team introduces weak-to-strong generalization, a new research direction for empirically aligning superhuman AI models by addressing the fundamental challenge of how weak human supervisors can reliably control and steer AI systems vastly smarter than themselves.

Trust Region On-Policy Distillation

Hugging Face Daily Papers

The paper proposes Trust Region On-Policy Distillation (TrOPD) to stabilize on-policy distillation of large language models by using trust regions, outlier estimation, and off-policy guidance, outperforming existing methods on reasoning and code generation benchmarks.