Trust Functions: Near-Lossless Weak-to-Strong Generalization by Learning When to Trust the Weak Teacher
Summary
Trust functions enable near-lossless weak-to-strong generalization by identifying reliable weak labels for training, achieving performance comparable to ground-truth supervision across multiple domains.
View Cached Full Text
Cached at: 06/10/26, 12:08 AM
Paper page - Trust Functions: Near-Lossless Weak-to-Strong Generalization by Learning When to Trust the Weak Teacher
Source: https://huggingface.co/papers/2606.01000
Abstract
Trust functions enable effective weak-to-strong generalization by identifying reliable weak labels for training, achieving performance comparable to ground-truth supervision across multiple domains.
Weak-to-strong generalizationstudies how to improve a strong student using supervision from a weaker teacher whenreliable labelsare scarce. We view this primarily as adata selectionproblem, where the key challenge is to identify which weak labels are reliable enough to serve as atraining signal. To address this, we introducetrust functionsthat assign each weak label a scalar trust score and use these scores to filterweak supervision. Across several domains, including world knowledge, quantitative reasoning, and strategy games, trust filtering yields students that match and sometimes surpass ground-truth supervision, achieving near-losslessweak-to-strong generalization. Moreover,trust functionsenable aniterative weak-to-strong chainthat compounds gains by training a student and reusing it as the next teacher, amplifying the gains. There are several mechanisms to which advantage oftrust functionscan be attributed.
View arXiv pageView PDFProject pageGitHub0Add to collection
Get this paper in your agent:
hf papers read 2606\.01000
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.01000 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.01000 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.01000 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Trust-Region Behavior Blending for On-Policy Distillation
Trust-Region behavior Blending (TRB) improves on-policy distillation by replacing poor early student rollouts with teacher-like behavior within a KL trust region during warmup, achieving stronger results on math-reasoning tasks.
A Systematic Study of Training-Free Methods for Trustworthy Large Language Models
A systematic study evaluating training-free methods for improving trustworthiness in large language models, categorizing approaches into input, internal, and output-level interventions while analyzing trade-offs between trustworthiness, utility, and robustness.
When Can LLMs Learn to Reason with Weak Supervision?
This paper systematically studies when LLMs can generalize in reasoning tasks under weak supervision (scarce data, noisy rewards, self-supervised proxy rewards), finding that reward saturation dynamics and reasoning faithfulness are key predictors, and that SFT on explicit reasoning traces is necessary for successful generalization under weak supervision.
Weak-to-strong generalization
OpenAI's Superalignment team introduces weak-to-strong generalization, a new research direction for empirically aligning superhuman AI models by addressing the fundamental challenge of how weak human supervisors can reliably control and steer AI systems vastly smarter than themselves.
Trust Region On-Policy Distillation
The paper proposes Trust Region On-Policy Distillation (TrOPD) to stabilize on-policy distillation of large language models by using trust regions, outlier estimation, and off-policy guidance, outperforming existing methods on reasoning and code generation benchmarks.