@ChenHenryWu: Self-improvement depends on whether a model can judge its own work. We usually train models to generate better - why no…
Summary
This tweet thread introduces research showing that training models to verify their own work can nearly double accuracy on hard math problems and improve scientific reasoning by 14x.
View Cached Full Text
Cached at: 06/06/26, 01:22 AM
Self-improvement depends on whether a model can judge its own work. We usually train models to generate better - why not train them to verify just as well?
We show how to train models to pinpoint their errors, and the same model nearly doubles its accuracy on hard math and jumps 14x on scientific reasoning. 1/5
We want verification to tell not just whether a solution is wrong, but where and why, so self-improvement has a direction. But how should we train for that?
Our key idea: show the model the reference solution so it has more context to teach itself to reason about where the errors are and why. We call this self-trained verification (STV). We then put it in the loop to improve at both test and training time. 2/5
At test time, the trained verifier makes refinement actually scale. STV roughly doubles pass@1 on hard math, and with enough verification compute, STV-guided 8B even beats a 4× larger model. 3/5
Next, we ask whether verifier-in-the-loop (ViL) training can improve the generator itself, especially after standard RLVR has saturated.
With test-time verification, we see a 33% pass@1 gain, as expected because the generator learns to use verifier output.
The surprise: even without a verifier at inference, standalone pass@1 improves by 30%. 4/5
Similar Articles
I Let a Small Model Train on Its Own Mistakes. It Reached 80% on HumanEval and Beat GPT-3.5 on Math
A researcher trained small language models on their own self-generated coding mistakes and corrections, achieving 80% on HumanEval and surpassing GPT-3.5 on math, demonstrating effective self-improvement with minimal resources.
@bcherny: We talk a lot about how important it is to set up self-verification loops. Especially in the age of powerful models tha…
Discussion on the importance of self-verification loops in AI models like Claude to improve reliability and reduce the need for manual oversight.
@rohanpaul_ai: A Primer paper about how reasoning models improve after training Shows that better reasoning models depend less on raw …
This primer paper explores how reasoning models improve after training, arguing that effective reasoning data relies more on checkable training evidence than raw data size. It categorizes reasoning data by verification methods and emphasizes preserving messy agent data for learning signals.
@omarsar0: Very good advice on self-improving agents. (bookmark it) This is something I am seeing in my own experiments with codin…
Tweet discussing advice on self-improving agents, with personal observations from experiments on coding agents for long-horizon tasks, noting that stronger models don't always yield better agents.
@bradenjhancock: In other words: Humans are teaching teacher models how to teach other models the way good human teachers teach other hu…
Humans are training teacher models to teach student models in a step-by-step manner, penalizing leaps, to improve model intelligence.