@ChenHenryWu: Self-improvement depends on whether a model can judge its own work. We usually train models to generate better - why no…

X AI KOLs Timeline 06/05/26, 07:52 PM Papers

self-improvement verification error-pinpointing training math-reasoning scientific-reasoning

Summary

This tweet thread introduces research showing that training models to verify their own work can nearly double accuracy on hard math problems and improve scientific reasoning by 14x.

Self-improvement depends on whether a model can judge its own work. We usually train models to generate better - why not train them to verify just as well? We show how to train models to pinpoint their errors, and the same model nearly doubles its accuracy on hard math and jumps 14x on scientific reasoning. 1/5

Original Article

View Cached Full Text

Cached at: 06/06/26, 01:22 AM

Self-improvement depends on whether a model can judge its own work. We usually train models to generate better - why not train them to verify just as well?

We show how to train models to pinpoint their errors, and the same model nearly doubles its accuracy on hard math and jumps 14x on scientific reasoning. 1/5

We want verification to tell not just whether a solution is wrong, but where and why, so self-improvement has a direction. But how should we train for that?

Our key idea: show the model the reference solution so it has more context to teach itself to reason about where the errors are and why. We call this self-trained verification (STV). We then put it in the loop to improve at both test and training time. 2/5

At test time, the trained verifier makes refinement actually scale. STV roughly doubles pass@1 on hard math, and with enough verification compute, STV-guided 8B even beats a 4× larger model. 3/5

Next, we ask whether verifier-in-the-loop (ViL) training can improve the generator itself, especially after standard RLVR has saturated.

With test-time verification, we see a 33% pass@1 gain, as expected because the generator learns to use verifier output.

The surprise: even without a verifier at inference, standalone pass@1 improves by 30%. 4/5

@ChenHenryWu: Self-improvement depends on whether a model can judge its own work. We usually train models to generate better - why no…

Similar Articles

I Let a Small Model Train on Its Own Mistakes. It Reached 80% on HumanEval and Beat GPT-3.5 on Math

@bcherny: We talk a lot about how important it is to set up self-verification loops. Especially in the age of powerful models tha…

@rohanpaul_ai: A Primer paper about how reasoning models improve after training Shows that better reasoning models depend less on raw …

@omarsar0: Very good advice on self-improving agents. (bookmark it) This is something I am seeing in my own experiments with codin…

@bradenjhancock: In other words: Humans are teaching teacher models how to teach other models the way good human teachers teach other hu…

Submit Feedback

Similar Articles

I Let a Small Model Train on Its Own Mistakes. It Reached 80% on HumanEval and Beat GPT-3.5 on Math

@bcherny: We talk a lot about how important it is to set up self-verification loops. Especially in the age of powerful models tha…

@rohanpaul_ai: A Primer paper about how reasoning models improve after training Shows that better reasoning models depend less on raw …

@omarsar0: Very good advice on self-improving agents. (bookmark it) This is something I am seeing in my own experiments with codin…

@bradenjhancock: In other words: Humans are teaching teacher models how to teach other models the way good human teachers teach other hu…