@ChenHenryWu: Self-improvement depends on whether a model can judge its own work. We usually train models to generate better - why no…

X AI KOLs Timeline Papers

Summary

This tweet thread introduces research showing that training models to verify their own work can nearly double accuracy on hard math problems and improve scientific reasoning by 14x.

Self-improvement depends on whether a model can judge its own work. We usually train models to generate better - why not train them to verify just as well? We show how to train models to pinpoint their errors, and the same model nearly doubles its accuracy on hard math and jumps 14x on scientific reasoning. 1/5
Original Article
View Cached Full Text

Cached at: 06/06/26, 01:22 AM

Self-improvement depends on whether a model can judge its own work. We usually train models to generate better - why not train them to verify just as well?

We show how to train models to pinpoint their errors, and the same model nearly doubles its accuracy on hard math and jumps 14x on scientific reasoning. 1/5

We want verification to tell not just whether a solution is wrong, but where and why, so self-improvement has a direction. But how should we train for that?

Our key idea: show the model the reference solution so it has more context to teach itself to reason about where the errors are and why. We call this self-trained verification (STV). We then put it in the loop to improve at both test and training time. 2/5

At test time, the trained verifier makes refinement actually scale. STV roughly doubles pass@1 on hard math, and with enough verification compute, STV-guided 8B even beats a 4× larger model. 3/5

Next, we ask whether verifier-in-the-loop (ViL) training can improve the generator itself, especially after standard RLVR has saturated.

With test-time verification, we see a 33% pass@1 gain, as expected because the generator learns to use verifier output.

The surprise: even without a verifier at inference, standalone pass@1 improves by 30%. 4/5

Similar Articles