Tag
A new paper shows that using a weak model with k=8 proposals and a critic-comparator selection loop can match frontier model performance on SWE-bench Verified, reaching 76.4% accuracy. The key insight is that correct patches are often already present in a weak model's top-k candidates, and the challenge is effective selection using execution verification.
This paper studies verifier-backed committee search as inference-time boosting for reasoning language models, showing that a committee of weak reasoning models can match the performance of much stronger models on code repair tasks like SWE-bench Verified.