more models more better. one expensive model is losing to three cheap ones, and there's a paper on it

Reddit r/artificial Papers

Summary

A mixture-of-agents paper (arxiv 2406.04692) shows that a committee of cheap open models can outperform GPT-4o on AlpacaEval 2.0 by leveraging decorrelated errors, and the author shares similar real-world findings where multiple cheap models catch more bugs than a single expensive model.

ok this one bugged me. there's a mixture-of-agents paper (arxiv 2406.04692) where a stack of open models, none of them frontier, get layered into a committee and beat gpt-4o on alpacaeval 2.0, 65.1 to 57.5. cheaper parts, better result. and it lines up uncomfortably well with something we tripped into ourselves. our setup is a few models reviewing each change and only merging if they agree. we did not pick it because it's elegant, we picked it because one model reviewing its own work just rubber-stamps it. the surprise was that cheap models disagreeing with each other caught more than one expensive model being confident on its own. the part nobody really tells you is why it works. it isn't that three models are individually smarter. it's that their mistakes don't line up. one big model has one set of blind spots and it hits them every time. three different ones miss in different places, so the disagreement is exactly where the bugs surface. obviously there's a cost. it's slower, you're paying for a few calls instead of one, and a committee of cheap models can still share a wrong prior and all be confidently wrong together. but for anything where being wrong is expensive, the trade has been worth it for us. anyone else gone multi-model instead of one big one? wondering if you hit the same decorrelated-errors effect or if it stops paying off once you scale up.
Original Article

Similar Articles

@dair_ai: NEW paper worth reading. GPT-5.4 nano plus a critic-comparator orchestration loop hits 76.4% on SWE-bench Verified, mat…

X AI KOLs Following

A new paper shows that using a weak model with k=8 proposals and a critic-comparator selection loop can match frontier model performance on SWE-bench Verified, reaching 76.4% accuracy. The key insight is that correct patches are often already present in a weak model's top-k candidates, and the challenge is effective selection using execution verification.