more models more better. one expensive model is losing to three cheap ones, and there's a paper on it

Reddit r/artificial 06/26/26, 04:06 AM Papers

mixture-of-agents open-models committee alpacaeval model-comparison cost-efficiency decorrelated-errors

Summary

A mixture-of-agents paper (arxiv 2406.04692) shows that a committee of cheap open models can outperform GPT-4o on AlpacaEval 2.0 by leveraging decorrelated errors, and the author shares similar real-world findings where multiple cheap models catch more bugs than a single expensive model.

ok this one bugged me. there's a mixture-of-agents paper (arxiv 2406.04692) where a stack of open models, none of them frontier, get layered into a committee and beat gpt-4o on alpacaeval 2.0, 65.1 to 57.5. cheaper parts, better result. and it lines up uncomfortably well with something we tripped into ourselves. our setup is a few models reviewing each change and only merging if they agree. we did not pick it because it's elegant, we picked it because one model reviewing its own work just rubber-stamps it. the surprise was that cheap models disagreeing with each other caught more than one expensive model being confident on its own. the part nobody really tells you is why it works. it isn't that three models are individually smarter. it's that their mistakes don't line up. one big model has one set of blind spots and it hits them every time. three different ones miss in different places, so the disagreement is exactly where the bugs surface. obviously there's a cost. it's slower, you're paying for a few calls instead of one, and a committee of cheap models can still share a wrong prior and all be confidently wrong together. but for anything where being wrong is expensive, the trade has been worth it for us. anyone else gone multi-model instead of one big one? wondering if you hit the same decorrelated-errors effect or if it stops paying off once you scale up.

Original Article

more models more better. one expensive model is losing to three cheap ones, and there's a paper on it

Similar Articles

AI agents feel much more reliable once multiple models are involved

@dair_ai: NEW paper worth reading. GPT-5.4 nano plus a critic-comparator orchestration loop hits 76.4% on SWE-bench Verified, mat…

@ChrisGPotts: We take for granted that larger models are better than smaller ones, but why is this so? Our new paper, led by Jing Hua…

A 4b model is now beating 30b ones at web research and the reason is not size

Five labs, five minds: building a multi-model finance drama on small models (6 minute read)

Submit Feedback

Similar Articles

AI agents feel much more reliable once multiple models are involved

@dair_ai: NEW paper worth reading. GPT-5.4 nano plus a critic-comparator orchestration loop hits 76.4% on SWE-bench Verified, mat…

@ChrisGPotts: We take for granted that larger models are better than smaller ones, but why is this so? Our new paper, led by Jing Hua…

A 4b model is now beating 30b ones at web research and the reason is not size

Five labs, five minds: building a multi-model finance drama on small models (6 minute read)