@zhengyaojiang: We benchmarked 7 frontier models on 3 categories of autoresearch tasks: ML engineering, harness/prompt engineering, and…

X AI KOLs Following 06/14/26, 05:36 PM News

benchmarking autoresearch ml-engineering frontier-models ai-research open-source cost-constraint

Summary

Researchers benchmarked 7 frontier models on autoresearch tasks. Fable-5 won overall, but the open model Kimi-K2.7-Code surpassed others on ML engineering tasks.

We benchmarked 7 frontier models on 3 categories of autoresearch tasks: ML engineering, harness/prompt engineering, and algorithmic discovery. Fable-5 won overall even under cost constraint, but on ML engineering, the open model Kimi-K2.7-Code surpassed frontier models.🧵(1/5) https://t.co/KzePspXd0Z

Original Article

View Cached Full Text

Cached at: 06/15/26, 09:01 AM

We benchmarked 7 frontier models on 3 categories of autoresearch tasks: ML engineering, harness/prompt engineering, and algorithmic discovery.

Fable-5 won overall even under cost constraint, but on ML engineering, the open model Kimi-K2.7-Code surpassed frontier models.(1/5)

Benchmark protocol:

cost (LLM + eval cost) constrained, not steps constrained. This means an agent can run more steps if its model or solutions are cheaper to run
All of the models use the autonomous research harness behind the @WecoAI service
The scores should be interpreted as how good the final solution is compared to a naive ReAct agent. (2/5)

Overall, Fable is a really strong model for autoresearch. It dominates harness/prompt engineering and algorithmic discovery tasks. We were especially surprised by the algorithmic discovery results, because the eval cost is low and cheaper models can run many more steps. (3/5)

Surprisingly though, we found that a recent open model, Kimi-K2.7, performed very well on ML engineering. And Fable performed even worse than Opus. This could be either because of the inflated cost, or the guardrails put on ML tasks. (4/5)

Overall, it seems like the model supply chain will be less stable in the autoresearch space.

On the Weco side, we’ll stay model-neutral and provide more options for our users. Today, we just added support for Kimi-2.7 (5/5)

If you’re interested:

Yes it’s possible I believe @SakanaAILabs had some research on this

On MLE, the Opus vs. GPT-5.5 gap is very small, so I wouldn’t read too much into it (can be noise).

On harness tuning, that’s a fair concern. It has been tuned for models from different providers throughout the development process. It’s possible there’s some bias here but it’s hard to tell which specific model gets an advantage.

yes similar as in: https://arxiv.org/html/2605.21384v1…

Thanks Davide! It is quite noisy but we ran a lot of seeds, the aggregated number should be rather robust

I think a key difference between our benchmark and others is that we’re cost-bound. Claude models are generally quite expensive, which leads to fewer iteration steps. Also, Claude is somewhat weaker at some niche tasks that differ from conventional software engineering. For example, it was quite bad at MLE until Opus 4.6, and is still bad at algorithmic/heuristic engineering.

yeah it’s not quite good at heuristic engineering & more conventional algorithm design

Thanks! I can’t share exact tasks but on high level tradML

Thanks!

@zhengyaojiang: We benchmarked 7 frontier models on 3 categories of autoresearch tasks: ML engineering, harness/prompt engineering, and…

Similar Articles

AutoLab: Can Frontier Models Solve Long-Horizon Auto Research and Engineering Tasks?

Open source battle: GLM vs Kimi vs MiMo vs DeepSeek

FrontierCode

@atomic_chat_hq: New @Zai_org GLM-5.2 beats Kimi K2.7 Code on physics contest! We gave both models the same three prompts and asked them…

@noisyb0y1: SOMEONE REVERSE-ENGINEERED KIMI K2.6 AND IT KILLS THE "BIGGER MODEL = BETTER AI" NARRATIVE FOR GOOD 1 trillion paramete…

Submit Feedback

Similar Articles

AutoLab: Can Frontier Models Solve Long-Horizon Auto Research and Engineering Tasks?

Open source battle: GLM vs Kimi vs MiMo vs DeepSeek

@atomic_chat_hq: New @Zai_org GLM-5.2 beats Kimi K2.7 Code on physics contest! We gave both models the same three prompts and asked them…

@noisyb0y1: SOMEONE REVERSE-ENGINEERED KIMI K2.6 AND IT KILLS THE "BIGGER MODEL = BETTER AI" NARRATIVE FOR GOOD 1 trillion paramete…