@zhengyaojiang: We benchmarked 7 frontier models on 3 categories of autoresearch tasks: ML engineering, harness/prompt engineering, and…

X AI KOLs Following News

Summary

Researchers benchmarked 7 frontier models on autoresearch tasks. Fable-5 won overall, but the open model Kimi-K2.7-Code surpassed others on ML engineering tasks.

We benchmarked 7 frontier models on 3 categories of autoresearch tasks: ML engineering, harness/prompt engineering, and algorithmic discovery. Fable-5 won overall even under cost constraint, but on ML engineering, the open model Kimi-K2.7-Code surpassed frontier models.🧵(1/5) https://t.co/KzePspXd0Z
Original Article
View Cached Full Text

Cached at: 06/15/26, 09:01 AM

We benchmarked 7 frontier models on 3 categories of autoresearch tasks: ML engineering, harness/prompt engineering, and algorithmic discovery.

Fable-5 won overall even under cost constraint, but on ML engineering, the open model Kimi-K2.7-Code surpassed frontier models.(1/5)

Benchmark protocol:

  • cost (LLM + eval cost) constrained, not steps constrained. This means an agent can run more steps if its model or solutions are cheaper to run
  • All of the models use the autonomous research harness behind the @WecoAI service
  • The scores should be interpreted as how good the final solution is compared to a naive ReAct agent. (2/5)

Overall, Fable is a really strong model for autoresearch. It dominates harness/prompt engineering and algorithmic discovery tasks. We were especially surprised by the algorithmic discovery results, because the eval cost is low and cheaper models can run many more steps. (3/5)

Surprisingly though, we found that a recent open model, Kimi-K2.7, performed very well on ML engineering. And Fable performed even worse than Opus. This could be either because of the inflated cost, or the guardrails put on ML tasks. (4/5)

Overall, it seems like the model supply chain will be less stable in the autoresearch space.

On the Weco side, we’ll stay model-neutral and provide more options for our users. Today, we just added support for Kimi-2.7 (5/5)

If you’re interested:

Yes it’s possible I believe @SakanaAILabs had some research on this

On MLE, the Opus vs. GPT-5.5 gap is very small, so I wouldn’t read too much into it (can be noise).

On harness tuning, that’s a fair concern. It has been tuned for models from different providers throughout the development process. It’s possible there’s some bias here but it’s hard to tell which specific model gets an advantage.

yes similar as in: https://arxiv.org/html/2605.21384v1…

Thanks Davide! It is quite noisy but we ran a lot of seeds, the aggregated number should be rather robust

I think a key difference between our benchmark and others is that we’re cost-bound. Claude models are generally quite expensive, which leads to fewer iteration steps. Also, Claude is somewhat weaker at some niche tasks that differ from conventional software engineering. For example, it was quite bad at MLE until Opus 4.6, and is still bad at algorithmic/heuristic engineering.

yeah it’s not quite good at heuristic engineering & more conventional algorithm design

Thanks! I can’t share exact tasks but on high level tradML

Thanks!

Similar Articles

AutoLab: Can Frontier Models Solve Long-Horizon Auto Research and Engineering Tasks?

Hugging Face Daily Papers

AutoLab introduces a benchmark for evaluating long-horizon iterative optimization capabilities of frontier models across diverse domains. Results show that persistence and time awareness are more critical than initial performance, with claude-opus-4.6 demonstrating strong capabilities while many models terminate prematurely.

Open source battle: GLM vs Kimi vs MiMo vs DeepSeek

Reddit r/LocalLLaMA

This article tests four open-source Chinese AI models — Zhipu GLM 5.1, Moonshot Kimi K2.6, Stepfun MIMO 2.5 Pro, and DeepSeek V4 Pro — on programming tasks. It finds that GLM leads overall in most tasks but not absolutely; each model has its own strengths and weaknesses.

FrontierCode

Hacker News Top

FrontierCode is a new benchmark from Cognition AI that measures AI models' ability to write high-quality, maintainable code by evaluating mergeability. Results show even top models like Claude Opus 4.8 score only 13.4% on the hardest subset, highlighting a significant gap in code quality.