@zhengyaojiang: We benchmarked 7 frontier models on 3 categories of autoresearch tasks: ML engineering, harness/prompt engineering, and…
Summary
Researchers benchmarked 7 frontier models on autoresearch tasks. Fable-5 won overall, but the open model Kimi-K2.7-Code surpassed others on ML engineering tasks.
View Cached Full Text
Cached at: 06/15/26, 09:01 AM
We benchmarked 7 frontier models on 3 categories of autoresearch tasks: ML engineering, harness/prompt engineering, and algorithmic discovery.
Fable-5 won overall even under cost constraint, but on ML engineering, the open model Kimi-K2.7-Code surpassed frontier models.(1/5)
Benchmark protocol:
- cost (LLM + eval cost) constrained, not steps constrained. This means an agent can run more steps if its model or solutions are cheaper to run
- All of the models use the autonomous research harness behind the @WecoAI service
- The scores should be interpreted as how good the final solution is compared to a naive ReAct agent. (2/5)
Overall, Fable is a really strong model for autoresearch. It dominates harness/prompt engineering and algorithmic discovery tasks. We were especially surprised by the algorithmic discovery results, because the eval cost is low and cheaper models can run many more steps. (3/5)
Surprisingly though, we found that a recent open model, Kimi-K2.7, performed very well on ML engineering. And Fable performed even worse than Opus. This could be either because of the inflated cost, or the guardrails put on ML tasks. (4/5)
Overall, it seems like the model supply chain will be less stable in the autoresearch space.
On the Weco side, we’ll stay model-neutral and provide more options for our users. Today, we just added support for Kimi-2.7 (5/5)
If you’re interested:
Yes it’s possible I believe @SakanaAILabs had some research on this
On MLE, the Opus vs. GPT-5.5 gap is very small, so I wouldn’t read too much into it (can be noise).
On harness tuning, that’s a fair concern. It has been tuned for models from different providers throughout the development process. It’s possible there’s some bias here but it’s hard to tell which specific model gets an advantage.
yes similar as in: https://arxiv.org/html/2605.21384v1…
Thanks Davide! It is quite noisy but we ran a lot of seeds, the aggregated number should be rather robust
I think a key difference between our benchmark and others is that we’re cost-bound. Claude models are generally quite expensive, which leads to fewer iteration steps. Also, Claude is somewhat weaker at some niche tasks that differ from conventional software engineering. For example, it was quite bad at MLE until Opus 4.6, and is still bad at algorithmic/heuristic engineering.
yeah it’s not quite good at heuristic engineering & more conventional algorithm design
Thanks! I can’t share exact tasks but on high level tradML
Thanks!
Similar Articles
AutoLab: Can Frontier Models Solve Long-Horizon Auto Research and Engineering Tasks?
AutoLab introduces a benchmark for evaluating long-horizon iterative optimization capabilities of frontier models across diverse domains. Results show that persistence and time awareness are more critical than initial performance, with claude-opus-4.6 demonstrating strong capabilities while many models terminate prematurely.
Open source battle: GLM vs Kimi vs MiMo vs DeepSeek
This article tests four open-source Chinese AI models — Zhipu GLM 5.1, Moonshot Kimi K2.6, Stepfun MIMO 2.5 Pro, and DeepSeek V4 Pro — on programming tasks. It finds that GLM leads overall in most tasks but not absolutely; each model has its own strengths and weaknesses.
FrontierCode
FrontierCode is a new benchmark from Cognition AI that measures AI models' ability to write high-quality, maintainable code by evaluating mergeability. Results show even top models like Claude Opus 4.8 score only 13.4% on the hardest subset, highlighting a significant gap in code quality.
@atomic_chat_hq: New @Zai_org GLM-5.2 beats Kimi K2.7 Code on physics contest! We gave both models the same three prompts and asked them…
Z.ai releases GLM-5.2, an open-weights AI model with improved coding and agentic performance, demonstrated by beating Kimi K2.7 Code on a physics simulation benchmark across three tasks.
@noisyb0y1: SOMEONE REVERSE-ENGINEERED KIMI K2.6 AND IT KILLS THE "BIGGER MODEL = BETTER AI" NARRATIVE FOR GOOD 1 trillion paramete…
A reverse engineering analysis of Kimi K2.6 reveals that its architecture prioritizes orchestration and skill injection over raw parameter count, achieving high SWE-Bench scores through multi-agent collaboration without retraining.