Tag
Ornith-1.0-9B is a new 9B parameter AI model optimized for 8-12GB GPUs, achieving strong performance on agentic coding benchmarks, matching or surpassing models 2-3x its size.
Devin Desktop now supports Kimi K2.7 and GLM 5.2 models, offering free trials until July 5 for Pro/Max/Teams users.
GLM 5.2 is a new open-weights model from Z.ai, compared against Claude Opus in a 3D game coding task. Opus performed faster and cleaner, but GLM 5.2 offers compelling cost and accessibility advantages.
Step 3.7 Flash, an open-weights model with a 256k context window, is available free in Cline for a month, claiming to outperform Gemini and DeepSeek flash models and approach frontier performance on SWE Bench.
The writer shares their experience with Nex-N2 Pro, originally mistaken as Rio-3.5, and finds it performs exceptionally well on coding benchmarks without hallucination, rivaling GPT-5.x on their Mac setup.
Ramp released its own private SWE-Bench benchmark built from real engineering problems, enabling evaluation of coding models within its financial software ecosystem.
FrontierCode is a new coding benchmark from METR and Cognition that evaluates AI models on code maintainability and quality, revealing that many models produce unmergeable code. It includes over 1000 hours of work and shows that even top models struggle, with Opus 4.8 achieving only 13.8% on the hardest tier.
Cognition introduces FrontierCode, a high-quality coding benchmark that goes beyond unit tests to measure code maintainability, regression safety, and quality, with 150 handcrafted tasks by open-source developers.
Qwen 3.6 27B scored 2% on the DeepSWE benchmark, placing 18/20 above Haiku 4.5 and Minimax M2.7, highlighting the gap between local and leading-edge models.
A discussion on how AI models perform best with harnesses developed by their own creators, as third-party harnesses may cause underperformance despite strong benchmarks, citing examples like Claude Code for Claude and Codex for GPT.
Jiayuan Zhang shared his initial experience with the M3 model's coding ability, stating that it is a qualitative improvement compared to m2.7, but the 1-shot results are not as comprehensive as Opus 4.6/4.7 and GPT5.5.
Apex-Testing, a benchmark for evaluating agentic coding models using real private GitHub repositories, has been updated with recent models and detailed metrics including cost, time, and ELO-based leaderboard.
According to the arena leaderboard, open weights models GLM and Mimo outperform Gemini 3.5 Flash in coding benchmarks.
AntLingAGI released Ring-2.6-1T, a trillion-parameter open-source AI model designed for long-horizon workflows and real-world coding tasks, achieving impressive benchmarks on Tau2-Bench, GPQA Diamond, and ClawEval.
Poetiq's Meta-System, using recursive self-improvement via standard API access without fine-tuning, achieves new state-of-the-art results on the LiveCodeBench Pro coding benchmark, outperforming leading models like GPT 5.5.
This article tests four open-source Chinese AI models — Zhipu GLM 5.1, Moonshot Kimi K2.6, Stepfun MIMO 2.5 Pro, and DeepSeek V4 Pro — on programming tasks. It finds that GLM leads overall in most tasks but not absolutely; each model has its own strengths and weaknesses.
Kimi K2, trained for $4.6 million, outperforms GPT-5 and Claude Opus 4.7 on coding benchmarks, with a detailed breakdown from its founder.
By pairing Qwen3.6-35B with the little-coder agent scaffold, the model hits 78.7% on the Polyglot coding benchmark, placing in the public top 10 and rivaling cloud models.