coding-benchmark

Tag

Cards List
#coding-benchmark

@SlimTradeyBaby: Attention all 8-12GB GPU users! This new Ornith-1.0-9B is looking like it will be a serz player for smaller VRAM setups…

X AI KOLs Timeline · 4d ago Cached

Ornith-1.0-9B is a new 9B parameter AI model optimized for 8-12GB GPUs, achieving strong performance on agentic coding benchmarks, matching or surpassing models 2-3x its size.

0 favorites 0 likes
#coding-benchmark

@cognition: Try Kimi K2.7 and GLM 5.2 for free in Devin Desktop and CLI

X AI KOLs Following · 6d ago Cached

Devin Desktop now supports Kimi K2.7 and GLM 5.2 models, offering free trials until July 5 for Pro/Max/Teams users.

0 favorites 0 likes
#coding-benchmark

GLM 5.2 vs. Opus

Hacker News Top · 2026-06-22 Cached

GLM 5.2 is a new open-weights model from Z.ai, compared against Claude Opus in a 3D game coding task. Opus performed faster and cleaner, but GLM 5.2 offers compelling cost and accessibility advantages.

0 favorites 0 likes
#coding-benchmark

@cline: Step 3.7 Flash is free in Cline for the next month. It beats Gemini and DeepSeek flash models, and comes surprisingly c…

X AI KOLs Following · 2026-06-17 Cached

Step 3.7 Flash, an open-weights model with a 256k context window, is available free in Cline for a month, claiming to outperform Gemini and DeepSeek flash models and approach frontier performance on SWE Bench.

0 favorites 0 likes
#coding-benchmark

Nex-N2 Pro is the real deal

Reddit r/LocalLLaMA · 2026-06-16

The writer shares their experience with Nex-N2 Pro, originally mistaken as Rio-3.5, and finds it performs exceptionally well on coding benchmarks without hallucination, rivaling GPT-5.x on their Mac setup.

0 favorites 0 likes
#coding-benchmark

Ramp SWE-Bench: a private, production-grounded coding benchmark (3 minute read)

TLDR AI · 2026-06-15

Ramp released its own private SWE-Bench benchmark built from real engineering problems, enabling evaluation of coding models within its financial software ecosystem.

0 favorites 0 likes
#coding-benchmark

@swyx: It's finally out!!! @METR_Evals found that more than half of SWEBench results is unmergeable slop. FrontierCode represe…

X AI KOLs Following · 2026-06-08 Cached

FrontierCode is a new coding benchmark from METR and Cognition that evaluates AI models on code maintainability and quality, revealing that many models produce unmergeable code. It includes over 1000 hours of work and shows that even top models struggle, with Opus 4.8 achieving only 13.8% on the hardest tier.

0 favorites 0 likes
#coding-benchmark

@scaling01: Opus 4.8 is the best coding model out there FrontierCode by Cognition is probably the highest quality coding benchmark …

X AI KOLs Timeline · 2026-06-08 Cached

Cognition introduces FrontierCode, a high-quality coding benchmark that goes beyond unit tests to measure code maintainability, regression safety, and quality, with 150 handcrafted tasks by open-source developers.

0 favorites 0 likes
#coding-benchmark

Qwen 3.6 27B on DeepSWE

Reddit r/LocalLLaMA · 2026-06-07

Qwen 3.6 27B scored 2% on the DeepSWE benchmark, placing 18/20 above Haiku 4.5 and Minimax M2.7, highlighting the gap between local and leading-edge models.

0 favorites 0 likes
#coding-benchmark

Observation: the best agent harness for each model will be from the model developer themselves

Reddit r/AI_Agents · 2026-06-01

A discussion on how AI models perform best with harnesses developed by their own creators, as third-party harnesses may cause underperformance despite strong benchmarks, citing examples like Claude Code for Claude and Codex for GPT.

0 favorites 0 likes
#coding-benchmark

@jiayuan_jy: A few objective clarifications: 1) This post has nothing to do with MiniMax (I never take sponsored posts). 2) 'Subjective feel' is not the same as actual performance; it's not quantitative data. After more extensive experience, overall coding ability is a qualitative improvement compared to m2.7. A current shortcoming is that 1-shot results compared with...

X AI KOLs Following · 2026-06-01 Cached

Jiayuan Zhang shared his initial experience with the M3 model's coding ability, stating that it is a qualitative improvement compared to m2.7, but the 1-shot results are not as comprehensive as Opus 4.6/4.7 and GPT5.5.

0 favorites 0 likes
#coding-benchmark

Apex-Testing: real-world, real repos, agentic coding benchmark (Update)

Reddit r/LocalLLaMA · 2026-05-23

Apex-Testing, a benchmark for evaluating agentic coding models using real private GitHub repositories, has been updated with recent models and detailed metrics including cost, time, and ELO-based leaderboard.

0 favorites 0 likes
#coding-benchmark

Open weights GLM and Mimo are better than Gemini 3.5 flash according to arena

Reddit r/LocalLLaMA · 2026-05-19

According to the arena leaderboard, open weights models GLM and Mimo outperform Gemini 3.5 Flash in coding benchmarks.

0 favorites 0 likes
#coding-benchmark

@DivyanshT91162: Open-source AI is getting dangerously good AntLingAGI just dropped Ring-2.6-1T… a TRILLION-parameter OSS model built fo…

X AI KOLs Timeline · 2026-05-16 Cached

AntLingAGI released Ring-2.6-1T, a trillion-parameter open-source AI model designed for long-horizon workflows and real-world coding tasks, achieving impressive benchmarks on Tau2-Bench, GPQA Diamond, and ClawEval.

0 favorites 0 likes
#coding-benchmark

Poetiq: Recursive Self-Improvement Delivers New SOTA Coding Performance

Reddit r/singularity · 2026-05-15 Cached

Poetiq's Meta-System, using recursive self-improvement via standard API access without fine-tuning, achieves new state-of-the-art results on the LiveCodeBench Pro coding benchmark, outperforming leading models like GPT 5.5.

0 favorites 0 likes
#coding-benchmark

Open source battle: GLM vs Kimi vs MiMo vs DeepSeek

Reddit r/LocalLLaMA · 2026-05-13 Cached

This article tests four open-source Chinese AI models — Zhipu GLM 5.1, Moonshot Kimi K2.6, Stepfun MIMO 2.5 Pro, and DeepSeek V4 Pro — on programming tasks. It finds that GLM leads overall in most tasks but not absolutely; each model has its own strengths and weaknesses.

0 favorites 0 likes
#coding-benchmark

@EvanLuthra: Kimi K2 was trained for $4.6 MILLION. GPT-5 reportedly cost hundreds of millions. Kimi still beats it on coding. Last w…

X AI KOLs Timeline · 2026-05-13

Kimi K2, trained for $4.6 million, outperforms GPT-5 and Claude Opus 4.7 on coding benchmarks, with a detailed breakdown from its founder.

0 favorites 0 likes
#coding-benchmark

Qwen3.6-35B becomes competitive with cloud models when paired with the right agent

Reddit r/LocalLLaMA · 2026-04-22

By pairing Qwen3.6-35B with the little-coder agent scaffold, the model hits 78.7% on the Polyglot coding benchmark, placing in the public top 10 and rivaling cloud models.

0 favorites 0 likes
← Back to home

Submit Feedback