after a month with 5 Chinese coding LLMs, is M3 actually going to take the top spot?

Reddit r/ArtificialInteligence News

Summary

A user shares a month-long comparison of five Chinese coding LLMs (Kimi K2.6, GLM-5.1, MiMo V2.5 Pro, MiniMax 2.7, DeepSeek V4 Pro) on a TypeScript/Next.js codebase, rating each in categories like frontend, backend, code review, all-rounder, and reasoning. They note MiniMax 2.7 achieves ~90% of Opus 4.6 quality at ~7% cost and speculate whether the upcoming MiniMax 3.0 will close gaps in planning and test coverage to become the top spot.

been rotating through 5 chinese coding models on a TS/Next codebase for the last 4-5 weeks. Kimi K2.6, GLM-5.1, MiMo V2.5 Pro, MiniMax 2.7, DeepSeek V4 Pro. wanted to share where i landed and ask about M3. quick per-category from my runs: * Frontend / design → K2.6 * Backend → K2.6 and GLM-5.1 * Code review → MiMo * All-rounder → M2.7 * Reasoning-heavy → DeepSeek afterwards i found llmdevguy posted a near-identical ranking on X a couple weeks back (162k views, 2.3k likes) and ended it with "now i'm waiting for MiniMax 3.0 to take the number 1 spot." weird to land in the exact same place. https://preview.redd.it/01k9njcpmo2h1.png?width=1190&format=png&auto=webp&s=ef920c65d32a34f1dc054718813d3bb57b54037e M2.7 didn't win any single category for me. what surprised me is cost. Kilo Code posted a benchmark on ClaudeAI: M2.7 hit \~90% of Opus 4.6 quality at \~7% of the cost ($0.27 vs $3.67 across three coding tasks). my own runs aren't scientific but the ratio tracks. short version of the shortcomings: thinner tests and it jumps straight to code instead of walking through reasoning. so i reach for it as an executor once a stronger model has planned, not as the planner. real question is whether M3 closes the planning and test-coverage gap. if it does, all-rounder becomes top of every category pretty fast. anyone else doing side-by-side runs? does this hold on python / go / rust or is it a TS thing?
Original Article

Similar Articles

I benchmarked 21 local LLMs on a MacBook Air M5 for code quality AND speed

Reddit r/LocalLLaMA

A developer benchmarked 21 local LLMs on MacBook Air M5 using HumanEval+ and found Qwen 3.6 35B-A3B (MoE) leads at 89.6% with 16.9 tok/s, while Qwen 2.5 Coder 7B offers the best RAM-to-performance ratio at 84.2% in 4.5 GB. Notably, Gemma 4 models significantly underperformed expectations (31.1% for 31B), possibly due to Q4_K_M quantization effects.

@jiayuan_jy: A few objective clarifications: 1) This post has nothing to do with MiniMax (I never take sponsored posts). 2) 'Subjective feel' is not the same as actual performance; it's not quantitative data. After more extensive experience, overall coding ability is a qualitative improvement compared to m2.7. A current shortcoming is that 1-shot results compared with...

X AI KOLs Following

Jiayuan Zhang shared his initial experience with the M3 model's coding ability, stating that it is a qualitative improvement compared to m2.7, but the 1-shot results are not as comprehensive as Opus 4.6/4.7 and GPT5.5.