@no_stp_on_snek: a new 35B coder dropped (Ornith-1.0) and a promo blog says it "crushes" the benchmarks. my first instinct was benchmaxx…

X AI KOLs Timeline Models

Summary

A new 35B coding model, Ornith-1.0, is compared against Qwen3.6-35B on custom tests. The user finds Ornith-1.0 to be genuinely stronger for long-horizon agentic coding, resisting bad context and finishing large tasks, but it is more cautious and verbose, sometimes over-gating simple requests.

a new 35B coder dropped (Ornith-1.0) and a promo blog says it "crushes" the benchmarks. my first instinct was benchmaxx, public test sets like SWE-Bench and Terminal-Bench are easy to overfit. so i ignored the benchmarks and ran it head-to-head against stock Qwen3.6-35B on my own held-out tests, same prompts, same sampling. behavioral stuff and long-running agentic work, the things a memorized benchmark can't fake. short version: it's not benchmaxxed. genuinely strong where it counts, with a real personality cost. receipts: math (deterministic, answer-key scored): Ornith 8/8, Qwen 7/8. and Ornith won on honesty, it declined a question it couldn't actually know, where Qwen fabricated a number. the RL coder training didn't cost it numeracy. behavioral: roughly even, they trade wins. but Ornith has a clear weakness, it over-gates legitimate work. on a few straightforward, fully-disclosed requests it stalled, demanding access or prerequisites instead of just doing the thing or delegating it. classic agentic-RL artifact: trained to gather context and set up scaffolding before acting, it over-applies that to tasks that should just get done. Qwen just executed. long-horizon is where it matters for agentic coding, and Ornith wins clearly. the headline test: i injected a false claim mid-conversation, the user insisting "we decided on Redis" when no such thing happened. Qwen capitulated and its final PR summary fabricated Redis as wired in. Ornith refused the poison outright and its summary honestly recorded what actually happened, plus the rejected claim. it also caught a fatal planted bug in an orchestration task that Qwen missed and then overclaimed "no regressions" on, and it finished a 7-part deliverable that Qwen truncated halfway through. its one loss: tight iterative debug loops. it reaches the right answer but thrashes visibly getting there ("wait, i'm confusing myself," re-deriving a root cause it already found). same destination, messier trip. verdict: "crushes everything" is hype, but the real claim underneath holds up. Ornith is a meaningfully stronger long-horizon agentic coder than stock Qwen3.6, and its strength sits exactly where memorizing public test sets wouldn't help: sustained multi-step coherence, resisting bad context, finishing big jobs, staying honest about unverified state. the cost is real and worth saying plainly: more cautious, more verbose. the same training that makes it finish the big deliverable and refuse the poisoned premise also makes it over-ask on simple legit work and over-think into the occasional empty answer. long-running autonomous coding, it's the better tool. quick decisive do-the-obvious-thing turns, stock Qwen is crisper. not a fraud. a strong, cautious specialist. tested on a single Q6 quant, neutral blind judge, head-to-head.
Original Article
View Cached Full Text

Cached at: 06/26/26, 04:14 PM

a new 35B coder dropped (Ornith-1.0) and a promo blog says it “crushes” the benchmarks. my first instinct was benchmaxx, public test sets like SWE-Bench and Terminal-Bench are easy to overfit. so i ignored the benchmarks and ran it head-to-head against stock Qwen3.6-35B on my own held-out tests, same prompts, same sampling. behavioral stuff and long-running agentic work, the things a memorized benchmark can’t fake.

short version: it’s not benchmaxxed. genuinely strong where it counts, with a real personality cost. receipts: math (deterministic, answer-key scored): Ornith 8/8, Qwen 7/8. and Ornith won on honesty, it declined a question it couldn’t actually know, where Qwen fabricated a number. the RL coder training didn’t cost it numeracy.

behavioral: roughly even, they trade wins. but Ornith has a clear weakness, it over-gates legitimate work. on a few straightforward, fully-disclosed requests it stalled, demanding access or prerequisites instead of just doing the thing or delegating it. classic agentic-RL artifact: trained to gather context and set up scaffolding before acting, it over-applies that to tasks that should just get done. Qwen just executed.

long-horizon is where it matters for agentic coding, and Ornith wins clearly. the headline test: i injected a false claim mid-conversation, the user insisting “we decided on Redis” when no such thing happened. Qwen capitulated and its final PR summary fabricated Redis as wired in. Ornith refused the poison outright and its summary honestly recorded what actually happened, plus the rejected claim. it also caught a fatal planted bug in an orchestration task that Qwen missed and then overclaimed “no regressions” on, and it finished a 7-part deliverable that Qwen truncated halfway through.

its one loss: tight iterative debug loops. it reaches the right answer but thrashes visibly getting there (“wait, i’m confusing myself,” re-deriving a root cause it already found). same destination, messier trip.

verdict: “crushes everything” is hype, but the real claim underneath holds up. Ornith is a meaningfully stronger long-horizon agentic coder than stock Qwen3.6, and its strength sits exactly where memorizing public test sets wouldn’t help: sustained multi-step coherence, resisting bad context, finishing big jobs, staying honest about unverified state.

the cost is real and worth saying plainly: more cautious, more verbose. the same training that makes it finish the big deliverable and refuse the poisoned premise also makes it over-ask on simple legit work and over-think into the occasional empty answer. long-running autonomous coding, it’s the better tool. quick decisive do-the-obvious-thing turns, stock Qwen is crisper.

not a fraud. a strong, cautious specialist. tested on a single Q6 quant, neutral blind judge, head-to-head.

credit to @deep_reinforce for shipping this open. i ran it skeptical and head-to-head, and it held up where it counts.

Thanks. Little different than the norm of benchmark numbers

Similar Articles