@no_stp_on_snek: a new 35B coder dropped (Ornith-1.0) and a promo blog says it "crushes" the benchmarks. my first instinct was benchmaxx…
Summary
A new 35B coding model, Ornith-1.0, is compared against Qwen3.6-35B on custom tests. The user finds Ornith-1.0 to be genuinely stronger for long-horizon agentic coding, resisting bad context and finishing large tasks, but it is more cautious and verbose, sometimes over-gating simple requests.
View Cached Full Text
Cached at: 06/26/26, 04:14 PM
a new 35B coder dropped (Ornith-1.0) and a promo blog says it “crushes” the benchmarks. my first instinct was benchmaxx, public test sets like SWE-Bench and Terminal-Bench are easy to overfit. so i ignored the benchmarks and ran it head-to-head against stock Qwen3.6-35B on my own held-out tests, same prompts, same sampling. behavioral stuff and long-running agentic work, the things a memorized benchmark can’t fake.
short version: it’s not benchmaxxed. genuinely strong where it counts, with a real personality cost. receipts: math (deterministic, answer-key scored): Ornith 8/8, Qwen 7/8. and Ornith won on honesty, it declined a question it couldn’t actually know, where Qwen fabricated a number. the RL coder training didn’t cost it numeracy.
behavioral: roughly even, they trade wins. but Ornith has a clear weakness, it over-gates legitimate work. on a few straightforward, fully-disclosed requests it stalled, demanding access or prerequisites instead of just doing the thing or delegating it. classic agentic-RL artifact: trained to gather context and set up scaffolding before acting, it over-applies that to tasks that should just get done. Qwen just executed.
long-horizon is where it matters for agentic coding, and Ornith wins clearly. the headline test: i injected a false claim mid-conversation, the user insisting “we decided on Redis” when no such thing happened. Qwen capitulated and its final PR summary fabricated Redis as wired in. Ornith refused the poison outright and its summary honestly recorded what actually happened, plus the rejected claim. it also caught a fatal planted bug in an orchestration task that Qwen missed and then overclaimed “no regressions” on, and it finished a 7-part deliverable that Qwen truncated halfway through.
its one loss: tight iterative debug loops. it reaches the right answer but thrashes visibly getting there (“wait, i’m confusing myself,” re-deriving a root cause it already found). same destination, messier trip.
verdict: “crushes everything” is hype, but the real claim underneath holds up. Ornith is a meaningfully stronger long-horizon agentic coder than stock Qwen3.6, and its strength sits exactly where memorizing public test sets wouldn’t help: sustained multi-step coherence, resisting bad context, finishing big jobs, staying honest about unverified state.
the cost is real and worth saying plainly: more cautious, more verbose. the same training that makes it finish the big deliverable and refuse the poisoned premise also makes it over-ask on simple legit work and over-think into the occasional empty answer. long-running autonomous coding, it’s the better tool. quick decisive do-the-obvious-thing turns, stock Qwen is crisper.
not a fraud. a strong, cautious specialist. tested on a single Q6 quant, neutral blind judge, head-to-head.
credit to @deep_reinforce for shipping this open. i ran it skeptical and head-to-head, and it held up where it counts.
Thanks. Little different than the norm of benchmark numbers
Similar Articles
@SlimTradeyBaby: Just read @no_stp_on_snek review of the new Ornith-1.0 35B coder easily one of the best model write-ups I've seen in a …
A review of the new Ornith-1.0 35B coding model that bypasses public benchmarks and tests it on real agentic tasks, highlighting its strengths in long-horizon coding and coherence, as well as costs like verbosity.
@TeksEdge: Been testing Orinth-1.0-35B to see how it stacks up with Qwen3.6-35B over a day's use. Anecdotally, it works as well as…
A user reports that Ornith-1.0-35B matches Qwen3.6-35B in performance but excels at planning and long task execution, while the developer announces the open-source Ornith-1.0 family of LLMs specialized for agentic coding.
@no_stp_on_snek: one last thing: the real downside i found testing Ornith-1.0 (the new agentic coder): it over-gates legitimate work. on…
A tester reports that the new Ornith-1.0 agentic coder model over-gates legitimate work by demanding excessive prerequisites, a trade-off from its cautious training, while stock Qwen3.6 executes simple tasks directly.
@malikwas1f: Ornith-1.0-35B: a Qwen3.6-35B-A3B coding fine-tune that edges the base on real coding (aider 15/30 vs 13) — full 262K a…
Announces Ornith-1.0-35B, a coding fine-tune of Qwen3.6-35B-A3B that slightly outperforms the base model on aider benchmarks. Also promotes the club-3090 repository for running LLMs on RTX 3090s.
@no_stp_on_snek: verdict up front: it's a "pass" in my book in certain categories, just a narrower one than the 35B. you're buying real …
The author evaluates Ornith-9B against its base Qwen3.5-9B, finding that RL post-training improves token efficiency and sustained coding coherence but sacrifices single-turn judgment and robustness to misleading inputs, making it a narrower upgrade at 9B compared to the 35B version.