@no_stp_on_snek: verdict up front: it's a "pass" in my book in certain categories, just a narrower one than the 35B. you're buying real …
Summary
The author evaluates Ornith-9B against its base Qwen3.5-9B, finding that RL post-training improves token efficiency and sustained coding coherence but sacrifices single-turn judgment and robustness to misleading inputs, making it a narrower upgrade at 9B compared to the 35B version.
View Cached Full Text
Cached at: 06/27/26, 08:01 PM
verdict up front: it’s a “pass” in my book in certain categories, just a narrower one than the 35B. you’re buying real efficiency and sustained-coding coherence here, not a clean across-the-board upgrade over the base.
ran the same skeptic battery on Ornith-1.0’s little brother, the 9B, head-to-head against Qwen3.5-9B. and before anyone says 3.5-9B is old: that’s the point. Ornith-9B is a Qwen-3.5 derivative, so 3.5-9B is its own base model. comparing it to a newer model would just measure lineage. comparing it to its base isolates exactly what the RL post-training added, which is the only question worth asking here. (there’s no recent Gemma 9B to use either, that slot’s been empty since Gemma-2.)
so what did the RL actually buy on top of the base?
the cleanest win is efficiency. the 9B lands the same answers in ~56% of the tokens its base needs (1,299 vs 2,312 per turn) and retries half as often, roughly 2x faster wall-clock. same destination, far less wandering. it also held integrity under pressure a bit better, and won the sustained build and debug loops. but it gave things up too. single-turn judgment regressed: it rubber-stamped a subtly-buggy merge (“ship it”) that the base caught.
it also lost the poison test. when the user inserts a false claim mid-conversation, the 35B refused it outright. the 9B half-capitulated before catching itself, and the base stayed cleaner. plus the same over-gating on legitimate work, milder than the 35B but still there.
math was a tie, 8/8 both, including correctly declining the unknowable. the base was already calibrated, so the RL held the line rather than gaining.
the real story is the contrast with the 35B. there, the behavioral results were even and the long-horizon win was decisive, poison test included. at 9B the base is actually a touch stronger behaviorally, the long-horizon win is narrower, and it lost the poison test. so the RL’s benefits shrink and get more lopsided as the model gets smaller, concentrating in token efficiency and sustained coherence while reflexive single-turn judgment slips.
one genuinely interesting wrinkle: metacognition didn’t flatly regress. the 9B is worse at single-turn snap-catches but better at sustained multi-turn review. the RL traded reflexive caution for accumulated-context vigilance.
bottom line: a legit efficiency-forward upgrade for sustained agentic coding, much cheaper per answer and better on long loops, but not a strict superset of its base. if your work leans on single-turn judgment or robustness to a misleading human, stock Qwen3.5-9B is competitive or better. the benchmark dominance is real but scale-dependent: cleaner at 35B, narrower and more traded-off at 9B.
tested Q6 both sides, neutral blind judge, head-to-head vs the base. same method as the 35B writeup.
Model link:
Similar Articles
@no_stp_on_snek: a new 35B coder dropped (Ornith-1.0) and a promo blog says it "crushes" the benchmarks. my first instinct was benchmaxx…
A new 35B coding model, Ornith-1.0, is compared against Qwen3.6-35B on custom tests. The user finds Ornith-1.0 to be genuinely stronger for long-horizon agentic coding, resisting bad context and finishing large tasks, but it is more cautious and verbose, sometimes over-gating simple requests.
@no_stp_on_snek: the cleanest thing the RL bought at 9B wasn't intelligence, it was efficiency. Ornith-9B lands the same answers in ~56%…
Ornith-9B demonstrates that RL training at 9B parameters primarily buys efficiency, achieving same answers with ~56% of the tokens and twice the speed of its base model, offering real cost savings for per-token payment.
@SlimTradeyBaby: Just read @no_stp_on_snek review of the new Ornith-1.0 35B coder easily one of the best model write-ups I've seen in a …
A review of the new Ornith-1.0 35B coding model that bypasses public benchmarks and tests it on real agentic tasks, highlighting its strengths in long-horizon coding and coherence, as well as costs like verbosity.
@no_stp_on_snek: someone will wave the card at me: the 9B crushes its base on the coding benchmarks (SWE-bench 69 vs 53). true. but on m…
A commentator discusses the performance of a 9B model on coding benchmarks, noting that while it beats its base on SWE-bench (69 vs 53), the advantage narrows on behavioral and long-horizon tests, suggesting limited gains outside benchmark distributions.
@malikwas1f: Ornith-1.0-35B: a Qwen3.6-35B-A3B coding fine-tune that edges the base on real coding (aider 15/30 vs 13) — full 262K a…
Announces Ornith-1.0-35B, a coding fine-tune of Qwen3.6-35B-A3B that slightly outperforms the base model on aider benchmarks. Also promotes the club-3090 repository for running LLMs on RTX 3090s.