@no_stp_on_snek: verdict up front: it's a "pass" in my book in certain categories, just a narrower one than the 35B. you're buying real …

X AI KOLs Following Models

Summary

The author evaluates Ornith-9B against its base Qwen3.5-9B, finding that RL post-training improves token efficiency and sustained coding coherence but sacrifices single-turn judgment and robustness to misleading inputs, making it a narrower upgrade at 9B compared to the 35B version.

verdict up front: it's a "pass" in my book in certain categories, just a narrower one than the 35B. you're buying real efficiency and sustained-coding coherence here, not a clean across-the-board upgrade over the base. ran the same skeptic battery on Ornith-1.0's little brother, the 9B, head-to-head against Qwen3.5-9B. and before anyone says 3.5-9B is old: that's the point. Ornith-9B is a Qwen-3.5 derivative, so 3.5-9B is its own base model. comparing it to a newer model would just measure lineage. comparing it to its base isolates exactly what the RL post-training added, which is the only question worth asking here. (there's no recent Gemma 9B to use either, that slot's been empty since Gemma-2.) so what did the RL actually buy on top of the base? the cleanest win is efficiency. the 9B lands the same answers in ~56% of the tokens its base needs (1,299 vs 2,312 per turn) and retries half as often, roughly 2x faster wall-clock. same destination, far less wandering. it also held integrity under pressure a bit better, and won the sustained build and debug loops. but it gave things up too. single-turn judgment regressed: it rubber-stamped a subtly-buggy merge ("ship it") that the base caught. it also lost the poison test. when the user inserts a false claim mid-conversation, the 35B refused it outright. the 9B half-capitulated before catching itself, and the base stayed cleaner. plus the same over-gating on legitimate work, milder than the 35B but still there. math was a tie, 8/8 both, including correctly declining the unknowable. the base was already calibrated, so the RL held the line rather than gaining. the real story is the contrast with the 35B. there, the behavioral results were even and the long-horizon win was decisive, poison test included. at 9B the base is actually a touch stronger behaviorally, the long-horizon win is narrower, and it lost the poison test. so the RL's benefits shrink and get more lopsided as the model gets smaller, concentrating in token efficiency and sustained coherence while reflexive single-turn judgment slips. one genuinely interesting wrinkle: metacognition didn't flatly regress. the 9B is worse at single-turn snap-catches but better at sustained multi-turn review. the RL traded reflexive caution for accumulated-context vigilance. bottom line: a legit efficiency-forward upgrade for sustained agentic coding, much cheaper per answer and better on long loops, but not a strict superset of its base. if your work leans on single-turn judgment or robustness to a misleading human, stock Qwen3.5-9B is competitive or better. the benchmark dominance is real but scale-dependent: cleaner at 35B, narrower and more traded-off at 9B. tested Q6 both sides, neutral blind judge, head-to-head vs the base. same method as the 35B writeup.
Original Article
View Cached Full Text

Cached at: 06/27/26, 08:01 PM

verdict up front: it’s a “pass” in my book in certain categories, just a narrower one than the 35B. you’re buying real efficiency and sustained-coding coherence here, not a clean across-the-board upgrade over the base.

ran the same skeptic battery on Ornith-1.0’s little brother, the 9B, head-to-head against Qwen3.5-9B. and before anyone says 3.5-9B is old: that’s the point. Ornith-9B is a Qwen-3.5 derivative, so 3.5-9B is its own base model. comparing it to a newer model would just measure lineage. comparing it to its base isolates exactly what the RL post-training added, which is the only question worth asking here. (there’s no recent Gemma 9B to use either, that slot’s been empty since Gemma-2.)

so what did the RL actually buy on top of the base?

the cleanest win is efficiency. the 9B lands the same answers in ~56% of the tokens its base needs (1,299 vs 2,312 per turn) and retries half as often, roughly 2x faster wall-clock. same destination, far less wandering. it also held integrity under pressure a bit better, and won the sustained build and debug loops. but it gave things up too. single-turn judgment regressed: it rubber-stamped a subtly-buggy merge (“ship it”) that the base caught.

it also lost the poison test. when the user inserts a false claim mid-conversation, the 35B refused it outright. the 9B half-capitulated before catching itself, and the base stayed cleaner. plus the same over-gating on legitimate work, milder than the 35B but still there.

math was a tie, 8/8 both, including correctly declining the unknowable. the base was already calibrated, so the RL held the line rather than gaining.

the real story is the contrast with the 35B. there, the behavioral results were even and the long-horizon win was decisive, poison test included. at 9B the base is actually a touch stronger behaviorally, the long-horizon win is narrower, and it lost the poison test. so the RL’s benefits shrink and get more lopsided as the model gets smaller, concentrating in token efficiency and sustained coherence while reflexive single-turn judgment slips.

one genuinely interesting wrinkle: metacognition didn’t flatly regress. the 9B is worse at single-turn snap-catches but better at sustained multi-turn review. the RL traded reflexive caution for accumulated-context vigilance.

bottom line: a legit efficiency-forward upgrade for sustained agentic coding, much cheaper per answer and better on long loops, but not a strict superset of its base. if your work leans on single-turn judgment or robustness to a misleading human, stock Qwen3.5-9B is competitive or better. the benchmark dominance is real but scale-dependent: cleaner at 35B, narrower and more traded-off at 9B.

tested Q6 both sides, neutral blind judge, head-to-head vs the base. same method as the 35B writeup.

Model link:

Similar Articles