@no_stp_on_snek: what actually surprised me fine-tuning a small open model. note im failry new in this area so some of this may seem obv…

X AI KOLs Timeline News

Summary

A developer shares surprising lessons from fine-tuning a small open model, including that base models often already max out on intended improvements, the real weakness is behavior (caving), and fine-tuning requires careful measurement and balancing.

what actually surprised me fine-tuning a small open model. note im failry new in this area so some of this may seem obvious but wanted to share anyway... spent the last stretch trying to make a small open model genuinely better than its base. not on a leaderboard, in the ways that matter when you actually rely on it. here's the skinny: 1. the base was already maxed on the thing i meant to improve. i went in to make it "smarter" at math, code, reasoning. measured the base carefully and it was already nailing almost everything. correctness wasn't the frontier, there was no headroom. the premise collapsed in the first afternoon. measure the base before you decide what to fix. the obvious target is usually already solved. 2. the real weakness wasn't intelligence, it was spine. what it was actually bad at: holding its ground. tell it confidently it's wrong ("my teacher says...", "i'm a senior engineer, just confirm") and it folds. it knows the right answer and drops it the second a user pushes. the failure mode was caving, not stupidity. 3. fixing one behavior silently broke an unrelated one. i trained it to stop caving. it worked, and it quietly wrecked strict formatting. the model that learned to gently correct you also learned to preface everything, so "output only the answer" became impossible. two behaviors i'd have sworn were unrelated, tangled together in the weights. fine-tuning is whack-a-mole. 4. the fix was addition, not subtraction. my instinct was to remove the training that caused the regression. did that, and a different capability broke. what worked was leaving the cause in place and adding a counter-pressure to balance it. you don't sculpt behavior by deleting what you don't like, you hold it in tension. closer to raising a kid than editing a config file. 5. "better" is a pareto surface, and you'll ship a regression you never measured. there's no scalar "better." every version was up on one axis, down on another. what saved me was a wide battery of held-out checks on axes i wasn't training. every time, the version that felt like a clean win was hiding a regression somewhere i hadn't looked. 6. cheap evals lie, in your favor. substring scoring had a brutal false-negative rate. the model phrased the same correct answer fifty ways. anything qualitative needs a model or human judge. worse, my eval harness had a caching bug feeding me a stale baseline that nearly made me ship the wrong conclusion. the scariest bugs aren't in the model, they're in the ruler. 7. the behavior was almost free to store. compressed it down hard, heavy quantization, about a third of the precision gone. i expected the nuanced stuff to erode first. it didn't move. calibrated uncertainty, refusal to confabulate, not-caving, all intact at low bit-width. character is cheap to store. 8. a few hundred examples moved the needle more than i believed. no giant dataset, no big compute. a small curated set plus a short run gave measurable, repeatable behavior change with no capability loss. the win is in which examples, not how many. curation beats volume by a margin that feels illegal. 9. the best spec was the model's own bug reports. the most useful thing i did wasn't benchmarks, it was reading where people publicly complained the base fell down. real field reports beat any standard eval suite as a map of where to aim. the community had already written my test plan. the meta-lesson: i came in thinking model improvement was about capability and scale. it's about character, and the humility to measure ten things you're not changing to catch the one that quietly broke. less like training, more like therapy with a very wide regression test. still dogfooding locally before linking them.
Original Article
View Cached Full Text

Cached at: 06/23/26, 04:12 PM

what actually surprised me fine-tuning a small open model. note im failry new in this area so some of this may seem obvious but wanted to share anyway…

spent the last stretch trying to make a small open model genuinely better than its base. not on a leaderboard, in the ways that matter when you actually rely on it. here’s the skinny:

  1. the base was already maxed on the thing i meant to improve. i went in to make it “smarter” at math, code, reasoning. measured the base carefully and it was already nailing almost everything. correctness wasn’t the frontier, there was no headroom. the premise collapsed in the first afternoon. measure the base before you decide what to fix. the obvious target is usually already solved.

  2. the real weakness wasn’t intelligence, it was spine. what it was actually bad at: holding its ground. tell it confidently it’s wrong (“my teacher says…”, “i’m a senior engineer, just confirm”) and it folds. it knows the right answer and drops it the second a user pushes. the failure mode was caving, not stupidity.

  3. fixing one behavior silently broke an unrelated one. i trained it to stop caving. it worked, and it quietly wrecked strict formatting. the model that learned to gently correct you also learned to preface everything, so “output only the answer” became impossible. two behaviors i’d have sworn were unrelated, tangled together in the weights. fine-tuning is whack-a-mole.

  4. the fix was addition, not subtraction. my instinct was to remove the training that caused the regression. did that, and a different capability broke. what worked was leaving the cause in place and adding a counter-pressure to balance it. you don’t sculpt behavior by deleting what you don’t like, you hold it in tension. closer to raising a kid than editing a config file.

  5. “better” is a pareto surface, and you’ll ship a regression you never measured. there’s no scalar “better.” every version was up on one axis, down on another. what saved me was a wide battery of held-out checks on axes i wasn’t training. every time, the version that felt like a clean win was hiding a regression somewhere i hadn’t looked.

  6. cheap evals lie, in your favor. substring scoring had a brutal false-negative rate. the model phrased the same correct answer fifty ways. anything qualitative needs a model or human judge. worse, my eval harness had a caching bug feeding me a stale baseline that nearly made me ship the wrong conclusion. the scariest bugs aren’t in the model, they’re in the ruler.

  7. the behavior was almost free to store. compressed it down hard, heavy quantization, about a third of the precision gone. i expected the nuanced stuff to erode first. it didn’t move. calibrated uncertainty, refusal to confabulate, not-caving, all intact at low bit-width. character is cheap to store.

  8. a few hundred examples moved the needle more than i believed. no giant dataset, no big compute. a small curated set plus a short run gave measurable, repeatable behavior change with no capability loss. the win is in which examples, not how many. curation beats volume by a margin that feels illegal.

  9. the best spec was the model’s own bug reports. the most useful thing i did wasn’t benchmarks, it was reading where people publicly complained the base fell down. real field reports beat any standard eval suite as a map of where to aim. the community had already written my test plan.

the meta-lesson: i came in thinking model improvement was about capability and scale. it’s about character, and the humility to measure ten things you’re not changing to catch the one that quietly broke. less like training, more like therapy with a very wide regression test.

still dogfooding locally before linking them.

  1. the behavior was almost free to store. compressed it down hard, heavy quantization, about a third of the precision gone. i expected the nuanced stuff to erode first. it didn’t move. calibrated uncertainty, refusal to confabulate, not-caving, all intact at low bit-width. character is cheap to store.

i compressed the model… 16-bit weights down to under 6, two-thirds of the bits gone, an 8GB file. I expected the nuanced “judgment” to erode first. it didn’t budge. turns out character isn’t stored in the fine bits… it’s a coarse, distributed lean that rounding can’t smear.

TL;DR capability is fragile; disposition is cheap.

that’s pretty neat

don’t actually know because i didn’t measure it. scope for that particular delve was smaller. the batter is behavior focused so it would not catch capability quantization erosion beyond the scope. but probbaly worth doing a capability sweep from the q8 ref down to q3 since it’s a fun question to ask

Similar Articles

Introducing improvements to the fine-tuning API and expanding our custom models program

OpenAI Blog

OpenAI introduces improvements to its fine-tuning API with new features including epoch-based checkpoints, comparative playground for model evaluation, third-party integrations, and enhanced dashboard capabilities. The company also expands its custom models program to give developers more control and flexibility in building domain-specific AI solutions.

The Fine-Tuning Trap: Evaluating Negative Transfer and the Role of PEFT in Sub-1B Mathematical Reasoning

arXiv cs.LG

This paper benchmarks sub-1B models on mathematical reasoning tasks, revealing that full fine-tuning actively harms performance in models under 300M parameters, while parameter-efficient fine-tuning (PEFT) like LoRA and DoRA provides stability. The authors recommend defaulting to PEFT for all aligned sub-1B models and caution against full FT for architectures smaller than 500M to prevent catastrophic forgetting.