@oneill_c: 1/ We fine-tune a lot of customer models, so we decided to systematically try and figure out some best practices for fi…
Summary
The thread shares systematic experimental findings on fine-tuning best practices, varying one SFT lever at a time across dense and MoE models up to 235B on four real-world customer datasets with custom evals to eliminate confounders.
View Cached Full Text
Cached at: 06/22/26, 01:36 PM
1/ We fine-tune a lot of customer models, so we decided to systematically try and figure out some best practices for finetuning. SFT isn’t sexy, but it’s still important. We vary one SFT lever at a time across 2 model families, dense + MoE to 235B, on 4 real-world customer datasets.
What makes this clean is that each dataset is paired with an eval that took weeks to build with the customer, and the training outputs were generated to pass that eval. So the supervised target and the thing we measure downstream are the same criterion, which strips out the usual confounders
2/ The biggest surprise was that the optimal LoRA learning rate doesn’t move with model size, it’s flat from 0.6B to 32B. We’ve heard previously that the optimum should scale roughly as inverse width. For LoRA, the exponent is statistically indistinguishable from zero in both Qwen and Llama.
Pick a lr and never sweep it again, because it costs <0.01 nats vs tuning
3/ Full fine-tuning’s optimum sits ~10-33x lower (≈3e-5), and that ratio is stable across families. And our selection rule transfers ie fit it on Qwen and predict Llama to within 0.004 nats
4/ Full FT wins all 72 matched comparisons on loss vs LoRA, but the margins are tiny. LoRA recovers a median 98% of the gain at 3–13% of the params, and importantly the gap closes as models get bigger
5/ The most important contributor to final loss (loss improvement) isn’t lr or batch size, it’s the data. Specifically, token composition explains 56–88% of the variance. LR and batch together explain ≤0.07
6/ Does val loss even predict downstream quality? Within a fixed (model, dataset, recipe) experiment, yes (Spearman −0.38 to −0.88). So you can select on loss inside a recipe. But it doesn’t survive crossing model families ie a lower-NLL model can score worse with the judge
7/ Also, MoEs roughly scale as the geometric-mean (of active and total) dense equivalent
8/ There’s a live “is Muon just Shampoo” debate happening right now. Our SFT data point is that its pretraining edge mostly doesn’t survive small-update SFT (has slightly lower loss). HOWEVER, it does retain more general instruction-following, and this is a super interesting finding we want to dig into more, and may be related to the flatness of the region we get into with Muon (did some Hessian trace analysis you can read about)
9/ Rank in lora helps to ~64, then plateaus. r=32 gets within ~0.001–0.003 nats at half the params; r=128 buys nothing. α=32 best in every experiment. The r=64/α=32 default holds
10/ In terms of epochs, past ~2, loss overfits, judged quality doesn’t improve, and instruction-following erodes. So if you have more data, spend it on fresh examples. Quite obviously, fresh 10k beats repeated 5k in all our comparisons
11/ We’d like to turn post-training into a measured science instead of inherited defaults, and we’re doing more of it. More interesting, ofc, are questions related to SFT vs RL vs OPD/OPSD, which we’ll release more on soon. And if you want to work on this, DM me!
Full report: https://datocms-assets.com/104802/1781805778-baseten-research-sft.pdf…
Thanks Elie!
this and my golf handicap
Hope it’s useful!
No worries, hope it’s useful!
Similar Articles
@no_stp_on_snek: what actually surprised me fine-tuning a small open model. note im failry new in this area so some of this may seem obv…
A developer shares surprising lessons from fine-tuning a small open model, including that base models often already max out on intended improvements, the real weakness is behavior (caving), and fine-tuning requires careful measurement and balancing.
@LangChain: Fine-tuning open models can exceed or match frontier models. Base @Alibaba_Qwen out of the box w/ good prompting: Stron…
Fine-tuning open models like Alibaba's Qwen with LoRA can match or exceed frontier model performance on error classification tasks.
The Fine-Tuning Trap: Evaluating Negative Transfer and the Role of PEFT in Sub-1B Mathematical Reasoning
This paper benchmarks sub-1B models on mathematical reasoning tasks, revealing that full fine-tuning actively harms performance in models under 300M parameters, while parameter-efficient fine-tuning (PEFT) like LoRA and DoRA provides stability. The authors recommend defaulting to PEFT for all aligned sub-1B models and caution against full FT for architectures smaller than 500M to prevent catastrophic forgetting.
Fine-Grained Benchmark Generation for Comprehensive Evaluation of Foundation Models
A new framework for automated benchmark generation enables fine-grained, comprehensive evaluation of foundation models with lower error rates and richer metadata, as demonstrated on ML, Corporate Finance, and Personal Finance benchmarks.
FocuSFT: Bilevel Optimization for Dilution-Aware Long-Context Fine-Tuning
The paper introduces FocuSFT, a bilevel optimization framework that enhances long-context language model performance by addressing attention dilution through parametric memory. It demonstrates significant improvements in accuracy and context engagement on benchmarks like BABILong and RULER.