@oneill_c: 1/ We fine-tune a lot of customer models, so we decided to systematically try and figure out some best practices for fi…

X AI KOLs Following Papers

Summary

The thread shares systematic experimental findings on fine-tuning best practices, varying one SFT lever at a time across dense and MoE models up to 235B on four real-world customer datasets with custom evals to eliminate confounders.

1/ We fine-tune a lot of customer models, so we decided to systematically try and figure out some best practices for finetuning. SFT isn't sexy, but it's still important. We vary one SFT lever at a time across 2 model families, dense + MoE to 235B, on 4 real-world customer datasets. What makes this clean is that each dataset is paired with an eval that took weeks to build with the customer, and the training outputs were generated to pass that eval. So the supervised target and the thing we measure downstream are the same criterion, which strips out the usual confounders
Original Article
View Cached Full Text

Cached at: 06/22/26, 01:36 PM

1/ We fine-tune a lot of customer models, so we decided to systematically try and figure out some best practices for finetuning. SFT isn’t sexy, but it’s still important. We vary one SFT lever at a time across 2 model families, dense + MoE to 235B, on 4 real-world customer datasets.

What makes this clean is that each dataset is paired with an eval that took weeks to build with the customer, and the training outputs were generated to pass that eval. So the supervised target and the thing we measure downstream are the same criterion, which strips out the usual confounders

2/ The biggest surprise was that the optimal LoRA learning rate doesn’t move with model size, it’s flat from 0.6B to 32B. We’ve heard previously that the optimum should scale roughly as inverse width. For LoRA, the exponent is statistically indistinguishable from zero in both Qwen and Llama.

Pick a lr and never sweep it again, because it costs <0.01 nats vs tuning

3/ Full fine-tuning’s optimum sits ~10-33x lower (≈3e-5), and that ratio is stable across families. And our selection rule transfers ie fit it on Qwen and predict Llama to within 0.004 nats

4/ Full FT wins all 72 matched comparisons on loss vs LoRA, but the margins are tiny. LoRA recovers a median 98% of the gain at 3–13% of the params, and importantly the gap closes as models get bigger

5/ The most important contributor to final loss (loss improvement) isn’t lr or batch size, it’s the data. Specifically, token composition explains 56–88% of the variance. LR and batch together explain ≤0.07

6/ Does val loss even predict downstream quality? Within a fixed (model, dataset, recipe) experiment, yes (Spearman −0.38 to −0.88). So you can select on loss inside a recipe. But it doesn’t survive crossing model families ie a lower-NLL model can score worse with the judge

7/ Also, MoEs roughly scale as the geometric-mean (of active and total) dense equivalent

8/ There’s a live “is Muon just Shampoo” debate happening right now. Our SFT data point is that its pretraining edge mostly doesn’t survive small-update SFT (has slightly lower loss). HOWEVER, it does retain more general instruction-following, and this is a super interesting finding we want to dig into more, and may be related to the flatness of the region we get into with Muon (did some Hessian trace analysis you can read about)

9/ Rank in lora helps to ~64, then plateaus. r=32 gets within ~0.001–0.003 nats at half the params; r=128 buys nothing. α=32 best in every experiment. The r=64/α=32 default holds

10/ In terms of epochs, past ~2, loss overfits, judged quality doesn’t improve, and instruction-following erodes. So if you have more data, spend it on fresh examples. Quite obviously, fresh 10k beats repeated 5k in all our comparisons

11/ We’d like to turn post-training into a measured science instead of inherited defaults, and we’re doing more of it. More interesting, ofc, are questions related to SFT vs RL vs OPD/OPSD, which we’ll release more on soon. And if you want to work on this, DM me!

Full report: https://datocms-assets.com/104802/1781805778-baseten-research-sft.pdf…

Thanks Elie!

this and my golf handicap

Hope it’s useful!

No worries, hope it’s useful!

Similar Articles

The Fine-Tuning Trap: Evaluating Negative Transfer and the Role of PEFT in Sub-1B Mathematical Reasoning

arXiv cs.LG

This paper benchmarks sub-1B models on mathematical reasoning tasks, revealing that full fine-tuning actively harms performance in models under 300M parameters, while parameter-efficient fine-tuning (PEFT) like LoRA and DoRA provides stability. The authors recommend defaulting to PEFT for all aligned sub-1B models and caution against full FT for architectures smaller than 500M to prevent catastrophic forgetting.

FocuSFT: Bilevel Optimization for Dilution-Aware Long-Context Fine-Tuning

Hugging Face Daily Papers

The paper introduces FocuSFT, a bilevel optimization framework that enhances long-context language model performance by addressing attention dilution through parametric memory. It demonstrates significant improvements in accuracy and context engagement on benchmarks like BABILong and RULER.