Tag
The thread shares systematic experimental findings on fine-tuning best practices, varying one SFT lever at a time across dense and MoE models up to 235B on four real-world customer datasets with custom evals to eliminate confounders.