@ben_burtenshaw: https://x.com/ben_burtenshaw/status/2067615361428545566

X AI KOLs Timeline 06/18/26, 02:28 PM News

supervised-fine-tuning training-agents pytorch sft next-token-prediction qwen agent-training

Summary

A detailed tutorial on supervised fine-tuning (SFT) for training AI agents, built from scratch in pure PyTorch using Qwen3-0.6B, explaining the mechanics of next-token prediction and label masking.

https://t.co/CcFD3hNmYs

Original Article

View Cached Full Text

Cached at: 06/18/26, 04:17 PM

Training Agents: Class 1; SFT from scratch

This is the first article in the Training Agent Series, where I release a series of articles and live streams on how to train your own agent. I’ll start off pretty simple, and work up. The first session is on ‘Fine-tuning an agent from your traces’. Check out that live stream on youtube.

To prepare for that session, I’ll use that article to explain Supervised fine-tuning (SFT) from basics, without any abstractions. I’ve covered much of this before on hf.co/learn, but here I just want to drop a bitesize and minimal version. SFT trains a model to reproduce demonstrated outputs. You give it pairs of prompts and completions, and the model learns to produce the completion when it sees the prompt. That is the whole idea. This post builds a minimal SFT loop in pure PyTorch so you can see exactly what the training signal is and how it changes the model.

The code is short. The full loop is under 100 lines, with no training framework in the way. We use transformers to load a model and tokenizer, then write the loss and the optimizer step by hand.

The setup

The examples run on one GPU with 16 GB of memory, the kind a free or low-cost cloud notebook provides. The model is Qwen3-0.6B, small enough to iterate on quickly. The goal is to see the mechanics, not actually to produce a strong model.

Only a few dependencies:

Load the model in bfloat16 and put it on the GPU:

Next-token prediction is the supervision

A token is a chunk of text, often a word or word-piece, that the tokenizer maps to an integer. A language model reads a sequence of tokens and, at each position, emits a probability distribution over the next token. The image below shows this for a four-token sequence. Each position reads the tokens up to it and predicts the one that follows.

Next-token prediction.

Training uses these predictions as the supervision signal. At each position the model has a predicted distribution and a target, the token that actually comes next. The loss is the cross-entropy between them. When the model is confident and correct, the loss is near zero. When it is confident and wrong, the loss is greater.

This is the entire learning signal in SFT. There is no separate notion of a good or bad answer, only the next token the demonstration says should come. Because the signal is per token, the model gets one gradient for every token in the completion.

The data format: prompts, completions, masks

An SFT example has two parts. The prompt is the input, such as a user’s question. The completion is the output you want the model to produce. In the context of agents, these are wrapped up in ‘traces’. You want the model to learn to generate the completion, not to regenerate the prompt, so the loss should fall only on completion tokens.

This is where masking comes in. You concatenate the prompt and the completion into one sequence and feed it to the model, but you tell the loss to ignore the prompt positions. PyTorch’s cross-entropy does this with a sentinel label of -100, which it skips. The masking diagram shows the alignment: the input row holds prompt tokens followed by completion tokens, and the label row holds -100 for every prompt token and the real token id for every completion token.

The SFT label mask. The prompt is replaced with -100 so the loss falls only on completion tokens.

The code applies the chat template to one short conversation and builds the labels from the assistant mask:

The messages list holds the conversation, one dict per turn. apply_chat_template renders it into a single token sequence, and return_assistant_tokens_mask marks which tokens belong to the assistant. Everything else becomes -100.

The loop in pure PyTorch

The loop has five steps, shown in the diagram: run the sequence through the model to get logits, shift logits and labels so each position lines up with the token it predicts, compute the cross-entropy, backpropagate, and step the optimizer.

SFT Training Loop

The logits come out of the model with one distribution per input position. Position t predicts the token at position t+1, so you drop the last logit and the first label to align them:

That is a complete SFT step. loss.backward() fills in the gradient of the loss for every weight, and optimizer.step() takes a small step to lower it. The cross-entropy is written out so the shift and the mask are not hidden.

What to take away

SFT is imitation. It shows the model what to do and trains it to copy. The mechanism is next-token prediction with a mask: every completion token contributes one cross-entropy term and one gradient, and the prompt contributes none. The ceiling is the data, since a model trained by SFT cannot reliably produce behavior its demonstrations never show.

That makes SFT stable and sample-efficient, and it makes the cost of SFT the cost of writing and validating good completions. When you can demonstrate the output you want, SFT is the simplest tool that works. When you can score behavior but cannot easily demonstrate the best version of it, reinforcement learning takes over, but that is a topic for another post.

@ben_burtenshaw: https://x.com/ben_burtenshaw/status/2067615361428545566

Training Agents: Class 1; SFT from scratch

The setup

Next-token prediction is the supervision

The data format: prompts, completions, masks

The loop in pure PyTorch

What to take away

Similar Articles

@SergioPaniego: https://x.com/SergioPaniego/status/2066498136273531363

@qinzytech: https://x.com/qinzytech/status/2066585405479371092

@omarsar0: Self-improving AI is a big deal! As a first step, I've been exploring how much of the post-training can be automated. H…

Fine-tuned Qwen2.5-7B to 96% of Claude Haiku on a domain-specific task using ~$3 of API calls and zero human labelers

@harshbhatt7585: https://x.com/harshbhatt7585/status/2063593933314113587

Submit Feedback

Similar Articles

@SergioPaniego: https://x.com/SergioPaniego/status/2066498136273531363

@qinzytech: https://x.com/qinzytech/status/2066585405479371092

@omarsar0: Self-improving AI is a big deal! As a first step, I've been exploring how much of the post-training can be automated. H…

Fine-tuned Qwen2.5-7B to 96% of Claude Haiku on a domain-specific task using ~$3 of API calls and zero human labelers

@harshbhatt7585: https://x.com/harshbhatt7585/status/2063593933314113587