A recap of a live stream where an AI agent (Codex) autonomously runs the entire SFT workflow to train a small Gemma 2B model to imitate a coding agent (pi). All artifacts and code are open-sourced.

X AI KOLs Tools

Summary

A recap of a live stream where an AI agent (Codex) autonomously runs the entire SFT workflow to train a small Gemma 2B model to imitate a coding agent (pi). All artifacts and code are open-sourced.

https://t.co/TqIHNfRkfC
Original Article
View Cached Full Text

Cached at: 06/23/26, 02:35 PM

Training Agents Class 1: SFT, run by an agent

We let an agent run the whole SFT workflow, live, from one prompt.

NOTE: You can (and should) watch the full live stream, where everything is covered in much more detail: https://www.youtube.com/watch?v=rNgUoH7Wbv8

We pasted a single prompt at the start of the stream and that was all we wrote, no training code by hand. From there, an AI agent did the engineering. The word agent means a few different things here. So before anything else, let’s be clear which one we mean.

Which agent is which?

There are two distinct agents in this story:

  • The builder: the agent that does the ML work. We used Codex, but any capable agent works. We gave it one prompt and it resolved the model, prepared the data, ran the training, tracked it, ran the evals, and wrote the model card.

  • The student (the one we train!): a small open Gemma model (gemma 4 2b) that learns to act like a coding agent.

The data comes from a third one: pi, a real coding agent. Its actual work sessions, the traces, are what the student imitates.

So one agent builds, one is being built, and a third gave the lessons.

Why does any of this make sense?

Why train a model to become a coding agent? Capable coding agents today run on large, expensive, often closed models. Teach a small open model to act like one and you get something cheap, private, and yours. And because you train it on your own traces, you can specialize it for a specific use case: your codebase, your tools, your workflow. It is also where you start. You cannot improve an agent that cannot yet act like one, so you teach it the format first.

Why Gemma 4, and why 2B? Gemma 4 is a recent, open, well-supported instruct model. The 2B size is small enough to train fast, run on a modest GPU, and iterate on live. We want you to see the mechanics and the workflow, not to top a benchmark. A small model also keeps the limits visible.

What should you expect as output? A LoRA adapter and a final model repo that imitate the agent’s format: its tool calls and its multi-turn loop. It learns the shape and language of the agent, not strong problem-solving. A 2B after one SFT pass is not going to be a great coder. You get a model that acts like the agent, and a pipeline that is reproducible and auditable.

Where can you find everything?

Every artifact is open (of course!):

  • Full stream: https://www.youtube.com/watch?v=rNgUoH7Wbv8

  • Slides: https://docs.google.com/presentation/d/1hcGZ4U9TjZZzcGNbH2K6wYD45qwZTyo_gosCQsnHlnc

  • SFT from scratch, by Ben: https://x.com/ben_burtenshaw/status/2067615361428545566

  • Full agent session trace: https://huggingface.co/buckets/burtenshaw/sft-on-traces/tree/example.jsonl

  • Trackio dashboard: https://huggingface.co/spaces/burtenshaw/youtube-livestream-1-trackio

  • Final model: https://huggingface.co/burtenshaw/gemma-4-E2B-it-pi-mono-lora-youtube-livestream-1

  • Winner adapter: https://huggingface.co/burtenshaw/gemma-4-E2B-it-pi-mono-lora-youtube-livestream-1-lr2e4-r16-len4k

  • Dataset (pi-mono traces): https://huggingface.co/datasets/badlogicgames/pi-mono

  • Code and context repo: https://github.com/burtenshaw/training-agents

Tools, mostly Hugging Face 🤗: TRL, Hugging Face Jobs, Trackio, and the Hub, plus Inspect AI and vLLM for the evals.

The agent did the build, but the judgment stayed ours: the goal, the constraints, the selection rule we fixed before any scores, and checking every artifact is real.

This was the recap. The class itself is the stream, that is where it happens. The series goes deeper from here.

Similar Articles