Training Agents: Live tutorial on how to fine-tune a coding agent for continual learning

YouTube AI Channels Events

training fine-tuning coding-agent sft gemma hf-jobs track-io

Summary

This live tutorial demonstrates how to fine-tune a small code agent (Gemma 4 2B) on an agent trace dataset using supervised fine-tuning (SFT), and automate hyperparameter sweeps and evaluation using HF Jobs and Track IO, embodying the concept of "using agents to train agents."

No content available

Original Article

View Cached Full Text

Cached at: 06/28/26, 08:59 AM

# TL;DR This first lecture demonstrates how to use supervised fine-tuning (SFT) on an agent trajectory dataset to train a small code agent (Gemma 4 2B) to learn tool calling and multi-turn dialogue, and leverages HF Jobs and Track IO to automatically run hyperparameter sweeps and evaluation. ## Course Overview The course is titled "Training Agents" and focuses on the post-training phase, especially the challenges of long-running tasks. The series plans at least three sessions: the first (current) covers SFT, the second covers basic RL, and the third covers advanced RL with environments. The course uses agents to train agents — by giving a high-level prompt to an agent like CodeX, it automatically plans, executes training, and evaluates results, forming a "meta-training" loop. ## Agent Trajectories & Dataset ### Why Trajectories Agent trajectories are logs left when an agent performs a task, including tool calls, multi-turn dialogues, and other complete interactions. Ben showed a rendered trajectory on Hugging Face, which follows a standard format for easy analysis. The training goal is for a small model (Gemma 4 2B) to learn agentic language from scratch — i.e., generating correct tool call formats and following multi-turn dialogues. SFT directly mimics these behaviors on trajectories. ### Dataset Source - Preferred approach: collect real-world data from your own agent's trajectories, but collection is time-consuming. - This time we use a public trajectory dataset from Mario Zechner (author of the Pie framework). These trajectories come from Claude Opus 4.5, but the format is framework-agnostic and can be converted for any model. ## SFT Training Process ### Training Contract (Constraints) The agent is given a high-level instruction (using the CodeX prompt as an example): - Run SFT on the specified model (Gemma 4 2B) and dataset. - Use multiple HF Jobs for hyperparameter sweeps, and track each run's metrics with Track IO. - Push all adapters to a HF repo, run final evaluation (loss on held-out set, HumanEval, MBPP). - Write evaluation scores, job IDs, Track IO links, etc. into the README. The agent must autonomously: parse model IDs, configure training scripts, run smoke tests (permission checks, memory validation), and document lessons learned. ### Task Execution Ben feeds the prompt into the CodeX agent, which starts running in the background. Since the task takes about 2.5 hours, the livestream does not wait for completion but shows the expected flow. ## Training Workflow Details ### Step 1: Verification & Planning The agent first checks: - Whether the model HF ID exists and is accessible. - Whether the dataset shape is suitable for training. - Whether push permissions and HF Jobs resources are available. It also reads documentation to ensure the correct dependencies and format are used. ### Step 2: Hyperparameter Sweep & Training - The agent generates multiple HF Jobs for different hyperparameters (learning rate, batch size, etc.). - Each job corresponds to one training run, and Track IO records loss, gradients, and evaluation metrics. - The agent automatically picks the best run's metrics (e.g., lowest validation loss) from the Track IO dashboard. ### Step 3: Evaluation & Documentation - The best model's final weights are pushed to the HF repo. - Two benchmark evaluations (HumanEval and MBPP) are run to produce code generation accuracy. - The agent compiles evaluation results, job IDs, and Track IO links into the final model's README, forming a complete report. ## Goals of This Session & Next Steps Current SFT only makes the model mimic behaviors in the trajectories; it involves no reward function or RL environment. Subsequent sessions will introduce GRPO and RL environments, enabling the agent to explore and optimize its own policy. ## Summary By fine-tuning a small base model with SFT on high-quality trajectories, you can quickly obtain a code agent capable of tool calling and multi-turn dialogues. The entire training workflow itself is automated by another agent, embodying the core philosophy of "using agents to train agents." **Source**: YouTube - Training Agents: Live tutorial on how to fine-tune a coding agent for continual learning (https://www.youtube.com/watch?v=rNgUoH7Wbv8)

Training Agents: Live tutorial on how to fine-tune a coding agent for continual learning

Similar Articles

A recap of a live stream where an AI agent (Codex) autonomously runs the entire SFT workflow to train a small Gemma 2B model to imitate a coding agent (pi). All artifacts and code are open-sourced.

Watch agents fight: a live challenge to speed up Gemma 4 E4B inference on a single A10G

Submit Feedback

Similar Articles

@vintcessun: Tonight I came across a learning roadmap project that redefined where to start learning Agent. I used to think Agent was just a pile of tools and frameworks, but its core is the "observe-think-execute" loop and the harness engineering's organization of permissions, state, and backtracking. It breaks down learning into building a minimal Agent loop from scratch all the way to deploying a real Agent, with 8 stages, each with clear deliverables and recommended resources — not just links but an actionable todo list. This systematic approach made me realize my previous learning was too fragmented.

@teach_fireworks: AI Coding is now entering a very interesting phase. In the past, discussions focused heavily on model capabilities, context length, Agent Loops, Tool Use, and automated programming. However, once Agents are placed in real-world development environments for extended periods, many teams realize the issue isn't just about 'whether code can be generated...',

A recap of a live stream where an AI agent (Codex) autonomously runs the entire SFT workflow to train a small Gemma 2B model to imitate a coding agent (pi). All artifacts and code are open-sourced.

Watch agents fight: a live challenge to speed up Gemma 4 E4B inference on a single A10G

@FeitengLi: Built a ReAct agent system by hand: Doing agent systems with LLMs. While walking this evening, I was thinking about how to train an LLM's agentic capabilities, data preparation, model training, constructing RL training with agent trajectory actions, and also about Claude's progress over the past year…