Training Agents: Live tutorial on how to fine-tune a coding agent for continual learning

YouTube AI Channels Events

Summary

This live tutorial demonstrates how to fine-tune a small code agent (Gemma 4 2B) on an agent trace dataset using supervised fine-tuning (SFT), and automate hyperparameter sweeps and evaluation using HF Jobs and Track IO, embodying the concept of "using agents to train agents."

No content available
Original Article
View Cached Full Text

Cached at: 06/28/26, 08:59 AM

# TL;DR This first lecture demonstrates how to use supervised fine-tuning (SFT) on an agent trajectory dataset to train a small code agent (Gemma 4 2B) to learn tool calling and multi-turn dialogue, and leverages HF Jobs and Track IO to automatically run hyperparameter sweeps and evaluation. ## Course Overview The course is titled "Training Agents" and focuses on the post-training phase, especially the challenges of long-running tasks. The series plans at least three sessions: the first (current) covers SFT, the second covers basic RL, and the third covers advanced RL with environments. The course uses agents to train agents — by giving a high-level prompt to an agent like CodeX, it automatically plans, executes training, and evaluates results, forming a "meta-training" loop. ## Agent Trajectories & Dataset ### Why Trajectories Agent trajectories are logs left when an agent performs a task, including tool calls, multi-turn dialogues, and other complete interactions. Ben showed a rendered trajectory on Hugging Face, which follows a standard format for easy analysis. The training goal is for a small model (Gemma 4 2B) to learn agentic language from scratch — i.e., generating correct tool call formats and following multi-turn dialogues. SFT directly mimics these behaviors on trajectories. ### Dataset Source - Preferred approach: collect real-world data from your own agent's trajectories, but collection is time-consuming. - This time we use a public trajectory dataset from Mario Zechner (author of the Pie framework). These trajectories come from Claude Opus 4.5, but the format is framework-agnostic and can be converted for any model. ## SFT Training Process ### Training Contract (Constraints) The agent is given a high-level instruction (using the CodeX prompt as an example): - Run SFT on the specified model (Gemma 4 2B) and dataset. - Use multiple HF Jobs for hyperparameter sweeps, and track each run's metrics with Track IO. - Push all adapters to a HF repo, run final evaluation (loss on held-out set, HumanEval, MBPP). - Write evaluation scores, job IDs, Track IO links, etc. into the README. The agent must autonomously: parse model IDs, configure training scripts, run smoke tests (permission checks, memory validation), and document lessons learned. ### Task Execution Ben feeds the prompt into the CodeX agent, which starts running in the background. Since the task takes about 2.5 hours, the livestream does not wait for completion but shows the expected flow. ## Training Workflow Details ### Step 1: Verification & Planning The agent first checks: - Whether the model HF ID exists and is accessible. - Whether the dataset shape is suitable for training. - Whether push permissions and HF Jobs resources are available. It also reads documentation to ensure the correct dependencies and format are used. ### Step 2: Hyperparameter Sweep & Training - The agent generates multiple HF Jobs for different hyperparameters (learning rate, batch size, etc.). - Each job corresponds to one training run, and Track IO records loss, gradients, and evaluation metrics. - The agent automatically picks the best run's metrics (e.g., lowest validation loss) from the Track IO dashboard. ### Step 3: Evaluation & Documentation - The best model's final weights are pushed to the HF repo. - Two benchmark evaluations (HumanEval and MBPP) are run to produce code generation accuracy. - The agent compiles evaluation results, job IDs, and Track IO links into the final model's README, forming a complete report. ## Goals of This Session & Next Steps Current SFT only makes the model mimic behaviors in the trajectories; it involves no reward function or RL environment. Subsequent sessions will introduce GRPO and RL environments, enabling the agent to explore and optimize its own policy. ## Summary By fine-tuning a small base model with SFT on high-quality trajectories, you can quickly obtain a code agent capable of tool calling and multi-turn dialogues. The entire training workflow itself is automated by another agent, embodying the core philosophy of "using agents to train agents." **Source**: YouTube - Training Agents: Live tutorial on how to fine-tune a coding agent for continual learning (https://www.youtube.com/watch?v=rNgUoH7Wbv8)

Similar Articles

@vintcessun: Tonight I came across a learning roadmap project that redefined where to start learning Agent. I used to think Agent was just a pile of tools and frameworks, but its core is the "observe-think-execute" loop and the harness engineering's organization of permissions, state, and backtracking. It breaks down learning into building a minimal Agent loop from scratch all the way to deploying a real Agent, with 8 stages, each with clear deliverables and recommended resources — not just links but an actionable todo list. This systematic approach made me realize my previous learning was too fragmented.

X AI KOLs Timeline

An open-source learning roadmap project called Agent-Learning-Hub, which breaks down AI Agent learning into 8 stages from building a minimal Agent loop to production deployment, providing executable todo lists and recommended resources, maintained by members of the Datawhale community.

@teach_fireworks: AI Coding is now entering a very interesting phase. In the past, discussions focused heavily on model capabilities, context length, Agent Loops, Tool Use, and automated programming. However, once Agents are placed in real-world development environments for extended periods, many teams realize the issue isn't just about 'whether code can be generated...',

X AI KOLs Timeline

Introducing re_gent, an open-source tool that provides runtime-level version control and observability infrastructure for AI coding Agents, addressing code traceability and audit issues arising from long-running Agent sessions.

@FeitengLi: Built a ReAct agent system by hand: Doing agent systems with LLMs. While walking this evening, I was thinking about how to train an LLM's agentic capabilities, data preparation, model training, constructing RL training with agent trajectory actions, and also about Claude's progress over the past year…

X AI KOLs Following

The author shares their experience building a ReAct agent system and introduces the GLM-5 technical report released by Zhipu AI, which achieves breakthroughs in agentic, reasoning, and coding capabilities.