[Full Workshop] Reinforcement Learning, Kernels, Reasoning, Quantization & Agents — Daniel Han

YouTube AI Channels Events

Summary

At the AI Engineer World Congress, Daniel Han delivered an in-depth talk on the practical experiences of reinforcement learning, model fine-tuning, quantization, and agents. He reviewed the evolution of open-source models from Llama to DeepSeek R1 and analyzed the five key stages of modern model training.

No content available
Original Article
View Cached Full Text

Cached at: 06/25/26, 01:32 PM

### TL;DR Daniel Han shared practical experience in reinforcement learning, fine-tuning, quantization, and agents at the AI Engineer World Summit, reviewed the evolution of open-source models from Llama to DeepSeek R1, and broke down several key stages of model training (pre-training, intermediate training, supervised fine-tuning, post-training, and RLVR). ## Speaker Team & Background Daniel first introduced his team's work: sharing AI frontier updates on Twitter, fixing a gradient accumulation bug last year, and introducing asynchronous offloading gradient checkpointing. They also collaborate with teams at Hugging Face, Google, Meta, Mistral, etc., fixing bugs in open-source models like Gemma, Llama, and Mixtral, and contributing code to projects like Llama CBP, Qwen, and Mistral. The team's Hugging Face monthly downloads exceed 10 million, and their GitHub packages have 40,000 stars, with the core goal of making fine-tuning faster and more memory-efficient. They also provide free Colab and Kaggle notebooks (using Google free GPUs and Kaggle's 30 hours of GPU per week) and upload fixed quantized models (e.g., 1.58-bit DC R10528) that can run on low-VRAM devices. ## Evolution of Open-Source Models ### History Starting from Llama - **Llama 1**: Meta initially released it only as a research paper; weights leaked, sparking the open-source movement. Llama 1 was trained on only 1.4 trillion tokens (far less than 10x more for today's models). Training loss vs. model size: larger models had lower loss (70B blue line, 650B red line). Normal fine-tuning loss should be around 2–3. - **Current Scale**: Google's Gemma 3 was trained on 14 trillion tokens, Llama 4 on 30 trillion tokens — 10–30 times more than Llama 1. ### Open-Source vs. Closed-Source: Two "Droughts" Daniel cited Maxime's chart showing that open-source models had steeper slopes on MMLU (5-shot) than closed-source models, eventually reaching GPT-4 level at Llama 3.1 405B. But two "droughts" occurred: 1. **First Drought (Dec 2022)**: ChatGPT released; open-source models lagged in instruction following and RLHF until Llama 1 gradually caught up. 2. **Second Drought (Sep 2024)**: o1-preview introduced reasoning chains, a leap in capability; the open-source community couldn't replicate it for four months until DeepSeek R1 (Jan 2025) broke the deadlock. Daniel believes each closed-source model makes a step function (SFT/RLHF jump → RL jump), but the next jump is unknown. Yann LeCun's cake diagram (unsupervised learning → supervised fine-tuning → reinforcement learning), though from 2016, remains the core framework. ## Base Models & Fine-Tuning Stages All large models start from a base model (e.g., ChatGPT's base model was never released), then become chat models via fine-tuning. Naming conventions for open-source models are inconsistent (e.g., Gemma 3 PT/IT, Llama 4 Instruct, Qwen 3 Base/no suffix). Daniel calls for standardization. ### Modern Training Stage Breakdown Daniel proposed an updated stage model: - **Pre-training**: Uses all web, Wikipedia, etc., data to predict the next token. - **Intermediate Training (Interm Training)**: Weights high-quality data (e.g., Wikipedia) more heavily; can also extend long context. - **Supervised Fine-Tuning (SFT / Instruction Tuning)**: Converts base model into a chat model. - **Post-training**: Preference fine-tuning (DPO), RLHF, etc. - **Reinforcement Fine-Tuning (RLVR)**: Reinforcement Learning with Verifiable Rewards, unlike traditional preference fine-tuning, directly uses reward functions to improve the model. He emphasized that model training starts from randomly initialized weights and progresses through the above stages. ## Conclusion & Interaction Daniel encouraged the audience to ask questions and promised to answer all of them. He reminded attendees to use free GPU resources and keep an eye on the team's released fixed models, as model bugs can affect accuracy by 10%. --- *Source: [Full Workshop] Reinforcement Learning, Kernels, Reasoning, Quantization & Agents — Daniel Han* (https://www.youtube.com/watch?v=OkEGJ5G3foU)

Similar Articles

@danielhanchen: I’m running a 3 hour advanced workshop at AI Engineer World’s Fair! 2026 has greatly changed how one should learn lower…

X AI KOLs Following

Daniel Han is hosting a 3-hour advanced workshop at the AI Engineer World's Fair, sharing insights on the history of open-source large models, classification of training stages (pre-training, intermediate training, supervised fine-tuning, post-training, reinforcement fine-tuning), and the leap in reasoning models. He also introduced his team's open-source contributions to fine-tuning optimization.

@Michaelzsguo: This is one of the best deep discussions I've seen recently about the fundamentals of reinforcement learning and its relationship to modern AI. Eric Jang and Dwarkesh turned a seemingly retro exercise—rebuilding AlphaGo with today's tools—into a very clear masterclass: why 'search +...'

X AI KOLs Timeline

A detailed discussion on reinforcement learning and its connection to modern AI, using the reconstruction of AlphaGo with modern tools as a clear example of search and self-play. Key takeaways include neural network amortization of search, credit assignment challenges in LLMs vs AlphaGo, and implications for automated research.

@dair_ai: https://x.com/dair_ai/status/2053495521243799717

X AI KOLs Following

DAIR AI's weekly roundup highlights top research papers including HeavySkill, which improves model performance via internalized parallel reasoning, and Sakana AI's Conductor, which uses RL to optimize agent orchestration. It also covers Meta FAIR's work on self-improving pretraining.

@snowboat84: https://x.com/snowboat84/status/2065215177029787705

X AI KOLs Timeline

This article is the middle part of the AI Engineering Landscape series, detailing core techniques such as inference optimization, model slimming (quantization, distillation, pruning, MoE), and speculative decoding, while reviewing the latest advances from hardware to the engineering stack.