@tan_maty: I'm blown away by this course, a must-see for CS majors: CS336, a course that's recently become legendary in the AI community. Building large language models from scratch. This course is offered by Stanford, taught by top NLP experts Percy Liang and Tatsunori Hashim…
Summary
A thread promoting Stanford's CS336 course on building language models from scratch, taught by NLP experts Percy Liang and Tatsunori Hashimoto, emphasizing hands-on understanding.
View Cached Full Text
Cached at: 06/27/26, 05:59 PM
I highly recommend this course — a must-watch for computer science majors: CS336. It has recently become legendary in the AI community 📚.
Language Modeling from Scratch
This course is offered by Stanford, taught by NLP heavyweights Percy Liang and Tatsunori Hashimoto.
Its core positioning is extremely hardcore: it’s the “operating systems course” of the large-model era. https://t.co/2kkTVoxW11
TL;DR: Stanford’s CS 336 teaches how to build a language model from scratch, emphasizing deep understanding through hands-on construction, while acknowledging the gap between small-scale experiments and frontier models.
Course Overview
CS 336 – “Language Modeling from Scratch” – is taught by Stanford NLP professors Percy Liang and Tatsunori Hashimoto, with TAs Roit, Neil, and Marcel. The course has grown by ~50% in enrollment since its first offering, now with three TAs. All lectures are publicly available on YouTube.
Why This Course Exists
The instructors see a crisis: researchers are increasingly disconnected from the underlying technology. Eight years ago, researchers implemented and trained AI models themselves. Six years ago, you could still download BERT and fine-tune it. Today, many just prompt proprietary models. While abstraction enables progress (e.g., prompting is fine for many studies), these abstractions are leaky. Unlike abstractions in programming languages or operating systems, we don’t truly understand what the abstraction is – roughly “string in, string out.” Foundational research requires breaking existing architectures and co-designing data, systems, and models. The course’s philosophy: “To understand, you must build.”
The Industrialization of Language Models
GPT-4 reportedly has 1.8 trillion parameters and cost $100M to train. xAI is building a cluster with 200,000 H100 GPUs, with projected investments exceeding $500 billion over four years. These models are built without public details. Even GPT-4 stated: “Due to the competitive landscape and safety implications… we are not disclosing any details.”
This means frontier models are out of reach for most. The course builds small language models, but small models may not be representative. Examples:
- Computation distribution: In small transformers, attention and MLP layers have roughly equal compute. At 175B parameters, MLP dominates. Optimizing attention at small scale may be optimizing the wrong thing.
- Emergent behaviors: Jason Wei’s 2022 paper showed that many tasks appear random until a certain compute threshold, then emergence occurs (e.g., in-context learning). Staying small leads to the false conclusion that language models are useless.
Three Kinds of Knowledge
- Mechanisms: What a transformer is, how to implement it, how model parallelism works. These can be taught directly.
- Mindset: Squeezing every ounce of performance from hardware, taking scaling seriously. This is more subtle but critical – it’s the scaling mindset that OpenAI pioneered.
- Intuition: Which data and modeling decisions yield good models. This can only be partially taught at small scale because architectures and datasets that work small may not transfer to large scale.
The instructors hope students gain two-thirds of this knowledge, calling it “a good deal.”
The Bitter Lesson Revisited
There is a common misinterpretation that “the bitter lesson” means scale alone matters and algorithms don’t. The correct reading is: algorithms for scale are what matters. Model accuracy = efficiency × resources. Efficiency is far more important at large scales because waste multiplies with huge budgets. OpenAI is likely far more efficient than anyone else.
Algorithmic efficiency gains are massive: a 2020 OpenAI paper showed that from 2012 to 2019, time to reach a given accuracy on ImageNet improved by 44× (faster than Moore’s Law). Without that, you’d pay 44× more. Similar results hold for language models.
The correct framework: Given a compute and data budget, build the best possible model. This question is meaningful at any scale. As researchers, the goal is to maximize algorithmic efficiency.
A Brief History of Language Models
- Shannon: Used language models to estimate entropy of English.
- 2007: Google trained a 5-gram model on 2 trillion tokens – more tokens than GPT-3. But these were n-gram models, exhibiting none of today’s interesting behaviors.
- 2010s deep learning revolution:
- 2003: Bengio’s first neural language model.
- seq2seq models (Illia, Google).
- Adam optimizer (over a decade old, still widely used).
- Attention mechanisms, leading to the 2017 “Attention Is All You Need” (Transformer).
- Mixture-of-experts scaling exploration.
- Late 2010s: model parallelism work, laying groundwork for training 100B+ models.
- Foundation models: ELMo, BERT, T5 – trained on massive text and adapted to tasks.
- Simplified history: OpenAI combined these elements with excellent engineering, pushed scaling laws, and produced GPT-2 and GPT-3. Google competed. This led to closed models (API only) and open models (Eleuther, Meta’s early attempts, Bloom, and later releases from Meta, Alibaba, DeepSeek, AI2). Openness exists on a spectrum: fully closed, open-weight (architecture details but no data), and open-source (weights plus data, honest papers).
Today’s frontier models include OpenAI, Anthropic, xAI, Google, Meta, DeepSeek, Alibaba, Tencent. The course reviews prior techniques and approximates frontier best practices using open-community information and inferences about closed models.
Course Format
Lectures are executable programs. The instructor walks through code step by step, embedding executable code in the slides. Students can run code as they follow along.
Source: YouTube: @tan_maty - CS336 Lecture 1 (https://youtu.be/SQ3fZ1sAqXI?si=uSBL1YbPMsBZb-V7)
Similar Articles
CS336: Language Modeling from Scratch
Stanford is offering a comprehensive course, CS336, where students build a language model from scratch, covering data collection, transformer construction, training, and evaluation.
@tan_maty: Oh my god, the AI Stanford course shared by the awesome @alisawuffles who starts at OpenAI next week — I found it! Must-see for beginners! I've already learned it (and lost my mind), come join me! I feel my English improving too! Stanford CS336: Language Mod…
Stanford CS336 aims to teach students how to build language models from scratch, with deep understanding of the full-stack design of data, systems, and models. The course videos are publicly available and suitable for AI beginners.
@Michaelzsguo: Alisa Liu mentioned the Stanford course CS336: Language Modeling from Scratch while preparing for an OpenAI interview. If you want to systematically learn LLM now, or if you plan to pursue AI research / MTS / ML e…
Recommends the Stanford open course CS336: Language Modeling from Scratch, which systematically explains the full training pipeline of language models from scratch, suitable for those preparing for AI interviews or wanting to deeply learn LLM.
@DanKornas: "Stanford CS229 I Machine Learning I Building Large Language Models (LLMs)" (Stanford Online), ... What you will learn:…
Stanford CS229 online course announcement covering building LLMs, deep neural networks, TensorFlow, Keras, OpenCV, and NLP with spaCy.
@stanfordnlp: There are two paths to learning the details (aka “tricks” or “secrets”) of successfully training state-of-the-art langu…
Stanford NLP promotes the CS336 course as a path to learning the tricks of successfully training state-of-the-art language models.