@tan_maty: I'm blown away by this course, a must-see for CS majors: CS336, a course that's recently become legendary in the AI community. Building large language models from scratch. This course is offered by Stanford, taught by top NLP experts Percy Liang and Tatsunori Hashim…

X AI KOLs Timeline 06/27/26, 09:48 AM Events

Summary

A thread promoting Stanford's CS336 course on building language models from scratch, taught by NLP experts Percy Liang and Tatsunori Hashimoto, emphasizing hands-on understanding.

I'm blown away by this course, a must-see for CS majors: CS336, a course that's recently become legendary in the AI community📚. Building large language models from scratch This course is offered by Stanford, taught by top NLP experts Percy Liang and Tatsunori Hashimoto. Its core positioning is extremely hardcore: it's the 'operating system course' of the era of large models. https://t.co/2kkTVoxW11

Original Article

View Cached Full Text

Cached at: 06/27/26, 05:59 PM

I highly recommend this course — a must-watch for computer science majors: CS336. It has recently become legendary in the AI community 📚.

Language Modeling from Scratch

This course is offered by Stanford, taught by NLP heavyweights Percy Liang and Tatsunori Hashimoto.

Its core positioning is extremely hardcore: it’s the “operating systems course” of the large-model era. https://t.co/2kkTVoxW11

TL;DR: Stanford’s CS 336 teaches how to build a language model from scratch, emphasizing deep understanding through hands-on construction, while acknowledging the gap between small-scale experiments and frontier models.

Course Overview

CS 336 – “Language Modeling from Scratch” – is taught by Stanford NLP professors Percy Liang and Tatsunori Hashimoto, with TAs Roit, Neil, and Marcel. The course has grown by ~50% in enrollment since its first offering, now with three TAs. All lectures are publicly available on YouTube.

Why This Course Exists

The instructors see a crisis: researchers are increasingly disconnected from the underlying technology. Eight years ago, researchers implemented and trained AI models themselves. Six years ago, you could still download BERT and fine-tune it. Today, many just prompt proprietary models. While abstraction enables progress (e.g., prompting is fine for many studies), these abstractions are leaky. Unlike abstractions in programming languages or operating systems, we don’t truly understand what the abstraction is – roughly “string in, string out.” Foundational research requires breaking existing architectures and co-designing data, systems, and models. The course’s philosophy: “To understand, you must build.”

The Industrialization of Language Models

GPT-4 reportedly has 1.8 trillion parameters and cost $100M to train. xAI is building a cluster with 200,000 H100 GPUs, with projected investments exceeding $500 billion over four years. These models are built without public details. Even GPT-4 stated: “Due to the competitive landscape and safety implications… we are not disclosing any details.”

This means frontier models are out of reach for most. The course builds small language models, but small models may not be representative. Examples:

Computation distribution: In small transformers, attention and MLP layers have roughly equal compute. At 175B parameters, MLP dominates. Optimizing attention at small scale may be optimizing the wrong thing.
Emergent behaviors: Jason Wei’s 2022 paper showed that many tasks appear random until a certain compute threshold, then emergence occurs (e.g., in-context learning). Staying small leads to the false conclusion that language models are useless.

Three Kinds of Knowledge

Mechanisms: What a transformer is, how to implement it, how model parallelism works. These can be taught directly.
Mindset: Squeezing every ounce of performance from hardware, taking scaling seriously. This is more subtle but critical – it’s the scaling mindset that OpenAI pioneered.
Intuition: Which data and modeling decisions yield good models. This can only be partially taught at small scale because architectures and datasets that work small may not transfer to large scale.

The instructors hope students gain two-thirds of this knowledge, calling it “a good deal.”

The Bitter Lesson Revisited

There is a common misinterpretation that “the bitter lesson” means scale alone matters and algorithms don’t. The correct reading is: algorithms for scale are what matters. Model accuracy = efficiency × resources. Efficiency is far more important at large scales because waste multiplies with huge budgets. OpenAI is likely far more efficient than anyone else.

Algorithmic efficiency gains are massive: a 2020 OpenAI paper showed that from 2012 to 2019, time to reach a given accuracy on ImageNet improved by 44× (faster than Moore’s Law). Without that, you’d pay 44× more. Similar results hold for language models.

The correct framework: Given a compute and data budget, build the best possible model. This question is meaningful at any scale. As researchers, the goal is to maximize algorithmic efficiency.

A Brief History of Language Models

Shannon: Used language models to estimate entropy of English.
2007: Google trained a 5-gram model on 2 trillion tokens – more tokens than GPT-3. But these were n-gram models, exhibiting none of today’s interesting behaviors.
2010s deep learning revolution:
- 2003: Bengio’s first neural language model.
- seq2seq models (Illia, Google).
- Adam optimizer (over a decade old, still widely used).
- Attention mechanisms, leading to the 2017 “Attention Is All You Need” (Transformer).
- Mixture-of-experts scaling exploration.
- Late 2010s: model parallelism work, laying groundwork for training 100B+ models.
Foundation models: ELMo, BERT, T5 – trained on massive text and adapted to tasks.
Simplified history: OpenAI combined these elements with excellent engineering, pushed scaling laws, and produced GPT-2 and GPT-3. Google competed. This led to closed models (API only) and open models (Eleuther, Meta’s early attempts, Bloom, and later releases from Meta, Alibaba, DeepSeek, AI2). Openness exists on a spectrum: fully closed, open-weight (architecture details but no data), and open-source (weights plus data, honest papers).

Today’s frontier models include OpenAI, Anthropic, xAI, Google, Meta, DeepSeek, Alibaba, Tencent. The course reviews prior techniques and approximates frontier best practices using open-community information and inferences about closed models.

Course Format

Lectures are executable programs. The instructor walks through code step by step, embedding executable code in the slides. Students can run code as they follow along.

Source: YouTube: @tan_maty - CS336 Lecture 1 (https://youtu.be/SQ3fZ1sAqXI?si=uSBL1YbPMsBZb-V7)

@tan_maty: I'm blown away by this course, a must-see for CS majors: CS336, a course that's recently become legendary in the AI community. Building large language models from scratch. This course is offered by Stanford, taught by top NLP experts Percy Liang and Tatsunori Hashim…

Course Overview

Why This Course Exists

The Industrialization of Language Models

Three Kinds of Knowledge

The Bitter Lesson Revisited

A Brief History of Language Models

Course Format

Similar Articles

CS336: Language Modeling from Scratch

@tan_maty: Oh my god, the AI Stanford course shared by the awesome @alisawuffles who starts at OpenAI next week — I found it! Must-see for beginners! I've already learned it (and lost my mind), come join me! I feel my English improving too! Stanford CS336: Language Mod…

@Michaelzsguo: Alisa Liu mentioned the Stanford course CS336: Language Modeling from Scratch while preparing for an OpenAI interview. If you want to systematically learn LLM now, or if you plan to pursue AI research / MTS / ML e…

@DanKornas: "Stanford CS229 I Machine Learning I Building Large Language Models (LLMs)" (Stanford Online), ... What you will learn:…

@stanfordnlp: There are two paths to learning the details (aka “tricks” or “secrets”) of successfully training state-of-the-art langu…

Submit Feedback

Similar Articles

CS336: Language Modeling from Scratch

@tan_maty: Oh my god, the AI Stanford course shared by the awesome @alisawuffles who starts at OpenAI next week — I found it! Must-see for beginners! I've already learned it (and lost my mind), come join me! I feel my English improving too! Stanford CS336: Language Mod…

@Michaelzsguo: Alisa Liu mentioned the Stanford course CS336: Language Modeling from Scratch while preparing for an OpenAI interview. If you want to systematically learn LLM now, or if you plan to pursue AI research / MTS / ML e…

@DanKornas: "Stanford CS229 I Machine Learning I Building Large Language Models (LLMs)" (Stanford Online), ... What you will learn:…

@stanfordnlp: There are two paths to learning the details (aka “tricks” or “secrets”) of successfully training state-of-the-art langu…