@_rohit_tiwari_: Builds GPT-like LLMs from scratch in PyTorch > Breaks the LLM architecture into simple parts. > Beginner friendly. > Fu…
Summary
A beginner-friendly, hands-on GitHub repository that breaks down GPT-like LLM architecture into simple parts, with 10 Jupyter notebooks covering tokenization, attention, transformer blocks, and a mini GPT implementation in PyTorch.
View Cached Full Text
Cached at: 06/05/26, 05:17 PM
Builds GPT-like LLMs from scratch in PyTorch
Breaks the LLM architecture into simple parts. Beginner friendly. Fully hands on.
https://github.com/analyticalrohit/llms-from-scratch…
Everything explained step by step.
Just 10 notebooks.
01_tokenization.ipynb 02_token_embeddings.ipynb 03_positional_embeddings.ipynb 04_self_attention_mechanism.ipynb 05_multi_head_self_attention.ipynb 06_feedforward_neural_networks.ipynb 07_residual_connections.ipynb 08_layer_normalization.ipynb 09_transformer_block.ipynb 10_mini_gpt.ipynb
analyticalrohit/llms-from-scratch
Source: https://github.com/analyticalrohit/llms-from-scratch
LLMs from Scratch
Overview
This repository is a hands on guide to building a ChatGPT like LLM in PyTorch. It breaks the architecture into simple parts and explains each one step by step.
LLM Architecture
Let us have a birds eye view of the Generative Pretrained Transformer (GPT) like LLM architecture.
Example: Every moment is a beginning
LLMs work by predicting one word or token at a time. LLMs generate text iteratively. Each predicted word token is appended to the previous input to form the context for the next prediction.
Contents
- Tokenization
- Token Embeddings
- Positional Embeddings
- Self Attention Mechanism
- Multi-Head Self Attention
- FeedForward Neural Networks
- Residual Connections
- Layer Normalization
- Transformer Block
- MiniGPT
Code Notebook
Dive into the hands-on examples for each LLM component using interactive Jupyter notebooks.
| Topic | Code |
|---|---|
| Tokenization | 01_tokenization.ipynb |
| Token Embeddings | 02_token_embeddings.ipynb |
| Positional Embeddings | 03_positional_embeddings.ipynb |
| Self Attention Mechanism | 04_self_attention_mechanism.ipynb |
| Multi-Head Self Attention | 05_multi_head_self_attention.ipynb |
| FeedForward Neural Networks | 06_feedforward_neural_networks.ipynb |
| Residual Connections | 07_residual_connections.ipynb |
| Layer Normalization | 08_layer_normalization.ipynb |
| Transformer Block | 09_transformer_block.ipynb |
| MiniGPT | 10_mini_gpt.ipynb |
Install Dependencies
pip install -r requirements.txt
If you’re installing torch with CUDA support, make sure to use the correct installation command from PyTorch’s official website, as some versions require a specific installation method.
Tokenization
Tokenization is the process of splitting a text into smaller units called tokens. These tokens are the fundamental building blocks an LLM works with.
Input Sentence: “Every moment is a beginning”
Tokens: [“Every”, “moment”, “is”, “a”, “beginning”]
This shows how a tokenizer can split a sentence into tokens. After tokenization, each unique token is assigned a unique numerical ID.
Here’s a simple visual showing tokenization:
Token Embeddings
Now we have a list of numbers, but these numbers alone don’t carry any meaning. The ID “15745” for “Every” doesn’t tell the machine that “Every” is a determiner used to describe a noun. This is where embeddings help.
Token Embeddings are essentially numerical representations (vectors) of tokens basically a long list of numbers (a vector) that describes its characteristics.
Positional Embeddings
Imagine the sentences:
- The dog jumps on the cat.
- The cat jumps on the dog.
The words are the same, but the meaning is entirely different because their positions are different. Our numerical token IDs and token embeddings, by themselves, don’t tell the LLM anything about the order of words.
This is solved with Positional Embeddings.
Positional embeddings are another list of numbers (a vector) added to the token embeddings. These vectors help the model understand the absolute or relative position of each token in the sequence.
Self Attention Mechanism
Self attention helps a model understand how words relate to each other in a sentence. Instead of reading each word alone, every token can look at the other tokens and decide which ones matter most.
Take this sentence:
“Every moment is a beginning.”
To understand the word “beginning”, the model pays attention to words like “moment” and “Every”. This gives context and helps the model capture the idea that each moment can represent a fresh start.
We compute the dot product between all Queries and Keys to measure how well they match.
The result is scaled by the square root of the key dimension to keep values stable during training.
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
Each token now contains information gathered from other tokens in the sequence. This is the core idea behind transformers.
In standard self attention, each token can attend to all other tokens in the sequence. But in language models, future tokens should not be visible during prediction.
For example, when predicting:
Every moment is -> a
The model should not look ahead at the future word beginning.
Causal self attention solves this using a mask that blocks access to future tokens.
The mask looks like this:
\begin{bmatrix} 1 & 0 & 0 \\ 1 & 1 & 0 \\ 1 & 1 & 1 \end{bmatrix}
A value of:
1means attention is allowed0means attention is blocked
This ensures:
- token 1 sees only itself
- token 2 sees tokens 1 and 2
- token 3 sees tokens 1, 2, and 3
Now each token can only attend to itself and previous tokens. This is the mechanism used in decoder only transformer models like GPT.
Multi-Head Self Attention
Multi-head attention allows a transformer to learn different types of relationships at the same time. Instead of using one single attention mechanism, the model uses multiple attention heads in parallel.
Each head looks at the same sentence differently and learns its own pattern of relationships.
Take the sentence:
“Every moment is a beginning.”
Different attention heads may focus on different meanings:
One head may connect: “moment” ↔ “beginning” to understand the idea of renewal or change.
Another head may focus on grammar: “Every” ↔ “moment” to understand that “Every” describes “moment”.
Another head may focus on sentence meaning: “is” ↔ “beginning” to understand the main statement of the sentence.
A single attention head can learn only one type of relationship at a time. Multiple heads allow the model to capture: grammar, meaning, long range dependencies, subject object relationships, and contextual patterns.
How it works:
- The input embeddings are split into smaller parts called heads.
- Each head performs attention independently.
- The outputs from all heads are concatenated together.
- A final linear layer combines the information into one representation.
Each head works on dimensions independently. This allows the model to learn richer and more diverse relationships between words.
The outputs from all heads are combined into one representation. This improves the model’s ability to understand language.
FeedForward Neural Networks
Attention allows tokens to communicate with each other and exchange information across the sequence.
For example, in the sentence:
“Every moment is a beginning.”
Attention helps the token “beginning” gather context from words like “moment” and “Every”.
But after this information is mixed together, each token still needs additional processing to learn more complex patterns. This is the role of the Feed Forward Network, often called the FFN or MLP block.
A Feedforward Neural Network typically consists of two linear layers with an activation function (like GELU) in between,temporarily expanding the hidden dimension (often by 4x) to help the model learn more complex patterns.
- Linear layer
- Activation function
- Second linear layer
Residual Connections
Deep neural networks are difficult to train.
As networks become deeper, gradients can become extremely small or extremely large during backpropagation. This is known as the vanishing gradient or exploding gradient problem.
When gradients vanish, earlier layers learn very slowly because the training signal fades as it moves backward through many layers.
Residual connections, also called skip connections, help solve this problem. Instead of learning a completely new transformation, the model learns how to modify the input relative to its original value.
The original input is added back to the output of a layer:
\text{Output} = x + \text{Sublayer}(x)
Transformers use residual connections around both:
- Multi-Head Attention
- Feed Forward Networks
Residual connections help transformers:
- train deeper networks
- stabilize gradients
- preserve information
They are one of the core building blocks of modern deep learning architectures.
Layer Normalization
Neural network activations can become unstable during training. As data passes through many layers, the values can grow too large or become too small. This makes optimization difficult and can slow down learning.
Layer Normalization helps stabilize these activations. It normalizes the features of each token independently so that the values maintain:
- mean ≈ 0
- standard deviation ≈ 1
This makes training faster, more stable, and more reliable.
Suppose a token embedding is:
x = [x_1, x_2, x_3]
LayerNorm computes:
The mean
\mu = \frac{1}{n}\sum x_i
The variance
\sigma^2 = \frac{1}{n}\sum (x_i - \mu)^2
The normalized output
\hat{x}_i =\frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}}
This transforms the features so they have approximately zero mean and unit variance.
Layer normalization is applied multiple times inside each transformer block.
Transformer Block
A transformer block combines:
- Multi-head attention
- Feedforward neural network
- Residual connections
- Layer normalization
This is the core building block of GPT models.
A Transformer block chains these pieces together in a specific order. In GPT models, which use the “pre norm” architecture, the sequence is:
- LayerNorm
- Multi Head Attention
- Residual Add
- LayerNorm
- FeedForward
- Residual Add
Dropout is also commonly used after attention and feedforward layers to reduce overfitting and improve generalization.
Modern GPT models stack many transformer blocks on top of each other. Each block refines the token representations.
MiniGPT
MiniGPT is a small GPT style language model built using transformer blocks.
It combines:
- Token embeddings
- Positional embeddings
- Transformer blocks
- Layer normalization
- Output projection layer
The model processes input tokens and predicts the next token in the sequence.
Parameters
| Parameter | Description |
|---|---|
vocab_size | Total number of tokens in the vocabulary |
block_size | Maximum sequence length |
embed_dim | Size of token embeddings |
num_heads | Number of attention heads |
hidden_dim | Hidden size of the feedforward network |
num_layers | Number of transformer blocks |
Overall Flow
Input Tokens
↓
Token Embeddings
↓
Positional Embeddings
↓
Transformer Blocks
↓
LayerNorm
↓
Linear Layer
↓
Vocabulary Logits
MiniGPT is trained autoregressively. It predicts the next token using previous tokens. This is the core idea behind GPT style language models.
Blog Post
Read the full breakdown and insights in the accompanying blogs.
- A Visual Guide to LLMs (Part 1): Text to Numbers: Tokenization and Embeddings
- A Visual Guide to LLMs (Part 2): Inside the Transformer Architecture
Newsletter
✅ Learn AI for FREE with visuals, easy-to-follow insights.
✅ Get cutting-edge topics like GenAI, RAGs, and LLMs in your inbox every week.
Contributing
We welcome contributions! If you have improvements, new notebooks, or fixes to suggest:
- Fork the repository.
- Create a feature branch:
git checkout -b feature/YourTopic. - Add or update notebooks in the
notebooks/folder. - Commit your changes:
git commit -m 'Add or update YourTopic notebook'. - Push your branch:
git push origin feature/YourTopic. - Open a pull request for review.
License
This project is licensed under MIT License
⭐️ If you find this repository helpful, please consider giving it a star!
Keywords: AI, Machine Learning, Deep Learning, PyTorch, Generative AI, LLMs, Transformers
Similar Articles
@Xx15573208: I've read many articles about Transformers and understand the theory, but when I actually sit down to write code, I have no idea where to start. LLMs-from-scratch is specifically designed to solve this problem: it accompanies the book "Build a Large Language Model" and guides you through implementing GPT from scratch using PyTorch…
LLMs-from-scratch is a GitHub repository that accompanies the book "Build a Large Language Model," providing complete code to implement GPT from scratch with PyTorch, covering the full pipeline including pretraining, fine-tuning, and RLHF. It has gained 93K+ stars and is ideal for developers who want to deeply understand the principles behind large language models.
@techNmak: Build LLMs from Scratch Found this gem from Vizuara, a 43-lecture series that actually delivers on its promise: buildin…
A 43-lecture series by Vizuara teaches how to build LLMs from scratch, covering transformer architecture, GPT internals, tokenization, and attention mechanisms with full Python implementations.
@Modular: The MAX-LLM book just made it even easier to build an LLM from scratch. The new notebook format lets you run the GPT-2 …
The MAX-LLM book now provides interactive Jupyter notebooks that walk through building a complete GPT-2 implementation from scratch using the MAX framework, enabling users to explore tensor shapes, run components, and generate text.
rasbt/LLMs-from-scratch
The repository provides open-source code to build, pretrain, and fine-tune a GPT-like large language model from scratch, serving as the official code companion to Sebastian Raschka's book of the same name.
@DanKornas: Stop learning LLMs from disconnected tutorials. LLM from Scratch is a hands-on PyTorch curriculum for builders who want…
A hands-on PyTorch curriculum that teaches LLM training from transformer basics through fine-tuning and alignment, including RLHF and GRPO.