@pallavishekhar_: Math Behind Gradient Descent Read here: https://outcomeschool.com/blog/math-behind-gradient-descent…

X AI KOLs Timeline News

Summary

This blog post explains the math behind gradient descent, the fundamental optimization algorithm used to train machine learning models, with a step-by-step numeric example and intuition.

Math Behind Gradient Descent Read here: https://outcomeschool.com/blog/math-behind-gradient-descent…
Original Article
View Cached Full Text

Cached at: 05/26/26, 03:12 PM

Math Behind Gradient Descent Read here: https://outcomeschool.com/blog/math-behind-gradient-descent…


Math Behind Gradient Descent

Source: https://outcomeschool.com/blog/math-behind-gradient-descent Math Behind Gradient Descent

In this blog, we will learn about the math behind gradient descent with a step-by-step numeric example.

Gradient descent is the most fundamental optimization algorithm used to trainmachine learningand deep learning models. Understanding the math behind it gives us a clear picture of how models actually learn. Do not worry, we will go through each concept step by step so that everything is easy to understand.

We will cover the following topics:

  • What is a Loss Function
  • What is Gradient Descent
  • The Intuition Behind Gradient Descent
  • The Math Behind Gradient Descent
  • Step-by-Step Numeric Example
  • Gradient Descent with Multiple Parameters
  • The Role of Learning Rate
  • Types of Gradient Descent
  • Gradient Descent in Python

I amAmit Shekhar, Founder @Outcome School, I have taught and mentored many developers, and their efforts landed them high-paying tech jobs, helped many tech companies in solving their unique problems, and created many open-source libraries being used by top companies. I am passionate about sharing knowledge through open-source, blogs, and videos.

I teachAI and Machine Learningat Outcome School.

Let’s get started.

The Big Picture

Before we go into the details, let’s understand the big picture.

A model learns by adjusting its weights so that its predictions become closer to the actual values. Gradient descent is the algorithm that tells the model how to adjust these weights. It keeps nudging the weights in the direction that reduces the error, step by step, until the error is as small as possible.

In simple words:

Gradient Descent = A simple way to slide down the error curve step by step until we reach the lowest point.

What is a Loss Function

Before we learn about gradient descent, we must first understand what aloss functionis.

Aloss functionis a function that measures how far the model’s prediction is from the actual value. In simple words, it tells us how wrong the model is.

Let’s say we are building a model that predicts house prices. The actual price of a house is 60 and our model predicts 50. The error is 60 - 50 = 10. One common way to measure this error is to square it. So, the loss becomes (60 - 50)² = 100.

We square the error for two reasons. First, squaring makes all errors positive so that negative and positive errors do not cancel each other out. Second, squaring penalizes larger errors more than smaller ones.

When we have many examples, we average these squared errors across the dataset. This is calledMean Squared Error (MSE), and it is one of the most common loss functions in machine learning.

**Our goal during training is to make this loss as small as possible.**And this is where gradient descent comes into the picture.

What is Gradient Descent

Let’s break down the term:Gradient Descent = Gradient + Descent.

  • Gradientmeans the slope or steepness of a surface in a particular direction.
  • Descentmeans going downward.

So,Gradient Descentmeans going downward in the direction of the steepest slope.

In simple words,**gradient descent is an optimization algorithm that finds the values of weights that minimize the loss function.**It does this by repeatedly adjusting the weights in the direction that reduces the loss.

The Intuition Behind Gradient Descent

The best way to learn this is by taking an example.

Suppose we are standing on a hill and it is completely foggy. We cannot see the bottom of the valley. The only thing we can feel is the slope of the ground under our feet. Our goal is to reach the bottom of the valley (the lowest point).

So, what would we do? We would feel the slope and take a step in the direction where the ground goes down. If the slope is steep, we take a bigger step. If the slope is gentle, we take a smaller step. We keep repeating this until the ground feels flat, which means we have reached the bottom.

This is exactly what gradient descent does. The hill is the loss function. The bottom of the valley is the minimum loss. The slope we feel is the gradient. And each step we take is a weight update.

The Math Behind Gradient Descent

Now, let’s understand the actual math behind gradient descent. We will keep it simple and go step by step.

What is a Derivative

Aderivativetells us the rate of change of a function. In simple words, it tells us the slope of the function at a given point.

Let’s say we are driving a car. The speedometer tells us how fast our position is changing with time. That speed is the derivative of our position with respect to time.

Similarly, in gradient descent, we need to know the slope of the loss function at the current weight value. The derivative tells us exactly that.

For a functionf\(w\), the derivative is written as below:

Here,f'\(w\)tells us how much the output offchanges when we changewby a tiny amount.

  • If the derivative ispositive, the function is going up (slope is upward).
  • If the derivative isnegative, the function is going down (slope is downward).
  • If the derivative iszero, the function is flat (we are at a minimum or maximum).

The Update Rule

Now that we know what the derivative (gradient) tells us, we can define how to update the weight. The update rule of gradient descent is as below:

w_new = w_old - α * f'(w_old)

Here:

  • w\_oldis the current value of the weight
  • w\_newis the updated value of the weight
  • α(alpha) is thelearning rate, a small positive number that controls the step size
  • f'\(w\_old\)is the gradient (derivative) at the current weight

Now, the question is: why do we subtract the gradient?

Because we want to go downhill (reduce the loss). If the gradient is positive (slope going up), subtracting it moves us to the left (downhill). If the gradient is negative (slope going down), subtracting a negative number means adding, so we move to the right (also downhill). This way, no matter which side of the minimum we are on, we always move toward the minimum.

This is the beauty of the minus sign in the update rule.

Step-by-Step Numeric Example

The best way to understand this is by taking a concrete example with actual numbers.

Let’s say our loss function is:

Here, the minimum of this function is atw = 3because\(3 \- 3\)² = 0. But let’s assume the model does not know this. It starts with an initial guess and uses gradient descent to find the minimum.

The derivative off\(w\) = \(w \- 3\)²is:

Now, let’s start withw = 0and a learning rateα = 0\.1, and apply the update rule step by step.

Stepw (before)Gradient: 2(w - 3)New w: w - 0.1 * gradient102(0 - 3) = -60 - 0.1 * (-6) = 0.620.62(0.6 - 3) = -4.80.6 - 0.1 * (-4.8) = 1.0831.082(1.08 - 3) = -3.841.08 - 0.1 * (-3.84) = 1.46441.4642(1.464 - 3) = -3.0721.464 - 0.1 * (-3.072) = 1.771251.77122(1.7712 - 3) = -2.45761.7712 - 0.1 * (-2.4576) = 2.0170Here, we can see thatwis getting closer and closer to 3 with each step. The gradient is also getting smaller with each step, which means the steps are getting smaller as we approach the minimum. This is exactly how gradient descent converges to the minimum.

To learn gradient descent, loss functions, and optimizers hands-on with real projects, check out theAI and Machine Learning Programby Outcome School.

Gradient Descent with Multiple Parameters

In the example above, we had only one weightw. But in a real neural network, we have millions of weights. So, how does gradient descent handle multiple weights?

Let’s say we have two weightsw1andw2, and our loss function isL\(w1, w2\). We need to find how the loss changes with respect to each weight separately. This is called apartial derivative.

Think of it this way. Suppose we are adjusting a TV. The TV has two knobs - one for volume and one for brightness. To understand the effect of each knob, we turn one knob at a time while keeping the other fixed. That is exactly what a partial derivative does.

The partial derivative ofLwith respect tow1is written as below:

Here, the symbolis just a fancy way of writing “derivative with respect to one variable while keeping everything else fixed.”

Thegradientis the collection of all partial derivatives. For two weights, the gradient is:

gradient = [∂L/∂w1, ∂L/∂w2]

And the update rule for each weight becomes:

w1_new = w1_old - α * ∂L/∂w1

w2_new = w2_old - α * ∂L/∂w2

Each weight gets updated independently using its own partial derivative. This is how gradient descent scales to millions or even billions of parameters in a neural network.

The Role of Learning Rate

Thelearning rate(α) controls how big each step is during gradient descent. Choosing the right learning rate is very important. Let’s see what happens with different learning rates using our same functionf\(w\) = \(w \- 3\)²starting fromw = 0.

Learning rate too small (α = 0.01):

Step 1: gradient = 2(0 - 3) = -6, new w = 0 - 0.01 * (-6) = 0.06
Step 2: gradient = 2(0.06 - 3) = -5.88, new w = 0.06 - 0.01 * (-5.88) = 0.1188

Here, we can see thatwis barely moving. In fact, after 20 steps atα = 0\.01,wis only around0\.997, still nowhere near 3. Compare this withα = 0\.1, which reached around 2.02 in just 4 steps. Training with a tiny learning rate will be extremely slow.

Learning rate just right (α = 0.1):

As we saw in our numeric example above,wmoves steadily toward 3. This is the ideal case.

Learning rate too large (α = 1.5):

Step 1: gradient = 2(0 - 3) = -6, new w = 0 - 1.5 * (-6) = 9
Step 2: gradient = 2(9 - 3) = 12, new w = 9 - 1.5 * 12 = -9

Here,wjumped from 0 to 9 (overshooting past 3), and then from 9 to -9 (even farther away). The value is diverging instead of converging. This means the learning rate is too large and gradient descent will never find the minimum.

So, the learning rate must be chosen carefully. If it is too small, training is slow. If it is too large, training becomes unstable. In practice, values like 0.001 or 0.01 are commonly used as a starting point. For large models like Transformers, even smaller values like1e\-4are common.

Types of Gradient Descent

So far, we have worked with a simple functionf\(w\) = \(w \- 3\)²to learn the math. But in practice, the loss is computed over a training dataset. The dataset can be very large (millions of examples), and computing the gradient over all examples at once can be very slow. This is where different types of gradient descent come into the picture.

**Batch Gradient Descent:**This is what we have been discussing. It uses the entire training dataset to compute the gradient at each step. The gradient is accurate, but it is slow for large datasets.

**Stochastic Gradient Descent (SGD):**Instead of using the entire dataset, SGD uses only one randomly chosen data point to compute the gradient at each step. This makes each step much faster, but the gradient is noisy because it is based on just one example.

**Note:**In modern deep learning, the term “SGD” is often used loosely to mean mini-batch SGD. For example,torch\.optim\.SGDin PyTorch works with any batch size, not just one. The name stuck even though the batch size is usually greater than one in practice.

**Mini-Batch Gradient Descent:**This is the middle ground. It uses a small batch of data points (commonly 32, 64, or 128, and sometimes much larger for big models on modern GPUs) to compute the gradient. It is faster than batch gradient descent and less noisy than SGD.

Let me tabulate the differences for your better understanding:

TypeData Used Per StepSpeedGradient AccuracyBatch Gradient DescentEntire datasetSlowHighStochastic Gradient Descent1 data pointFastLow (noisy)Mini-Batch Gradient DescentA batch (e.g., 32 to 1024)ModerateModerateIn practice,mini-batch gradient descentis the most commonly used approach because it provides a good balance between speed and accuracy.

To master SGD, mini-batch gradient descent, and hyperparameter tuning hands-on, we have a complete program - check out theAI and Machine Learning Programby Outcome School.

Gradient Descent in Python

Now, let’s see gradient descent in action with Python code. We will use the same functionf\(w\) = \(w \- 3\)²as below:

w = 0.0
learning_rate = 0.1

for step in range(50):
    gradient = 2 * (w - 3)
    w = w - learning_rate * gradient
    loss = (w - 3) ** 2
    print(f"Step {step + 1}: w = {w:.4f}, loss = {loss:.6f}")

Here, we start withw = 0\.0and a learning rate of0\.1. At each step, we compute the gradient, update the weight, and print the current value ofwand the loss.

The final step prints:

Step 50: w = 3.0000, loss = 0.000000

After 50 steps,wis essentially 3 and the loss is essentially 0. The model has found the minimum.

This is how gradient descent works in code. In real deep learning frameworks like PyTorch and TensorFlow, the same principle is applied but the gradients are computed automatically using backpropagation. We have a detailed blog onMath Behind Backpropagationthat explains how these gradients are actually computed step by step.

Putting It All Together

Let’s recap what we have learned:

  • Theloss functionmeasures how wrong the model’s prediction is.
  • Gradient descentis the algorithm that minimizes the loss by adjusting the weights.
  • Thegradient(derivative) tells us the direction and steepness of the slope.
  • Theupdate rulew\_new = w\_old \- α \* gradientmoves the weight toward the minimum.
  • Thelearning ratecontrols how big each step is.
  • In practice,mini-batch gradient descentis used for training large models.

This is how the math behind gradient descent works, and this is the foundation of how every neural network learns.

Prepare yourself for AI Engineering Interview:AI Engineering Interview Questions

That’s it for now.

Thanks

Amit Shekhar Founder @Outcome School

You can connect with me on:

Follow Outcome School on:

Read all of our high-quality blogs here.

Similar Articles

Attacking machine learning with adversarial examples

OpenAI Blog

This article examines adversarial attacks on machine learning models and demonstrates why gradient masking—a defensive technique that attempts to deny attackers access to useful gradients—is fundamentally ineffective. The paper shows that attackers can circumvent gradient masking by training substitute models that mimic the defended model's behavior, making the defense strategy ultimately futile.