Gradient Descent: from Spiderman ALL IN ONE PLATE

Abhishek Biswas
7 min readMar 20, 2023

--

Understanding about gradient descend using layman language with no code, no math approach.

Table of content:

  • Understanding Gradient Descend from Spiderman and Goblin
  • Associated Terminologies
  • Types of Gradient Descend
  • Why is gradient descent called gradient descent?
  • Tabular comparison

Understanding Gradient Descend from Spiderman and Goblin:

Spiderman swung through the city using spider-webs, following the Goblin as he darted through the streets. The Goblin was up to something, and Spiderman was determined to find out what it was.

As Spiderman followed the Goblin, he noticed that the streets were sloping downward. The gradient slope was leading them towards the river. Spiderman knew that he needed to catch the Goblin before he reached the water.

But as Spiderman followed the slope downwards, he found himself getting closer and closer to the river. He realized that he was stuck in a local optima, a point where the gradient slope had plateaued and was no longer leading him towards the global optima, the point of minimum loss.

Spiderman knew that he needed to find a way out of the local optima and back on track towards the global optima; cause he was getting off track. He had to converge towards the minimum loss point; so that he can intercept with Goblin as soon as possible. He decided to adjust his learning rate, the step size taken in each iteration, to help him find the global optima more quickly.

With a lower learning rate, Spiderman could take smaller steps and follow the gradient slope more closely, avoiding getting trapped in local optima. With each step, Spiderman moved closer to the global optima, his algorithm converging towards the minimum loss point.

But the Goblin wasn’t giving up. He was trying to lead Spiderman astray by running through the alleys and changing directions frequently. Spiderman had to adjust his learning rate and be careful not to overshoot the minimum loss point, or he would be diverging from the solution; cause in the both cases he would lose Goblin.

Finally, after many iterations, Spiderman caught up to the Goblin and saved the day. He had used gradient descent to optimize his path and converge towards the global optima. By adjusting his web-shooter every time he was shooting spider-webs to climb from one position to another (learning rate), he avoided getting stuck in local optima and diverging from the solution. With his powerful algorithm, Spiderman had defeated the Goblin.

Terminologies:

Cost function:

  • The cost function is a measure of how well a machine learning model is performing. In the story, Spiderman wants to catch the Goblin as quickly as possible, so his cost function is the time it takes to catch him.
  • The goal of gradient descent is to minimize the cost function by adjusting the model’s parameters.

Learning rate:

  • The learning rate determines the size of the steps taken in gradient descent. In the story, Spiderman adjusts his learning rate to take smaller steps when he gets close to the river.
  • If the learning rate is too high, Spiderman might overshoot the target and diverge from the solution. If it’s too low, he might take too long to reach the target.

Optimization:

  • Optimization refers to the process of finding the best values of the model’s parameters to minimize the cost function. In the story, Spiderman is optimizing his path to catch the Goblin as quickly as possible.
  • Gradient descent is an optimization algorithm used in machine learning to find the optimal values of the model’s parameters.

Local minimum:

  • A local minimum is a point in the cost function where the gradient slope plateaus and is no longer leading towards the global minimum. In the story, Spiderman gets stuck in a local minimum when he gets too close to the river. Goblin prefer to bring Spiderman closer to the river so that it can be difficult for Spiderman to swing from wall to wall. Goblin was using Local minimum to decoy Peter(A.K.A Spiderman).
  • To avoid getting stuck in local minimums, Spiderman adjusts his learning rate and takes smaller steps.

Global minimum:

  • The global minimum is the point in the cost function with the lowest value. In the story, the global minimum is the point where Spiderman catches the Goblin as quickly as possible.
  • Gradient descent aims to converge towards the global minimum by following the gradient slope of the cost function.

Types of Gradient Descend:

After catching the Goblin a few times, Spiderman(A.k.a Peter) decides to create a set of algorithms so that in future he can use them to optimize the his chasing technique to catch Goblin faster.

Stochastic Gradient:

He realizes that he doesn’t need to swing through the entire city to catch him. Instead, he could just swing through a few random buildings and still catch the Goblin.

In Spiderman’s case, the cost function is the time it takes to catch the Goblin.

Spiderman randomly selects a few buildings to swing through, rather than following the entire gradient slope. He calculates the gradient slope of the cost function using this mini-batch of data(his past moves while he was chasing Goblin).

This gradient slope tells him which direction he needs to move in to catch the Goblin as quickly as possible.

Spiderman then adjusts his position by taking a step in the opposite direction of the gradient slope, multiplied by the learning rate. The learning rate determines how big of a step he takes.

He repeatedly does the same calculation a couple of times to find optimal path to reach the global minima(the low most point where he can catch Goblin quickly.)

Stochastic Gradient Descent can be faster than other optimization algorithms because it uses smaller, more frequent steps. However, it can also be more noisy because it’s not using all the available data to optimize the model. In Spiderman’s case, it ensures that he’s still taking a good path to catch the Goblin, but it might not be the absolute best path.

Batch Gradient Descent:

Spiderman starts to notice a pattern in his path. He decides to use all the previous data he collected to optimize his path. This way, he can make sure that his path is the best one possible, using all the information available to him.

To do this, Spiderman uses a machine learning algorithm called Batch Gradient Descent. Batch Gradient Descent uses the entire dataset to calculate the gradient slope of the cost function. In Spiderman’s case, the cost function is the time it takes to catch the Goblin.

To optimize his path using Batch Gradient Descent, Spiderman first randomly selects a starting point. Then, he calculates the gradient slope of the cost function using all the previous data he collected. This gradient slope tells him which direction he needs to move in to catch the Goblin as quickly as possible.

Spiderman then adjusts his position by taking a step in the opposite direction of the gradient slope, multiplied by the learning rate. The learning rate determines how big of a step he takes.

He repeats this process over and over until he reaches the global minimum, which is the point where he catches the Goblin as quickly as possible.

But, while Peter was doing the calculation by choosing random mini datasets of his past moves. He Understood it can be a good algorithm. But, the calculation itself is very time consuming.

Mini-batch gradient descent:

Both batch gradient descent and stochastic gradient descent are combined in mini-batch gradient descent. The training dataset is divided into smaller batches, and updates are run on each of those batches. This method creates a balance between stochastic gradient descent’s speed and batch gradient descent’s computing effectiveness.

Why Mini-batch gradient descent?

Mini-batch gradient descent combines concepts from both batch gradient descent and stochastic gradient descent. It splits the training dataset into small batch sizes and performs updates on each of those batches. This approach strikes a balance between the computational efficiency of batch gradient descent and the speed of stochastic gradient descent.

Why is gradient descent called gradient descent?

Gradient descent is called so because it involves descending along the gradient (or slope) of a function in order to find its minimum value.

In machine learning, we often use gradient descent to optimize a model’s parameters such as weights and biases, by minimizing a cost or loss function. The gradient of this cost function is the vector of partial derivatives of the function with respect to each parameter, and it tells us the direction in which the function increases the most rapidly.

To find the minimum of the function, we need to move in the opposite direction of the gradient. By taking small steps in the direction of the negative gradient, we gradually descend along the function towards its minimum value. This process is repeated iteratively until the algorithm converges to a minimum.

Therefore, gradient descent is called gradient descent because it involves descending along the gradient of the function to find its minimum value.

Tabular comparison:

--

--

Abhishek Biswas
Abhishek Biswas

Written by Abhishek Biswas

Technologist | Writer | Mentor | Industrial Ambassador | Mighty Polymath

No responses yet