Please note that I will not be covering the mathematical portions, but rather the big ideas.
Can come in different forms, however at the end of the day we are trying to learn a function in order to map or training. Such that it becomes a good predictor of an output given a inputs, this is a sort of regression problem. Where as functions who’s outputs are limited to a few discrete outputs, given various inputs are for classification problems.
These functions are used to measure the level of accuracy of our hypothesis functions (the function we learned), by measuring the difference between our predicted value and the true output. And then computing the average error by way of the “Mean Squared Error” function. In a single variable regression problem the hypothesis function reduces to being the equation of a line.
The objective becomes to minimize the cost (or error) function. If you have ever taken Calculus before you can do that easily by taking the derivate of the function, setting it equal to zero, solving for the parameter, and using that value in your learned function.
However its not always easy when given a large set of parameters. Therefor you can also use contour plots, that act like maps to the values that reduce cost function to zero.
An easy way to think about gradient descent is by, imagining your a blind person trying to find a ball in a hilly area. You don’t know where the ball is, but you know the ball has rolled into the deepest valley in the area. However your blind so you can’t see the depth in your surroundings. Therefor you have to use your feet to feel for the steepness of the ground in front of you. By doing so you take little steps in the direction of the most steepness. This is exactly what the gradient descent algorithm does as well. It takes little steps, gauges the steepness, and then moves in that directions till it finds the global minimum of the cost function’s derivative.
You can find the first part of this series here.