Near the end of the last post I started talking about gradient descent. Now for the most part you can think of gradient as surface that your trying to find the lowest point in. However once you get into systems that require you factor in more then just one feature it stops being a 3rd surface and become an abstract idea of being a surface. Now you have to optimize for N different cost functions that represent the N number of features.

To help in do this gradient descent effectively you have to properly calibrate the learning rate and adjust your training set using feature scaling. Each of these two things will help in making gradient descent find the right solution faster.

Feature Scaling: You want to do this when there are wild discrepancies in the range of values. For example for one of the features could be the size of the house in meter squared (in the 100s), and other could be number of previous owns ( 1- 5). You want to think about these types of things because it may cause your gradient decent algorithm to jump back and fort, making it harder to find the global minima. When you do feature scaling you are simply trying to get all your features into the same range of values.

Learning Rate: As you know the learning rate is effectually your step size as you go down the 3rd surface, trying to find the global minima. Some times the gradient descent algorithm may step so far ahead it may miss the minima, and keeps missing it since its step size (learning rate) is so high. However if you making your learning rate too small it may take a very long time before you find your minima since your have to take so many more steps.

Another way to find the values that minimize the cost function is to use the Normal equation.

Where theta is the value that minimizes the cost function.

Where X is your feature matrix

Where Y is the known output

Using the Normal equation we can compute these values in a straight forward process without the use of iteration. However this comes at a price of speed when the feature matrixes get very big, since to compute the inverse of a (n x n) matrix is roughly O(n^3). In these cases gradient descent is going to be the better choice.