Getting Started with Matlab – Part 1:
A lot of the math used for Machine Learning (ML) is linear algebra, actually a lot of engineering in general is based on the stuff. A matrix of numbers could represent everything from a circuit based control system, to a concrete pillar under a bridge. This also means that all the CAD software runs on the stuff as well. Its basically everywhere.
Linear algebra can be done by hand or you could do it using a programming language. You could be hardcore and write your own super optimized matrix inversion antilogarithms in C. Or even use python’s Numpy, or even R. However the industry standard is Matlab, these guys hold a virtual monopoly over the realm of engineering programming languages. When I say “engineering” I mean the hard physically based stuff like designing aircraft and rockets, or buildings.
Some would argue that Matlab is easier then Python, it is but only for linear algebra. You would not want to make your website in this stuff. Matlab works using a repal interface, all your variables are stored in memory and persist in your work space, this work space can be saved and loaded up again when you need it.
Below are some Matlab’s more basic commands.
Near the end of the last post I started talking about gradient descent. Now for the most part you can think of gradient as surface that your trying to find the lowest point in. However once you get into systems that require you factor in more then just one feature it stops being a 3rd surface and become an abstract idea of being a surface. Now you have to optimize for N different cost functions that represent the N number of features.
To help in do this gradient descent effectively you have to properly calibrate the learning rate and adjust your training set using feature scaling. Each of these two things will help in making gradient descent find the right solution faster.
Feature Scaling: You want to do this when there are wild discrepancies in the range of values. For example for one of the features could be the size of the house in meter squared (in the 100s), and other could be number of previous owns ( 1- 5). You want to think about these types of things because it may cause your gradient decent algorithm to jump back and fort, making it harder to find the global minima. When you do feature scaling you are simply trying to get all your features into the same range of values.
Learning Rate: As you know the learning rate is effectually your step size as you go down the 3rd surface, trying to find the global minima. Some times the gradient descent algorithm may step so far ahead it may miss the minima, and keeps missing it since its step size (learning rate) is so high. However if you making your learning rate too small it may take a very long time before you find your minima since your have to take so many more steps.
Another way to find the values that minimize the cost function is to use the Normal equation.
Where theta is the value that minimizes the cost function.
Where X is your feature matrix
Where Y is the known output
Using the Normal equation we can compute these values in a straight forward process without the use of iteration. However this comes at a price of speed when the feature matrixes get very big, since to compute the inverse of a (n x n) matrix is roughly O(n^3). In these cases gradient descent is going to be the better choice.
Please note that I will not be covering the mathematical portions, but rather the big ideas.
Can come in different forms, however at the end of the day we are trying to learn a function in order to map or training. Such that it becomes a good predictor of an output given a inputs, this is a sort of regression problem. Where as functions who’s outputs are limited to a few discrete outputs, given various inputs are for classification problems.
These functions are used to measure the level of accuracy of our hypothesis functions (the function we learned), by measuring the difference between our predicted value and the true output. And then computing the average error by way of the “Mean Squared Error” function. In a single variable regression problem the hypothesis function reduces to being the equation of a line.
The objective becomes to minimize the cost (or error) function. If you have ever taken Calculus before you can do that easily by taking the derivate of the function, setting it equal to zero, solving for the parameter, and using that value in your learned function.
However its not always easy when given a large set of parameters. Therefor you can also use contour plots, that act like maps to the values that reduce cost function to zero.
An easy way to think about gradient descent is by, imagining your a blind person trying to find a ball in a hilly area. You don’t know where the ball is, but you know the ball has rolled into the deepest valley in the area. However your blind so you can’t see the depth in your surroundings. Therefor you have to use your feet to feel for the steepness of the ground in front of you. By doing so you take little steps in the direction of the most steepness. This is exactly what the gradient descent algorithm does as well. It takes little steps, gauges the steepness, and then moves in that directions till it finds the global minimum of the cost function’s derivative.
You can find the first part of this series here.
You can see why I started this series here.
What is machine learning ?
- My definition:
- When you tell a machine to learn from experience rather then, explicitly giving it a bunch of instructions.
- Course Definitions:
- “the field of study that gives computers the ability to learn without being explicitly programmed.” – Arthur Samuel
- “A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.” – Tom Mitchell
You can think of this like teaching a small child how to do something you already know. Such as counting objects, or throwing a ball. Another way of thinking about it is that you give the machine the data, knowing that there is some relationship there. Then having the machine find it by its self.
This type of learning comes in two different flavours:
Given a bunch of data and asked to predict what will happen next. An example will could be: “given all the historical data about housing prices, what will be the price of a house in 2020 ?” We are mapping input data to a continuous function to.
Take the input data, and give me discrete outputs (classifications) . For example if you were to take data on students and predict which students would become engineers. Here we still know what factors really influence the result, which still makes it supervised learning.
We have mountains of data that we think is random and has no structure. We have no idea what the relationships are between the variables. So we let our machine loose on the data to discover the relationships between the different variables. And it starts to cluster the data into different piles.