Neural Network Types

I have been working on my capstone project for the last little bit. It involves using neural networks to solve the problem of segmenting medical images.

Here is a little of what I have learned. I think I am going to start doing a series about this. Mainly cause Machine learning is easier than it seems, and the more people that realize that, the more innovation that will happen 😁

* You wanna have a good understanding of a basic generic Neural networks, before reading on.

So lets get on with it. There are basically a couple of different types of neural network types, such as Generative Adversarial Networks (GAN), Convolutional Networks (CNN), and Recurrent Networks (RNN). Each have their own area and application where they work best. However they all generally use the same principles.

In GANs one part of the NN, is called the generator. This generator generates new data instances, while the other part, the discriminator, evaluates them for authenticity. The discriminator decides whether each instance of data it reviews belongs to the actual training dataset or not. The goal of the discriminator, when shown an instance from the real-world, is to recognize it as authentic.

Generalized Flow of GAN events as follows: The generator takes in random numbers and returns an image. The generated image is fed into the discriminator alongside a stream of images taken from the actual dataset. The discriminator then takes in both real and fake images and returns probabilities, a number between 0 and 1, with 1 representing a prediction of authenticity and 0 representing fake. Then it enters a double feedback loop. Where the discriminator is in a feedback loop with the ground truth, and the generator is in a feedback loop with the discriminator.


CNNs for SIS are similar to ordinary GANs in the sense that they are made up of two main parts. The first part is known as the encoder, which is responsible for extracting the features of the image. And the second part is known as the decoder, which is responsible for decoding the image.


The encoding part of CNNs are stacks of Convolutional (C), Activation (A), and Pooling (P) Layers. In the convolutional layer, filters are passed along the image taking dot products to create feature maps. These result of this gets passed through an activation layer. If the first CA layer gets done too many times the feature maps start to degrade, therefor P layers are used. These P layers average out the values in the feature map, which helps perceive the keys features. Most CNN architectures have several CAP stacks before getting put into the decoding aspect of the CNN.


The decoding part of the NN goes through the inverse operations of the encoder. Since by the time the feature maps reach the decoder they have been significantly compressed. The CAP layers in the decoding portion go through the process of deconvolution and up sampling using max pooling.

RNNs are a type of NNs where connections between units form a directed graph along a sequence. This allows it to exhibit dynamic temporal behavior for a time sequence. Unlike feedforward neural networks, RNNs can use their internal state (memory) to process sequences of inputs. This makes them applicable to tasks such as unsegmented, connected handwriting recognition or speech recognition.

Recurrent networks are distinguished from feedforward networks by that feedback loop connected to their past decisions, ingesting their own outputs moment after moment as input. It is often said that recurrent networks have memory. Adding memory to neural networks has a purpose: There is information in the sequence itself, and recurrent nets use it to perform tasks that feedforward networks can’t.

That sequential information is preserved in the recurrent network’s hidden state, which manages to span many time steps as it cascades forward to affect the processing of each new example. It is finding correlations between events separated by many moments, and these correlations are called “long-term dependencies”. This is because an event downstream in time depends upon, and is a function of, one or more events that came before. One way to think about RNNs is this: they are a way to share weights over time.

Machine Learning Explained – Part 1.4

Getting Started with Matlab – Part 1:

A lot of the math used for Machine Learning (ML) is linear algebra, actually a lot of engineering in general is based on the stuff. A matrix of numbers could represent everything from a circuit based control system, to a concrete pillar under a bridge. This also means that all the CAD software runs on the stuff as well. Its basically everywhere.

Linear algebra can be done by hand or you could do it using a programming language. You could be hardcore and write your own super optimized matrix inversion antilogarithms in C. Or even use python’s Numpy, or even R. However the industry standard is Matlab, these guys hold a virtual monopoly over the realm of engineering programming languages. When I say “engineering” I mean the hard physically based stuff like designing aircraft and rockets, or buildings.

Some would argue that Matlab is easier then Python, it is but only for linear algebra. You would not want to make your website in this stuff. Matlab works using a repal interface, all your variables are stored in memory and persist in your work space, this work space can be saved and loaded up again when you need it.

Below are some Matlab’s more basic commands.

% The ; means we are stoping and starting a new row
A = [1, 2, 3; 4, 5, 6; 7, 8, 9; 10, 11, 12] % 4 X 3 matrix
% making a vector and stroing it in v
v = [1;2;3] % 3 X 1 matrix
[m,n] = size(A) % the size function returns the amoutn of rows and columns and puts them in m and n
% now you can access m and n as two different variables
% You could also store it this way
% in this case size will return a vector that has the size
% it will be [4,3]
dim_A = size(A)
% You can access different parts of the matrix
A_23 = A(2,3) % loking for the element in the 2nd row and 3rd column
A = [1, 2, 4; 5, 3, 2]
B = [1, 3, 4; 1, 1, 1]
s = 2 % just like python there is no static typing
% adds element-wise
add_AB = A + B
% subtraction element-wise
sub_AB = A B
% scalar multiplication
mult_As = A * s
% Divide A by s
div_As = A / s
% scalar addition, does the same thing as what you would expect in A/s
add_As = A + s
v = [1;1;1]
% Multipling A * v
A_Times_v = A * v % easy as pie 🙂
% taking the inverse of a matrix
InvOfA = A'
% Taking the inverse of A
A_inv = inv(A) % also easy as pie 🙂
% note bigger matrices take longer

view raw
hosted with ❤ by GitHub



Machine Learning Explained – Part 1.3


Near the end of the last post I started talking about gradient descent. Now for the most part you can think of gradient as surface that your trying to find the lowest point in. However once you get into systems that require you factor in more then just one feature it stops being a 3rd surface and become an abstract idea of being a surface. Now you have to optimize for N different cost functions that represent the N number of features.

To help in do this gradient descent effectively you have to properly calibrate the learning rate and adjust your training set using feature scaling. Each of these two things will help in making gradient descent find the right solution faster.

Feature Scaling: You want to do this when there are wild discrepancies in the range of values. For example for one of the features could be the size of the house in meter squared (in the 100s), and other could be number of previous owns ( 1- 5). You want to think about these types of things because it may cause your gradient decent algorithm to jump back and fort, making it harder to find the global minima. When you do feature scaling you are simply trying to get all your features into the same range of values.

Learning Rate: As you know the learning rate is effectually your step size as you go down the 3rd surface, trying to find the global minima. Some times the gradient descent algorithm may step so far ahead it may miss the minima, and keeps missing it since its step size (learning rate) is so high. However if you making your learning rate too small it may take a very long time before you find your minima since your have to take so many more steps.

Another way to find the values that minimize the cost function is to use the Normal equation.

Screen Shot 2018-02-11 at 12.34.28 AM

Where theta is the value that minimizes the cost function.

Where X is your feature matrix

Where Y is the known output

Using the Normal equation we can compute these values  in a straight forward process without the use of iteration. However this comes at a price of speed when the feature matrixes get very big, since to compute the inverse of a (n x n) matrix is roughly O(n^3). In these cases gradient descent is going to be the better choice.

Machine Learning Explained – Part 1.1

You can see why I started this series here.

What is machine learning ?

  • My definition:
    • When you tell a machine to learn from experience rather then, explicitly giving it a bunch of instructions.
  • Course Definitions:
    • “the field of study that gives computers the ability to learn without being explicitly programmed.” – Arthur Samuel
    • “A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.” – Tom Mitchell

Supervised Learning:

You can think of this like teaching a small child how to do something you already know. Such as counting objects, or throwing a ball. Another way of thinking about it is that you give the machine the data, knowing that there is some relationship there. Then having the machine find it by its self.

This type of learning comes in two different flavours:


Given a bunch of data and asked to predict what will happen next. An example will could be: “given all the historical data about housing prices, what will be the price of a house in 2020 ?” We are mapping input data to a continuous function to.


Take the input data, and give me discrete outputs (classifications) . For example if you were to take data on students and predict which students would become engineers. Here we still know what factors really influence the result, which still makes it supervised learning.

Unsupervised Learning:

We have mountains of data that we think is random and has no structure. We have no idea what the relationships are between the variables. So we let our machine loose on the data to discover the relationships between the different variables. And it starts to cluster the data into different piles.