Complete Deep Learning Intuition

Published in

The Startup

18 min readNov 30, 2020

It’s way easier than you would think.

Much of the content below is based on the Intro to Deep Learning with PyTorch course by Facebook AI. If you want to learn more, take the course, or just take a look here.

Imagine this you’re a computer science student and want to know what your chances of making it to your dream university are.

Below is a graph that determines whether or not a student will be accepted into a university. Two pieces of data have been used: grades and tests each on a scale of 0–10. Applicants that have been accepted are in blue with those rejected in red.

You have been an okay student with test marks of 7 and grades of. Could you determine if you would be accepted?

Congrats! The answer is YES.

Now how do we know this? Well, when you look at this graph, you likely looked for the point (7,6). Seeing that many blue dots surround it, and it was deep into the blue region you could then assume that you were accepted.

You may not know it but what you are doing is exactly what many Machine Learning algorithms strive to do. Look at some data points, and based on previous trends and patterns, determine the label for the new one.

But what about that murky region around the center? It would definitely be harder to tell whether a student with grades and test scores of 6 gets accepted or rejected. So let’s set a clear boundary.

Finding the equation

That’s much better. Although this line isn’t perfect, it gives us a very good idea of whether a student makes the cut. We can therefore say with confidence that students on the right side of the line will be accepted, and those on the left will not be. But perhaps what’s more valuable is that the line gives us is an equation.

The equation of this line, in this case, is 2x₁ + x₂ -18 = 0.

What’s exciting is by using the linear equation, we can actually extract a mathematical formula to determine if students get accepted. This simply means the equation used to find out whether a student makes it is 2*test + Grades -18. If the resulting number is equal to or over 0, congrats on your acceptance! If not, better luck next time.

We call this line the decision boundary. A good way to remember this is the fact that this boundary separates the algorithm between two different decisions.

So let’s revisit our old question of a student with grades and test score of 6 gets accepted or rejected. Well 2*6 + 6–18 =0. 0 is equal to 0, which is the minimum score to get accepted, meaning that this student is indeed accepted.

All linear equations will follow this format W₁x₁ + W₂x₂ + b. Complicated? Not at all. Let’s refer back to the equation 2x₁ + x₂ -18 = 0.

In this equation, W₁ is 2, the number you are multiplying is x₁ or in this case the test. The w is called the weight, and all you need to do with it is multiply x₁ by it. x, in this case, is the input for our example the scores of students on their test and grades.

This is because we do not want to adjust the x inputs so w allows us to manipulate the values, without touching the inputs.

The same goes for W₂, we did not have a number before x₂ in the previous equation because the weight is 1 meaning we can simply express x₂ as x₂.

Last but not least, we have b which stands for bias. b can be both a positive or negative number, in our case, the number was -18.

The Dimension Problem

Getting back on track, what if we had one more piece of data such as class rank? Well, instead of visualizing the data points in 2 dimensions, we would instead use 3 dimensions to visualize data. Therefore, the equation would no longer be W₁x₁ + W₂x₂ + b, instead if would be W₁x₁ + W₂x₂ + W₃x₃+ b.

Remember, this is because we have one more input (x₃) or, in our example, class rank. This doesn’t seem like a big problem right now, but if we had 4 different pieces of data or 5, 20, or even 1,000. What do we do then?

The three different data categories (grades, test, and class rank)

Well, one common solution that has been agreed on is simplifying W₁x₁ + W₂x₂ + b equation to the vector equation Wx + b. In this equation, W represents W₁ …, Wn-₁, Wn, and x represent x₁ …, xn-₁, xn. Where n is the number of dimensions. What this means is that W and x represent every single W and x in the equation W₁x₁ + W₂x₂….. Wnxn+ b

This is extremely convenient as Wx + b is able to express an infinite amount of dimensions easily.

Another thing to note with multiple dimensions is the decision boundary. In 2D shapes, we used a one-dimensional decision boundary. In the 3D example above, we used a two-dimensional decision boundary. Notice the pattern? The decision boundary is always one dimension below the space it resides in. We call this a hyperplane. If there are n-dimensions our decision boundary will always be n-1 dimensions.

When we think about neural nets the image we often think of is this.

An intimidating collection of connected circles and lines. What do these circles mean? What do the lines mean? Why are they interconnected? Well like many movie cliches, it turned out you already knew the answer. Let’s just put the pieces together

Perceptrons

Perceptrons are the building block of neural networks and are a more efficient way to represent our equation from before. In this basic perception, the large circle (or node) represents the graph while the two smaller nodes represent the input. In this case point (7, 6) is in the blue area showing us that the student makes it to the university.

You might remember that our initial equation also contained a bias. With our equation being 2x₁ + x₂ -18 = 0. So where does that fit in?

Whether you think of the larger node, as an equation or graph both of them determine if the x plotted is over or under the linear boundary.

In this case, we add a third node with a value of 1. But remember our basis is -18, so why is out x here 1? Well, it turns out the lines we see are actually all weighted. In our case, the test line has a weight of 2 the grade line has a weight of 1 and the bias line has a weight of -18.

if you’re worried about the equation at the top of the middle node just realize that this is a different format to express Wx + b.

The general convention for perception has x as the input multiplied by some weight put into an equation. We then check if the output is greater or lesser than a certain output. In this case, if Wx + b is greater than 0 the student would make it into university while if it below 0 the student does not.

Finding the Linear Boundary

You may be wondering how do we determine the correct linear boundary in the first place? Well, this is exactly what the next section is gonna cover, figuring out the correct linear boundary resides in the heart of machine learning. Let’s start at the most basic level.

Imagine we have some points and a random equation like in this diagram below. Some points are correctly placed, and some are not, this is purely out of randomization so far and not our selves. Now in order to find the linear boundary logically what would the misclassified points want? Would it want the linear boundary to come closer or farther to it?

Correct! The points would want the linear boundary to come closer and classify the point correctly.

In order to express this mathematically, the first thing we need to recognize is that we have two distinct regions. Let’s say blue is the positive region, while red is the negative region. Keep this in mind as we will be doing slightly different equations dependent on the region.

Now our misclassified point resides at (4, 5). With our linear boundary having an equation of 3x1 + 4x2–10 = 0. Meaning that if the point is above 0 the point will be blue, well below it would return red. However, as we can see this line is fundamentally flawed, as the red point resides in the blue area, lets change that.

Now in order to fix the linear boundary, all we need to do is subtract the weights and biases of the linear boundary (3, 4, -10) by the location of the point as well as another bias (let’s set it to 1 for now).

Perfect right? While almost. The problem with this new line is that it is a drastic shift. Algorithms often have more than one misclassified point so it's key we take small steps in order to take all these points into consideration. This is where the learning rate comes in.

The learning rate is the rate of change in a linear boundary. Our change before was a massive subtraction from our previous line. This drastic change goes straight to a hypothetical solution without considering the answers in between. This is similar to attempting to guess a number from 0 to 10, you could start at 0 and go straight to 10 or you could consider all the numbers in between. Chances are the answer resides somewhere in the middle there.

The learning rate traditionally ranges from 0.001 to 0.01. In this scenario we are gonna set our learning rate to 0.1, multiplying the numbers we used to subtract by 0.1.

This allows for smaller changes allowing us to better test whether a line is correct or not, and find the best fit.

Now let’s say we have a point located in the negative area. What would we do? Well, actually the only change here would be addition instead of subtraction. So instead of subtracting from 3, 4, and -10 for example, we would add to our input.

Now you try it out! The point here is located at (1,1), our learning rate is 0.1, and our bias is 1. Can you determine what the equation of the new linear boundary looks like?

If your answer was 3.1x₁ + 4.1x₂ -9.9 = 0, your correct! Because our point and biases are all 1 and our learning rate is 0.1 we would simply add 0.1 to each of our numbers.

Now our algorithm can slowly make changes to find the perfect optimal.

Error Function

The concept of an error function will be a recurring theme in the next section. An error function put simply is just the distance between where you are and where you want to be. Take for example the door in your room, in this case, our error function would be the distance between you and the door. When we approach the door the error function minimizes and when we move away from the door it maximizes.

Gradient Descent

You may remember me talking about the perfect linear boundary, which we want to slowly move to, but how does the algorithm know what the perfect linear boundary is? Well, let's introduce Gradient descent.

Now let’s say we were on a mountain, and we wanted to descend quickly. How would we do so? Well, we would analyze the different paths we can take and then take a step down the steepest one. Now at this new position, we would then look around us again and take a step down the steepest path again. This is the intuition behind gradient descent in which the height of the mountain is our error function. Every step we take is optimized to make sure our final error function is decreased. This means that the algorithm will take steps (which is the size of our learning rate) to reduce the error function slowly.

As a quantitative value, what would the error function look like? Would it simply be 2, or 3? Or would it be some decimal number?

Let’s take a look at the diagram below. It looks like two points are incorrectly classified so it seems pretty intuitive that the error function here would simply be 2. Right?

Actually, this is not the case at all. The problem with an error function like this is that is a discrete error function. A discrete error function is black or white, your points are either correctly classified or not. This is problematic as our learning rates mean changes are relatively small and any changes would still return the same error function of 2.

This is analogous to using gradient descent to move downstairs. When are at the top of the stairs every single direction we move would give us the same error function. This confuses the algorithm and it breaks as no step would reduce the error function. Instead, we use a continuous error function where even if the point is not correctly classified we get a value that updates and tells us how close we are to correctly classify the point.

Sigmoid

Previously we used a simple 1 or 0 output to represent if our output was yes or no. This time we will now be outputting a continuous number, which is obtained by the sigmoid activation function.

When a sigmoid function is multiplied by any number it returns some value between 1 and 0. The larger the number, the closer it is to 1, and the lower the number is the closer it is to 0. The number 0 resides exactly on point 0.5.

An activation function determines the output of the final product of your neural network. This is analogous to the brain in which it determines which neurons fire or not. In our sigmoid function, our inputs are passed through the activation function to return some value.

Take for example if your neural net returned the numbers 5, 1, and -5. If the number were 5 your neural net would return 0.924 extremely close to 1, if your number were 1 the activation function would return 0.62 extremely close to 0.5. And lastly, if your number were -5 the activation function would return a value of 0.075.

For the purposes of intuition, you only need to know that sigmoid (σ) is a function that we use in order to determine probabilities of a certain point being a certain label. If you’re interested in the math behind it click here.

The probability space

In order to determine the probability of each point being correctly classified we create a probability space. This probability space is based on our linear boundary with the linear boundary being 0.5. This means that if the point lies on the linear boundary there is a 50% chance it's red or blue.

The probability space constantly changes with changes in your line. Any adjustments to your line drastically change what the probability space looks like. Now, let's figure out how to assess how well our linear boundary is doing.

Maximum likelihood

Now let’s say that we had a model with two distinct lines. To us, it should be pretty evident which model is better. But how would the algorithm determine this?

Well, using the equation of the linear boundary we can determine which points the models believe have a high probability of being blue well which ones don’t. This means the further you are in the blue region, the larger your probability is. In fact, let’s assign each point a value on how likely it is a blue point.

Now let’s multiply it to obtain some output. Note as we can see here the red point with a probability of 0.9 is not blue. This means that we actually take the inverse of that so 0.1, we will also take the inverse for the blue point in the red area. We would also attribute 0.7 to the red point in the red region with a value of 0.3. Now if we multiplied these values 0.1*0.6*0.7*0.2 we would get the value of 0.0084. Let’s try it with the second scenario

Note that these probabilities are for the specific color, so 0.8 would be a 0.8 probability for red, well 0.9 would point to a 0.9 probability for blue. If we multiplied 0.8*0.7*0.6*0.9 we would get a value of 0.324.

Well, what do these numbers mean? Well, it returns the probability of the points being correctly classified. In our first scenario, our chances were .8%, well in the second one our chances were 32%. We can clearly see now which model is superior.

Therefore by using the probability space we are able to determine our model performance.

There is sadly an inherent issue with using multiplication to determine probability. In scenarios where we may have hundreds or thousands of data points the model returns a number that may be thousands of digits long. Why not sum instead?

Cross entropy

If there was ever a “secret sauce” between machine learning logarithms would be it. The product of the logarithm X*Y is equal to the sum of logarithm X + Y. So let’s use logarithms. Now when we use logarithm in our math it looks somewhat like this.

ln(0.6) + ln(0.2) +ln(0.1) +ln(0.7) = (-0.51) + (-1.61) + (-2.3) +(-0.36)

It looks substantially better and neater but working with negative numbers adds complications we don’t need. So let's take the negative of the logarithm.

Now let’s try to calculate what our previous models looked like.

Algorithm A

= ln(0.6) + ln(0.2) +ln(0.1) +ln(0.7)

= (0.51) + (1.61) + (2.3) +(0.36)

= 4.8

On your own tryout Algorithm B as well.

ln(0.7) + ln(0.9) +ln(0.8) +ln(0.6)

Algorithm B

= ln(0.7) + ln(0.9) +ln(0.8) +ln(0.6)

= (.36) + (.1) + (.22) +(.51)

= 1.2

Odd the better algorithm seemed to return a lower number. Well, actually this is because the algorithm now returns cross-entropy. Entropy being chaos, and unpredictability is counterintuitive to what we are trying to achieve, so our goal is now to reduce cross-entropy.

With this, we’re now able to assess our linear boundary, give it feedback on its performance, and find the perfect fit.

Neural Network architecture

Up to this point, we’ve been mostly dealing with linear boundaries, however, real-world problems are not always so clear cut. Often times you’ll have some curves like this one.

Do not be intimidated. This curve is the same as the decision boundary we have encountered before the distinction here however is the use of two different lines instead of one. In order to get the curve we combine our two linear models to get our third model, this is roughly similar to addition.

But how does this actually work? Well, let’s look at it from a point by point level. We know that the linear model is a probability space so let's determine the probability of each point.

The probabilities in this case are 0.7 and 0.8. Now lets add 0.7 + 0.8 together for an answer of 1.5. Of course, we need to do one more thing as having a probability of 1.5 is mathematically impossible.

Do you recount a particular function that would give us a number between 0 and 1? If you guessed the sigmoid function you’re absolutely right! Applying the sigmoid function we can now determine that the probability of the specific point being blue is 0.82 in our new model. If we did this systematically with every point we are then able to determine the boundary of our new linear model based on our points.

Now, what if we wanted to use one model more than the other one? Well, this is where weights come in. Using weights we can prioritize one linear model more than the other. Depending on which model we want to use we can multiply the sums of each model by a different number.

However, multiplying our probabilities and adding them, and then pushing them through a sigmoid function will give us a number much closer to one than we intended. This is why we add a bias, in order to subtract or add as needed. This bias will ensure that the probability of the new point won’t be drastically high or low.

You may have noticed that the structure of the linear equations above seems eerily similar to that of a perceptron. This is no coincidence and is at the core to how linear equations and perceptions link together. If we look at the equation below we can see that the input (x) is being multiplied by some bias (5 or -2). We can also see our bias of positive 8 to ensure that the output is not skewed.

We can check this again with our second equation with a bias of -1, and weights of 7 and -3.

The correlation between probability space and perceptions checks out. Now let’s combine these two linear equations together into a neural network. You may remember that the second layer has weights of 7 and 5. This along with the bias of -6 allows us to put it together with our previous equations.

For formatting's sake, we can clean up our neural network farther and eliminate the recurring x1 and x2’s in the first layer.

And congrats you’ve just built your first neural network! Savor this moment but not for too long! The best is yet to come.

Feedforward Mathematics

Feedforward is essentially the process neural networks take from input to output, which we have been doing all along. Let’s take a closer and in-depth look at mathematics.

Note that in the 3x2 matrix the (1) stands for these weights being in the first layer, and W21 for example means that this is the first weight for the second input. These numbers are just to better label the weight. Also, the o symbol with a line coming out stands for sigmoid. Lastly, the y with an arrow on top stands for y hat our final prediction.

Alright, let’s go through the math layer by layer. First, we would multiply our first layer (x1, x2, 1) by our weights. Each input would have different weights, in the equation, the weights in the same row are the ones we’re multiplying. We would then pass the result of the input*weights through a sigmoid equation to give us a value between 0 and 1. We would then multiply these results by some weight (the second layer of W). This output would then be passed through a sigmoid one more time to give us our output.

Conclusion

That’s it! A complete semi-mathematical guide to neural network intuition. If you don’t get 100% of all the content do not worry, many senior engineers do not either. Instead, a general understanding of how all the pieces feed into each other is much more valuable.

Key Takeaways

A neural network sets a line to determine the boundary between different outputs called the decision boundary
The number of inputs equals how many dimensions the data is plotted in with the formula Wx + b
The linear boundary moves by small increments due to the learning rate. It uses gradient descent
The linear boundary receives feedback on its performance by a continuous output as a result of the probability space.
We calculate the continuous output by using the negative logarithm to determine our networks cross-entropy and how well it’s doing
Curved boundaries are a combination of linear boundaries which can be represented by perceptrons
The math behind the neural networks is input multiplied by weights and an activation function. The more layers the more times we multiply by weights and an activation function

Do you see those hands below? If you enjoyed this article light it up blue, and the medium AI will recommend more of these types of articles!