In my previous post, I had introduced neural networks at a very high level. If you haven’t read that post then I suggest reading it first before continuing. It will give you an idea of what we are going to accomplish today.
In this post, I am going to show you how to create your own neural network from scratch in Python using just Numpy.
For this tutorial, we are going to train a network to compute an XOR gate (). This example is simple enough to show the components required for training.
Here is how the network is going to look:
The first layer is the input layer, the second is the hidden layer and the last layer is the output layer.
Let’s start coding by initializing the training data.
import numpy as np def sigmoid(x): return 1.0/(1.0 + np.exp(-x)) def sigmoid_prime(x): return sigmoid(x)*(1-sigmoid(x)) #Create dataset X = np.array([[0,1, 1], [1,0, 1], [1, 1, 1], [0,0,1]]) Y = np.array([ , , , ])
I have also defined the sigmoid function and its derivative function as we will need it later.
Creating the network
np.random.seed(1) # good practice hidden_weights = np.random.randn(3,4) output_weights = np.random.randn(4,1) hidden_bias = np.random.randn(4) output_bias = np.random.randn(1)
Since the first layer is the input layer, we essentially deal with two sets of weights. The first set is from the 3 input units to the 4 hidden units. The second set maps the hidden units to the output unit.
I have also included bias for completeness. While this simple example works just fine without biases, including them in the calculation keeps our network consistent with the one shown in the previous post.
The randn function assigns weights from a uniform normal distribution. There are cleverer ways to initialize weights but this is good enough for the current example.
Finally, remember to seed the random function. This will initialize the weights similarly in each run. You don’t want the initialization to change between executions. It becomes crucial when optimizing the hyperparameters. Random weights initialization will lead you to believe that the improvement was because of a particular selection of hyperparameter leaving you with a sub-par model.
If you remember from the previous post the output of the network is , w represents the weights and x represents the output from the previous layer.
The code for the forward pass is very simple.
hidden_activation = sigmoid(np.dot(X, hidden_weights) + hidden_bias) output_activation = sigmoid(np.dot(hidden_activation, output_weights) + output_bias)
This is where things get interesting. Up till now, we have just created a simple neural network which takes in data and makes a prediction. Initially, the network does not predict the right value of Y given the input. To be able to train a network we need a loss function. Loss function tells by how much the network deviates from the actual input.
We are going to use the Mean Squared Error Loss function defined as
Here n is the total number of training examples. represents the actual label for the training example.
Our motive is to adjust the weights so that the predicted output is closer to the required output. That is, the loss is reduced. Let’s just focus on a single training example.
This can be written as
This is the activation from the last layer. We want to see how much effect does the weights of the last layer have on the loss. Recall that the derivative at a point in a graph represents its slope.
The above image represents a curve with its derivative at the point in red. The loss function can be considered as the above curve but in multiple dimension (it actually forms a hyperplane). Trying to reduce the loss is similar towards moving in the valley in the above curve.
We can continue calculating the slope (derivative) and taking a step towards the valley. This is essentially what backpropagation does. Let’s take a partial derivative of the loss function (we are ignoring the bias for now)
Further, taking derivative with respect to bias
This shows how much effect the weights in the last layer have on the final cost. Similarly, we can write the cost function in terms of weights in the hidden layer.
Taking partial derivative with respect to will give the effect of on the loss function. To become comfortable with the calculation I suggest calculating the derivative of the above equation on your own.
is the derivative of . With calculation it can be shown that . This is what we have used in the function above.
After iterating over the entire training data and accumulating the change in weights, the neural network is updated as follows.
delta_output_weights = (output_activation - Y) * sigmoid_prime(output_activation) delta_hidden_weights = delta_output_weights.dot(output_weights.T) * sigmoid_prime(hidden_activation) output_weights -= hidden_activation.T.dot(delta_output_weights) hidden_weights -= X.T.dot(delta_hidden_weights) output_bias -= np.mean(delta_output_weights) hidden_bias -= np.mean(delta_hidden_weights, axis=1)
Putting It All Together
Finally, the entire code looks like this
And you can see the result below
Loss: 0.5013730688788007 Loss: 0.4664830127093482 Loss: 0.24207358013744795 Loss: 0.09891100378926312 Loss: 0.087323152676015 Loss: 0.06892222408037366 Loss: 0.05528837086156295 Loss: 0.027425185939300573
Thanks for reading 😎