Inside Artificial Intelligence II

Technical Supplement

This piece contains supplementary material to our article, Inside Artificial Intelligence. The first, main section, provides a heavily annotated version of the artificial neural network code, explaining what it’s doing step by step, such that it should be possible to follow the gist of the program even without any coding knowledge. The second is the bare Python code for the neural net – simply to illustrate how compact it is. The third and final section briefly covers the mathematics of the learning cycle.

Although the coding structure is different, the overall design of the neural network is heavily influenced by the work of Tariq Rashid.


The Annotated Program

# This is a simple neural network that tries to recognise handwritten numerals

# Firstly it sets up (initialises) everything up that it needs

# Next it ‘trains’ itself to recognise characters

# Finally it is ‘tested’ – it has to work out what number an input character represents

import numpy as np

# NumPy is a mathematics library

# We’ll use it to make calculations with matrices (tables) easier

tr_file = open(“mnist_train.csv”, ‘r’)

# The training data is in a ‘Comma Separated Values’ (csv) file, opened in ‘read’ mode

# The files are from the Modified National Institute of Standards and Technology dataset

tr_strings = tr_file.readlines()

# The readlines ‘method’ converts the training data into a list of strings

# There’s one string of numbers on each line

# Each string encodes one image of a handwritten numeral (0, 1, 2,..,9)

# We’ll explain how it does this a little later


# Close the file, just to be on the safe side and to release storage

test_file = open(“mnist_test.csv”, ‘r’)

test_strings = test_file.readlines()


# Same procedure for the test file

# The training data is used to teach the net. It can then be let loose on the test data.

# The network comprises three layers, labelled ‘i’, ‘j’ and ‘k’.

i_nodes = 784

# The i layer is for the input ‘signal’

# The input – representing the image – encodes a square grid

# The grid is 28 by 28 squares (pixels) in size

# The numerical value of each square is a measure of the amount of ‘ink’ on that square

# Each line in the training and test data files is a string of these values

# Each line thereby ‘encodes’ an image as a string of values

# Each number – each square value – is fed into an ‘input node’ in the i-layer

# That means the i-layer needs 28×28 = 784 nodes to ‘see’ one image

j_nodes = 200

# The j layer sits between the input and output layers. It’s a ‘hidden’ layer

# The number of nodes in the hidden layer is arbitrary. It’s up to us

k_nodes = 10

# The k layer represents the output

# The net is trying to work out what each hand-written numeral actually is

# It signals its answer by figuratively ‘lighting up’ the node corresponding to that number

# We therefore need ten output nodes – one for each possible answer (0..9)

learning_rate = 0.2

# This variable allows us to ‘moderate’ the adjustment the net makes in response to an error

# The idea is to avoid ‘over-compensating’ for the error

epochs = 4

# The number of learning cycles – how many times the net runs through the training data.

# It’s an arbitrary number. We can experiment with it

Wij = (np.random.rand(j_nodes, i_nodes)-0.5)

# All the i nodes have a connection to all of the j nodes.

# These connections are represented by a matrix (a table) with j rows and i columns.

# The value of each connection is a ‘weight’, W.

# The weight modifies the signal from an i node to a j node (by multiplying it).

# random.rand returns a random value between 0 and 1

# Subtracting 0.5, we’ve initialised all the weights with random values between -0.5 and 0.5.

# Technically we could do better than this with some maths, but this should work fine

Wjk = (np.random.rand(k_nodes, j_nodes)-0.5)

# In the same way, we initialize all the connections between the j and k nodes.

def S(x):

    return 1/(1+np.exp(-x))

# This is the formula for the sigmoid function, a kind of smoothed step up

# This outputs a value rapidly approaching 1 once the input passes a ‘threshold’ level.

# Below the threshold, the output should rapidly approach zero.

# In this case, the threshold level is 0 (at which the output is 0.5):

# – Negative input values result in output lower than 0.5 (but never reaching 0)

# – Positive input values result in output higher than 0,5 (but never reaching 1)

# It’s analogous to the ‘firing’ of a neuron when the input signals are strong enough

# This bit of code defines a function ‘S’ that we can use to do this later on

# We could change this formula to create a different activation function

# If we did, we wouldn’t have to change any of the other code

def train(input_list, target_list):

    # This bit defines the function that we’ll use to teach the network to recognise numerals

    global Wij, Wjk

    # We’ll need to access the two sets of weights defined above

    targets = np.array(target_list, ndmin=2).T

    # The targets are ten nodes representing the numerals 0 to 9

    # The node representing the correct answer has the value 0.99 – just below 1

    # All the other nodes are 0.01 – just above 0

    # The target list is converted into a  matrix to make it easier to compute with

    # The parameter T means the matrix is ‘transposed’

    # This converts it from a horizontal list to a column ‘vector’

    # The target list for a given number is passed to this function 

    # We’ll see how this works later

    inputs = np.array(input_list, ndmin=2).T

    # The input_list is a list of 28×28 ‘greyscale’ values, one for each pixel

    # np.array converts this list into a (vertically transposed) matrix

    j_in =, inputs)

    # This is a very condensed bit of ‘matrix multiplication’

    # It takes each input value and multiplies it by a weight

    # The answer passes as an input to a j (hidden) node

    # E.g. i node 1 is multiplied by Weight[1,2] and the result passed as an input to j node 2

    # Generally, input node X is multiplied by weight[X, Y] and passed to j node Y

    # The weighted values passed to each j node are aggregated (added up)

    # E.g j node 1 is input node 1 times weight[1,1] plus input node 2 times weight[2,1] etc

    # This one line of code does this for all 784 input nodes and 200 hidden nodes

    # (using 784 x 200 = 15,680 individual weights)

    j_out = S(j_in)

    # This puts all the 200 j node inputs just calculated through the sigmoid function

    # The result is 200 j node outputs for passing to the next layer

    k_in =, j_out)

    # As with the input to j node weights, this line does the same massive calculation

    # This time the j node outputs are weighted and passed to the k nodes as inputs

    # So j node output X is multiplied by Weight[X, Y] and passed to (added to) k node Y

    # This happens for all X and Y

    # This time there are 200 j nodes and only 10 k nodes, so 2,000 weights

    k_out = S(k_in)

    # Finally, each of the 10 inputs to the k nodes are shoved through the sigmoid function

    # This gives ten values, one for each potential answer

    # The closer the value is to 1, the more the net thinks that node is the correct answer

    # The closer the value is to 0, the more the net thinks that node is a wrong answer

    k_errors = targets – k_out

    # On each k node, the error is the difference between the target value and the output

    # E.g. if the character is actually ‘0’, the first target value is 0.99 and all the rest are 0.01

    # So the error in the first output node value is 0.99 minus that value

    # And the error in the other nine output values is in each case 0.01 minus the value

    # We’ll use these errors to tweak the j to k weights, k_errors)

    # But we also need to ‘back propagate’ the errors so we can tweak the i to j weights too

    # To do this, we attribute a notional error to each j node

    # This is based on the weight it contributed to the output error

    # I.e. we take each k error, and apply what were the j to k weights to it 

    # We transpose the weights to do this, so the notional error at j node 1 is:

    # Error at k node 1 times weight[1,1] plus

    # Error at k node 2 times weight[1,2] and so on

    Wjk += learning_rate **k_out*(1-k_out)),np.transpose(j_out))

    Wij += learning_rate **j_out*(1-j_out)),np.transpose(inputs))

    # This is the key bit of the training function

    # The j to k weights are adjusted based on the output errors

    # The i to j weights are adjusted based on the notional errors attributed to the j nodes

    # We use calculus to work out the way in which the errors change with the weights 

    # (i.e. we calculate the ‘slope’ of the error versus weight curve)

    # The next section provides more detail on the mathematics

    # We multiply the slope by the learning rate 

    # (so we take a fraction of the slope, in this case 0.2)

    # We add this (it may be a negative number though) to the old weight to get the new one

    # If the slope is low (i.e. shallow) we may be near the minimum error

    # In this case the weight isn’t changed much

    # If the slope is high (steep) we may be a long way from the minimum error, 

    # In this case we make a bigger adjustment

def test(input_list):

    # This section defines the function that will try to recognise a character

    # We’ll invoke it once we’ve finished training

    # It works liket the ‘train’ function

    # But once it’s calculated output values it simply returns them

    # (It doesn’t try to calculate errors or adjust the weights anymore)

        inputs = np.array(input_list, ndmin=2).T

        j_in =, inputs)

        j_out = S(j_in)

        k_in =, j_out)

        k_out = S(k_in)


# The main program starts here

for e in range(epochs):

    # For each training cycle

    for r in tr_strings:

        # For each encoded training image – there are 1,000 in the training data set

        tr_values = r.split(‘,’)

        # Each image is simply a string of values separated by commas

        # The ‘split’ method returns a list of all the 784 separate values

        scaled_inputs = (np.asfarray(tr_values[1:])/255*0.99) + 0.01

        # Each pixel has a value from 0 to 255 encoding the grayscale of that square

        # We need to rescale these numbers to give a value between 0 and 1

        # (We try to avoid values of exactly 0 or 1 though)

        target_outputs = np.zeros(k_nodes) + 0.01

        # Sets up a list of ten possible values, and initially sets them all to just above zero

        target_outputs[int(tr_values[0])] = 0.99

        # Each training value is a list of 784 numbers  encoding the image

        # Most of these numbers represent the ‘ink’ in each pixel

        # However, the very first number (with ‘index’ 0) identifies the actual numeral itself

        # E.g. if the picture is of a handwritten ‘7’, tr_values[0] is the number 7

        # That number is then used to set the target value of the correct output

        # In this example) the 8th target output value is set to just below 1

        # (Why the 8th? Because output node 1 represents ‘0’, so node 8 represents ‘7’)

        # That means that all of the target values of the output nodes are set:

        #   – close to zero for the outputs not matching the correct numeral

        #   – close to one for the output node representing the right answer

        train(scaled_inputs, target_outputs)

        # Training the network by feeding it: 

        # – the encoded input image

        # – the target values encoding the correct answer

right = 0

wrong = 0

# We’re going to set the network loose on a library of 10,000 handwritten numerals

# These two variables will keep a tally of how many it’s got right and wrong

for r in test_strings:

    # For each image…

    test_values = r.split(‘,’)

    # As with the training cycles, start by converting the comma separated values into a list

    correct_answer = int(test_values[0])

    # As explained above, the first value is the actual number represented by the image

    scaled_inputs = (np.asfarray(test_values[1:])/255*0.99)+0.01

    # As with training, rescale the input values from 0 to 255, to 0 to 1


    # ‘out’ is a list of ten output values (0 through 9) from the test

    judgement = np.argmax(out)

    # We take the biggest output value (the value closest to 1) to be the network’s answer

    # E.g. if node 3 has the highest value, the ‘judgement’ is 2 (we start counting at 0, not 1)

    if judgement == correct_answer:

        right += 1


        wrong += 1

print(“Right: %d, Wrong: %d, Performance: %.1f%%” % (right, wrong, (right*100/(right+wrong))))

# Display the number of right and wrong answers, and the percentage the net got right


The Bare Code

Although we use some shortcuts from Numpy that make the matrix algebra especially compact, it remains the case that the program itself is very short. The complexity (‘intelligence’, if you prefer) lies in the 158,000 weights, not the code itself.


The Mathematics of Learning

A very brief summary of the adjustment mechanism for the weights. For a more relaxed explanation, see Make Your Own Neural Network, by Tariq Rashid

Leave a comment

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.