### Technical Supplement

This piece contains supplementary material to our article, Inside Artificial Intelligence. The first, main section, provides a heavily annotated version of the artificial neural network code, explaining what it’s doing step by step, such that it should be possible to follow the gist of the program even without any coding knowledge. The second is the bare Python code for the neural net – simply to illustrate how compact it is. The third and final section briefly covers the mathematics of the learning cycle.

Although the coding structure is different, the overall design of the neural network is heavily influenced by the work of Tariq Rashid.

***

**The Annotated Program**

# This is a simple neural network that tries to recognise handwritten numerals

# Firstly it sets up (initialises) everything up that it needs

# Next it ‘trains’ itself to recognise characters

# Finally it is ‘tested’ – it has to work out what number an input character represents

**import numpy as np**

# NumPy is a mathematics library

# We’ll use it to make calculations with matrices (tables) easier

**tr_file = open(“mnist_train.csv”, ‘r’)**

# The training data is in a ‘Comma Separated Values’ (csv) file, opened in ‘read’ mode

# The files are from the Modified National Institute of Standards and Technology dataset

**tr_strings = tr_file.readlines()**

# The readlines ‘method’ converts the training data into a list of strings

# There’s one string of numbers on each line

# Each string encodes one image of a handwritten numeral (0, 1, 2,..,9)

# We’ll explain how it does this a little later

**tr_file.close()**

# Close the file, just to be on the safe side and to release storage

**test_file = open(“mnist_test.csv”, ‘r’)**

**test_strings = test_file.readlines()**

**test_file.close()**

# Same procedure for the test file

# The training data is used to teach the net. It can then be let loose on the test data.

# The network comprises three layers, labelled ‘i’, ‘j’ and ‘k’.

**i_nodes = 784**

# The i layer is for the input ‘signal’

# The input – representing the image – encodes a square grid

# The grid is 28 by 28 squares (pixels) in size

# The numerical value of each square is a measure of the amount of ‘ink’ on that square

# Each line in the training and test data files is a string of these values

# Each line thereby ‘encodes’ an image as a string of values

# Each number – each square value – is fed into an ‘input node’ in the i-layer

# That means the i-layer needs 28×28 = 784 nodes to ‘see’ one image

**j_nodes = 200**

# The j layer sits between the input and output layers. It’s a ‘hidden’ layer

# The number of nodes in the hidden layer is arbitrary. It’s up to us

**k_nodes = 10**

# The k layer represents the output

# The net is trying to work out what each hand-written numeral actually is

# It signals its answer by figuratively ‘lighting up’ the node corresponding to that number

# We therefore need ten output nodes – one for each possible answer (0..9)

**learning_rate = 0.2**

# This variable allows us to ‘moderate’ the adjustment the net makes in response to an error

# The idea is to avoid ‘over-compensating’ for the error

**epochs = 4**

# The number of learning cycles – how many times the net runs through the training data.

# It’s an arbitrary number. We can experiment with it

**Wij = (np.random.rand(j_nodes, i_nodes)-0.5)**

# All the i nodes have a connection to all of the j nodes.

# These connections are represented by a matrix (a table) with j rows and i columns.

# The value of each connection is a ‘weight’, W.

# The weight modifies the signal from an i node to a j node (by multiplying it).

# random.rand returns a random value between 0 and 1

# Subtracting 0.5, we’ve initialised all the weights with random values between -0.5 and 0.5.

# Technically we could do better than this with some maths, but this should work fine

**Wjk = (np.random.rand(k_nodes, j_nodes)-0.5)**

# In the same way, we initialize all the connections between the j and k nodes.

**def S(x):**

** return 1/(1+np.exp(-x))**

# This is the formula for the sigmoid function, a kind of smoothed step up

# This outputs a value rapidly approaching 1 once the input passes a ‘threshold’ level.

# Below the threshold, the output should rapidly approach zero.

# In this case, the threshold level is 0 (at which the output is 0.5):

# – Negative input values result in output lower than 0.5 (but never reaching 0)

# – Positive input values result in output higher than 0,5 (but never reaching 1)

# It’s analogous to the ‘firing’ of a neuron when the input signals are strong enough

# This bit of code defines a function ‘S’ that we can use to do this later on

# We could change this formula to create a different activation function

# If we did, we wouldn’t have to change any of the other code

**def train(input_list, target_list):**

# This bit defines the function that we’ll use to teach the network to recognise numerals

** global Wij, Wjk**

# We’ll need to access the two sets of weights defined above

** targets = np.array(target_list, ndmin=2).T**

# The targets are ten nodes representing the numerals 0 to 9

# The node representing the correct answer has the value 0.99 – just below 1

# All the other nodes are 0.01 – just above 0

# The target list is converted into a matrix to make it easier to compute with

# The parameter T means the matrix is ‘transposed’

# This converts it from a horizontal list to a column ‘vector’

# The target list for a given number is passed to this function

# We’ll see how this works later

** inputs = np.array(input_list, ndmin=2).T**

# The input_list is a list of 28×28 ‘greyscale’ values, one for each pixel

# np.array converts this list into a (vertically transposed) matrix

** j_in = np.dot(Wij, inputs)**

# This is a very condensed bit of ‘matrix multiplication’

# It takes each input value and multiplies it by a weight

# The answer passes as an input to a j (hidden) node

# E.g. i node 1 is multiplied by Weight[1,2] and the result passed as an input to j node 2

# Generally, input node X is multiplied by weight[X, Y] and passed to j node Y

# The weighted values passed to each j node are aggregated (added up)

# E.g j node 1 is input node 1 times weight[1,1] plus input node 2 times weight[2,1] etc

# This one line of code does this for all 784 input nodes and 200 hidden nodes

# (using 784 x 200 = 15,680 individual weights)

**j_out = S(j_in)**

# This puts all the 200 j node inputs just calculated through the sigmoid function

# The result is 200 j node outputs for passing to the next layer

** k_in = np.dot(Wjk, j_out)**

# As with the input to j node weights, this line does the same massive calculation

# This time the j node outputs are weighted and passed to the k nodes as inputs

# So j node output X is multiplied by Weight[X, Y] and passed to (added to) k node Y

# This happens for all X and Y

# This time there are 200 j nodes and only 10 k nodes, so 2,000 weights

** k_out = S(k_in)**

# Finally, each of the 10 inputs to the k nodes are shoved through the sigmoid function

# This gives ten values, one for each potential answer

# The closer the value is to 1, the more the net thinks that node is the correct answer

# The closer the value is to 0, the more the net thinks that node is a wrong answer

** k_errors = targets – k_out**

# On each k node, the error is the difference between the target value and the output

# E.g. if the character is actually ‘0’, the first target value is 0.99 and all the rest are 0.01

# So the error in the first output node value is 0.99 minus that value

# And the error in the other nine output values is in each case 0.01 minus the value

# We’ll use these errors to tweak the j to k weights

** j_errors=np.dot(Wjk.T, k_errors)**

# But we also need to ‘back propagate’ the errors so we can tweak the i to j weights too

# To do this, we attribute a notional error to each j node

# This is based on the weight it contributed to the output error

# I.e. we take each k error, and apply what were the j to k weights to it

# We transpose the weights to do this, so the notional error at j node 1 is:

# Error at k node 1 times weight[1,1] plus

# Error at k node 2 times weight[1,2] and so on

** Wjk += learning_rate * np.dot((k_errors*k_out*(1-k_out)),np.transpose(j_out))**

** Wij += learning_rate * np.dot((j_errors*j_out*(1-j_out)),np.transpose(inputs))**

# This is the key bit of the training function

# The j to k weights are adjusted based on the output errors

# The i to j weights are adjusted based on the notional errors attributed to the j nodes

# We use calculus to work out the way in which the errors change with the weights

# (i.e. we calculate the ‘slope’ of the error versus weight curve)

# The next section provides more detail on the mathematics

# We multiply the slope by the learning rate

# (so we take a fraction of the slope, in this case 0.2)

# We add this (it may be a negative number though) to the old weight to get the new one

# If the slope is low (i.e. shallow) we may be near the minimum error

# In this case the weight isn’t changed much

# If the slope is high (steep) we may be a long way from the minimum error,

# In this case we make a bigger adjustment

**def test(input_list):**

# This section defines the function that will try to recognise a character

# We’ll invoke it once we’ve finished training

# It works liket the ‘train’ function

# But once it’s calculated output values it simply returns them

# (It doesn’t try to calculate errors or adjust the weights anymore)

**inputs = np.array(input_list, ndmin=2).T**

** j_in = np.dot(Wij, inputs)**

** j_out = S(j_in)**

** k_in = np.dot(Wjk, j_out)**

** k_out = S(k_in)**

** return(k_out)**

# The main program starts here

**for e in range(epochs):**

# For each training cycle

** for r in tr_strings:**

# For each encoded training image – there are 1,000 in the training data set

** tr_values = r.split(‘,’)**

# Each image is simply a string of values separated by commas

# The ‘split’ method returns a list of all the 784 separate values

** scaled_inputs = (np.asfarray(tr_values[1:])/255*0.99) + 0.01**

# Each pixel has a value from 0 to 255 encoding the grayscale of that square

# We need to rescale these numbers to give a value between 0 and 1

# (We try to avoid values of exactly 0 or 1 though)

** target_outputs = np.zeros(k_nodes) + 0.01**

# Sets up a list of ten possible values, and initially sets them all to just above zero

** target_outputs[int(tr_values[0])] = 0.99**

# Each training value is a list of 784 numbers encoding the image

# Most of these numbers represent the ‘ink’ in each pixel

# However, the very first number (with ‘index’ 0) identifies the actual numeral itself

# E.g. if the picture is of a handwritten ‘7’, tr_values[0] is the number 7

# That number is then used to set the target value of the correct output

# In this example) the 8th target output value is set to just below 1

# (Why the 8th? Because output node 1 represents ‘0’, so node 8 represents ‘7’)

# That means that all of the target values of the output nodes are set:

# – close to zero for the outputs not matching the correct numeral

# – close to one for the output node representing the right answer

** train(scaled_inputs, target_outputs)**

# Training the network by feeding it:

# – the encoded input image

# – the target values encoding the correct answer

**right = 0**

**wrong = 0**

# We’re going to set the network loose on a library of 10,000 handwritten numerals

# These two variables will keep a tally of how many it’s got right and wrong

**for r in test_strings:**

# For each image…

** test_values = r.split(‘,’)**

# As with the training cycles, start by converting the comma separated values into a list

** correct_answer = int(test_values[0])**

# As explained above, the first value is the actual number represented by the image

** scaled_inputs = (np.asfarray(test_values[1:])/255*0.99)+0.01**

# As with training, rescale the input values from 0 to 255, to 0 to 1

** out=test(scaled_inputs)**

# ‘out’ is a list of ten output values (0 through 9) from the test

** judgement = np.argmax(out)**

# We take the biggest output value (the value closest to 1) to be the network’s answer

# E.g. if node 3 has the highest value, the ‘judgement’ is 2 (we start counting at 0, not 1)

** if judgement == correct_answer:**

** right += 1**

** else:**

** wrong += 1**

**print(“Right: %d, Wrong: %d, Performance: %.1f%%” % (right, wrong, (right*100/(right+wrong))))**

# Display the number of right and wrong answers, and the percentage the net got right

***

**The Bare Code**

Although we use some shortcuts from Numpy that make the matrix algebra especially compact, it remains the case that the program itself is very short. The complexity (‘intelligence’, if you prefer) lies in the 158,000 weights, not the code itself.

***

#### The Mathematics of Learning

A very brief summary of the adjustment mechanism for the weights. For a more relaxed explanation, see *Make Your Own Neural Network*, by Tariq Rashid