How Do Machines Actually Think?
I kept dreaming of a world I thought I’d never see. And then, one day, I got in…Kevin Flynn (Jeff Bridges), Tron: Legacy, Disney Movies (2010)
The previous article in this series surveyed the strategic context for AI as a trending technology. It was a high level view.
This article sets out to give the professional but non-specialist ‘lay reader’ some real insight into how AI actually does what it does. It is very much a low level view. It will prove a demanding read, but get through it and twenty minutes from now you’ll be able to boast that you know how AI works to anyone who’ll listen.
Why bother? My experience has been that non-specialists who can muster the patience (and, to labour the point, it really will demand your patience) to be led through this ‘down among the weeds’ journey typically emerge with real, light-bulb moment insights. Most of us have an extremely vague view of what AI is, which apart from anything else is a poor basis for opinion forming.
It’s not that you shouldn’t be worried about AI, but it’s quite possible that you’re worried for the wrong reasons.
Obviously, the sensible place to start is with matchboxes.
In the early 1960’s, a Bletchley Park veteran called Donald Mitchie designed a system dubbed ‘MENACE’, which progressively learned to play better and better games of Tic-Tac-Toe. MENACE stood for ‘Matchbox Educable Noughts-and-Crosses Engine’.
It works like this. We take a collection of matchboxes (without the matches), and we stick drawings on each of them. The drawings are of a noughts-and-crosses grid, and show all of the possible positions that MENACE could face. Each drawing (so each matchbox) shows one of those positions. To keep down the number of matchboxes, we assume that MENACE plays the ‘odd’ moves (first, third, fifth etc) while the human plays the even moves. So the empty grid is the board which MENACE faces on its first move. There’s then a matchbox for every possible position that MENACE may face after each human’s move (second, fourth, etc).
Each matchbox contains tokens – they might be different coloured buttons or beads for example. We have a simple system that identifies each button with a square on the grid. For example, a red button might mean ‘top left square’, a blue button might mean ‘middle square’, and so on. The tokens in each matchbox represent a legal move from the position shown in the drawing on that box. Tokens are excluded from that particular box if they represent a square that’s already been taken.
It’s MENACE to move first. We pick up the box with the empty grid, give it a shake, and pull out a token at random (no cheating). The token tells us where MENACE puts his cross. Now it’s our turn – put a nought somewhere – then go find the matchbox which has a picture of this position on it. A token from that one gives MENACE ‘s next move. Then it’s our turn again. And so on.
At the end of the game, one of three things has happened. If MENACE won, we put the tokens back in the boxes they came from but we add extra tokens of the same type, increasing the chance of picking that token next time that same position is encountered. If the machine has lost, we set aside the tokens we took out and put the boxes back without them, reducing the chance that we’ll get the same move next time. If it’s a draw, we go back and replace the tokens in the boxes they came from. Overall, we’ve reinforced a winning sequence of moves, or suppressed a losing one.
And, as if by magic, over a long series of games, MENACE gets better and better at playing the game. It starts out clueless and ends a master.
Now, you may be thinking that matchboxes and buttons are as far from AI as it’s possible to get, but actually, at a fundamental level, this is how machine learning works to this day. You really don’t need to read any further – you now have a keen instinct for how AI actually functions.
Frankly I’m sensing some scepticism, so I’m going to try to take you under the skin of something a bit more up to date. In fairness, if you talk to your grandkids about the next bit it’ll sound more convincing than the stuff about matchboxes. But it’s going to be hard work, it’ll put demands on your attention and concentration, and at the end of it you won’t really, fundamentally, know anything new. Feel free to bail out now, or to jump to the concluding section.
If you’re among the remaining stalwarts, let’s crack on.
When we’re talking about AI, we’re usually talking about some form of ‘machine learning’, and when we’re talking about machine learning, we’re usually talking about ‘neural networks’. These are (very) loosely based on the way that biological neurons work, and in what follows we’ll look how a simple neural network might tackle basic ‘optical character recognition’. In this case, the characters we’ll be recognising are single, handwritten numerals – 0 through 9. We’re going to ‘train’ the network, from scratch, to recognise numerals handwritten by different people, and then let it loose on more examples to see how it fares once it’s trained.
What follows, by the way, describes a real, working neural network I programmed a while ago (in order to try to get a better understanding of this stuff for myself – I don’t work in this field). If you’d like to indulge your inner geek, a heavily annotated version of the code can be found on our website. It’s explained in some detail, so it’s not necessary to be familiar with programming languages to follow the gist of what’s going on.
A point perhaps worth making is that a neural network is usually, as here, a software simulation running on a general computer (in this case a PC) – a program in other words – rather than a physically realised network.
Our network comprises three separate layers – an input layer, a ‘hidden’ layer and an output layer. We can add more hidden layers if we like, but they work the same way so we’ll keep it simple. Figuratively speaking, each layer is made up of ‘nodes’, and every node in a layer is connected to every node in the next layer. The nodes themselves are really just values which can be varied.
Figure One below illustrates what happens in a small but representative part of the network.
First we have to get the handwritten character into the network in a format which it can work with. We use the input layer for this. As illustrated, the character is actually encoded in a computer as a grid of squares, or ‘pixels’. Each pixel has a value which basically represents how dark the square is (its ‘grayscale’ value). Because the image is 28 by 28 pixels, we need 784 nodes in the input layer – one for each pixel. The value of each input node represents the grayscale value of the corresponding square – figuratively, the amount and darkness of the ‘ink’ on that pixel. The darker the square, the higher the value. Figure One shows one particular pixel being fed into an input node, with the illustrative value 0.8.
As an aside, although we’re talking about an image here, you can see that in principle we could present other forms of input to the network, such as speech, in pretty much the same way, as long as they can be represented digitally.
The job of this network is to tell us what it thinks the input character is. Because the answer is a digit between 0 and 9, we can use ten output nodes, one for each potential answer. The output node with the highest value represents the network’s guess (‘high’ in this case means close to 1; ‘low’ means close to zero). Sometimes, as we’ll see, it hedges its bets, but figuratively speaking it’s trying to ‘light up’ the node representing the character. In Figure One, the network is indicating pretty strongly that the image is ‘4’ by feeding a high value into the output node that represents ‘4’.
Between the input and output layers, we have a hidden layer. The hidden layer can have an arbitrary number of nodes. Our example uses 200, for no particular reason.
Once the grayscale value of a pixel has been fed into its own input node, it’s passed on to all of the hidden nodes. So the value of 0.8 is passed on to all 200 hidden nodes, for example. However, in each case (for each hidden node) the value is multiplied by a number (called a ‘weight’), which is unique to each connection between a particular input node and a particular hidden node.
The illustration shows the weight from the first input node to the first hidden node as being 0.75. The net multiplies the value of 0.8 by 0.75 (to give 0.6), and passes this value to the first hidden node. Although it’s not shown on the diagram, the same thing happens between the first input node and every other hidden node. In every case the weight is different, so the value of 0.8 is multiplied by 200 different weights, one for each hidden node.
The same thing happens for every other input node. All 784 input nodes therefore have their unique own set of 200 weights (that’s 156,800 separate weights). So in the illustration, the second input node has the value 0.2. This is passed to the first hidden node, but with a different weight from the one we just used for the first input node, in this case -0.5, to give a value passed on of -0.1.
The purpose of the weights is effectively to strengthen or weaken the ‘signal’ from an input node to a hidden node.
Every hidden node receives a number, which is a pixel’s grayscale value times a weight, from every input node. So a weighted version of every pixel is passed on to every hidden node. Every hidden node receives a weighted version of every pixel. Each hidden node aggregates all of its 784 inputs into a single value, simply by adding them together. This is all just arithmetic. Massively parallel arithmetic.
Where do the weights come from? Initially (meaning, for the very first image in the training data set) they’re just random values, in our case between -0.5 and +0.5.
Er, hang on. Doesn’t that mean we’ve just taken the input image and effectively scrambled it into 200 random numbers? Well, yes we have. Much as the matchboxes initially play completely random tic-tac-toe moves. Bear with me.
What happens next is where the analogy to neurons comes in. The idea is that a neuron will ‘fire’ when the accumulated signals from its dendrites are strong enough – they reach a certain ‘threshold’ value. Similarly, each hidden node takes its input, which is the sum of all the weighted signals from the input layer, and pushes it through an ‘activation function’ that mimics this effect, giving a high output (close to one) for a positive input, and a low output (close to zero) for a negative input. Our net uses a ‘sigmoid’ function to do this – see Figure Two.
These output values from the hidden layer are then pushed through their own sets of weights to each of the output layer nodes. There are ten output nodes, so each hidden node has ten individual output weights. As with the hidden layer, each output node aggregates the signals from all of the 200 hidden nodes (by adding them together). It’s absolutely the same process.
The final stage is that these signals to the output nodes are themselves shoved through the activation function, to give a set of ten output values. Voila.
Now, if this were the first run, it’s unlikely that we’d have an unambiguous answer (meaning just one output node with a high value), and equally unlikely that it would be right even if we did. So how does it improve? In pretty much the same way as MENACE did, except that this time instead of adjusting the numbers of buttons in the matchboxes, the net adjusts the weights.
But how? An important feature of the training data set is that it covertly includes the actual, correct answers. In this case, the very first pixel of an image is stolen and used to store the number that image represents (rather than a grayscale value). Let’s say it’s a four, as in Figure One. That means that if the net were performing well, it would output a value close to 1 for output node ‘4’. Every other output node – all the wrong answers – would be valued at close to zero. So we have an easy measure of the ‘error’ at each output node – it’s simply the difference between the actual output and what it should have been.
This glosses over a couple of very significant questions: for ‘real world’ neural networks working on, say, facial recognition, where do those ‘correct’ answers come from (a clue: humans, usually); and are they actually correct?
In any case, the clever bit is to adjust the weights between the hidden nodes and the output nodes to try to reduce – ideally to minimise – the chance of error, by nudging them a bit. You may, hazily or otherwise, remember from school days that in principle we can find the minimum of some function using ‘differential calculus’. This sounds hideous but all it basically does is work out how steep a slope is. Imagine that you’re walking down the side of a U-shaped valley towards the valley floor. High up the valley sides it’s steep, but as you get closer to the bottom the slope gets shallower and shallower until eventually it’s completely flat. If you walk any further, it would start getting steeper again. Now we want to minimise the errors. if you think of the error as how far up you are, it makes sense to make a big adjustment if the slope is steep (you’re a long way from the minimum), but a smaller adjustment if it’s shallow (so you don’t overshoot too far). In this way, although the mathematics itself is a bit tortuous it turns out that it’s possible to tweak each weight based on how each error varies with that specific weight (the slope of the ‘error curve’). After all, the error is just a (mathematical) combination of the outputs from the hidden layer, the weights and the sigmoid formula itself.
It sounds a bit surreal, but all we’ve really done is a sophisticated equivalent of adjusting the numbers of buttons in the matchboxes.
We’re not quite done yet though. Remember the first set of weights, between the input layer and the hidden layer? We need to tweak them too. Firstly we ascribe a kind of notional error to each of the hidden nodes. We do this by effectively reversing the hidden-layer to output-layer weights: the error on a given output node is allocated to each hidden node in proportion to the strength of the weights which (effectively) contributed to that error. Once we’ve done this, we do exactly the same mathematical wrangling as before to adjust the first set of weights.
We do this for every image in the training data set (our example used 1,000 images). It’s not just learning how to identify one image, it needs to identify any image (of a numeral, in this case). When it moves on from one image to the next, the weights effectively contain the accumulated ‘knowledge’ from all the previous images. They are no longer random.
We can usually improve the accuracy of the net by running this training cycle a few times with the same training data: we no longer start each training cycle with random weights.
And that’s how to train your dragon. Our example was subsequently let loose on 10,000 images of test data and achieved an accuracy of over 96% so, although it may be fairly basic, it’s no dullard.
It’s instructive to take a look at some of the answers that the network got wrong. Typically what would happen is that instead of picking a clear winner with an output close to 1 (and every other output close to zero), in these cases there would be a couple of fairly high outputs, say around 0.6, as if the machine couldn’t quite make up its mind. And, lo and behold, looking at the offending handwritten characters they were indeed ambiguous – a seven that looked like a one, a five that looked like a three, and so on. The net was having the same difficulty that a human would.
If you managed to make it this far (and congratulations if you did by the way) you might be a bit puzzled about where the ‘intelligence’ is in all of this, artificial or otherwise.
The learning itself lives in the weights (not the nodes, or ‘neurons’, interestingly enough). It’s also spread among them in a highly complex way, and it’s pretty much intractable. Once we’ve started training the machine, we don’t really know how it’s learning exactly. It’s a ‘black box’. There are ways of tackling this ignorance, but it remains the case that we can’t really point to where the knowledge lives – it’s highly distributed throughout a vast number of connections. If you take a look at the code, even though you may not be able to read it, you’ll at least see that it’s pretty short. The intelligence (or, at least, the complexity) sits in the network, not the computer program as such, which is actually fairly simple. This opacity is a very live issue for more complex neural networks dealing with real world problems, where a high degree of trust in the technology is needed (such as self-driving cars or medical diagnosis).
Our network has a total of 994 nodes, which act like neurons, and 158,000 individual weights, which act like synaptic connections. A nematode worm has around 300 neurons and 7,000 synaptic connections, and a human has around 86 billion neurons and a quadrillion synaptic connections. So it’s perhaps not surprising if the net doesn’t look particularly smart, but it’s also hard to shake the feeling that it’s not doing anything that we would actually recognise as intelligence at all. It’s pattern matching, but it has no understanding about what it’s seeing, and it’s not really learning to recognise ‘forms’ in the same way that we do, let alone applying common sense or bringing to bear broader considerations. All of these remain valid concerns for much more sophisticated AI applications than this.
This article started with a question: how do machines actually think? And the fairly clear answer seems to be: they don’t. To this point, Edsger Dijkstra once said “The question of whether a computer can think is no more interesting than the question of whether a submarine can swim.”
Fair enough I suppose. But if, until now, you were worried about an Artificial Intelligence crowning itself God Emperor of Earth, rather than mistaking your kidney tumour for a blancmange, it’s my hope that this article may have given you pause for thought.
Machine Learning for Absolute Beginners by Oliver Theobald is an introductory survey that gives a broader perspective than we could cover here. Make Your Own Neural Network by Tariq Rashid on the other hand is an excellent ‘hobby level’ account of neural network design for those with some Python experience, and which had a big influence on this ‘project’.
 At some point in the last half century this morphed into ‘Machine Educable Noughts and Crosses Engine’, but I prefer the original version.
 Restricting MENACE to odd moves cuts the number of matchboxes to around 300. It’s fairly straightforward to code this up in Python or something nowadays though, so you don’t have to worry about collecting enough matchboxes, in which case there’s no reason to impose this restriction. On the other hand, the legendary mathematics journalist Martin Gardner designed a game he dubbed ‘hexapawn’, which can be played with only 30 boxes – a good place to start if you prefer the old fashioned approach. I have happy memories of building this thing as a child.
 I’ve simplified this a bit. In the original design there’s some finessing around how many tokens are used for each move (more for the earlier moves). In the event of MENACE winning, three tokens of the ‘winning’ type are added to each box in the winning sequence, and in the event of a draw, an extra token of the relevant type is added as well as replacing the ones used in the game.
 I’ve modified this network to use extra hidden layers in fact, and not noticed any significant improvement in performance.
 It’s a technical point, but the values are actually re-scaled so that instead of being between 0 and 255, which is how they’re encoded in the image, the values in the input nodes are between 0 and 1, but the proportions are still the same.
 We refer in the article to output node ‘4’, but as a technicality this is actually node number 5, since we start at 0.
 The net actually uses 0.99 to represent ‘correct’ (rather than 1.0), and 0.01 for ‘wrong’ (rather than 0.0), for technical reasons that don’t impact at all on the thrust of this article.
 If you’re curious, the annotated code annex includes a run through of the mathematics involved
 Of course the error depends on all of the weights, among other things, but it’s possible to do these slope (or ‘gradient’) calculations in such a way as to isolate the contribution to the ‘error function’ of individual weights – a wondrous mathematical tool called ‘partial differentiation’.
 See our discussion of the ‘Moravec Paradox’ in Strategy for Humans.