Natural Language Processing and an introduction to Neural Networks

Tom Ribaroff
9 min readOct 30, 2019

Humans are born with an evolutionary-honed ability to understand language — we pick up language from a very early age. As quickly as we may take to natural language, we struggle to make machines that are able to generate, understand and process natural language.

NLP describes the collection of techniques used for all things language, computationally speaking. Computer sciences’ first foray into linguistics began in the 1950s, with Turning’s famous article titled “Computing Machinery and Intelligence” — our first introduction the Turing test as a criterion of intelligence and it’s developed into a rich subject. Interestingly, most of the recent major advancements in the ability for machines to process language have come from the Machine Learning community — while they may be inspired by aspects of linguistics , the profound advancements have not come from the linguistics community itself.

Here are some examples of problems that are currently being worked on:

Machine translation is a good example. Translating individual words takes nothing more than a list of every word translated, and a little search through. Sentences, however, are more interesting — how do you get a computer to take into account the grammar or context of the words to more accurately translate them

Topic modelling is another — taking a collection of documents and determine what are the topics they cover.

Sentiment analysis is important too — it could take in some text and analyse and return the sentiment. Is it positive or negative? What emotions does it cover? Is it sarcastic? Imagine a call centre needing to sort through thousands of voicemails — it would be much faster to redirect the angry ones to the right people and the positive ones to other people.

NER — name entity recognition is another - recognising all the proper names/places/brands that are mentioned in a text. It would allow us to take the key bits of information to search on. ‘When are the movie times?’ Would take movie times out and specifically search for that/ make searches more accurate.

This begs the question: “Where do we even start?”.

One place could be with Facebook. They have set out a construct that allows Computer Scientists to focus their efforts on the same problems, which they argue cover alot of this ground in a succinct way; The bAbI data set they have written to promote research into automated text understanding and reasoning. It provides a set of tasks to set your computer. The first set of tasks are simple text comprehension — 10000 examples of short simple stories that have attached a simple question. They have 10000 answers to train your network and 1000 queries left seperate as testing data to check how effective your program is that youve written at understanding these stories. The entire data set’s tasks have been set up in ascending difficulty, starting with our example: chaining facts. It grows on to simple induction, deduction and many more. By separating these tasks into 20 groups, it allows developers to know where the downfalls are in their system that they develop, and work on them directly. It is also a measure of stength of your system if you can get it working on less that 10000.

So, what techniques are used to build systems that can read sentences and make some sense of them?

In the early days, it used to be mainly statistical techniques. Some of you will certainly have heard of the ‘bag of words’ approach. It adds up frequency of words used, but doesn’t place some weighting on their order. We might also have made the program compare its results to other common documents to give our sums context. The frequency of words does grasp some value out of the sentences — Early search engines used variations of this, but it’s very crude and not a great measure of most things. Could you mark an English essay by asking what frequency of words used in the essay? Wouldnt be very accurate, frequency doesn’t tell you how they’ve been used — but it certainly would be consistent, that’s a plus! Funnily enough that grammar and spelling checks do use similar primitive statistical methods, so its not completely useless today. It is also helpful in tracking language change — take the whole internet not using some words anymore. It could be a good measure of language change.

The technique that has made the most progress in this field in the last two decades are neural networks. I’ll attempt to briefly describe how neural networks fit in to help us solve this issue. We’ll take the example above and show a model that help solve it.

Reminder: “Basic factoid QA with single supporting fact“.

A selection of our Q&As for the computer to read through

It might seem like a daunting task to try to design something that allows a machine to read this and answer relevant questions, but lets split the task down into sections:

Our overview for tackling the problem

We already have our data, so that’s a great start! Essentially, we are taking the data, chopping it up into a more readable format, and feeding it to our engine which will do the hard work for us, “learning” what rules these sentences generally follow, what patterns they copy etc. Step one we can assume is complete, so lets move on

Step 2 — Pre processing

Where we aim to be at the end of the next step

Now to start chopping up the data, so that it is more readable for a computer! It’s not too complicated, you can just write a function in Python that splits sentences up every time you see a blank space, and then you can just add the split up individual words into a new list.

Part 3– Feature Extraction

The next step is a little more interesting — we need to take our chopped up sentences and format them into new lists and collections so that our computer knows what its working with. Helpfully, for this data set, we only finish with 22 unique words, which makes our task seem a little less daunting. The computer will then have a list of 22 words, and our task now is basically trying to figure out what combinations of these 22 words results in what answers. As a computer has no real understanding of what a sentence is, we could in theory feed it completely random combinations of words as practice example to learn from — it would have no filter on what makes sense in real life.

When we did simple linear regression, we were supplied with data points (x,y) that we were trying to fit a line to. We can see our current example as being similar to that — we now have all our x values, and we also have the line of best fit (the answers) given to us. Now we have to try and find a way to figure out what the right y values are, and we’ll have a complete model!

Part 4 — Building your Memory Network Model

Brace yourself, this next bit involves some pretty gross generalisations — but we have to start somewhere! Neural networks are a concept based on, you guessed it, the neurons in your brain. A stimuli neutron, like the ones connected to your eyes, are connected to millions of other neurons, that help process and filter the input to your eye. Some of these neurons have more ‘weight’ that others. A neuron that senses dark red light might have a heavy weight that activates the danger neurons connected to your adrenal glands, as red is traditionally the colour of danger. The neurons that sense dark blue light might also be connected to the danger neurons, but the weight would be much lower, as we know blue isn’t traditionally a dangerous colour.

Imagine the same situation here — you can imagine words like ‘the’, ‘and’, and ‘it’ wouldn’t be the most important hints towards getting the correct answer in the model, but the names/ places words would be! Say our sentence had three particular words: “Mary”, “bathroom”, and “hallway”. We’d hope our function would take in these inputs and when asked “Where is Mary?”, respond with an answer “bathroom” or “hallway”, depending on other words in the sentence/order etc. On a simple level, we’d expect the “bathroom” and “hallway” neurons to have strong weight, and the “kitchen” neuron to have weak weight when fed these inputs, in this example.

Here it is on a slightly more detailed level:

We can take our 22 unique words, and make combinations of them in every possible way by drawing branches between them. Furthest on the left, we make nodes for all the possible words that could be used in our questions. We “activate” these nodes depending on if the word they represent are in our story. Depending the how many nodes have been lit up before it, each node further down the line could activate a little bit, or alot. We finish in the final nodes, which are all the potential answers. In theory, the middle sets of nodes could be for helping us make sense of order of words in the sentance. Another set of nodes could be for capatalised words. The computer deceides what it wants to make important in its sorting, not us! The branches themselves are where we incoroporate weight — think of a metal wire that conducts electricity really well if we want lots of weight to be transferred to the next nodes, is really insulated if we don’t want much weight to be transferred down.

Eventually we end up in the nodes on the further right, which are ‘lit up’ a certain amount. We’ve used scaling functions along the way to make sure our strength that we ‘light up’ our nodes can be represented by a number between zero and one. Then, the node on the right that is lit up the brightest would be our answer.

Part 5 — Training the Model

How good is this answer? Not very good at the start for sure! We will define something called a cost function — where we add up all the values in the nodes that weren’t supposed to be lit up. We know the correct answer, so we can measure how wrong our system has been in guessing this answer. In this context, we will have a number between zero and one for each node, and adding up all these numbers except for the number in the correct node, will give us a value for how incorrect our guess is. Now our job is to just minimise this error value, the cost function!

Every time we send a practice question and answer through this system, we will be able to see how incorrect our weighting system is, and we give the system an ability to tweak the weights accordingly to try and bring this value down.

“What we mean when we talk about a network learning, is just that we are minimising this cost function”

After 10,000 examples, the memory network has been reducing and reducing this cost function as much as it can, and actually begins to light up the right nodes!

Part 6 — Tests and Results

It’s a brilliant theory, and on this simple level it manages to be 98% accurate with our test data, a great start!

This very basic overview hopefully has given us an insight into how we might tackle these scary problems head on, and show that they're managable!

Thanks to:


Applied Machine Learning blog and Data Sceptic podcast

Comrades Titus, Ioana and Joe

3blue1brown Youtube Channel