Improving Stock Market Predictions using Natural Language Processing

Predicting how stock prices will change is a notoriously tricky business. There are millions of people whose jobs depend on it, and many different approaches, that range from pure superstition, to cutting edge machine learning models.

LSTM models are a popular tool used to make quick predictions about stock movements, or time-series in general. They can make remarkably accurate predictions by observing past price changes— they look at historical price changes and can “see” patterns that humans cannot. What LSTM models struggle with is predicting the crashes that occasionally happen due to external factors (like a Housing Bubble popping). The disclaimer that investment advisors often say — “Past prices are not necessarily an indication of future prices” — certainly rings true when crashes are involved, but in times of calm, LSTM models are remarkably useful. I had made some quick basic models of the FTSE 100 and S&P 500 (using the Keras library in Python) and was impressed at how easy it was to get reasonably accurate models for the ebbs and flows of the price. Giving the model the rates from 2010–2017 to learn from, see below how well it predicted 2018 and 2019 prices.

My Baseline FTSE Model — Green is what happened; Orange is what my Model predicted to happen

The question remains — how might we improve these models? Simple: give it more data, not just past prices. If you give your LSTM model more information, it will discover more patterns, and therefore make even more accurate predictions.

What extra data should we give our models? I theorised that it would be helpful to incorporate data on the sentiment of the newspapers we read.

My line of thought when imagining what data might improve my models

The idea that there might be some underlying patterns in Newspaper Headlines (which could give a model some added predicting power) came to me after a conversation with a Data Scientist at the Financial Times Newspaper in London. Their published information is enormous (decades worth of daily newspapers for starters). Given that it has been the newspaper of choice in the City of London for decades, and continues to influence decisions made in the financial world, could we give this published information to our models “to read” and make better predictions than the analysts reading them?

It might sound like a scary space-age prospect — computers reading newspapers to make informed predictions — but Natural Language Processing is here, and it is powerful!

I recommend this article to anyone who wants to familiarise themselves of how powerful NLP is becoming:

Bump in the road

Unfortunately, the Financial Times was not the well of knowledge I wanted it to be. The FT “neither has clean Databases of their past published content nor would they give it out for free if they did”. However, I found that the Guardian and the New York Times both do have publicly available databases of their content.

It’s worth mentioning that my definition of ‘improvement’ to my model will be a reduction in the ‘Mean Squared Error’ of the ‘validation set’ (2018 and 2019 values). The MSE is the average of the squared differences between our predictions for what would have happened and what actually happened. Better predictions = lower MSE.


Here was my plan:

1 — Obtain Financial Data (from Pandas Data Reader)

2 — Obtain Newspaper Data (from APIs produced by the Newspapers)

3 — Analyse the Newspaper Data using Sentiment Analysis (TextBlob)

4 — Create a Baseline LSTM Model using only the Financial Data (using Keras)

5 — Add the Newspaper Data into the LSTM Models and see if they improve

Obtaining Financial Data

Pandas Data Reader is a brilliant tool, which does the hard work for us. It requires little more than typing in the range of dates, and to specify the type of data we want (in this case ^UKX is the symbol for the FTSE100), and it will return a nice clean dataset.

Only a few lines of code, and you get all of this!

Obtaining Newspaper Data

This was a little more tricky. Both the Guardian and the NYT have API’s (portals that allow us to request data, under some strict rules). For example, I could request articles from the NYT for a particular month, but I would have to do the hard work filtering out the meaningless items myself (as each request returned 1000s of articles written that month). I made a point to include only articles I thought likely to influence the financial market (Front Page, business, tech, manufacturing, etc.).

I wrote a few functions that collectively took the results of the API requests, and filtered through them, returning only the Headlines on days that I had specified:

Analysing the Newspaper Data

One drawback of LSTM models is that you can only give it numerical data. So, how can we represent our headlines numerically? TextBlob.

TextBlob is a Sentiment Analysis Library — it reads the words you give it, and returns a rating on how subjective the sentence is, as well as a rating for its polarity (-1 for very negative, 1 for very positive).

I fed all the Headlines through TextBlob and received a rating for how positive or negative the news that day was. Assuming there is a correlation between positive news stories and upswings in the stock market (and negative stories and downswings), this might help our model’s accuracy, so those are the numbers I stored for later use.

Creating a Baseline Model for comparison

Using Keras, another brilliant Python Library, I quickly was able to set up baseline models for the FTSE and S&P. It made more sense to me to try to improve a decent baseline model, as showing that I improved a weak model with my newspaper data was meaningless (nobody wants to have a poor model in the first place). So, I tweaked the parameters of the models until they were reasonable accurate (changing the number of epochs, batch size and optimisation function for anyone interested) and then made a note of the MSE of the best models.

Adding in the Newspaper Data

Just like preparing your data to feed your model, you’ve got to line your coins up before pushing them into the coin machine!

This was a relatively quick final stage, thanks to my hard work up to this point, but I did need to be careful about the format of my data when inputting it into my model. LSTM models are fiddly in their formatting, requiring you to shape your data into pre-determined shapes before inserting (think of it like those coin machines on arcade games — you’ve got to line your coins up in a certain way before pushing them in!). When you go from univariate data (just financial) to multivariate data (financial + newspaper), this becomes a whole level more complicated. I ran the final models and took a closer look at their performance.

Looking great to my eyes— but let’s have a detailed look at results

The Results

A 2% drop in the loss of the validation set — evidence of success!

All this hard work seems to have made a little bit of difference to our final FTSE model. Our MSE in the Validation Set has gone down, so our predictions are closer to being correct!. This means our model is more likely to be accurate for 2021 and beyond.

Unfortunately, my S&P model didn’t seem to make any improvement in MSE, which could be down to many reasons, but is likely down to my New York Times data being less informative.

Future Work and Improvements

This project was just the tip of the iceberg for me. Here are a few examples of things I plan to work on in the future:

1 — I plan to improve the quality of my Newspaper Data. Filtering the headlines that relate to financial matters only might help remove some of the noise. I’d also like to try other methods of Sentiment Analysis — polarity of the headlines might not be the best method for extracting data. Bag of Words would be my next approach (thereby seeing if particular words mentioned in headlines correlate to specific trends in the Stock Market), but I hesitated from doing that this time round purely for the size of arrays I would need to create and input into the models.

2 — I’d like to improve the quality of my modelling. I would approach this by running the training on some form of cloud computing (likely AWS) so that I can cut down on training time and push the limits of my modelling. More epoch, more layers, more types of optimisation etc.

3 — I’d like to experiment with different kinds of data. Newspaper headlines are a fun idea, but what about using the actual content of newspaper articles. Or statistics from Google Searches? Data from the OECD? Twitter Trends? Spotify Statistics? Who knows what butterfly effect there may be to link one type of data to changes in the stock market. With more computing power, there isn’t a limit to what you can try.

4 — I’d like to try predicting specific stock prices, rather than the entire S&P500, and then perhaps filter the newspaper data to include only articles relating to subjects in that category. For example, use Apple Stock price (AAPL) and screen on news about Apple, technology, and California.

5 — I’d like to try different Newspapers, potentially using the Wayback Machine to scrape data off of past newspaper websites.

6 — I would like to try training my model on the dates running up to Stock Market Crashes to develop a tool to predict them before they happen.

If you’d like to have a closer look at this project, please feel free to look at the Github Repo I have for it:



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store