Mark Salmon, Professor, Cambridge University; Director of Research, Centre for Advanced Financial Engineering and Advisor, Old Mutual Global Investors
| May 22, 2018
View an extract of this session held at the London Big Data and Machine Learning Revolution event in April 2018.
You can also request access to the full video and slides.
A review of recent academic literature that attempts to ensure causation rather than correlation in the use and dangers of machine learning. Applications of machine learning in Genome/Cancer research have recognized this critical issue for some time and the case is obviously equally strong in finance where money may be allocated on the basis of completely spurious data driven models. We will look at developments in "Post Model Selective Inference" and "Counter-factual Causal Prediction" with working examples.
I think everybody in the room understands what the problems with machine learning are fundamentally they are overfitting and the confusion between correlation and causation. What I want to say is that there are routes for looking for solutions, some solutions already being developed to a tree model can exploit and I want you to understand what they are.
The Left had an article on the problems with machine learning about four weeks ago talking about spurious correlation and The Economist did two years ago. So this is not a new issue and is it about time we really came to grips with it.
So the real issue is how to formally handle overfitting. Machine learning is very good at model reduction and that's essentially what we want to do. We want to identify the correct features or the correct factors or the correct strategies that are actually statistically relevant.
So there are three issues that I thought about when I was asked to talk. One is the inference bringing statistical inference back into machine learning. Another one is caused by inference and causal prediction. And the third one is quality rather than quantity. What I'm going to talk about now is the post model selective inference and multiple hypothesis testing.
So I am going to exploit the work and report on the work that these names on the top of a slide have been producing over the last couple of years and if you want to Google and find what the literature is going these are the people that should be looking at.
There's a lot of work by the people who developed machine learning and essentially machine learning is not static, it's evolving very rapidly and the same guys that developed machine learning tools, like Elastic Net and Laso are also now moving forward into causality.
So we have to come to terms with that and finance, it's been in the biomedical area for some time and we need to adopt the same techniques that they are using because we can't afford to make mistakes, in the same way that they can't in cancer detection, we can't allocate money to models that don't make any sense.
So machine learning is in a state of transition between the middle ground between Braeman's two extremes of theory and data models. So the theory model is the classic statistical model whereby you start off with a theory or a model, you get some data, you estimate it and then you draw some inference. The data model, Braeman criticized that paradigm, that culture and said the artificiality of having a theory model to start off with is wrong, you should use the data to find out what model is supported by the data.
So Braeman's alternative, the data model, is where you get some data, you derive the model from the data and then maybe you do some inference but it wasn't specified. That's the big vacuum that's been left there and that's the vacuum that we can now feel with these inference techniques.
The other thing is that there's a big difference between prediction and causality. A lot of machine learning is orientated towards prediction but a prediction is not an end in itself. Once you predicted something it gets put into a decision problem and there's a lost function associated with that decision, it may be a portfolio allocation.
If you're loss averse, then really you should have an asymmetric last function then have that lost function applied to your prediction. So you have a prediction which is optimal for an asymmetric loss function which is the final decision. Most predictions in machine learning is done with the means error lost function which implies a quadratic loss.
I don't think many of us are equally hurt if we're five minutes early for a train of five minutes late for a train. We are loss averse. Many decisions are covered by asymmetric loss functions. So we need to think more clearly about what these predictions are doing in terms of how they are optimal in the sense of the lost functions relevant for the decision that follows from the prediction.
The dangers of machine learning - Correlation is not causation.
Now we can do that with standard machine learning techniques as I'll show you in a minute. I have some spurious correlation examples. So clearly eating ice cream triggers shark attacks. This is actual data from the U.S. somewhere and the causality comes out obviously because in the summer a seasonal factor is causing people to go to the beach and they go into the sea. There is a more shark attacks but ice cream is eaten.
The next one is a bit more interesting and that is apparently there's a very high relationship between the consumption of chocolate and the Nobel Prize winning. This must be something which is important slightly more maybe we should look at individual Nobel Prize winners and see whether the fat or thin.
This is looking at Google Clicks. The blue line is clicks on auto sales and the red line is Indian restaurant. This is precisely what we need to remove from our application of machine learning in finance. Feature selection is very difficult, and that's why we need new tools to remove these problems. We have wrong variables in our set.
Three most common approaches:
Please use your business email. If you don't have one, please email us at firstname.lastname@example.org.
We will process your personal data with the purpose of managing your personal account on
RavenPack and offering our services. You can exercise your rights of access, rectification,
erasure, restriction of processing, data portability and objection by emailing us at email@example.com. For more information, you can
Your request has been recorded and a team member will be in touch soon.
We consider incorporating sentiment signals from news, earnings call transcripts, and insider transactions to
boost the risk-adjusted returns, and revive factor performance.
We find stronger, more predictable market reactions when the words of company executives agree with their actions.
We have gathered 12 insights from 2021 research that can be leveraged in 2022.