## Abstract

A review of recent academic literature that attempts to ensure causation rather than correlation in the use and dangers of machine learning. Applications of machine learning in Genome/Cancer research have recognized this critical issue for some time and the case is obviously equally strong in finance where money may be allocated on the basis of completely spurious data driven models. We will look at developments in "Post Model Selective Inference" and "Counter-factual Causal Prediction" with working examples.

### Overview

- Prediction vs. Explanation; Correlation vs. Causation. Spurious correlation - how do we handle overfitting formally? Model reduction, Machine Learning is good at this! So the question is how do we make it better?
- Post model selective inference and Multiple Hypothesis testing
- Introducing causal inference and causal prediction into Machine Learning; vice versa JMLR had a special issue on Causality in 2007!
- Big Data; Quality not Quantity
- Why interested in Causality- why isn’t Prediction sufficient?-a decision problem follows prediction- prediction is not an end in itself- Robustness - conditionality- invariance to regimes-interpretability
- Essentially an issue of spurious over fitting
- Cross valuation? Theoretical basis difficult unclear, invalid for casual models - counterfactual data, non-id data, sample splitting
- When p
- When p> n infeasible so penalization - LASSO, ANET etc.
- Data determined trade off between bias and variance beyond Classical interference - BLUE - unbiasedness not important for Prediction but is for Causality; criteria differ
- Family-wise error rate: controls the probability of making just one false rejection
- False discovery proportion: controls for probability of a user, specified proportion of false rejections for a given sample
- False discovery rate: controls the expectation, across many samples, the proportion of false rejections
- Machine Learning needs the traditional scientific method of confidence intervals to report the value of a (data) model to remove over fitting- not me saying this but the originators of many of today’s Machine Learning techniques.
- Machine Learning evolving - very different in the near future will be common in Finance Industry to use Multiple Hypothesis Testing and Post Model Selective Inference- simple because we cannot afford to make mistakes.
- Will make the application of Machine Learning and results more robust, reliable, interpretable - a transition that has already taken place in biomedical research.
- Integration of Causality and Machine Learning might take longer but much more profound effects- integration of Machine Learning into structural statistical models beyond prediction.
- Bottom line would you invest in a strategy you do not understand or could not explain or sell? If not use the correct statistical methods. Request Event Materials

## Machine Learning: Next Steps in Finance?

I think everybody in the room understands what the problems with machine learning are fundamentally they are overfitting and the confusion between correlation and causation. What I want to say is that there are routes for looking for solutions, some solutions already being developed to a tree model can exploit and I want you to understand what they are.

The Left had an article on the problems with machine learning about four weeks ago talking about spurious correlation and The Economist did two years ago. So this is not a new issue and is it about time we really came to grips with it.

So the real issue is how to formally handle overfitting. Machine learning is very good at model reduction and that's essentially what we want to do. We want to identify the correct features or the correct factors or the correct strategies that are actually statistically relevant.

So there are three issues that I thought about when I was asked to talk. One is the inference bringing statistical inference back into machine learning. Another one is caused by inference and causal prediction. And the third one is quality rather than quantity. What I'm going to talk about now is the post model selective inference and multiple hypothesis testing.

So I am going to exploit the work and report on the work that these names on the top of a slide have been producing over the last couple of years and if you want to Google and find what the literature is going these are the people that should be looking at.

There's a lot of work by the people who developed machine learning and essentially machine learning is not static, it's evolving very rapidly and the same guys that developed machine learning tools, like Elastic Net and Laso are also now moving forward into causality.

So we have to come to terms with that and finance, it's been in the biomedical area for some time and we need to adopt the same techniques that they are using because we can't afford to make mistakes, in the same way that they can't in cancer detection, we can't allocate money to models that don't make any sense.

So machine learning is in a state of transition between the middle ground between Braeman's two extremes of theory and data models. So the theory model is the classic statistical model whereby you start off with a theory or a model, you get some data, you estimate it and then you draw some inference. The data model, Braeman criticized that paradigm, that culture and said the artificiality of having a theory model to start off with is wrong, you should use the data to find out what model is supported by the data.

So Braeman's alternative, the data model, is where you get some data, you derive the model from the data and then maybe you do some inference but it wasn't specified. That's the big vacuum that's been left there and that's the vacuum that we can now feel with these inference techniques.

The other thing is that there's a big difference between prediction and causality. A lot of machine learning is orientated towards prediction but a prediction is not an end in itself. Once you predicted something it gets put into a decision problem and there's a lost function associated with that decision, it may be a portfolio allocation.

If you're loss averse, then really you should have an asymmetric last function then have that lost function applied to your prediction. So you have a prediction which is optimal for an asymmetric loss function which is the final decision. Most predictions in machine learning is done with the means error lost function which implies a quadratic loss.

I don't think many of us are equally hurt if we're five minutes early for a train of five minutes late for a train. We are loss averse. Many decisions are covered by asymmetric loss functions. So we need to think more clearly about what these predictions are doing in terms of how they are optimal in the sense of the lost functions relevant for the decision that follows from the prediction.

## What We Already Know about Machine Learning

**The dangers of machine learning - Correlation is not causation.**Now we can do that with standard machine learning techniques as I'll show you in a minute. I have some spurious correlation examples. So clearly eating ice cream triggers shark attacks. This is actual data from the U.S. somewhere and the causality comes out obviously because in the summer a seasonal factor is causing people to go to the beach and they go into the sea. There is a more shark attacks but ice cream is eaten.

The next one is a bit more interesting and that is apparently there's a very high relationship between the consumption of chocolate and the Nobel Prize winning. This must be something which is important slightly more maybe we should look at individual Nobel Prize winners and see whether the fat or thin.

### Correlation is not causation: Variable

This is looking at Google Clicks. The blue line is clicks on auto sales and the red line is Indian restaurant. This is precisely what we need to remove from our application of machine learning in finance. Feature selection is very difficult, and that's why we need new tools to remove these problems. We have wrong variables in our set.

## What do we do about it?

### Sparse modelling - LASSO

### Multiple Hypothesis Testing

### Why Multiple Testing Matters

### Multiple Testing Problem

### Controlling Errors of Inference

Three most common approaches:

## Conclusion