Practitioner’s point of view on what in big data and machine learning investing is challenging and what to do about it. Demystifying the “magic box”, sharing best practices and real-life examples of machine learning application to investing including NLP with RavenPack. View an extract of Understanding and Overcoming the Weak Points of Big Data and Machine Learning by Andrej Rusakov, Co-founding Partner, Data Capital Management, held at the London Big Data and Machine Learning Revolution event in April 2018.
Machine learning (ML) is NOT a magic box
ML investing requires a full new setup of the firm (that is drastically different from a traditional setup). Machine learning is changing the way people invest and the enhancements it provides to other fields. I will give you an overview of why it is not a solution that will completely revolutionize everything and what is hard in what we are doing.
Contrary to common believe, this magic black box is not easy to implement. The barriers to entry is still pretty high for people who are taking it seriously. Both from a team perspective, you have a lot of brain power to invest in, and from a capital perspective.
What We Are Trying to Achieve is the Following
We need to build an assembly line whereby we take data, we clean the data, we structure the data, and we provide the data to researchers so they can easily experiment with it and build models on.
Secondly, we need to find structure within those data sets that are predictive of the returns you are trying to model. And execute those and risk manage. It sounds relatively simple on a high level but if you dig deeper, you need a whole new team to do this, you need different approaches to data handling, you need robust and differentiated technology, and in most cases it’s things like cloud providing services and all the underlying architectures that goes along with it. As well as support functions that are slightly different. How do you explain what your model does to investors and your compliance officer if they have no idea?
Typical Approach Does Not Work
The typical approach usually looks like this:
- Hire 10 PhDs, demand from each of them to produce an investment strategy within 6 months
- This approach typically backfires because each of these PhDs will frantically search for investment opportunities and eventually settle for: (a) a false positive that looks great in an over-fit back test or (b) a standard factor model which is an overcrowded strategy with low Sharpe ratio but at least has academic support
- Both outcomes are disappointing
Why is it Hard?
- The complexities involved in developing a true investment strategy are overwhelming:
- Data collection, curation, processing, structure
- HPC infrastructure
- Software development
- Feature analysis
- Execution simulators
Even if the firm provides you with shared services in those areas, you are like a worker at a BMW factory who has been asked to build the entire car alone by using all the workshops around you
- One week you need to be a master welder, another week an electrician, another week a mechanical engineer… try, fail and circle back to welding. It is a futile endeavor
Challenge 1: Data Acquisition
Identifying new strategies requires large teams with very diverse skill sets working together.
Challenge 2: Data Integration
Data are just summaries of thousands of stories.
This a problem that most of us are familiar with. How do you deal with bias-variance tradeoff? This is something we face everyday. You have to make sure your model is predicting the real distribution of the outcomes in the most plausible way. But you don’t want to over feed the models so that it's basically showing you noise.
There is not just one solution for machine learning that fits all. It’s actually very fit for purpose and you have to be smart about what you are picking up.
Applying ML to finance is NOT straightforward. “Plug & play” approach of taking add tech ML tools and applying them to finance does not work.
Most practitioners do not deeply understand the underlying math behind ML they use. It is a good idea to either deeply understand the math behind the ML method or (and) being able to interpret results.