Panelist address what key areas financial institutions should have in mind when looking at alternative data to avoid wasting resources on alternative datasets doomed not to provide value. Panelists share their experience highlighting what pitfalls they should try to avoid as quant or fundamental investors, and how to be successful with alternative data. They also discuss what attributes are required of a potentially performing alternative dataset.
Moderated by: Dan Furstenberg, Head of Data Strategy, Jefferies.
Panelists: Peter Hafez, Chief Data Scientist, RavenPack; Leigh Drogen, CEO, Estimize; Rich Brown, Managing Director, Schonfeld Strategic Techworx; and Michael Mayhew, Principal, Integrity Research, held at the London Big Data and Machine Learning Revolution event in April 2018. You can also access the full video below.
How Should Firms Go About Sourcing New Datasets?
I would like to put you all on the spot in regards to how your organisation (you are all from different areas, being buy side, sell side and systematic), from each of your vantages points, what are the biggest pitfalls when sourcing data - as companies as well as the client you are working with? And how have you evolved your process in the last year or so in terms of an audit function?
From a data sourcing perspective, there are about 1,500 out there in the data vendor space. So what you really have to do is separate the wheat from the chaff and find out if there is a proven methodology in how they've assembled the database, how they are bringing it to market and when the out of sample period truly starts and the maturity of their processes internally.
How many people are you using to support your audit function is it one, is it 10? Is it a lawyer or is it two data scientists? What's the composition loosely and what's the infrastructure needed that you think to do this effectively, both for systematic and fundamental?
I think multi manager platform, it's a little more challenging. We certainly have rules and procedures in place that each of the managers have to follow in order to be able to analyze the data but the data coming in sort of starts my team and then we have a matrix environment which includes both legal and compliance. So the typical things that people are doing around due diligence questionnaires and really digging into the individual vendors, whether they're the true alternative ones or even just some of the traditional market data vendors that you know have a product that's a little bit on the edge of of alternative.
And the background of auditing data, did it help you, is it engineering, statistics? What do you find best for them?
You know I think in general it's a function of the legal and compliance team. So we don't bring in any data scientist to truly go into the pure audit function. There are certain data sets that we are comfortable with and the procedures and due diligence that we do are put together in that way and other data sets, I think there's certain limitations to what they're doing now that they don't fit some of our strategies.
Lee and Peter, you are real data providers of sorts. But you're all co-founders, top executives. You're betting your company your life everyday that the data you're providing is accurate. Walk us through this as I think that's an incredibly stressful proposition. I would lose sleep over it. How were you protecting your data, your brand? I mean, with all due respect, your leaders in your in your space, but you're not a thousand person company, so how do you think about the process and the infrastructure as well?
For us it starts with how we collect the data itself. So we're naturally a crowdsourced platform where we're asking buyside independent, nonprofessional, industry experts to all come to the platform and give us their forward looking views. And so instead of having that be a complete free-for-all, we obviously use a set of algorithms and a set of manual review after those algorithms, to make sure we don't have crazy data, make sure we don't have people attempting to misuse the platform. So we don't really lose sleep over that because we built those algorithms a long time ago.
What we do lose sleep over or let's say, try really hard to effect, is when we hand data over to the buy side to test, it's not just about throwing the data over the wall and hoping that they come back and say ‘great we found awesome stuff’, that simply doesn't work, no matter how sophisticated the team is. What does work, and what we've worked really hard on over the years, is producing both events studies and quantitative research.
The basic descriptive work that they can trust, and all the way down to the level of factor models in Python Notebooks, is that we can hand over with the full code and strategy. That's what I know. I lose sleep over it and worry about how do we make sure that we best present the data to a buy side firm. So it's as easy as possible for them to move it through their process.
I would say from our side, we are a data provider and we are obviously consuming a lot of data ourselves. So we take in a lot of newsfeeds and we are obviously checking the quality of those sources before we let it into the products. So the process that we go through first of all, we have to take in the data and normalize the content turning it into a RavenPack XML format.
Then we would run an initial classification of the content and then give it to the Data Science Team to do some tests on it to see you know what is in the mix of the things, that we can extract from the content. How does it interact with with what we already have? Is the data consistent over time or is the volume that we see in particular sources are moving around a lot, are there holes in the archives? Can you trust the timestamps and all of these type of things? So there's a lot of quality assurance from that side.
Then of course in our own delivery of product we have a great technology and an Operations Team that makes sure that everything is handled as it should and that our systems are up and running at all times, and it's at 99.99993%. So we are doing really well on that side. And then it's just a matter of tracking what's being produced in real time in terms of volumes across various regions and categories and so on, so I want to see if something is out of the ordinary and trying to capture it if we see something.
Im curious, when you found errors, when you found mistakes, which bucket does it typically fall in Peter?
I would say in terms of things that look a little strange, that we are not always in control of what we are receiving from our vendors and especially if you look at what we would call more ‘premium sources’, there you have good consistency. But as soon as you go to the Web and get content it, starts to be some search potential issues.
You can see certain sources dropping off or coming back in, or you know it's a little less consistent. And these are sometimes the issues that we have or if we have a vendor suddenly shutting off a particular source for us without telling us something like. Then we have to go and chase them and try and find other ways of getting the sources back.