How to Select the Right Alternative Datasets

RavenPack | March 11, 2020

It used to be the case that hedge fund managers would actively seek out novel alternative datasets to provide them with an edge, but now, following exponential growth in the sector, and a mushrooming of available datasets, they are faced with the opposite problem: how to make the right choice from so many options

From Sea of Data to a Dataset Sea

The number of alternative datasets is exploding and for some buyers the problem has resulted in an ‘embarrassment of riches’.

“We are seeing tons of new sources. I used to love talking to people, but now I hate it.” Says Matei Zatreanu, founder of System2, a data science service provider. “It’s like, another one of these, another one of those, and it’s trying to figure out how to filter through all of that noise.”

Alternative Datasets Released

For many asset managers (AMs) the solution has been to set up a dedicated data science team to evaluate all the competing alternative datasets.

Others have taken the more radical step of building online ‘evaluation portals’ where prospective vendors submit their datasets for pre-selection testing; examples include funds such as Balyasny and WorldQuant.

Data-to-end user matchmaking agencies such as Battlefin, have sprung up, regularly hosting conferences, events, and data science strategy competitions, providing a new forum for hedge fund manager data-dating.

A class of alt data middlemen has emerged who can handle the whole process from selection to strategy development; one example is System2, which helps smaller funds and AMs access the alternative data space.

Alternative Datasets Selection Process

Increasingly it is becoming important for funds to devise ways of prioritizing the most valuable alternative datasets but what is it they look for during screening of prospects, and how could they improve their selection processes?

Getting a Good Deal

Of key importance is ascertaining the data’s quality: where it is sourced from, how far back it goes, and how structured it is. If the data lacks historicity, for example, it may be impossible to backtest properly. If it is unstructured, the user will have to spend time and resources cleaning and formatting.

An example of a dataset that ticks all the boxes is RavenPack’s news sentiment data, which is sourced from the highest quality media sites: Dow Jones, Barrons, the Wall Street Journal. The data also goes back all the way to 2000, allowing for large-sample backtesting, and it is already structured so users can start analyzing it immediately.

Asking the Right Questions

Of critical importance is to clearly frame the investment ‘hypothesis’ at the start of the search as this will determine which of the differing datasets offers the best chances of answering it.

“It has less to do with the datasets and more to do with what kind of questions you are asking of the datasets,” says Matei Zatreanu.

Zatreanu’s company System2, follows a process in which it first establishes the companies its clients want to trade, what those companies’ KPIs are, and then which datasets are best suited to evaluating the KPIs in question.

Sometimes datasets, particularly those that are large, may also require creativity and inventiveness on the part of the analyst to get the most out of them.

“Perhaps it is better to invest your resources on somewhat proven datasets and say ‘now I want to become an expert in that’ because a lot of these datasets are rich and can be used in many different ways.” Says Peter Hafez, Chief Data Scientist at RavenPack.

He cites one example from a study by Citigroup in which researchers at the bank approached the use of Ravenpack news sentiment data in a novel way by focusing on sentiment around Capex news stories. The theory behind the approach is that high CAPEX is a sign of rising base costs and thus could warn investors that the stock might experience lower stock returns in the future.

“Empirical evidence suggests high reported CAPEX firms experience poor future stock returns,” says Hafez.

Sometimes the process can require lateral - or out-of-the-box - thinking skills or approaches.

Zatreanu illustrates this with the example of how a somewhat novel investment metric such as ‘customer-life-time-value’ can be used to discover underlying strength or weakness in a stock.

“If I know that topline revenues are fine but then I drill down into it and I notice that the most loyal cohort - the one that has been the most profitable to me as a company is churning out - and I am replacing it with people who are much more fickle, even if my top line might be fine and I might be getting the quarter right, the long-term health of that company is not there,” says Zatreanu.

The Bottom Line

The cost of a new dataset is a consideration but it is not always easy to determine, and many vendors admit to using arbitrary means to price their data.

The AUM of the buyer is an important consideration, according to Milind Sharma, CEO of hedge fund QuantZ, since a large AM such as Blackrock arguably has the resources to arbitrage out the value of the data on its own, leaving other buyers with little remaining alpha.

Others disagree, however, arguing that even data which has been heavily arbed is still of value to investors because it becomes an ‘underlying factor’ in the market and, in the words of one vendor, “if you do not have that data you are at a disadvantage.”


“Any unique dataset over time goes through a cycle of first being an emergent property and having some reflexivity to the market and being able to generate alpha,” says Eric Weinberg, Executive in Residence, Great Hill Partners, “but then look at datasets like I/B/E/S which have been arbed for the last 20-30 years. I/B/E/S has now become an underlying property of the market - i.e that earnings revisions expectations are highly correlated to returns - and if you don’t have that data you are at a disadvantage.”

It is also true that some of the bigger, more ‘multidimensional’ datasets tend to retain their value more effectively, such as, for example, RavenPack news sentiment.


“Because the data is so deep and rich, there are many ways of slicing and dicing it, reducing the chances of signals getting easily arbitraged away,” says Armando Gonzalez, CEO and co-founder of RavenPack.

Back to the problem of pricing, however, and one method to solace the problem is to compare it to the cost of a discretionary money manager who can deliver the same returns on a two-and-twenty fee basis.

“If I am selling you a signal as opposed to you are allocating to a manager, and if we can hypothetically predict that it gives you 20%,” says Sharma, “Then if we had a manager doing 20% at two-and-twenty what is that? That’s 600 bps, right.. So now we are starting to put some structure in the problem.”

Indeed he bases this example on his own leveraged buyout portfolio model which incorporates RavenPack sentiment data to optimize stocks considered ripe for takeover.

Whilst price considerations may be a factor, however, they are unlikely to be a driving factor given the increasing demand for alternative datasets from an industry hungry for alpha; and with buy-side players now expected to spend 70% more on alternative data in 2020 compared to a year ago, the advantage in the alt data game, looks to be firmly with the vendors, not the buyers.

Buy Side Spent on Alternative Datasets

Don’t Regress From Statistics

A final consideration is that choosing the right alternative dataset often requires a knowledge of statistics. This especially true when comparing the performance of two different datasets. Yet, surprisingly this has also been a factor many users have overlooked.

Part of the reason for this is because some alternative datasets are so large, users have thought they can carry out simple ‘counting cars in parking lots’ type analyses, without the need for ‘belt and bootstraps' level confidence.


“There is a saying that those who ignore statistics are doomed to reinvent it,” says Matei Zatreanu. “And that is exactly what we are seeing now; people have not been focusing on stats and all of a sudden they think I need to start thinking about these things in more statistical ways.”

Even an on the face of it relatively simple exercise such as comparing two regression lines requires a knowledge of statistics.

“I ran all these models, how do I evaluate that these models are actually producing different results or they are within the same confidence interval as each other?” Says Zatreanu.

The conclusion is clear: a certain level of statistics is invaluable to ascertain the strength and weaknesses of different models and datasets they are evolved from.

Look for Pedigree

Given the huge increase in data types available and the unregulated ‘wild west’ quality to the industry it sometimes pays to go with an established vendor, the value of whose data has already been proven.

RavenPack is an example of such an established vendor, having been in the business since 2003 at the start of the alternative data boom.

There is also a plethora of research conducted using the data which evidences its value as a generator of alpha.

“If they are giving us a so-called signal, did they make ridiculous assumptions? That is actually where I think there is a need for some industry standards to evolve,” says Milind Sharma.

To browse our research section of studies conducted - not just by ourselves - but also by third parties and academics, click here.

Ravenpack Analytics

Asset Managers can easily tackle Alternative Data with the RavenPack Analytics Platform, which includes sentiment data on over 250,000 individual entities and 6,800+ market-moving events, available on visual dashboards or via web APIs. To request a free trial click here.

Request a Trial

Fill out the form below and see RavenPack in action.