It all starts with sources
Many of RavenPack’s clients use our insights for investment decision-making in
fast-paced financial markets, so it is essential for the source data to be both accurate
and up-to-date. News reports must first be ingested and then scanned with the minimum of
lag time in order to provide clients with the critical information edge they need to
generate returns that beat the market.
RavenPack already integrates a whole range of reputable high-profile news and market data
sources including Dow Jones Newswires, Factiva and Alliance News, and thousands of other
lesser-known publications, blogs and social media sites, however, the process of
identifying new sources is always ongoing.
When a potential new source has been identified, it is the job of the Content Management
team to trial the new source’s data and make sure it lives up to its claims,
before preparing and managing the ingestion of the actual news feed, assuming it is
successful.
News feeds often differ when it comes to the transmission protocols they use to
communicate with third parties. For the Content Management team this often involves
customizing software on the RavenPack side to accept feeds from new publishers.
“Every source talks its own language.” Says Juan Sánchez Gómez,
the Head of the Content Management at RavenPack. “By which I mean the API or
channel they use to publish the documents. These can be different, some use HTTP, others
FTP, TCP - there are many.”
After the content is ingested, the next step is to normalize it or convert it from its
raw format into a RavenPack friendly format.
“Sources send documents in any format. XML, HTML, JSON. All the file
formats you can imagine. This means we need to become experts in the channels and the
formats. We have to understand them at a level most people don’t realize so we can
process them as if they were our own documents. We can then translate them into a
standard format that we can use internally,” says
Sánchez Gómez.
Making the Grade
Before accepting a news feed, Sánchez and his team trial the data to test whether
it measures up. Everything gets tested. Information latency - the time it takes to
download new data - is a key consideration; this stage of the selection process is known
as ‘Validation’.
“Every provider claims that their feed is really fast, that their content is highly
structured, or that they compare very favorably to their competitors, and that’s
part of the sales process. So our role is to test and make data-driven decisions. We go
into these content evaluation iterations and we do a trial. We consume a few years of
the feed, 1, 2 or 5 years, and we also connect in real-time, so as to confirm that what
they said is correct,” says the Head of Content Management.
The process of validation can take weeks or months, after which a decision is made as to
whether or not to integrate a new source into RavenPack’s data ecosystem.
Post-Validation
If a new source is given the green light, the process of onboarding a longer-term archive
of the source’s content begins - this can often involve downloading content
stretching as far back as the year 2000.
Since online news publications often change both transmission protocols and content
formatting over time, further technical adaptations are often required at this stage so
that the RavenPack system can ingest legacy files. “Turning multiple generations
of formats and specs into a single, normalized representation that is easier to query is
part of the added value of RavenPack as an analytics provider.”
Many content providers are also aggregators of third party content from other sources and
part of the process of trialing and onboarding content involves distinguishing and
categorizing data according to origination.
Coping with Revisions
It is very common for news services to update an existing article as new information comes
to light affecting the story. How does the RavenPack system differentiate between the
different versions?
“Our approach to working with revisions is event-based. For us these are two
different events. One is the first event with the original document that was published
which generates one normalized document, which makes the market react in a certain way,
and maybe 10 minutes later you get an update. This is a different event, that may
include new paragraphs, a different heading, whatever they change; but this is a
different event that may make the market react in a different way. So for us they are
like two different documents,” says Juan Sánchez Gómez.
Maintaining the Feed
RavenPack’s system boasts a track record of 99.98% uptime. This is achieved by
running two separate data centres with the entire workflow of the platform running in
parallel, so that if one goes down the other can step in and take over.
“We have a very good system, we have two locations and two data centres. One is in
Virginia and the other is in Ireland. They are both very stable but this allows the
consumers to switch between them, so if consumers in Virginia have problems connecting
to our services they can switch to consuming content in Ireland, or the other way
around,” says Juan.
The Next Generation
RavenPack constantly improves its technology to process data more quickly, efficiently
and cheaply.
For instance, the latest infrastructure can ingest and process more news - over 300 million
records a month in total - compared to over 65 million previously. As the technology keeps evolving,
so does the ability of our clients to garner new and unique insights from the data we provide.