Exploring Content Acquisition

How RavenPack captures and aggregates millions of documents each day

Gathering of quality news is the first stage in the process to turn unstructured text into actionable insights.

Juan Sánchez Gómez, RavenPack’s head of Content Management, gives us a guided tour of that critical step.

RavenPack uses sophisticated natural language processing -- or NLP -- to scan and make sense of unstructured documents. Think of the millions of online news articles that are published every day, and now imagine that we process those to enable users to derive valuable insights from them, whether they are news sites, social media, or other alternative data sources.

The platform relies on ingesting high quality news sources in order to generate its news analytics. Let’s explore the very first step in our process: how these outside sources are selected and integrated into the RavenPack platform before they are even processed by the NLP engine.

It all starts with sources

Many of RavenPack’s clients use our insights for investment decision-making in fast-paced financial markets, so it is essential for the source data to be both accurate and up-to-date. News reports must first be ingested and then scanned with the minimum of lag time in order to provide clients with the critical information edge they need to generate returns that beat the market.

RavenPack already integrates a whole range of reputable high-profile news and market data sources including Dow Jones Newswires, Factiva and Alliance News, and thousands of other lesser-known publications, blogs and social media sites, however, the process of identifying new sources is always ongoing.

When a potential new source has been identified, it is the job of the Content Management team to trial the new source’s data and make sure it lives up to its claims, before preparing and managing the ingestion of the actual news feed, assuming it is successful.

News feeds often differ when it comes to the transmission protocols they use to communicate with third parties. For the Content Management team this often involves customizing software on the RavenPack side to accept feeds from new publishers.

“Every source talks its own language.” Says Juan Sánchez Gómez, the Head of the Content Management at RavenPack. “By which I mean the API or channel they use to publish the documents. These can be different, some use HTTP, others FTP, TCP - there are many.”

After the content is ingested, the next step is to normalize it or convert it from its raw format into a RavenPack friendly format.


“Sources send documents in any format. XML, HTML, JSON. All the file formats you can imagine. This means we need to become experts in the channels and the formats. We have to understand them at a level most people don’t realize so we can process them as if they were our own documents. We can then translate them into a standard format that we can use internally,” says Sánchez Gómez.

Making the Grade

Before accepting a news feed, Sánchez and his team trial the data to test whether it measures up. Everything gets tested. Information latency - the time it takes to download new data - is a key consideration; this stage of the selection process is known as ‘Validation’.

“Every provider claims that their feed is really fast, that their content is highly structured, or that they compare very favorably to their competitors, and that’s part of the sales process. So our role is to test and make data-driven decisions. We go into these content evaluation iterations and we do a trial. We consume a few years of the feed, 1, 2 or 5 years, and we also connect in real-time, so as to confirm that what they said is correct,” says the Head of Content Management.

The process of validation can take weeks or months, after which a decision is made as to whether or not to integrate a new source into RavenPack’s data ecosystem.


Post-Validation

If a new source is given the green light, the process of onboarding a longer-term archive of the source’s content begins - this can often involve downloading content stretching as far back as the year 2000.

Since online news publications often change both transmission protocols and content formatting over time, further technical adaptations are often required at this stage so that the RavenPack system can ingest legacy files. “Turning multiple generations of formats and specs into a single, normalized representation that is easier to query is part of the added value of RavenPack as an analytics provider.”

Many content providers are also aggregators of third party content from other sources and part of the process of trialing and onboarding content involves distinguishing and categorizing data according to origination.

Coping with Revisions

It is very common for news services to update an existing article as new information comes to light affecting the story. How does the RavenPack system differentiate between the different versions?

“Our approach to working with revisions is event-based. For us these are two different events. One is the first event with the original document that was published which generates one normalized document, which makes the market react in a certain way, and maybe 10 minutes later you get an update. This is a different event, that may include new paragraphs, a different heading, whatever they change; but this is a different event that may make the market react in a different way. So for us they are like two different documents,” says Juan Sánchez Gómez.

Maintaining the Feed

RavenPack’s system boasts a track record of 99.98% uptime. This is achieved by running two separate data centres with the entire workflow of the platform running in parallel, so that if one goes down the other can step in and take over.

“We have a very good system, we have two locations and two data centres. One is in Virginia and the other is in Ireland. They are both very stable but this allows the consumers to switch between them, so if consumers in Virginia have problems connecting to our services they can switch to consuming content in Ireland, or the other way around,” says Juan.

The Next Generation

RavenPack constantly improves its technology to process data more quickly, efficiently and cheaply.

For instance, the latest infrastructure can ingest and process more news - over 300 million records a month in total - compared to over 65 million previously. As the technology keeps evolving, so does the ability of our clients to garner new and unique insights from the data we provide.