Get on the Fast Track

Focus on machine learning, not data preparation

The problem

Training Data is the longest part of an ML project

Machine Learning Model outcomes are only as good as the data that trained them, but traditional approaches to generate textual training data are inconsistent, time-consuming, and costly.

For optimal results, training data should be:

  • Diverse with a variety of curated, domain-focused sources
  • Context and topic aware to improve the results of the model
  • Annotated thoroughly to articulate relevant relationships and entities
  • With consistent quality, at scale to produce the largest training datasets
  • Delivered timely to stay ahead of competition and iterate model training seamlessly
Comparing traditional approaches to training data generation

In-house projects provide great control over data quality, but at the expense of focus, as Machine Learning Engineers shift their attention away from modeling to data preparation. Any additional iteration to update training data compounds the problem.

Outsourced projects require considerable specifications to ensure that data preparation, often offshored, meets the exact needs of the model. Multiple rounds increase costs and delays.

Crowdsourced projects lower production costs but increase quality analysis needs to control and harmonize the output, which can further delay data availability.

Quality Cost Duration
In-house project
Outsource project
Crowdsourced project
Optimal solution

RavenPack Text Analytics deliver a compelling alternative that is consistently high-quality thanks to a world-class NLP infrastructure, immediately available with billions of sentences already processed, and highly cost-effective with continuous updates to training archives and knowledge graph.

RavenPack Text Analytics

What’s included

RavenPack Text Analytics is specifically designed to address the challenges of working with unstructured textual content in the business, finance, social, and legal sectors — from historical archives to real-time feeds.


A Schema built for Language Models

  • Quickly and easily create the right corpus of training data to feed into downstream NLP tasks.
  • RavenPack’s normalized JSON schemas support billions of documents across thousands of sources.
  • Story, Paragraph & Sentence coordinates provided with text-based bounding boxes for all detections.
  • Co-reference capabilities, and entity, topic, document & sentence-level Sentiment Indicators, Novelty & Relevance analytics for powerful deduplication.
Learn more
Retrieve the text, and analytics like sentiment, for each sentence
   "text": "Microsoft invests "billions" in OpenAI, owner of ChatGPT"
           0     5    10   15   20   25   30   35   40   45   50   55 
   "sentence_sentiment": 0.21
   "sentence_sentiment_confidence": 0.68Retrieve the named entities identified 
   with their coordinates in the sentence
   "entities": ⏷ Microsoft Corp. entity
                  "Coordinates":   [[  0,  9]]
                  "rp_entity_id":  "228D42"
               ⏷ OpenAI entity
                  "Coordinates":  [[ 32, 38]]
                  "rp_entity_id": "UWCVKI"
               ⏷ ChatGPT entity
                  "Coordinates":  [[ 49, 56]]
                  "rp_entity_id": "6C78CE"
⏷ Retrieve events detected in the sentence
           "Event":  "topic": "business"
                     "group": "acquisitions-mergers"
                     "type": "stake"
                     "event_similarity_key": "75F5CC5…15FDCD9B35"
                     "event_similarity_days": 0.000040
                     "event_relevance": 100
                     "coordinates": [[0, 37]]
                     "paragraph_index": 0,
                     "sentence_count": 1,
                     "sentence_index": 0,
                     "event_sentiment": 0.09

Easy access to massive archives and real-time feeds

  • RavenPack works with clients on historical archives of web, social, news, filings, and internal content in over 1,000 file formats.
  • The infrastructure processes, annotates, and normalizes documents into a consistent Point-In-Time ready archive, and supports real-time feeds.
  • 400 Billion tokens available from curated public sources, and 200 billion more tokens with your premium subscriptions.
Learn more
Web and Social Dataup to 330 Billion Tokens
3 years
5 years
16 years
5 years
23+ years
SEC Filings Dataup to 100 Billion Tokens

Prebuilt taxonomies and reference data

  • RavenPack brings 20 years of experience modeling financial and business language.
  • We deliver out-of-the-box support for pretraining your large language models:
    • Billions of sentences covering important financial events, ESG controversies, general business, financial, economic, and legal concepts
    • Point-In-Time-compatible knowledge graph that covers people, companies, and places.
Learn more
7.1 Million COMPANIES 228D42Microsoft Corp.Since 25-06-1981
AliasesMicrosoft Inc.Since 25-06-1981
25 Point-in-time identifiers and 311 more bond securities for Microsoft Corp., including 19 active securities
ISIN: US5949181045 TCKR: MSFTSince 13-03-1986 LSTG: XAMS:MSF26-09-2000 till 04-10-2017
409 Microsoft Units and Subsidiaries, including their own subsidiaries and beyond
5FSTNCLumenisity Ltd. Since 09-12-2022 BA72R9Skype S.A.R.L Since 11-05-2011 SDQD92Skype Technologies SA Since 14-10-2005 3LGEFLSkype Communications SARL
60,000 PRODUCTS 228D42Microsoft Corp.
388 Products and 63 product types for Microsoft (out of 844 product types)
E2ED2BLumia 532 Dual SIMMobile PhoneSince 14-01-2015 54E4E9Office X for MacProductivity SoftwareSince 24-10-2001 0A2E7BSurface Pro 7Tablet ComputerSince 10-02-2019 B45194Microsoft 365 CopilotAI PlatformSince 16-03-2023
4.5 Million PERSONS 228D42Microsoft Corp.
2,879 Executives for Microsoft
E2ED2B Bill Gates J1Q3Y4 Judson Althoff MUTHA Chris C. Capossela TI8PBC Kathleen T. Jogan D9C04D Amy E. Hood 7MU48V Venkata Satya Nadella
5,000 POSITIONS 228D42Microsoft Corp. E2ED2B Bill Gates
Positions held by each executive
Co-FounderSince 01-01-1981 Advisor Since 04-02-2004 Chief Software Architect 01-01-2000 till 01-06-2006 Chief Executive Officer01-01-1981 till 01-01-2000 Chairman 01-01-1981 till 04-02-2014
RavenPack Training Data in action

See for Yourself

As a simple test, RavenPack pretrained RoBERTa, with 500,000 sentences from our training data, sampled multiple ways, then asked the model to complete the following sentence:

High Mask is going to affect us.

Here are the words that each trained model suggested for the mask:

Base RoBERTa

The baseline model uses a generalist corpus of training words. As a result, the model infers an unfocused suggestion list:

Suggestion Probability
Stress 13%
Water 12%
Heat 5.9%
Lighting 4.3%
Pressure 3.8%

RoBERTa pretrained with 500,000 randomly sampled sentences

With domain-specific training sentences, the fill-mask suggestions show a stronger prevalence of business and finance words.

Suggestion Probability
Pressure 11%
Demand 9.2%
Inflation 6.2%
Water 4.6%
Fever 3.5%

RoBERTa pretrained with 500,000 sentences with business concepts

When training sentences contain targeted business concepts, fill-mask suggestions become more accurate and relevant.

Suggestion Probability
Inflation 18%
Unemployment 17%
Demand 4.4%
Prices 4%
Pricing 3.4%
Read More

Stay in the loop

Explore the significance of training data and its pivotal role in advancing Large Language Models with these articles by RavenPack.

Request more Information