Annotated Archives

Train Powerful Language Models, Faster, and More Reliably with the World's Largest Annotated Topic Database

Request information

Built for Scale

Used for our own language models, RavenPack Training Data archives are specifically designed to address the challenges of working with unstructured textual content in the business, finance, social, and legal sectors — from historical archives to real-time feeds. Pre-annotated by hundreds of thousands of rules, they give you the highest-quality training data.

We spent over a decade crafting the best-in-class text training archives to build NLP models, so you don't have to.

Why it works

Streamline your machine learning workflows for shorter time to market with the only training data infrastructure that delivers:

How our Archives are Prepared

The RavenPack Training Data Archives are produced by our proven natural language processing infrastructure that has processed terabytes of unstructured data over 15 years.

Content Ingestion & Schema Normalization From existing web and filings feeds, or from your own content in over 1,000 formats, RavenPack turns texts into a unified schema ready for NLP tasks with a single representation, bounding boxes, story, paragraph, and sentence coordinates. Entity Extraction, Co-referencing & Knowledge Graph Detection RavenPack identifies, co-references, and tags entities and concepts from our constantly improving RavenPack Knowledge Graph of 12 Million entities including Companies, Places, People. Extra-Large Scale Topic Classification & Relationship Extraction Using millions of pre-curated semantic templates, RavenPack identifies 7,400 topics and how entities relate to those topics. The taxonomy covers business, legal, society, ESG, and more. Sentiment, Relevance & Other Analytics Proprietary algorithms then process each sentence, entity, and topic detected for Relevance, Sentiment, Novelty, and Similarity metrics for powerful deduplication analytics. Knowledge Graph & Data Feed Output Knowledge graph and archives are historically generated, version controlled, and deployed via APIs with daily and even real-time updates available Content Ingestion & Schema Normalization From existing web and filings feeds, or from your own content in over 1,000 formats, RavenPack turns text into a unified schema ready for NLP tasks with a single representation, bounding boxes, story, paragraph, and sentence coordinates. Entity Extraction, Co-referencing & Knowledge Graph Detection RavenPack identifies, co-references, and tags entities and concepts from our constantly improving RavenPack Knowledge Graph of 12 Million entities including Companies, Places, People. Extra-Large Scale Topic Classification & Relationship Extraction Using millions of pre-curated semantic templates, RavenPack identifies 7,400 topics and how entities relate to those topics. The taxonomy covers business, legal, society, ESG, and more. Sentiment, Relevance & Other Analytics Proprietary algorithms then process each sentence, entity, and topic detected for Relevance, Sentiment, Novelty, and Similarity metrics for powerful deduplication analytics. Knowledge Graph & Data Feed Output Knowledge graph and archives are historically generated, version controlled, and deployed via APIs with daily and even real-time updates available

Billions of sentences

Choose sentences containing specific concepts from 12 million named entities among the several archives available, or work with us to turn premium content you subscribe to, and even internal content, into training data:

Web and Social Archive

40,000 sources with content processed over 15 years

History 3 years 5 years 16+ years
Documents 388 Million 605 Million 1.2 Billion
Sentences 10 Billion 15 Billion 30 Billion
Tokens ~110 Billion ~165 Billion ~330 Billion

SEC Filings Archive

Tap a structured archive of annotated SEC filings

History 3 years 5 years 23+ years
Documents 2.1 Million 3.4 Million 14.1 Million
Sentences 1.4 Billion 2.4 Billion 9 Billion
Tokens ~16 Billion ~27 Billion ~100 Billion

Request more Information