Entities Detection | Technology

What happens when new companies, like Stripe, or currencies, like Bitcoin, start appearing in the news? To keep up with the pace of the market, we wanted to create a tool using a Deep Learning approach to assist our teams in detecting new entities that are not currently present in our database so they can be considered for inclusion.

The differences between off-the-shelf NER and NER at RavenPack

HuggingFace offers pre-trained models on the NER task based on various architectures. They perform really well without further tuning, as you can see in this example:

Aluminum prices have declined in recent months on concerns about the eurozone crisis and its implications for demand, with the London Metal Exchange (LME) three-month aluminum price down to $ 2,100 a tonne from its peak of $ 2,800 in May.

The models are trained using the English version of the standard CoNLL-2003 Named Entity Recognition dataset. They are capable of recognizing four types of entities:

LOC: location.
PER: person.
ORG: organizations.
MISC: miscellaneous.

However, for us, this is not enough. You can see the difference between what the off-the-shelf models were able to identify and what our current system identifies:

At RavenPack we have a rich corpus of data and use 16 different types of entities that are important for a better understanding from a financial standpoint. As you can see in the non-exhaustive list below, some are similar to those of the pre-trained models, like PEOP and PER, while on the other hand, we distinguish between COMP for companies and ORGA for organizations.

PEOP: people.
ORGA: organizations like colleges, NGOs, etc.
COMP: companies.
PROD: products.

For these reasons, we are going to leverage the capabilities of HuggingFace and pre-trained models, and fine-tune them for a NER task using a custom dataset.

Preprocessing our datasets for a NER task

For privacy reasons, we cannot share the dataset and preprocessing functions. Nevertheless, what is required to fine-tune a model for NER is to convert our sentences into different tokens, each with its corresponding entity label. Let’s use the sentence “Ping An, for instance, has about 20% market share in healthcare insurance.” as an example.

In our systems, we have the following entities:

Ping An: COMP (company)
Healthcare insurance: SECT (sector)

We will split the sentence by whitespaces and BIO-annotate our entities.

The O stands for tokens that are outside entities and the ‘B ’ or ‘I ’ at the beginning of the entity type correspond to the BIO-annotation to mark the beginning or the middle/end of the entity in our text. This transforms our 16 entities into 33 different classes (B/I + the entity type and the O). This translates a NER class into a multiclass token classification problem.

Our preprocessing function transforms sentence by sentence from a corpus of NUMBER OF STORIES, creating two lists:

texts: a list of lists, each sublist containing the tokens of a sentence. ['Ping','An','for',...],]
tags: a list of lists, each sublist containing the tags of the sentence. [['B_COMP','I_COMP','O',...],]

Preparing the data to train the model

Now it is time to tokenize our labels using HuggingFace tokenizers. As we will rely on DistilBERT as our base model to fine-tune, due to latency requirements, we are using the DistilBERT fast tokenizer for this task.

Now we arrive at a common obstacle with using pre-trained models for token-level classification: many of the tokens in the W-NUT corpus are not in DistilBERT’s vocabulary. Bert and many models like it use a method called WordPiece Tokenization, meaning that single words are split into multiple tokens such that each token is likely to be in the vocabulary. For example, DistilBERT’s tokenizer would split the Twitter handle @huggingface into the tokens [‘@’, ‘hugging’, ‘##face’]. This is a problem for us because we have exactly one tag per token. If the tokenizer splits a token into multiple sub-tokens, then we will end up with a mismatch between our tokens and our labels.

One way to handle this is to only train on the tag labels for the first subtoken of a split token. We can do this in Transformers by setting the labels we wish to ignore to -100. In the example above, if the label for @HuggingFace is 3 (indexing B-corporation), we would set the labels of [‘@’, ‘hugging’, ‘##face’] to [3, -100, -100].

In our case, we need to ponder this trade-off. On the one hand, it is true that some of our financial data tokens will most likely not be present in DistilBERT’s vocabulary. On the other hand, masking the subtokens would make the task of identifying the whole company as an entity more difficult. After a test confronting both approaches, we decided to keep all labels. We kick off by importing the necessary libraries:

Tokenizing the model

DistilBERT does punctuation splitting and wordpiece tokenization, which in turn requires filling some gaps in our training dataset to assign the proper label for the NER task. We have relied on the general and tensorflow guides provided by HuggingFace to do this final preprocessing step. In addition, we will convert the datasets to Tensorflow, as it is the library that we are using at RavenPack for our models.

Training the model

Now that we have our dataset, it is time to fine-tune the model. For this, we will use the TensorFlow implementation of DistilBERT and fine-tune it for 3 epochs . The reason we have chosen DistilBERT is that latency is a constraining factor in our products, as we strive to deliver real-time results to our clients, and compared to BERT or RoBERTa, DistilBERT is nimbler in size and provides the best tradeoff between inference speed and accuracy.

DistilBERT has 65 million parameters and we could think about only fine-tuning it by modifying the weights of the classification head. However, after experimenting we decided to keep all parameters trainable.

We are almost there. Now it is time to choose the hyper parameters and train the model.

Learning rate: 5e-5
Batch size: 16
Epochs: 3
Optimizer: Adam

After running the training loop for three epochs, each taking around 3 minutes for 20,000 examples on a GeForce RTX3090, the models are now able to detect our different entity types. Let’s assess the performance on some examples. It is important to notice that after fine-tuning the model and integrating it in the NER pipeline with the DistilBERT tokenizer, the entity labels will be numeric and we must transform them to our format. We have created the helper function detect_entities to wrap the pipeline.

In addition to this conversion, we can choose whether or not to group the entities. The difference between grouping or not, going back to our example, would be that without grouping we will obtain {'Ping':'B-COMP','An':'I-COMP',...} whereas grouping all consecutive labels belonging to the same type will be merged {'Ping An':'COMP'} . We can also remove the ‘O’ labels with the option remove_labels = True .

Evaluate model

Avon said second-quarter profit plunged 70% as the world’s largest direct seller of cosmetics sold fewer items and continued to lose sales representatives in key markets.

French healthcare company Sanofi-Aventis (SNY) will report its third-quarter earnings on October 31.

Conclusion and comments

In this article we have explored how to fine-tune a pre-trained model for a NER task, using custom financial data that we create at RavenPack. We were able to achieve great performance both in latency and accuracy leveraging our rich datasets and the pre-trained models and libraries of HuggingFace. These techniques underlie many of the internal workflows at RavenPack, empowering our editorial staff to achieve greater efficiency and deliver better data.

Using NER to detect relevant entities in finance

GOING FURTHER