What happens when new companies, like Stripe, or
currencies, like Bitcoin, start appearing in the news? To keep up with the
pace of the market, we wanted to create a tool using a Deep Learning
approach to assist our teams in detecting new entities that are not
currently present in our database so they can be considered for inclusion.
The differences between off-the-shelf NER and
NER at RavenPack
HuggingFace offers
pre-trained
models
on the NER task based on various architectures.
They perform really well without further tuning, as you can see in this
example:
Aluminum prices have declined in recent
months on concerns about the eurozone crisis and its implications for
demand, with the London Metal Exchange (LME) three-month aluminum price
down to $ 2,100 a tonne from its peak of $ 2,800 in May.
The models are trained using the English version
of the standard
CoNLL-2003 Named Entity
Recognition
dataset. They are capable of
recognizing
four types of entities:
-
LOC: location.
-
PER: person.
-
ORG: organizations.
-
MISC: miscellaneous.
However, for us, this is not enough. You can see
the difference between what the off-the-shelf models were able to identify
and what our current system identifies:
Aluminum prices have declined in recent
months on concerns about the eurozone crisis and its implications for
demand, with the London Metal Exchange (LME) three-month aluminum price
down to $ 2,100 a tonne from its peak of $ 2,800 in May.
At RavenPack we have a rich corpus
of data and use
16 different types
of
entities that are important for a better understanding from a
financial standpoint. As you can see in the non-exhaustive list
below, some are similar to those of the pre-trained models, like
PEOP and PER, while on the other hand, we distinguish between
COMP for companies and ORGA for organizations.
-
PEOP: people.
-
ORGA: organizations like
colleges, NGOs, etc.
-
COMP: companies.
-
PROD: products.
For these reasons, we are going to
leverage the capabilities of HuggingFace and pre-trained models,
and
fine-tune
them for a NER task
using a custom dataset.
Preprocessing our datasets for a
NER task
For privacy reasons, we cannot share
the dataset and preprocessing functions. Nevertheless, what is
required to fine-tune a model for NER is to convert our
sentences into different tokens, each with its corresponding
entity label. Let’s use the sentence
“Ping
An, for instance, has about 20% market share in healthcare
insurance.”
as an example.
In our systems, we have the
following entities:
-
Ping An: COMP (company)
-
Healthcare insurance: SECT (sector)
We will split the sentence by
whitespaces and BIO-annotate our entities.
The O stands for tokens
that are outside entities and the
‘B
’ or ‘I
’ at the
beginning of the entity type correspond to
the
BIO-annotation
to
mark the beginning or the middle/end of the entity
in our text. This transforms our 16 entities into 33
different classes (B/I + the entity type and the O).
This translates a NER class into a multiclass token
classification problem.
Our preprocessing
function transforms sentence by sentence from a
corpus of NUMBER OF STORIES, creating two lists:
-
texts: a list of
lists, each sublist containing the tokens of a
sentence.
['Ping','An','for',...],]
-
tags: a list of
lists, each sublist containing the tags of the
sentence.
[['B_COMP','I_COMP','O',...],]
Preparing the data to train the
model
Now it is time to tokenize our
labels using HuggingFace tokenizers. As we will rely
on
DistilBERT
as our base model
to fine-tune, due to latency requirements, we are using the
DistilBERT fast tokenizer for this task.
Now we arrive at a common
obstacle with using pre-trained models for token-level
classification: many of the tokens in the W-NUT corpus
are not in DistilBERT’s vocabulary. Bert and many
models like it use a method called WordPiece
Tokenization, meaning that single words are split into
multiple tokens such that each token is likely to be in
the vocabulary. For example, DistilBERT’s
tokenizer would split the Twitter handle @huggingface
into the tokens [‘@’, ‘hugging’,
‘##face’]
. This is a problem for us because
we have exactly one tag per token. If the tokenizer
splits a token into multiple sub-tokens, then we will
end up with a mismatch between our tokens and our
labels.
One way to handle this is to
only train on the tag labels for the first subtoken of a
split token. We can do this in Transformers by
setting the labels we wish to ignore to -100. In the
example above, if the label for @HuggingFace is 3
(indexing B-corporation), we would set the labels of
[‘@’, ‘hugging’,
‘##face’]
to [3, -100, -100]
.
In our case, we need to ponder
this trade-off. On the one hand, it is true that some of our
financial data tokens will most likely not be present in
DistilBERT’s vocabulary. On the other hand, masking
the subtokens would make the task of identifying the whole
company as an entity more difficult. After a test
confronting both approaches,
we decided to keep
all labels. We kick off by importing the necessary libraries:
Tokenizing the model
DistilBERT
does
punctuation
splitting
and
wordpiece
tokenization, which in turn requires filling
some gaps in our training dataset to assign the proper label
for the NER task. We have relied on the
general
and
tensorflow
guides
provided by HuggingFace to do this final preprocessing step.
In addition, we will convert the datasets to Tensorflow, as
it is the library that we are using at RavenPack for our
models.
Training the model
Now that we have our dataset, it
is time to fine-tune the model. For this, we will use the
TensorFlow implementation of DistilBERT
and
fine-tune it for 3 epochs
. The
reason we have chosen DistilBERT is that latency is a
constraining factor in our products, as we strive to deliver
real-time results to our clients, and compared to BERT or
RoBERTa, DistilBERT is nimbler in size and provides the best
tradeoff between inference speed and accuracy.
DistilBERT has
65
million parameters
and we could think about
only fine-tuning it by modifying the weights of the
classification head. However, after experimenting we decided
to keep all parameters trainable.
We are almost there. Now it is
time to choose the hyper parameters and train the model.
-
Learning
rate:
5e-5
-
Batch
size:
16
-
Epochs:
3
-
Optimizer:
Adam
After running the training loop
for three epochs, each taking around
3 minutes
for 20,000 examples
on a GeForce RTX3090,
the models are now able to detect our different entity
types. Let’s assess the performance on some examples.
It is important to notice that after fine-tuning the model
and integrating it in the NER pipeline with the DistilBERT
tokenizer, the entity labels will be numeric and we must
transform them to our format. We have created the helper
function detect_entities to wrap the pipeline.
In addition to this conversion,
we can choose whether or not to group the entities. The
difference between grouping or not, going back to our
example, would be that without grouping we will
obtain
{'Ping':'B-COMP','An':'I-COMP',...}
whereas
grouping all consecutive labels belonging to the same type
will be merged
{'Ping An':'COMP'}
. We can
also remove the ‘O’ labels with the
option
remove_labels = True
.
Evaluate model
Avon said second-quarter
profit plunged 70% as the world’s largest direct
seller of cosmetics sold fewer items and continued to
lose sales representatives in key markets.
French healthcare company
Sanofi-Aventis (SNY) will report its third-quarter
earnings on October 31.
Conclusion and comments
In this article we have explored
how to fine-tune a pre-trained model for a NER task,
using
custom financial data
that
we create at RavenPack. We were able to achieve great
performance both
in
latency
and
accuracy
leveraging
our rich datasets and the pre-trained models and libraries
of HuggingFace. These techniques underlie many of the internal workflows at RavenPack, empowering our editorial staff to
achieve greater efficiency and deliver better data.