Sentiment at the Sentence Level
As you may tell, we take Sentiment & Classification
seriously; it is our core business. In this article, our focus is a deep
learning model designed to predict sentence sentiment.
Our initial model output was a regression model with an
output between -1 (extremely negative) and 1 (extremely positive). After some quick
tests, we decided to move to a classification problem, dividing the -1, 1 range into 41
bins. The classification approach has additional benefits, such as computing
probabilities for each bin, which allows us to build a
confidence
score
associated with the
sentiment score.
Note: the state of the art sentiment analysis models used to
have a maximum of three outputs (Positive, Neutral, Negative), but we are dealing with a
more granular range of 41 levels. (20 positive, 20 negative and 1 neutral).
The Problem
We tested dozens of
Natural Language
Processing
models, analyzed multiple architectures, embeddings,
losses, cell types… a rich variety of approaches, but they all share one thing in
common:
the input.
The input we initially focused on was the sentence itself,
something arguably shared across the majority of the NLP models. From there, we started
observing differences: how to tokenize, encode, or what embeddings to use. We optimized
the model through experimentation on embeddings, tokenizer combinations, and
experimented with different architectures.
Finally, when we analyzed the model errors, we faced a long
tail of production issues: certain negations and uncommon phrase constructions were
resulting in model inaccuracies. Certain cell types
like
LSTM
or uses of
the
convolutions
were helping with a better understanding of
the phrase construction, but the results weren’t perfect.
Some of those phrase constructions included:
-
Positive outputs caused by negative
causes.
For example, a sentence like ”Due to [something
bad], [something good] is happening”. The sentence should have a positive
sentiment, but the negative part often confused the model and guided it to a
negative sentiment. This also happened in the opposite direction, negative output
due to positive causes.
-
Using negative words for describing positive
things (or the other way around: For example “smaller loss than
expected” or “Results beating on the bottom line”
-
Negative/positive words used to explain a
positive/negative outcome:
“The forecasts were pessimistic,
because of fears of recession, but the results were good”.
We also faced other issues, with some words having a
significant impact on the sentiment output. Certain terms (like bitcoin) or company
names, were always associated with positive or negative sentences and were driving the
output to a specific sentiment. Those “entities” were
biasing the output of the network.
First Attempt: Blank Some Entities
We didn’t want the network to bias the sentiment
because of
entities
, so our first idea was to blank them.
Let’s see an example:
Gamestop has announced a new trade-in offer where it
will pay you $200 for your current-gen system when you apply the credit towards the
purchase of either a new Xbox One or PS4.
If we blank the entity names, we would feed the network with
this input:
* has announced a new trade-in offer where it will pay
you $200 for your current-gen system when you apply the credit towards the purchase
of either a new * or *.
By removing the entities, we were losing information from
the sentence… Yet, processing entities within the sentences introduced unwanted
bias. How do you ensure the model isn’t biased towards specific entities and
captures phrase construction, all while ensuring our performance isn’t impacted:
By adding
additional information
to the network, a second
input.
Second Attempt: Add a Secondary Input to the Network
But… how can we detect those entities before training
the network? Well, working at
RavenPack
has some benefits. We maintain an
extensive Point-In-Time knowledge base that we use to
detect entities in
financial news.
Our current systems process sentences using
the
RavenPack Enhanced Annotator. Our domain-specific Annotator is
designed for financial language and is capable of generating
Part of
Speech
tags, noun & verb phrases,
and
NER
specific tags like currencies, places, persons,
products, companies, date periods, reporting periods, etc.
We decided to expose these tags from the current systems to
our machine learning models and included them as a
second input.
That second input was merged in the neural network body, after translating the words
into
embeddings.
The Annotated Input
This “annotator on steroids” can provide up to
80 different tags
per word. Here we have the previous example
after using the tagger:
GameStop ['entity_D42DBA', '$NOUN-PHRASE', 'NOUN', '$COMPANY']
has ['$IGNORABLE-TEXT', 'VERB', '$VERB-PHRASE']
announced ['VERB', '$VERB-PHRASE']
a ['$NOUN-PHRASE', '$IGNORABLE-TEXT', 'DET', '$RATING']
new ['$NOUN-PHRASE', '$IGNORABLE-TEXT', 'ADJ']
trade-in ['$NOUN-PHRASE', '$IGNORABLE-TEXT', 'NOUN']
offer ['$NOUN-PHRASE', 'NOUN']
where ['ADV']
it ['PRON']
will ['VERB', '$VERB-PHRASE', '$IGNORABLE-TEXT']
pay ['VERB', '$VERB-PHRASE', '%RESPECT-NOUN']
you ['PRON']
$200 ['$NOUN-PHRASE','NUMBER','$CURRENCY-NAME']
for ['$PREP-PHRASE', 'PREP']
your ['$PREP-PHRASE', 'ADJ', '$IGNORABLE-TEXT']
current-gen ['$PREP-PHRASE', 'NOUN']
system ['$PREP-PHRASE', 'NOUN']
when ['ADV']
you ['PRON']
apply ['$VERB-PHRASE', 'VERB']
the ['$IGNORABLE-TEXT', '$NOUN-PHRASE', 'DET']
credit ['NOUN', '$NOUN-PHRASE']
towards ['PREP', '$PREP-PHRASE']
the ['$IGNORABLE-TEXT', 'DET', '$PREP-PHRASE']
purchase ['NOUN', '$PREP-PHRASE']
of ['PREP']
either ['CONJ']
a ['DET', '$IGNORABLE-TEXT', '$RATING', '$NOUN-PHRASE']
new ['$IGNORABLE-TEXT', '$NOUN-PHRASE', 'ADJ']
Xbox ['entity_8131DB', '$COMPANY', '$NOUN-PHRASE', '$PRODUCT', 'NOUN']
One ['entity_8131DB', '$COMPANY', '$NOUN-PHRASE', '$NUMBER', '$PRODUCT', 'NOUN']
or ['CONJ']
PS4. ['$COMPANY', 'entity_61EB58', '$NOUN-PHRASE', '.', '$NUMBER', '$PRODUCT', 'NOUN']
entity_D42DBA
refers to
the company entity GameStop
Corp.
entity_8131DB
refers to the
product entity Xbox One from the company Microsoft
Corp.
entity_61EB58
refers to the
product entity PlayStation 4 from the company Sony Interactive Entertainment
LLC
This second input is a tensor with
dimension
n_words * 80
that we mask for enabling or disabling
some of the tags. Let’s see how that looks in our code. We
use
Tensorflow
in
Python:
We can remove
entity names, numbers, and
dates
from a sentence because we provide this info in the tagger
input. This helps our model train without bias, while simultaneously feeding the
network
additional information
about the sentence structure.
This structure with number awareness, date awareness, and
extra information, allowed our models to achieve marked accuracy improvements,
especially in the long tail of errors we observed. While our first attempt was already
optimized, this additional information provided the tools for avoiding mistakes caused
by some difficult phrase constructions, leading to
a boost in
robustness.
Benchmarks & Findings
In order to benchmark our optimized approach, we built three
models with the same exact architecture, embeddings, and outputs, but we
changed
the input.
The first model contains a single input:
the
sentence.
The second model contains the sentence input and the tags
input, but we
masked some tags: entity, company, team, person,
organization or product. We are excluding them because we want the model also to work
without our entity detections. Even without those tags, this model still has access to
the POS, phrase structure info and other detections like commodities, currencies, or
places.
The third model uses
all available
tags.
The models were trained with
9 million
sentences
in 5 epochs. The results shown below belong to
the
256k test set, not used in training.
As you may observe, the model without tags is already pretty
optimized. The Mean Absolute Error (
MAE
) is almost the same, but by
looking at the other metrics we can observe the consistency of the improvement.
We also present
sign accuracy
, as it
is commonly used in financial models where sentiment direction is important. In this
case, we reduce our 41 sentiment levels to 3 (Negative, Neutral, Positive).
The
R-square
shows how well our
results fit to the baseline linguistic model based on templates.
To finish, we use the
Kullback Leibler
Divergence
(also our loss function) because we output 41
probabilities, for the 41 different outputs, and we compare that with the multiple
sentiments present in the templates-based system.
Sentiment Examples
While these improvements may not seem that significant at a
macro scale, we can observe real improvements when we look at the
complex
sentence structures
that are being misinterpreted by the single-input
model but are now correctly interpreted by the model with all our tags: