Machine Learning @ RavenPack: Improving Sentiment Models With Better Inputs

June 29, 2020

At RavenPack we are developing our next-gen platform and we are publishing some of our research findings with the NLP community at large.


At RavenPack, we’re in the business of Giving Meaning to Unstructured Data. Leaving the marketing hyperbole aside, we’re constantly looking to innovate and improve our text processing algorithms, models, and the data we feed them. Over the years, we’ve experimented with many different approaches and improvements to our core Named Entity Recognition (NER), Classification, and Sentiment analysis tasks, with varying degrees of success.

The final approach we adopted always came down to the basics: precision vs recall, especially at our scale. This became such an obsession for us that we built our own text processing library from the ground up in LISP (we’re always hiring LISP programmers!) to ensure accuracy and consistency in our output.

To better benchmark our performance, we have built the largest sentiment analysis dataset with millions of sentences labeled with a score between -1 and 1. This represents a balanced sample from over 20 years of our classified news data.

“And it couldn’t have come at a better time. NLP’s ImageNet moment had arrived, and we had all the right ingredients to push the envelope at RavenPack further”

Over a series of articles, we plan to publish some of our findings to the NLP community at large. We’re actively building our Text Analysis infrastructure & APIs for use across the community and would love your feedback.

Sentiment at the Sentence Level

As you may tell, we take Sentiment & Classification seriously; it is our core business. For today’s topic, our focus is a deep learning model designed to predict sentence sentiment.

Our initial model output was a regression model with an output between -1 (extremely negative) to 1 (extremely positive). After some quick tests, we decided to move to a classification problem, dividing the -1, 1 range into 41 bins. The classification approach has additional benefits, such as computing probabilities for each bin, which allows us to build a confidence score associated with the sentiment score.

Note: the state of the art sentiment analysis models used to have a maximum of three outputs (Positive, Neutral, Negative), but we are dealing with a more granular range of 41 levels. (20 positive, 20 negative and 1 neutral).

The Problem

We tested dozens of Natural Language Processing models, analyzed multiple architectures, embeddings, losses, cell types… a rich variety of approaches, but they all share one thing in common: the input.

The input we initially focused on was the sentence itself, something arguably shared across the majority of the NLP models. From there, we started observing differences: how to tokenize, encode, or what embeddings to use. We optimized the model through experimentation on embeddings, tokenizer combinations, and experimented with different architectures.

Finally, when we analyzed the model errors, we faced a long tail of production issues: certain negations and uncommon phrase constructions were resulting in model inaccuracies. Certain cell types like LSTM or uses of the convolutions were helping with a better understanding of the phrase construction, but the results weren’t perfect.

Some of those phrase constructions included:

  • Positive outputs caused by negative causes. For example, a sentence like ”Due to [something bad], [something good] is happening”. The sentence should have a positive sentiment, but the negative part often confused the model and guided it to negative sentiment. This also happened in the opposite direction, negative output due to positive causes.
  • Using negative words for describing positive things (or the other way around): For example “smaller loss than expected” or “Results beating on the bottom line”
  • Negative/positive words used to explain a positive/negative outcome: “The forecasts were pessimistic, because of fears of recession, but the results were good”.
  • We also faced other issues, with some words having a significant impact on the sentiment output. Certain terms (like bitcoin) or company names, were always associated with positive or negative sentences and were driving the output to a specific sentiment. Those “entities” were biasing the output of the network.

    First Attempt: Blank Some Entities

    We didn’t want the network to bias the sentiment because of entities, so our first idea was to blank them.

    Let’s see an example:

    Gamestop has announced a new trade-in offer where it will pay you $200 for your current-gen system when you apply the credit towards the purchase of either a new Xbox One or PS4.

    If we blank the entity names, we would feed the network with this input:

    *has announced a new trade-in offer where it will pay you $200 for your current-gen system when you apply the credit towards the purchase of either a new* or *.

    By removing the entities, we were losing information from the sentence… Yet, processing entities within the sentences introduced unwanted bias. How do you ensure the model isn’t biased towards specific entities and captures phrase construction, all while ensuring our performance isn’t impacted: By adding additional information to the network, a second input.

    Second Attempt: Add a Secondary Input to the Network

    But… how can we detect those entities before training the network? Well, working at RavenPack has some benefits. We maintain an extensive Point-In-Time knowledge base that we use to detect entities in financial news.

    Our current systems process sentences using the RavenPack Enhanced Annotator . Our domain-specific Annotator is designed for financial language and is capable of generating Part of Speech tags, noun & verb phrases, and NER specific tags like currencies, places, persons, products, companies, date periods, reporting periods, etc.

    We decided to expose these tags from the current systems to our machine learning models and included them as a second input . That second input was merged in the neural network body, after translating the words into embeddings.


    The Annotated Input

    This “annotator on steroids” can provide up to 80 different tags per word. Here we have the previous example after using the tagger:

    annotated input

    entity_D42DBA refers to the company entity GameStop Corp.
    entity_8131DB refers to the product entity Xbox One from the company Microsoft Corp.
    entity_61EB58 refers to the product entity PlayStation 4 from the company Sony Interactive Entertainment LLC
    This second input is a tensor with dimension n_words * 80 that we mask for enabling or disabling some of the tags. Let’s see how that looks in our code. We use Tensorflow in Python:
    Tensorflow in Python

    We can remove entity names, numbers, and dates from a sentence because we provide this info in the tagger input. This helps our model train without bias, while simultaneously feeding the network additional information about the sentence structure.

    This structure with number awareness, date awareness, and extra information, allowed our models to achieve marked accuracy improvements, especially in the long tail of errors we observed. While our first attempt was already optimized, this additional information provided the tools for avoiding mistakes caused by some difficult phrase constructions, leading to a boost in robustness.

    Benchmarks & Findings

    In order to benchmark our optimized approach, we built three models with the same exact architecture, embeddings, and outputs, but we changed the input.

    The first model contains a single input: the sentence.

    The second model contains the sentence input and the tags input, but we masked some tags : entity, company, team, person, organization, or product. We are excluding them because we want the model also to work without our entity detections. Even without those tags, this model still has access to the POS, phrase structure info, and other detections like commodities, currencies, or places.

    The third model uses all available tags.

    The models were trained with 9 million sentences in 5 epochs. The results shown below belong to the 256k test set , not used in training.
    Tensorflow in Python2
    As you may observe, the model without tags is already pretty optimized. The Mean Absolute Error (MAE) is almost the same, but by looking at the other metrics we can observe the consistency of the improvement.

    We also present sign accuracy , as it is commonly used in financial models where sentiment direction is important. In this case, we reduce our 41 sentiment levels to 3 (Negative, Neutral, Positive).

    The R-square shows how well our results fit the baseline linguistic model based on templates.

    To finish, we use the Kullback Leibler Divergence (also our loss function) because we output 41 probabilities, for the 41 different outputs, and we compare that with the multiple sentiments present in the templates-based system.

    Sentiment Examples

    While these improvements may not seem that significant at a macro scale, we can observe real improvements when we look at the complex sentence structures that are being misinterpreted by the single-input model but are now correctly interpreted by the model with all our tags:
    Sentiment Examples

    Final Thoughts

    To summarize, in this article we introduced some of the key problems we’ve faced in implementing our NextGen NLP Sentiment Classifier along with RavenPack’s Enhanced Annotator service.

    By providing your personal information and submitting your details, you acknowledge that you have read, understood, and agreed to our Privacy Statement and you accept our Terms and Conditions. We will handle your personal information in compliance with our Privacy Statement. You can exercise your rights of access, rectification, erasure, restriction of processing, data portability, and objection by emailing us at in accordance with the GDPRs. You also are agreeing to receive occasional updates and communications from RavenPack about resources, events, products, or services that may be of interest to you.

    Data Insights

    Read More