Classification | Technology

The Stages of NLP

The process of NLP takes place in 3 steps: the first is to ‘mark up’ each individual word of a sentence depending on its role and function within the sentence. The second involves the detection of significant events, and the third the calculation of sentiment based on predefined parameters and modifying factors.

Marking Up

In simple terms, marking up text is a categorization exercise whereby words are tagged as belonging to different predefined high level categories. Examples of this would be to mark up a word as either a ‘noun’, ‘adjective’, ‘object’, ‘subject’, ‘predicate’, ‘company’, ‘product’, ‘asset’, ‘stock’, ‘bond’ or other designation. The same word can belong to multiple categories depending on the level of abstraction - an example might be the word 'iPhone' which belongs to the categories of ‘noun’, ‘object’, ‘product’, and ‘electronic device’ all at the same time.

At RavenPack, the team of developers led by Andrew Lawson, RavenPack’s VP of Analytics, write the programmes tasked with marking up text using a specialist computer language called Lisp.

The process involves the codifying of rules of grammar and language usage in its many and varied forms. A word’s context in relation to nearby words can also provide clues as to its role, purpose and meaning. Another method is to use long lists of rules derived from real-world usage. To programme the system to define a word as belonging to the category ‘company’, for example, a programmer might include complex rules including that the word be a noun that is likely to be capitalized and that ends in any of the following suffixes: Inc, Corp, SL, Ltd, or SA - or any number of other suffixes depending on the country of incorporation. At the same time, the programme would also have to take into account the fact that not all mentions of a company will end in a suffix, and not all capitalized nouns are also companies - they may be the names of people, countries or NGOs - all of which are also capitalized. The programme might be able to narrow down the possibilities, however, using nearby words or phrases or other words in the same article as clues, further honing the definition process so as to finally reach an accurate classification of the word.

The Linguistics Prof Turned NLP Developer

Oliver Mason is an academic in the field of linguistics turned developer who works as a Lisp programmer on Andrew’s team. Oliver is tasked with refining the way the system identifies the different parts of speech in a sentence and mark-up text.

“My area is syntactical analysis. Looking at a sentence and identifying the different components in a sentence and what are their roles and functions. What is the subject, what is the predicate, what is the object? The subject is the agent and the predicate is the action that they are doing. There are other components, like the so-called adjunct, which gives you information on time and place, or any other substantial information about an event,” says Oliver.

“At the moment this is done, but not in a very scalable way. There are some very simple patterns that try to match a noun at the beginning of a sentence with a subject and a verb with a predicate. It doesn’t work very reliably because language is very complex and there are many different ways in which you can express the same thing and they are all slightly different in the structures they use.

“The next step is then to group words together that belong to one unit or phrase and then have that phrase fulfill a particular role in the sentence. I’m working on using formulas to describe what we call ‘grammar’. It isn’t quite the same as the grammar you learn when you learn a language, this grammar is a set of rules or patterns describing what we find in the text.”

Over 12 Million Entities

To help in the word classification process, Andrew’s team references information held in a vast entity database that contains the names of over 12 million different entities. These include the names of companies, financial assets, politicians, countries, companies and company executives. This entity database is continuously updated by professional, full-time editorial staff in RavenPack’s Product department.

To illustrate the process, let us take as an example the word ‘Apple’ which can be either defined as belonging to the category of company or fruit, and within the company category as belonging to either the sub category of music company (such as in the case of the Beatles’ recording company) or as the tech company founded by Steve Jobs. How could a Lisp programmer code the NLP system to distinguish between these 3 different possible classifications of the same word? One way would be by cross-referencing information in the entity database and searching for words associated with the three different possible definitions.

“If we find something that mentions an iphone in the same story or the name of an executive who works for Apple, then we know it is the tech company. This gives us a way of double-checking our entity detection is accurate,” says Andrew.

It is by using this combination of code that operates mainly at the level of grammar and usage with reference to the concrete terms held in the entity database that the system can accurately mark up words in text.

“I think a big part of the success of our system is that it is in two parts: it relies both on our code, which is my team’s responsibility, and also on data. Entity detection relies on entity information so someone has to maintain a database that knows about Apple the electronics company, Apple the record company and Apple the fruit. Product maintains the database of 12 million entities, across companies, organisations, countries, politicians, employees, sports teams, products, types of products etc. So a large part of what we do relies on that database.”

Detecting Events

The next stage in the NLP process is to locate what are known as events. These are significant occurrences that impact investor sentiment. Examples would include quarterly earnings releases, reports of layoffs or bankruptcies. The system can identify over 7,400 different event types.

The system can identify events using predefined patterns in the English language that correspond to the sequence of word types that are normally used to report the event in the press.

“Next thing up is we find events and we do so because we have a system that understands predefined patterns in the English language that may indicate an event. So we have a system that searches very efficiently across documents and looks for text like ‘someone talking about something’ for example. Importantly it relies very heavily on entity detection because it looks for things where ‘someone or something does something to something else’ and knowing the entity helps identify the event. So you get to a point where you are not looking at text as text, you are looking at a series of concepts. So a company buys a company...”

Once an event has been located by the system, it logs all the key facts about the event in information fields. This would include the name of the company that is the main actor, the date, any key figures released, products involved, rating agencies, or any other facts that are usually associated with the event.

In the past, the patterns that search for events were created by Andrew’s team of developers, but now the Product department updates the system with any new events that they want the NLP to identify. In recent years, for example, the increasing focus on ESG-investing has led to the need for RavenPack’s NLP to be able to find events relating to a company’s activities in relation to the environment, as well as social and governance issues related to how it treats its employees and the local community, and the team has had to update the system accordingly.

Generating Sentiment Metrics

It is via event detection that sentiment metrics are generated for the principal actors or entities involved in the event. In the case of a bankruptcy, for example, the name of the company involved would be assigned a predefined sentiment score, which would probably be a negative sentiment in the case of an event as unambiguous as bankruptcy.

For other events, however, there may be variables modifying the base sentiment score attached to the event. In the case of earnings releases, for example, the base sentiment score would be highly impacted by the difference between the previous set of earnings results and the new results. The NLP system is capable of calculating the difference and adjusting the final sentiment score accordingly. The tone of the words around the event can also vary the sentiment score.

A combination of the product department as well as panels of financial experts set the base sentiment score, parameters and modifying factors for an event.

Serving up Analytics

The final stage in the NLP process is the generation of the raw analytics data, including the sentiment and volume metrics that deliver valuable insights to end users.

For clients who request the data in a raw format and do not use the RavenPack web application to access RavenPack’s analytics, Andrew’s team manage the data transfer from the NLP analytics engine directly to clients, since some clients have bespoke analytics requirements.

Classifying News

GOING FURTHER