The Stages of NLP
The process of NLP takes place in 3 steps: the first is to
‘mark up’ each individual word of a sentence depending on its role and
function within the sentence. The second involves the detection of significant events,
and the third the calculation of sentiment based on predefined parameters and modifying
factors.
Marking Up
In simple terms, marking up text is a categorization exercise whereby words are tagged as
belonging to different predefined high level categories. Examples of this would be to
mark up a word as either a ‘noun’, ‘adjective’,
‘object’, ‘subject’, ‘predicate’,
‘company’, ‘product’, ‘asset’, ‘stock’,
‘bond’ or other designation. The same word can belong to multiple categories
depending on the level of abstraction - an example might be the word 'iPhone' which
belongs to the categories of ‘noun’, ‘object’,
‘product’, and ‘electronic device’ all at the same time.
At RavenPack, the team of developers led by Andrew Lawson, RavenPack’s VP of Analytics,
write the programmes tasked with
marking up text using a specialist computer language called
Lisp.
The process involves the codifying of rules of grammar and language usage in its many and
varied forms. A word’s context in relation to nearby words can also provide clues
as to its role, purpose and meaning. Another method is to use long lists of rules
derived from real-world usage. To programme the system to define a word as belonging to
the category ‘company’, for example, a programmer might include complex
rules including that the word be a noun that is likely to be capitalized and that ends
in any of the following suffixes: Inc, Corp, SL, Ltd, or SA - or any number of other
suffixes depending on the country of incorporation. At the same time, the programme would
also have to take into account the fact that not all mentions of a company will end in a
suffix, and not all capitalized nouns are also companies - they may be the names of
people, countries or NGOs - all of which are also capitalized. The programme might be
able to narrow down the possibilities, however, using nearby words or phrases or other
words in the same article as clues, further honing the definition process so as to
finally reach an accurate classification of the word.
The Linguistics Prof Turned NLP Developer
Oliver Mason is an academic in the field of linguistics turned developer who works as a
Lisp programmer on Andrew’s team. Oliver is tasked with refining the way the
system identifies the different parts of speech in a sentence and mark-up
text.
“My area is syntactical analysis. Looking at a sentence and identifying the
different components in a sentence and what are their roles and functions. What is the
subject, what is the predicate, what is the object? The subject is the agent and the
predicate is the action that they are doing. There are other components, like the
so-called adjunct, which gives you information on time and place, or any other
substantial information about an event,” says Oliver.
“At the moment this is done, but not in a very scalable way. There are some very
simple patterns that try to match a noun at the beginning of a sentence with a subject
and a verb with a predicate. It doesn’t work very reliably because language is
very complex and there are many different ways in which you can express the same thing
and they are all slightly different in the structures they use.
“The next step is then to group words together that belong to one unit or phrase
and then have that phrase fulfill a particular role in the sentence. I’m working on
using formulas to describe what we call ‘grammar’. It isn’t quite the
same as the grammar you learn when you learn a language, this grammar is a set of rules
or patterns describing what we find in the text.”
Over 12 Million Entities
To help in the word classification process, Andrew’s team references information
held in a vast entity database that contains the names of over 12 million different
entities. These include the names of companies, financial assets, politicians,
countries, companies and company executives. This entity database is continuously
updated by professional, full-time editorial staff in RavenPack’s Product department.
To illustrate the process, let us take as an example the word ‘Apple’ which
can be either defined as belonging to the category of company or fruit, and within the
company category as belonging to either the sub category of music company (such as in
the case of the Beatles’ recording company) or as the tech company founded by
Steve Jobs. How could a Lisp programmer code the NLP system to distinguish between these
3 different possible classifications of the same word? One way would be by
cross-referencing information in the entity database and searching for words associated
with the three different possible definitions.
“If we find something that mentions an iphone in the same story or the name of an
executive who works for Apple, then we know it is the tech company. This gives us a way
of double-checking our entity detection is accurate,” says Andrew.
It is by using this combination of code that operates mainly at the level of grammar and
usage with reference to the concrete terms held in the entity database that the system
can accurately mark up words in text.
“I think a big part of the success of our system is that it is in two parts: it
relies both on our code, which is my team’s responsibility, and also on data.
Entity detection relies on entity information so someone has to maintain a database that
knows about Apple the electronics company, Apple the record company and Apple the fruit.
Product maintains the database of 12 million entities, across companies, organisations,
countries, politicians, employees, sports teams, products, types of products etc. So a
large part of what we do relies on that database.”
Detecting Events
The next stage in the NLP process is to locate what are known as events. These are
significant occurrences that impact investor sentiment. Examples would include quarterly
earnings releases, reports of layoffs or bankruptcies. The system can identify over 7,400 different event types.
The system can identify events using predefined patterns in the English language that
correspond to the sequence of word types that are normally used to report the event in
the press.
“Next thing up is we find events and we do so because we have a system that
understands predefined patterns in the English language that may indicate an event. So
we have a system that searches very efficiently across documents and looks for text like ‘someone
talking about something’ for example. Importantly it relies very heavily on entity
detection because it looks for things where ‘someone or something does something
to something else’ and knowing the entity helps identify the event. So you get to
a point where you are not looking at text as text, you are looking at a series of
concepts. So a company buys a company...”
Once an event has been located by the system, it logs all the key facts about the event
in information fields. This would include the name of the company that is the main
actor, the date, any key figures released, products involved, rating agencies, or any
other facts that are usually associated with the event.
In the past, the patterns that search for events were created by Andrew’s team of
developers, but now the Product department updates the system with any new events that
they want the NLP to identify. In recent years, for example, the increasing focus on
ESG-investing has led to the need for RavenPack’s NLP to be able to find events
relating to a company’s activities in relation to the environment, as well as
social and governance issues related to how it treats its employees and the local
community, and the team has had to update the system accordingly.
Generating Sentiment Metrics
It is via event detection that sentiment metrics are generated for the principal actors
or entities involved in the event. In the case of a bankruptcy, for example, the name of
the company involved would be assigned a predefined sentiment score, which would
probably be a negative sentiment in the case of an event as unambiguous as
bankruptcy.
For other events, however, there may be variables modifying the base sentiment score
attached to the event. In the case of earnings releases, for example, the base sentiment
score would be highly impacted by the difference between the previous set of earnings
results and the new results. The NLP system is capable of calculating the difference and
adjusting the final sentiment score accordingly. The tone of the words around the event
can also vary the sentiment score.
A combination of the product department as well as panels of financial experts set the
base sentiment score, parameters and modifying factors for an event.
Serving up Analytics
The final stage in the NLP process is the generation of the raw analytics data, including
the sentiment and volume metrics that deliver valuable insights to end users.
For clients who request the data in a raw format and do not use the RavenPack web
application to access RavenPack’s analytics, Andrew’s team manage the data
transfer from the NLP analytics engine directly to clients, since some clients have
bespoke analytics requirements.