Dimensionality reduction approaches for sentence embeddings
Being able to capture
the
context
of a word or sentence provides insightful
features for downstream tasks, like classification or named entity recognition.
These context captures, called
embeddings, are ubiquitous in
current NLP approaches.
Using
transfer learning, we can
leverage pre-trained models created from a rich corpus of data, and use their output
as the input layer on
nimbler
models. Examples of these
can be the
Universal Sentence Encoder
and other models
available in SpaCy or HuggingFace.
These models are extremely good at capturing context,
but they have a
fixed number of outputs. What happens then,
when we want to
reduce the size
of the embedding, due to
storage or other constraints?
In this article, we will explore
two
approaches
to reduce the number of dimensions of Google’s
Universal Sentence Encoder
from 512 to 128
outputs.
Creating a baseline
With any dimensionality reduction approach, there will
be a
loss of information. It is important to assess this loss
using a
benchmark
like
STS
to discard techniques that sacrifice a lot
of the
inherent relationship
in the data for the sake
of
compression.
On the other hand, if we are using a pre-trained model
with a general corpus of data, like Wikipedia, but our downstream task
is
domain-specific
, such as detecting financial sentiment or
medical entities, we can argue that some
dimensionality
reduction
tasks might make these reduced embeddings more relevant
to the data of that particular domain.
To create the dimensionality-reduced baseline, we will
use PCA and model a
reduction head
that we will append
after the Universal Sentence Encoder to build the full pipeline.
PCA baseline
Principal Components are
the
directions
in the data that capture most of the
information. Formally, geometrically they represent the directions capturing
the
maximal amount of variance
.
As they are well known in the Machine Learning community
and easy to implement, we will use them as
the
benchmark
for any other approaches.
It is important to bear in mind though, that although
PCA is a perfectly valid approach, its
performance will
decay
if the live data it receives differs greatly from the data
it was originally trained with.
To fit our PCA model, we will use
10
million sentence embeddings
from our Financial data corpus and
keep the first 128 directions.
To compare the baseline with the original 512-dimensions
version, we compute the Pearson correlation of the vector or pairwise similarities
on the raw embedded sentences and the reduced version.
In both cases, we achieve a
Pearson
correlation of 78%
on the STS benchmark and
77% for
MRR
on a golden dataset.
PCA model implementation
To implement the model, we use scikit-learn to train it
and
wrap the projection
function on a keras layer to use
it on a sequential model.
Deep Learning approaches
Using PCA is straightforward, but it
imposes
orthogonality constraints
on the dimensions,
which limits the flexibility of the dimensionality reduction. We wanted to avoid
such constraint in order to be able to
capture other
patterns
in the data. The approach had to:
-
Be trained
with
unlabelled
data,
-
Be able to
capture
nonlinear
behavior, and
-
The output needed to
be
normalized
.
We decided to create
an
autoencoder
and use the encoding block up to the
bottleneck as the dimensionality reducing head. Using an autoencoder is great,
because the input and the output of the training data are the same.
This enabled us to use our corpus without requiring
further work adding
labels or annotations
and train the
model in an unsupervised fashion.
After several experiments with the hyperparameters and
model design, the final model had the following characteristics:
-
Two layers of 256 and 128 neurons with
selu
(Scaled Exponential Linear Units)
activations.
-
A
normalization
layer in
the bottleneck.
-
Tied weights
between decoder
and encoder.
For the training, the final hyperparameters
were:
-
Epochs: 100
-
Learning rate: 0.0001
-
Batch size: 128
-
Optimizer: Adam with the default parameters
Autoencoder model implementation
A custom layer has been implemented to tie the weights
of the encoder and the decoder. In addition, everything has been wrapped in a class
that trains the autoencoder and extracts the encoding (dimensionality reduction)
block from it.
Conclusions and next steps
The autoencoder model outperformed the PCA,
with a Pearson correlation of 94.1% and a
MRR of 82% and
99.6% net DGC
on a golden dataset.
During training, it was
extremely
easy to overfit
on the embedded
sentences and create autoencoders capable of reconstructing or denoising
the input.
A meaningful encoding block, in terms of
dimensionality reduction capabilities, was obtained only when
the
weights were tied
with the decoder
weights.
Moreover, with a focus on similarity search,
an improvement was achieved when the bottleneck
was
normalized
during training. This, alongside a
4x reduction in dimensionality, has an impact both on storage size and
query times.
Takeaways
-
In this article we have explored how
to
reduce the size
of the sentence
embeddings from 512 to 128 with minimal information loss.
-
The
best
architecture
to achieve this is an autoencoder
with tied weights, bottleneck normalization and
a
SELU
activation function.
-
Store requirements
are
reduced by a factor of 4
and a
speed-up in query time is achieved.