Sentence Embedding | Technology

Dimensionality reduction approaches for sentence embeddings

Being able to capture the context of a word or sentence provides insightful features for downstream tasks, like classification or named entity recognition. These context captures, called embeddings, are ubiquitous in current NLP approaches.

Using transfer learning, we can leverage pre-trained models created from a rich corpus of data, and use their output as the input layer on nimbler models. Examples of these can be the Universal Sentence Encoder and other models available in SpaCy or HuggingFace.

These models are extremely good at capturing context, but they have a fixed number of outputs. What happens then, when we want to reduce the size of the embedding, due to storage or other constraints? In this article, we will explore two approaches to reduce the number of dimensions of Google’s Universal Sentence Encoder from 512 to 128 outputs.

Creating a baseline

With any dimensionality reduction approach, there will be a loss of information. It is important to assess this loss using a benchmark like STS to discard techniques that sacrifice a lot of the inherent relationship in the data for the sake of compression.

On the other hand, if we are using a pre-trained model with a general corpus of data, like Wikipedia, but our downstream task is domain-specific , such as detecting financial sentiment or medical entities, we can argue that some dimensionality reduction tasks might make these reduced embeddings more relevant to the data of that particular domain.

To create the dimensionality-reduced baseline, we will use PCA and model a reduction head that we will append after the Universal Sentence Encoder to build the full pipeline.

PCA baseline

Principal Components are the directions in the data that capture most of the information. Formally, geometrically they represent the directions capturing the maximal amount of variance .

As they are well known in the Machine Learning community and easy to implement, we will use them as the benchmark for any other approaches.

It is important to bear in mind though, that although PCA is a perfectly valid approach, its performance will decay if the live data it receives differs greatly from the data it was originally trained with.

To fit our PCA model, we will use 10 million sentence embeddings from our Financial data corpus and keep the first 128 directions.

To compare the baseline with the original 512-dimensions version, we compute the Pearson correlation of the vector or pairwise similarities on the raw embedded sentences and the reduced version.

In both cases, we achieve a Pearson correlation of 78% on the STS benchmark and 77% for MRR on a golden dataset.

PCA model implementation

To implement the model, we use scikit-learn to train it and wrap the projection function on a keras layer to use it on a sequential model.

Deep Learning approaches

Using PCA is straightforward, but it imposes orthogonality constraints on the dimensions, which limits the flexibility of the dimensionality reduction. We wanted to avoid such constraint in order to be able to capture other patterns in the data. The approach had to:

Be trained with unlabelled data,
Be able to capture nonlinear behavior, and
The output needed to be normalized .

We decided to create an autoencoder and use the encoding block up to the bottleneck as the dimensionality reducing head. Using an autoencoder is great, because the input and the output of the training data are the same.

This enabled us to use our corpus without requiring further work adding labels or annotations and train the model in an unsupervised fashion.

After several experiments with the hyperparameters and model design, the final model had the following characteristics:

Two layers of 256 and 128 neurons with selu (Scaled Exponential Linear Units) activations.
A normalization layer in the bottleneck.
Tied weights between decoder and encoder.

For the training, the final hyperparameters were:

Epochs: 100
Learning rate: 0.0001
Batch size: 128
Optimizer: Adam with the default parameters

Autoencoder model implementation

A custom layer has been implemented to tie the weights of the encoder and the decoder. In addition, everything has been wrapped in a class that trains the autoencoder and extracts the encoding (dimensionality reduction) block from it.

Conclusions and next steps

The autoencoder model outperformed the PCA, with a Pearson correlation of 94.1% and a MRR of 82% and 99.6% net DGC on a golden dataset.

During training, it was extremely easy to overfit on the embedded sentences and create autoencoders capable of reconstructing or denoising the input.

A meaningful encoding block, in terms of dimensionality reduction capabilities, was obtained only when the weights were tied with the decoder weights.

Moreover, with a focus on similarity search, an improvement was achieved when the bottleneck was normalized during training. This, alongside a 4x reduction in dimensionality, has an impact both on storage size and query times.

Takeaways

In this article we have explored how to reduce the size of the sentence embeddings from 512 to 128 with minimal information loss.
The best architecture to achieve this is an autoencoder with tied weights, bottleneck normalization and a SELU activation function.
Store requirements are reduced by a factor of 4 and a speed-up in query time is achieved.

Using Tied Autoencoders