The magic dust behind ChatGPT

March 21, 2023

The world is in awe of the surprising and versatile interactions with Large Language Models such as ChatGPT — from popping out quirky lyrics to passing bar exams, it’s no wonder that Language AI has captured our imagination. What’s fueling this revolution?

AI revolution

In January 2023, ChatGPT reached over 100 million users, making it the fastest growing consumer application to date. Tech giants are following suit — Google is preparing to launch Bard AI, while Microsoft has already launched a limited preview of Bing with ChatGPT.

Multiple other similar options already available:

What’s driving this revolution

Advancements in Language AI are driven by Natural Language Processing (NLP) and Machine Learning (ML) — both are technologies that enable computers to process, understand and speak human language. The global NLP market alone is projected to reach an expected value of USD 91 billion by 2030, growing at a compound annual growth rate (CAGR) of 27%. Similarly, ML market is expected to grow from $21.17 billion in 2022 to $209.91 billion by 2029, with a CAGR of 38.8%

Both NLP and ML work with data, which is the real source of magic behind Large Language Models.

But not just any data. The success of language AI applications is directly connected to the quality of the training data used to develop them.

What is driving this revolution

What is training data?

Training data is a set of examples used to teach a machine learning model to make accurate predictions. The model uses the input-output pairs in the training data to learn how to map inputs to the correct outputs. It serves as the foundation of the entire project and provides the basis for their models to learn from.

For example, in a sentiment analysis task , the training data might consist of a set of reviews along with their corresponding sentiment labels, such as:

  • fabulous > positive ;
  • inacceptable > negative ;
  • functional > neutral .

The model then uses this data to learn how to predict the sentiment of new reviews.

In another example, for a language translation task , the training data would consist of sentence pairs in the source language and their translations in the target language, such as Spring is coming – La primavera está llegando . The model uses this data to learn how to translate new sentences.

The more quality samples we offer the machine, the more accurate will be the output. For instance, ChatGPT - 3 was trained on 176 billion parameters, totalling 570 GB of books, articles, websites and other textual data scraped from the Internet.

Some types of training data commonly used in language AI models:

  • Supervised textual data
    Supervised textual data
    Labeled text data, where the input and output are given
  • Unsupervised textual data
    Unsupervised textual data
    Unlabeled text data, where the model has to find patterns or relationships within the data
  • Structured data
    Structured data
    Data that is organized in a structured format such as tables or databases
  • Semi-structured data
    Semi-structured data
    Data that contains elements of structure, but also includes unstructured information such as text
  • Multilingual Data
    Multi-lingual data
    Data in multiple languages, used to train models for language translation or multi-lingual text classification
  • Speech data
    Speech data
    Audio data used to train models for speech recognition or text-to-speech synthesis
  • Images and video data
    Images and video data
    Visual data used to train models for image and video captioning, object detection, etc.
  • Audio and music data
    Audio and music data
    Audio data used to train models for music classification, genre recognition, etc.

For algorithms to effectively learn and make precise predictions using these various types of training data, it is essential that the data is pre-labelled with relevant tags or annotations.

What is data labelling and why does it matter?

Data labeling is a crucial step in the development of many machine learning and artificial intelligence applications, as it ensures the accuracy and quality of the data used to train these systems. It is important to have high-quality, accurate and consistent data labels in order to train effective machine learning models.

Data labeling is the process of assigning one or more tags or labels to a given dataset to enable the classification, organization, and retrieval of data. It involves manually annotating the data with meaningful and relevant metadata, which helps in making the data understandable and accessible to machines for training and analysis. It can involve different types of tasks such as object recognition, image and speech recognition, sentiment analysis, and text classification, among others.

Where does labeled training data come from?

Training data can come from a variety of sources, depending on the particular task or application being considered. Here are some examples:

Hand-labeled data

Data that has been labeled by humans, typically through a process of manual annotation. For example, a dataset of images might be labeled with the objects or scenes present in each image, or a dataset of text might be labeled with the sentiment or topic of each piece of text.


Platforms like Amazon Mechanical Turk or CrowdFlower can be used to collect labeled data from a large number of people at a relatively low cost. This is particularly useful for tasks that require a large amount of labeled data, such as image or speech recognition.

Web scraping

Data can be gathered from websites by using automated tools to extract relevant information. This can be useful for tasks like text classification or sentiment analysis.

Sensor data

In applications such as self-driving cars or health monitoring, data can be collected from sensors such as cameras, LIDAR, or accelerometers.

Existing datasets

There are many existing datasets that have been created and made publicly available, such as the ImageNet dataset for image recognition or the MNIST dataset for handwritten digit recognition.

The race for more powerful AI models is fundamentally a race for quality and reliable training data. AI algorithms are becoming more sophisticated and capable, so they require vast amounts of high-quality data to learn from. In fact, access to high-quality training data has become a key differentiator in the development of AI technologies. As the competition to develop the most advanced AI systems intensifies, the ability to collect, label, and process high-quality training data has become an increasingly critical factor in achieving innovation and competitive advantage.

Key takeaways:

  • Advancements in Language AI depend heavily on the quality of training data that the models are trained on.
  • Training data is a set of examples used to teach a machine learning model to make accurate predictions.
  • More quality samples we offer the machine, the more accurate will be the output.
  • Data labeling is a crucial step in ensuring the accuracy and quality of the data used to train language models
Interested in reading more in-depth articles about Natural Language Processing and Machine Learning? Share with us your email address to keep you updated about new articles in the future.
Data Insights

Read More