The Trials and Tribulations of Training Data

July 24, 2023

Machine Learning developers share the ground truth.

Aspiring and seasoned machine learning (ML) developers are well aware of the crucial role that training data plays in the success of their projects. However, the reality is far from the ideal scenario of effortlessly plugging in pre-processed datasets.

After we looked at what training data is and what qualities make or break a textual training data set, we now delve into the bottlenecks that ML developers face, with real-world examples that illustrate the impact of biases and misconceptions in training data.

Industry veterans emphasize the importance of Quality Data

If you are a Machine Learning engineer, it is expected that you will invest probably 80% in sourcing, labeling, cleaning and debugging the data and only 20% in creating the model and putting it into production. Even though the spotlight often shines on groundbreaking models and cutting-edge algorithms, the success or failure of these endeavors is determined by data quality. Experts in the field talk about the paramount importance of this often-overlooked aspect.

Andrej Karpathy, a former Senior Director of AI at Tesla and one of the research scientists and founding members at OpenAI highlights:

Andrej Karpathy
Andrej Karpathy

Former Senior Director
of AI


“The neural net architectures and the training systems are increasingly standardized into a commodity, so most of the active 'software development' takes the form of curating, massaging, and cleaning labeled datasets.”

“In most practical applications today, the neural net architectures and the training systems are increasingly standardized into a commodity, so most of the active 'software development' takes the form of curating, growing, massaging and cleaning labeled datasets. This is fundamentally altering the programming paradigm by which we iterate on our software, as the teams split in two: the 2.0 programmers (data labellers) edit and grow the datasets, while a few 1.0 programmers maintain and iterate on the surrounding training code infrastructure, analytics, visualizations and labeling interfaces.”

PhD datasets models & algorithms Amount of lost sleep over... Tesla
In a talk, Karpathy showed that, at Telsa, the focus has shifted from the actual modeling to handling the datasets.

Andrew Ng, Founder and CEO at Landing AI and Co-founder of Coursera

Andrew Ng
Andrew Ng

Founder and CEO

Landing AI

The biggest bottleneck in machine learning is getting good data. The data is the single most important factor in the success of any machine learning project.

Ian Goodfellow, Co-author of the Deep Learning book and former Director of Machine Learning at Apple:

Ian Goodfellow
Ian Goodfellow

Director of Machine Learning


I've seen projects fail because they couldn't get enough good data.

Jeremy Howard, Founder of

Jeremy Howard
Jeremy Howard


The hardest part of machine learning is getting the data, cleaning the data, and making sure the data is representative of the problem you're trying to solve.

Ready-to-Use Datasets

Expectations versus Reality

ML developers often encounter the challenge of either creating training data from scratch, or having to clean it, leading to significant time investments. An anonymous ML engineer says:

I once spent a week cleaning up a dataset of cat images, only to find out that half of them were actually pictures of dogs

An early ML developer describes what he thought it would be like working with training data: “The datasets are excellent because they are ready to be used with machine learning algorithms right out of the box. You’d download the data, choose your algorithm, call the .fit() function, pass it the data and all of a sudden the loss value would start going down and you’d be left with an accuracy metric. Magic.”

…and how was it was actually like - messy, nerve racking and time consuming:

“Then I got a job as a machine learning engineer. I thought, finally, I can apply what I’ve been learning to real-world problems. Roadblock. The client sent us the data. I looked at it. What was this? Words, time stamps, more words, rows with missing data, columns, lots of columns. Where were the numbers? ‘How do I deal with this data?”

One Reddit user writes:

Reddit icon
Reddit User

What's usually longest is when we need to create training data.

“What's usually longest is when we need to create training data. In successful projects I think the slower ones took months to get to the point of having enough high-quality data to build something useful. Though we often keep working to get more data and improve annotator agreement for a while, depending on the importance of the project.“

The banana skins of training data

Biases and misrepresentations in training data can lead to erroneous predictions and limited generalization capabilities. In a talk titled "Autopilot Vision and Learning", Andrej Karpathy shared an example of a self-driving car model mistaking a soda can for a stop sign due to insufficient data on unusual lighting conditions.

One of the challenges we have is that the training data is often biased towards certain conditions. For example, we might have a lot of data for stop signs in good lighting conditions, but not as much data for stop signs in unusual lighting conditions, like at night or in fog. This can lead to the model making mistakes in unusual conditions. For example, we once had a case where the model mistook a soda can for a stop sign because the training data didn't have enough examples of stop signs in low-light conditions.

Data scientist and ML researcher Rachel Thomas shared a story on Fast Forward Labs on working on a project to predict heart disease. The model performed well on the training data, but when tested on new data, it failed to generalize:

Rachel Thomas
Rachel Thomas

Data scientist

It contained mostly data from white men, and it did not represent the diversity of the population as a whole.

The reason for this was that the training data was biased. It contained mostly data from white men, and it did not represent the diversity of the population as a whole. This led the model to make biased predictions. For example, the model was more likely to predict that a white man had heart disease than a black woman, even though the risk of heart disease is actually higher for black women.

The challenges associated with training data in machine learning projects are many. From the unavailability of ready-to-use datasets to biases, misrepresentations, and unexpected pitfalls, ML developers face a rollercoaster ride in their pursuit of quality data. Acknowledging and addressing these challenges is crucial for the success of ML projects, as developers strive to create models that can withstand real-world scenarios and produce reliable results.

By providing your personal information and submitting your details, you acknowledge that you have read, understood, and agreed to our Privacy Statement and you accept our Terms and Conditions. We will handle your personal information in compliance with our Privacy Statement. You can exercise your rights of access, rectification, erasure, restriction of processing, data portability, and objection by emailing us at in accordance with the GDPRs. You also are agreeing to receive occasional updates and communications from RavenPack about resources, events, products, or services that may be of interest to you.