This article is part of Demystifying AI, a series of posts that (try to) disambiguate the jargon and myths surrounding AI.
Machine learning models can perform wonderful things—if they have enough training data. Unfortunately, for many applications, access to quality data remains a barrier.
One solution to this problem is “data augmentation,” a technique that generates new training examples from existing ones. Data augmentation is a low-cost and effective method to improve the performance and accuracy of machine learning models in data-constrained environments.
Overfitting in machine learning models
When machine learning models are trained on limited examples, they tend to “overfit.” Overfitting happens when an ML model performs accurately on its training examples but fails to generalize to unseen data.
There are several ways to avoid overfitting in machine learning, such as choosing different algorithms, modifying the model’s architecture, and adjusting hyperparameters. But ultimately, the main remedy to overfitting is adding more quality data to the training dataset.
For example, consider the convolutional neural network (CNN), a type of machine learning architecture that is especially good for image classification tasks. Without a large and diverse set of training examples, a CNN will end up misclassifying images in the real world. On the other hand, if a CNN is trained on images of objects from different angles and under different lighting conditions, it will become more robust at identifying them in the real world.
However, gathering extra training examples can be expensive, time-consuming, or sometimes impossible. This challenge becomes even more difficult in supervised learning applications where training examples must be labeled by human experts.
One of the ways to increase the diversity of the training dataset is to create copies of the existing data and make small modifications to them. This is called “data augmentation.”
For example, say you have twenty images of ducks in your image classification dataset. By creating copies of your duck images and flipping them horizontally, you have doubled the training examples for the “duck” class. You can use other transformations such rotation, cropping, zooming, and translation. You can also combine the transformations to further expand your collection of unique training examples.
Data augmentation does not need to be limited to geometric manipulation. Adding noise, changing color settings, and other effects such as blur and sharpening filters can also help in repurposing existing training examples as new data.
Data augmentation is especially useful for supervised learning because you already have the labels and don’t need to put in extra effort to annotate the new examples. Data augmentation is also useful for other classes of machine learning algorithms such as unsupervised learning, contrastive learning, and generative models.
Data augmentation has become a standard practice for training machine learning models for computer vision applications. Popular machine learning and deep learning programming libraries have easy-to-use functions to integrate data augmentation into the ML training pipeline.
Data augmentation is not limited to images and can be applied to other types of data. For text datasets, nouns and verbs can be replaced with their synonyms. In audio data, training examples can be modified by adding noise or changing the playback speed.
Limits of data augmentation
Data augmentation is not a silver bullet to solve all your data problems. You can think of it as a free performance booster for your ML models. Based on your target application, you still need a fairly large training dataset with enough examples.
In some applications, training data might be too limited for data augmentation to help. In these cases, you must collect more data until you reach a minimum threshold before you can use data augmentation. Sometimes, you can use transfer learning, where you train an ML model on a general dataset (e.g., ImageNet) and then repurpose it by finetuning its higher layers on the limited data you have for your target application.
Data augmentation also doesn’t address other problems such as biases that exist in the training dataset. The data augmentation process also needs to be adjusted to address other potential problems, such as class imbalance.
Used wisely, data management can be a powerful tool in the machine learning engineer’s toolbox.