Real-world data is in most cases incomplete, noisy, and inconsistent.
With the exponentially growing data generation and the increasing number of heterogeneous data sources, the probability of gathering anomalous or incorrect data is quite high.
But only high-quality data can lead to accurate models and, ultimately, accurate predictions. Hence, it’s crucial to process data for the best possible quality. This step of processing data is called data preprocessing, and it’s one of the essential steps in data science, machine learning, and artificial intelligence.
What is data preprocessing?
Data preprocessing is the process of transforming raw data into a useful, understandable format. Real-world or raw data usually has inconsistent formatting, human errors, and can also be incomplete. Data preprocessing resolves such issues and makes datasets more complete and efficient to perform data analysis.
It’s a crucial process that can affect the success of data mining and machine learning projects. It makes knowledge discovery from datasets faster and can ultimately affect the performance of machine learning models.
of a data scientist’s time is spent on data preparation tasks.
In other words, data preprocessing is transforming data into a form that computers can easily work on. It makes data analysis or visualization easier and increases the accuracy and speed of the machine learning algorithms that train on the data.
Why is data preprocessing required?
As you know, a database is a collection of data points. Data points are also called observations, data samples, events, and records.
Each sample is described using different characteristics, also known as features or attributes. Data preprocessing is essential to effectively build models with these features.
Numerous problems can arise while collecting data. You may have to aggregate data from different data sources, leading to mismatching data formats, such as integer and float.
If you’re aggregating data from two or more independent datasets, the gender field may have two different values for men: man and male. Likewise, if you’re aggregating data from ten different datasets, a field that’s present in eight of them may be missing in the rest two.
By preprocessing data, we make it easier to interpret and use. This process eliminates inconsistencies or duplicates in data, which can otherwise negatively affect a model’s accuracy. Data preprocessing also ensures that there aren’t any incorrect or missing values due to human error or bugs. In short, employing data preprocessing techniques makes the database more complete and accurate.
Characteristics of quality data
For machine learning algorithms, nothing is more important than quality training data. Their performance or accuracy depends on how relevant, representative, and comprehensive the data is.
Before looking at how data is preprocessed, let’s look at some factors contributing to data quality.
- Accuracy: As the name suggests, accuracy means that the information is correct. Outdated information, typos, and redundancies can affect a dataset’s accuracy.
- Consistency: The data should have no contradictions. Inconsistent data may give you different answers to the same question.
- Completeness: The dataset shouldn’t have incomplete fields or lack empty fields. This characteristic allows data scientists to perform accurate analyses as they have access to a complete picture of the situation the data describes.
- Validity: A dataset is considered valid if the data samples appear in the correct format, are within a specified range, and are of the right type. Invalid datasets are hard to organize and analyze.
- Timeliness: Data should be collected as soon as the event it represents occurs. As time passes, every dataset becomes less accurate and useful as it doesn’t represent the current reality. Therefore, the topicality and relevance of data is a critical data quality characteristic.
The four stages of data preprocessing
For machine learning models, data is fodder.
An incomplete training set can lead to unintended consequences such as bias, leading to an unfair advantage or disadvantage for a particular group of people. Incomplete or inconsistent data can negatively affect the outcome of data mining projects as well. To resolve such problems, the process of data preprocessing is used.
There are four stages of data processing: cleaning, integration, reduction, and transformation.
1. Data cleaning
Data cleaning or cleansing is the process of cleaning datasets by accounting for missing values, removing outliers, correcting inconsistent data points, and smoothing noisy data. In essence, the motive behind data cleaning is to offer complete and accurate samples for machine learning models.
The techniques used in data cleaning are specific to the data scientist’s preferences and the problem they’re trying to solve. Here’s a quick look at the issues that are solved during data cleaning and the techniques involved.
The problem of missing data values is quite common. It may happen during data collection or due to some specific data validation rule. In such cases, you need to collect additional data samples or look for additional datasets.
The issue of missing values can also arise when you concatenate two or more datasets to form a bigger dataset. If not all fields are present in both datasets, it’s better to delete such fields before merging.
Here are some ways to account for missing data:
- Manually fill in the missing values. This can be a tedious and time-consuming approach and is not recommended for large datasets.
- Make use of a standard value to replace the missing data value. You can use a global constant like “unknown” or “N/A” to replace the missing value. Although a straightforward approach, it isn’t foolproof.
- Fill the missing value with the most probable value. To predict the probable value, you can use algorithms like logistic regression or decision trees.
- Use a central tendency to replace the missing value. Central tendency is the tendency of a value to cluster around its mean, mode, or median.
If 50 percent of values for any of the rows or columns in the database is missing, it’s better to delete the entire row or column unless it’s possible to fill the values using any of the above methods.
A large amount of meaningless data is called noise. More precisely, it’s the random variance in a measured variable or data having incorrect attribute values. Noise includes duplicate or semi-duplicates of data points, data segments of no value for a specific research process, or unwanted information fields.
For example, if you need to predict whether a person can drive, information about their hair color, height, or weight will be irrelevant.
An outlier can be treated as noise, although some consider it a valid data point. Suppose you’re training an algorithm to detect tortoises in pictures. The image dataset may contain images of turtles wrongly labeled as tortoises. This can be considered noise.
However, there can be a tortoise’s image that looks more like a turtle than a tortoise. That sample can be considered an outlier and not necessarily noise. This is because we want to teach the algorithm all possible ways to detect tortoises, and so, deviation from the group is essential.
For numeric values, you can use a scatter plot or box plot to identify outliers.
The following are some methods used to solve the problem of noise:
- Regression: Regression analysis can help determine the variables that have an impact. This will enable you to work with only the essential features instead of analyzing large volumes of data. Both linear regression and multiple linear regression can be used for smoothing the data.
- Binning: Binning methods can be used for a collection of sorted data. They smoothen a sorted value by looking at the values around it. The sorted values are then divided into “bins,” which means sorting data into smaller segments of the same size. There are different techniques for binning, including smoothing by bin means and smoothing by bin medians.
- Clustering: Clustering algorithms such as k-means clustering can be used to group data and detect outliers in the process.
2. Data integration
Since data is collected from various sources, data integration is a crucial part of data preparation. Integration may lead to several inconsistent and redundant data points, ultimately leading to models with inferior accuracy.
Here are some approaches to integrate data:
- Data consolidation: Data is physically brought together and stored in a single place. Having all data in one place increases efficiency and productivity. This step typically involves using data warehouse software.
- Data virtualization: In this approach, an interface provides a unified and real-time view of data from multiple sources. In other words, data can be viewed from a single point of view.
- Data propagation: Involves copying data from one location to another with the help of specific applications. This process can be synchronous or asynchronous and is usually event-driven.
3. Data reduction
As the name suggests, data reduction is used to reduce the amount of data and thereby reduce the costs associated with data mining or data analysis.
It offers a condensed representation of the dataset. Although this step reduces the volume, it maintains the integrity of the original data. This data preprocessing step is especially crucial when working with big data as the amount of data involved would be gigantic.
The following are some techniques used for data reduction.
Dimensionality reduction, also known as dimension reduction, reduces the number of features or input variables in a dataset.
The number of features or input variables of a dataset is called its dimensionality. The higher the number of features, the more troublesome it is to visualize the training dataset and create a predictive model.
In some cases, most of these attributes are correlated, hence redundant; therefore, dimensionality reduction algorithms can be used to reduce the number of random variables and obtain a set of principal variables.
There are two segments of dimensionality reduction: feature selection and feature extraction.
In feature selection, we try to find a subset of the original set of features. This allows us to get a smaller subset that can be used for modeling the problem. On the other hand, feature extraction reduces the data in a high-dimensional space to a lower-dimensional space, or in other words, space with a lesser number of dimensions.
The following are some ways to perform dimensionality reduction:
- Principal component analysis (PCA): A statistical technique used to extract a new set of variables from a large set of variables. The newly extracted variables are called principal components. This method works only for features with numerical values.
- High correlation filter: A technique used to find highly correlated features and remove them; otherwise, a pair of highly correlated variables can increase the multicollinearity in the dataset.
- Missing values ratio: This method removes attributes having missing values more than a specified threshold.
- Low variance filter: Involves removing normalized attributes having variance less than a threshold value as minor changes in data translate to less information.
- Random forest: This technique is used to assess the importance of each feature in a dataset, allowing us to keep just the top most important features.
Other dimensionality reduction techniques include factor analysis, independent component analysis, and linear discriminant analysis (LDA).
Feature subset selection
Feature subset selection is the process of selecting a subset of features or attributes that contribute the most or are the most important.
Suppose you’re trying to predict whether a student will pass or fail by looking at historical data of similar students. You have a dataset with four features: roll number, total marks, study hours, and extracurricular activities.
In this case, roll numbers do not affect students’ performance and can be eliminated. The new subset will have just three features and will be more efficient than the original set.
This data reduction approach can help create faster and more cost-efficient machine learning models. Attribute subset selection can also be performed in the data transformation step.
Numerosity reduction is the process of replacing the original data with a smaller form of data representation. There are two ways to perform this: parametric and non-parametric methods.
Parametric methods use models for data representation. Log-linear and regression methods are used to create such models. In contrast, non-parametric methods store reduced data representations using clustering, histograms, data cube aggregation, and data sampling.
4. Data transformation
Data transformation is the process of converting data from one format to another. In essence, it involves methods for transforming data into appropriate formats that the computer can learn efficiently from.
For example, the speed units can be miles per hour, meters per second, or kilometers per hour. Therefore a dataset may store values of the speed of a car in different units as such. Before feeding this data to an algorithm, we need to transform the data into the same unit.
The following are some strategies for data transformation.
This statistical approach is used to remove noise from the data with the help of algorithms. It helps highlight the most valuable features in a dataset and predict patterns. It also involves eliminating outliers from the dataset to make the patterns more visible.
Aggregation refers to pooling data from multiple sources and presenting it in a unified format for data mining or analysis. Aggregating data from various sources to increase the number of data points is essential as only then the ML model will have enough examples to learn from.
Discretization involves converting continuous data into sets of smaller intervals. For example, it’s more efficient to place people in categories such as “teen,” “young adult,” “middle age,” or “senior” than using continuous age values.
Generalization involves converting low-level data features into high-level data features. For instance, categorical attributes such as home address can be generalized to higher-level definitions such as city or state.
Normalization refers to the process of converting all data variables into a specific range. In other words, it’s used to scale the values of an attribute so that it falls within a smaller range, for example, 0 to 1. Decimal scaling, min-max normalization, and z-score normalization are some methods of data normalization.
Feature construction involves constructing new features from the given set of features. This method simplifies the original dataset and makes it easier to analyze, mine, or visualize the data.
Concept hierarchy generation
Concept hierarchy generation lets you create a hierarchy between features, although it isn’t specified. For example, if you have a house address dataset containing data about the street, city, state, and country, this method can be used to organize the data in hierarchical forms.
Accurate data, accurate results
Machine learning algorithms are like kids. They have little to no understanding of what’s favorable or unfavorable. Like how kids start repeating foul language picked up from adults, inaccurate or inconsistent data easily influences ML models. The key is to feed them high-quality, accurate data, for which data preprocessing is an essential step.
Machine learning algorithms are usually spoken of as hard workers. But there’s an algorithm that’s often labeled as lazy. It’s called the k-nearest neighbor algorithm and is an excellent classification algorithm.