Data Preprocessing: Exploring the Keys to Data Preparation — SitePoint
In this article, we’ll explore what data preprocessing is, why it’s important, and how to clean, transform, integrate and reduce our data.
Data preprocessing is a fundamental step in data analysis and machine learning. It’s an intricate process that sets the stage for the success of any data-driven endeavor.
At its core, data preprocessing encompasses an array of techniques to transform raw, unrefined data into a structured and coherent format ripe for insightful analysis and modeling.
This vital preparatory phase is the backbone for extracting valuable knowledge and wisdom from data, empowering decision-making and predictive modeling across diverse domains.
The need for data preprocessing arises from real-world data’s inherent imperfections and complexities. Often acquired from different sources, raw data tends to be riddled with missing values, outliers, inconsistencies, and noise. These flaws can obstruct the analytical process, endangering the reliability and accuracy of the conclusions drawn. Moreover, data collected from various channels may vary in scales, units, and formats, making direct comparisons arduous and potentially misleading.
Data preprocessing typically involves several steps, including data cleaning, data transformation, data integration, and data reduction. We’ll explore each of these in turn below.
Data cleaning involves identifying and correcting errors, inconsistencies, and inaccuracies in the data. Some standard techniques used in data cleaning include:
Let’s discuss each of these data-cleaning techniques in turn.
Handling missing values is an essential part of data preprocessing. Observations with missing data are dealt with under this technique. We’ll discuss three standard methods for handling missing values: removing observations (rows) with missing values, imputing missing values with the statistics tools, and imputing missing values with machine learning algorithms.
We will demonstrate each technique with a custom dataset and explain the output of each method, discussing all of these techniques of handling missing values individually.
The simplest way to deal with missing values is to drop rows with missing ones. This method usually isn’t recommended, as it can affect our dataset by removing rows containing essential data.
Let’s understand this method with the help of an example. We create a custom dataset with age, income, and education data. We introduce missing values by setting some values to NaN (not a number). NaN is a special floating-point value that indicates an invalid or undefined result. The observations with NaN will be dropped with the help of the dropna() function from the Pandas library:
The output of the above code is given below. Note that the output won’t be produced in a bordered table format. We’re providing it in this format to make the output more interpretable, as shown below.
Original dataset
Cleaned dataset
The observations with missing values are removed in the cleaned dataset, so only the observations without missing values are kept. You’ll find that only row 0 and 4 are in the cleaned dataset.
Dropping rows or columns with missing values can significantly reduce the number of observations in our dataset. This may affect the accuracy and generalization of our machine-learning model. Therefore, we should use this approach cautiously and only when we have a large enough dataset or when the missing values aren’t essential for analysis.
This is a more sophisticated way to deal with missing data compared with the previous one. It replaces the missing values with some statistics, such as the mean, median, mode, or constant value.
This time, we create a custom dataset with age, income, gender, and marital_status data with some missing (NaN) values. We then impute the missing values with the median using the fillna() function from the Pandas library:
The output of the above code in table form is shown below.
Original dataset
Imputed dataset
In the imputed dataset, the missing values in the age, income, gender, and marital_status columns are replaced with their respective column medians.
Machine-learning algorithms provide a sophisticated way to deal with missing values based on features of our data. For example, the KNNImputer class from the Scikit-learn library is a powerful way to impute missing values. Let’s understand this with the help of a code example:
The output of this code is shown below.
Original Dataset
Dataset after imputing with KNNImputer
The above example demonstrates that imputing missing values with machine learning can produce more realistic and accurate values than imputing with statistics, as it considers the relationship between the features and the missing values. However, this approach can also be more computationally expensive and complex than imputing with statistics, as it requires choosing and tuning a suitable machine learning algorithm and its parameters. Therefore, we should use this approach when we have sufficient data, and the missing values are not random or trivial for your analysis.
It’s important to note that many machine-learning algorithms can handle missing values internally. XGBoost, LightGBM, and CatBoost are brilliant examples of machine-learning algorithms supporting missing values. These algorithms take missing values internally by ignoring missing ones, splitting missing values, and so on. But this approach doesn’t work well on all types of data. It can result in bias and noise in our model.
There are many times we have to deal with data with duplicate rows — such as rows with the same data in all columns. This process involves the identification and removal of duplicated rows in the dataset.
Here, the duplicated() and drop_duplicates() functions can us. The duplicated() function is used to find the duplicated rows in the data, while the drop_duplicates() function removes these rows. This technique can also lead to the removal of important data. So it’s important to analyze the data before applying this method:
The output of the above code is shown below.
Original dataset
Duplicate rows
Deduplicated dataset
The duplicate rows are removed from the original dataset based on the deduplicated dataset’s name, age, and income columns.
In real-world data analysis, we often come across data with outliers. Outliers are very small or huge values that deviate significantly from other observations in a dataset. Such outliers are first identified, then removed, and the dataset is transformed at a specific scale. Let’s understand with the following detail.
As we’ve already seen, the first step is to identify the outliers in our dataset. Various statistical techniques can be used for this, such as the interquartile range (IQR), z-score, or Tukey methods.
We’ll mainly look at z-score. It’s a common technique for the identification of outliers in the dataset.
The z-score measures how many standard deviations an observation is from the mean of the dataset. The formula for calculating the z-score of an observation is this:
The threshold for the z-score method is typically chosen based on the level of significance or the desired level of confidence in identifying outliers. A commonly used threshold is a z-score of 3, meaning any observation with a z-score more significant than 3 or less than -3 is considered an outlier.
Once the outliers are identified, they can be removed from the dataset using various techniques such as trimming, or removing the observations with extreme values. However, it’s important to carefully analyze the dataset and determine the appropriate technique for handling outliers.
Alternatively, the data can be transformed using mathematical functions such as logarithmic, square root, or inverse functions to reduce the impact of outliers on the analysis:
In this example, we’ve created a custom dataset with outliers in the age column. We then apply the outlier handling technique to identify and remove outliers from the dataset. We first calculate the mean and standard deviation of the data, and then identify the outliers using the z-score method. The z-score is calculated for each observation in the dataset, and any observation that has a z-score greater than the threshold value (in this case, 3) is considered an outlier. Finally, we remove the outliers from the dataset.
The output of the above code in table form is shown below.
Original dataset
Outliers
Dataset without outliers
The outlier (200) in the age column in the dataset without outliers is removed from the original dataset.
Data transformation is another method in data processing to improve data quality by modifying it. This transformation process involves converting the raw data into a more suitable format for analysis by adjusting the data’s scale, distribution, or format.
Let’s look at an example:
In this example, our custom dataset has a variable called spending. A significant outlier in this variable is causing skewness in the data. We’re controlling this skewness in the spending variable. The square root transformation has transformed the skewed spending variable into a more normal distribution. Transformed values are stored in a new variable called sqrt_spending. The normal distribution of sqrt_spending is between 1.00000 to 6.00000, making it more suitable for data analysis.
The output of the above code in table form is shown below.
Original dataset
Transformed dataset
The data integration technique combines data from various sources into a single, unified view. This helps to increase the completeness and diversity of the data, as well as resolve any inconsistencies or conflicts that may exist between the different sources. Data integration is helpful for data mining, enabling data analysis spread across multiple systems or platforms.
Let’s suppose we have two datasets. One contains customer IDs and their purchases, while the other dataset contains information on customer IDs and demographics, as given below. We intend to combine these two datasets for a more comprehensive customer behavior analysis.
Customer Purchase Dataset
Customer Demographics Dataset
To integrate these datasets, we need to map the common variable, the customer ID, and combine the data. We can use the Pandas library in Python to accomplish this:
The output of the above code in table form is shown below.
We’ve used the merge() function from the Pandas library. It merges the two datasets based on the common customer ID variable. It results in a unified dataset containing purchase information and customer demographics. This integrated dataset can now be used for more comprehensive analysis, such as analyzing purchasing patterns by age or gender.
Data reduction is one of the commonly used techniques in the data processing. It’s used when we have a lot of data with plenty of irrelevant information. This method reduces data without losing the most critical information.
There are different methods of data reduction, such as those listed below.
Data preprocessing is essential, because the quality of the data directly affects the accuracy and reliability of the analysis or model. By properly preprocessing the data, we can improve the performance of the machine learning models and obtain more accurate insights from the data.
Preparing data for machine learning is like getting ready for a big party. Like cleaning and tidying up a room, data preprocessing involves fixing inconsistencies, filling in missing information, and ensuring that all data points are compatible. Using techniques such as data cleaning, data transformation, data integration, and data reduction, we create a well-prepared dataset that allows computers to identify patterns and learn effectively.
It’s recommended that we explore data in depth, understand data patterns and find the reasons for missingness in data before choosing an approach. Validation and test set are also important ways to evaluate the performance of different techniques.
Rehman Ahmad Chaudhary is a blogger, writer, computer vision enthusiast, programming guy, and founder of StudyEnablers.
In this article, we’ll explore what data preprocessing is, why it’s important, and how to clean, transform, integrate and reduce our data.Data preprocessingData cleaningOriginal datasetCleaned datasetOriginal datasetImputed datasetOriginal DatasetDataset after imputing with KNNImputerOriginal datasetDuplicate rowsDeduplicated datasetOutliersageOriginal datasetOutliersDataset without outliersData transformationLog transformationSquare root transformationOriginal datasetTransformed datasetdata integrationCustomer Purchase DatasetCustomer Demographics DatasetData reductionData cube aggregationDimensionality reductionData compressionNumerosity reduction