Scraping data from the web is becoming increasingly challenging: dynamic content, anti-scraping defenses, and CAPTCHAs are three of the biggest obstacles.
But the problems don't end there.
Most datasets have missing values, duplicates, and errors lurking beneath the surface that can significantly affect your results and analysis.
This beginner's guide lays out how to clean a dataset in Python, which is a great tool for this task.
If you use Python to scrape the web, knowing how to use it to clean your datasets will allow you to make the most of the web data you've collected.
This guide will show you why cleaning data is important and how to find and fix common problems. By the end, your dataset will be clean and ready to analyze.
We've also included a cheat sheet to help with future cleaning tasks.
Types of dirty data
Before cleaning data, it’s important to know where to start. Dirty data (i.e. the information in a dataset before it undergoes cleaning) has different forms:
1. Missing values
Missing values are the gaps in your dataset and are very common. Missing values may occur for various reasons, mostly because data was not collected or lost during transfer.
Missing values can be fixed in 2 ways:
- Deletion: Deleting the row with a missing value. This can work but may lose too much data.
- Imputation: Filling in the missing value with a statistical measure (mean, median, etc.) or certain text or 0.
2. Duplicates
Duplicates are copies of the same data point. They may occur when data is collected from various sources and/or when errors in data entry or processing occur.
Duplicates are also a prevalent type of dirty data, and many Python frameworks (including Pandas) have built-in functions to remove duplicates of a dataset.
3. Outliers
In datasets, outliers are certain data points that are significantly different from the others. Outliers may exist because of errors in data collection or from other legitimate but rare events.
To handle outliers, you can remove them or use methods to reduce their impact on analysis. The choice depends on the dataset and your goals.
4. Erroneous data
These are incorrect values in a dataset. For example, if a certain dataset contains the ages of a group of people and one age value indicates “200”, it’s erroneous.
Erroneous data can be handled successfully only if the correct value is known. Otherwise, they should be treated as outliers or, most often, as missing data.
5. Inconsistencies
Inconsistencies happen when the same data is recorded differently. For example, a country column might list the United States as "US", "USA", "America", and "United States".
Inconsistencies can be fixed by standardizing the data to a single format.
6. Misformatted data
Misformatted data refers to data that are not in the correct form or structure e.g., numbers stored as strings.
Misformatted data can be easily handled by converting it to a suitable format.
Setting up your environment for data cleaning with Python
- Python 3.5+: Make sure you have installed Python 3.5 or higher and set up the environment.
- Installing and importing Pandas: Run the command below on the terminal to install Pandas.
pip install pandas
To import Pandas into the script, use the following code.
import pandas as pd
- Download the dataset - This tutorial uses a dataset of tech jobs in Chicago, extracted from Indeed. The data was extracted using Indeed Scraper from Apify Store.
Understanding the data
The above dataset covers details such as company information, job type, location, position, posting date, and salary. It has 261 rows and 27 columns.
By analyzing this dataset, you can determine job market trends, salary distributions, and other information.
Loading and viewing the data
The Pandas framework offers built-in functions for reading data from different file types such as CSV, JSON, and Excel.
Since the sample dataset is a CSV file, you can use the following code to read it.
df = pd.read_csv('Indeed_tech jobs_chicago_2024.csv')
# Loading the CSV File
To print the data in the dataset, simply run a print()
function.
print (df)
print(df.shape)
While the print function prints the dataset (which is not required), the df.shape
command returns the number of rows and columns in the sample dataset.
How to clean data with Pandas
With the environment set up, you can start cleaning the dataset.
When cleaning any type of dirty data, you should make sure to answer these 2 questions first:
- What do I need to do?
- What’s the easiest way to do it?
1. Addressing duplicates
The first step of any data cleaning program is to remove duplicates in the dataset. This ensures that the dataset consists of unique values. To do that, you can simply use the Pandas drop_duplicates()
function.
df = df.drop_duplicates()
If you run the command df.shape
before and after the above line, you can see how many rows have been removed, which is 0 in this case.
2. Cleaning columns
To start cleaning the remaining data, you should determine the types of data you need for the analysis. Start by looking at the columns’ names present in the dataset, as they represent the information in the data slots.
print(df.columns) # prints columns' names
The above code will return the following result.
Index(['company', 'companyInfo/companyDescription', 'companyInfo/companyLogo',
'companyInfo/indeedUrl', 'companyInfo/rating',
'companyInfo/reviewCount', 'companyInfo/url', 'description',
'descriptionHTML', 'externalApplyLink', 'id', 'isExpired', 'jobType',
'jobType/0', 'jobType/1', 'jobType/2', 'location', 'positionName',
'postedAt', 'postingDateParsed', 'rating', 'reviewsCount', 'salary',
'scrapedAt', 'searchInput', 'url', 'urlInput'],
dtype='object')
For job market analysis, you typically need:
- Company name
- Company’s rating
- Description
- Job position
- Job type
- Salary
- Listing date
- Location
There are many unwanted columns in the sample dataset. To clean those, you can use two methods:
- Remove the unwanted columns - Deletion
- Keep the wanted columns - Selection
Since we need only 8 out of 27 columns, it’s easier to use selection.
import pandas as pd
df = pd.read_csv('Indeed_tech jobs_chicago_2024.csv') # loading the CSV file
# Specifying the columns needed
columns_to_keep = ['company', 'companyInfo/rating', 'description', 'jobType/0', 'location', 'positionName', 'postingDateParsed', 'salary']
df = df[columns_to_keep] # updating the dataframe
The code selects specific columns from the dataset. It uses the exact column names to create a new, updated DataFrame. This process removes unnecessary information, leaving only the data you need.
Next, you can edit the column names of the DataFrame for better readability and accessibility.
The following changes can be made to achieve this goal:
- Modifying “companyInfo/companyDescription” to “Company Description”.
- Modifying “jobType/0” to “Job Type”
- Modifying “postingDateParsed” to “Posted Date”.
While there are several ways to do this, the easiest way would be to create a new dictionary with the existing column name for the key
and the modified name for the value
. Then, you can call the Pandas rename()
function.
# Keep only the columns you want
df = df[columns_to_keep] # updating the dataframe
# key = existing name
# value = new name
renamed_columns = {'companyInfo/rating': 'Company Rating',
'jobType/0': 'Job Type',
'postingDateParsed': 'Data Posted'}
# call rename () method
df.rename(columns=renamed_columns,
inplace=True)
# `inplace = true` modifies the existing data frame instead of creating a new one.
This results in a more readable DataFrame as below:
3. Fixing inconsistencies
As mentioned earlier, inconsistencies occur when the same data exists in different formats. In this modified DataFrame, inconsistencies are only in the Salary column.
If you look at the Salary column of the dataset, you can see that the data is recorded per hour as well as per year.
To fix this, you have to convert these values to one unit— salary per year— and keep those as float values for further calculations.
While you can use any algorithm for this, the following image shows the easiest method.
To clean salary data, remove text and convert to floats. For values under 100, multiply by 2,000 to get yearly salaries. Keep values 100 and above as is. This standardizes all salaries to a yearly format for analysis.
However, in this DataFrame, some data points include salary ranges.
To clean salary ranges, keep the hyphen but remove other non-numeric characters. Then, calculate the average of the range to get a single salary value. This prevents merging ranges into incorrect large numbers. A function can perform these tasks efficiently.
import re
def clean_salary_column(df, column_name='salary'):
def process_salary(salary):
# Convert salary to string
salary = str(salary)
# Remove all non-numeric characters except for dots and hyphens
salary = re.sub(r'[^\d.-]', '', salary)
# Calculate the average if there's a range, else convert to float
if '-' in salary:
salary = sum(map(float, salary.split('-'))) / 2
else:
salary = float(salary) if salary != '' else 0
# Adjust values less than 100
return salary if salary > 100 else salary * 2000
# Apply the processing function to the salary column
df[column_name] = df[column_name].apply(process_salary)
# Replace 0.0 float values with None for future operations
df[column_name] = df[column_name].apply(lambda x: None if x == 0.0 else x)
return df
Here, you can see that we’ve imported a new, built-in Python library, re
. This library is used for regular expression operations like pattern matching, substitution, and splitting.
In this code, it’s used in two lines:
df['salary'] = df['salary'].apply(lambda x: re.sub(r'[^\d.-]', '', x))
This line uses the re.sub
function to replace all characters in the Salary column that are not digits (\d
), periods (.
), or hyphens (-
).
The [^...]
notation is used to match any character not enclosed in the brackets. So, [^d.-]
matches any character that is not a digit, period, or hyphen.
df['salary'] = df['salary'].apply(lambda x: sum(map(float, x.split('-')))/2 if '-' in x else float(x) if x != '' else 0)
This line doesn’t directly use the re
library, but it does use string splitting on a hyphen (-
), which is a common operation when working with regular expressions.
If a salary value contains a hyphen, it’s split into two parts; each part is converted to a float, and the average of the two parts is calculated.
For those regular expression operations, the data points must be in string format, which is why the Salary column has been converted to str
first.
Once you run the code, the salary data is converted to float values representing yearly salaries. Ranges are replaced with their average values.
4. Managing missing values
This dataset has occasional missing values in three columns: Company Rating, Job Type, and Salary. When imported as a Pandas DataFrame, these gaps are filled with "NaN" values. Other columns are complete.
There are a few solutions to remedy this.
- Imputation: Filling the null values with statistical measures such as mean or median or with a certain value like 0 or “N/A”.
- Deletion: Complete deletion of rows with empty values.
- Forward fill or backward fill (propagation): This fills gaps with nearby values. It works well when missing data can be reasonably estimated from adjacent entries, like in a list of similar job salaries.
- Interpolation: This will perform a linear interpolation to estimate the missing value.
- Replace with an estimated value: Use other data to make an educated guess. For example, in a school dataset, you might estimate missing ages based on grade levels.
For the Job Type and Company Rating columns, fill in missing values with "N/A". This simple approach works because these fields don't affect numerical analysis. You can use the Pandas fillna()
method as below:
df['Company Rating'].fillna("N/A", inplace=True)
df['Job Type'].fillna("N/A", inplace=True)
Remember, these are now considered data points with string values, and no null values are present in the columns.
5. Cleaning the Salary column
If you look back at the previous section, we converted all the values into a float in Salary, including null values, which were converted to a 0. Thus, we have already used imputation.
Using zeros or averages for missing values could skew analysis. Since the dataset is diverse, imputing mean or median values might introduce false information. The best approach here is to delete rows with missing salary data.
Even if this method reduces data volume, the remaining information allows for more accurate conclusions.
Pandas has a built-in function to remove null values, which is called dropna()
. But since it doesn’t determine 0 as a null value, you’ll have to convert 0 values to “None” values again.
print("before deleting null values: ", df.shape)
# Replacing 0.0 float values with None
df['salary'] = df['salary'].apply(lambda x: None if x == 0.0 else x)
df = df.dropna(subset=['salary'])
# to find number of rows removed
print("after deleting null values: ", df.shape)
As expected, about 66% of the data has been removed. Yet, you can analyze the available data more accurately than by using other methods.
before deleting null values: (261, 8)
after deleting null values: (89, 8)
6. Cleaning text data
Cleaning text data is also important in a large dataset like this. Cleaning text data includes doing the following things:
- Removing whitespaces: Removing unwanted spaces before, after, and between words/letters.
- Converting to lowercase: This prevents the computer from treating identical terms with different capitalization as unique values. For example, "Coca-Cola" and "coca-cola" would be recognized as the same company.
- Removing special characters.
Here are two methods to make these changes:
apply()
: Removing special characters, removing whitespaces, etc. However, the apply function can only be applied to one column at a time (which means changes to the complete DataFrame are not possible).map()
: Just like themap()
function, this is an advanced method used to make changes to all the elements in a DataFrame. We'll discuss this in the next section.
applymap()
but in the newer Pandas versions, this has been deprecated.Cleaning text with the apply()
method
The Date Posted column contains some special characters (possibly the time) that need to be cleaned. Using apply()
, you can remove these unwanted characters and keep the posted date only.
Note: There’s no space between these values. After the standard date, the other text always starts with a capital T, which can be used to split the date and unwanted text, as below:
# Because the split() function only works in strings
df['Date Posted'] = df['Date Posted'].astype(str)
df['Date Posted'] = df['Date Posted'].apply(lambda x: x.split('T')[0])
# Split the text by T and then remove it.
Similarly, if there was a space between these values, you can simply replace ‘T’ with ‘ ‘.
Text processing with map()
For any changes related to the whole dataset, using map()
is the best option.
Cleaning unwanted whitespaces is one such change that should be applied to the whole dataset for cleaner data. To do that, you can first use the strip()
function on all the data points and then apply it to the DataFrame.
But since the strip()
function only works on strings, you have to add an "if condition" to check whether the data points are strings. If there aren’t strings, they'll be maintained.
df = df.map(lambda x: x.strip() if isinstance(x, str) else x)
Even though you can’t see a significant change in the DataFrame, using some code to find the differences before and after this process will reveal any changes.
Converting to lowercase is another change that should be applied to the complete dataset. You use lower()
along with map()
for this:
df = df.map(lambda x: x.lower() if isinstance(x, str) else x)
This returns a dataset without uppercase characters.
7. Managing outliers
Outliers can skew data analysis, so it's often best to identify and remove them. In this dataset, only the Salary column may contain outliers. Here's a simple method to detect them:
- Find the mean of the salaries.
- Find the standard deviation of the salaries.
- Determine the cut-off value to determine outliers: The standard way of doing this is to multiply the standard deviation by 3. This method is defined as the three-sigma rule of thumb - any data point that lies more than three standard deviations from the mean is considered an outlier.
- Find the absolute difference between each data point and the mean salary, and if it’s higher than the cutoff values, mark it as an outlier.
The following code can be used.
mean_salary = df['salary'].mean() #Find the mean
std_salary = df['salary'].std() #Find standard deviation
cutoff = std_salary * 3 #determining the cutoff
df['Outlier'] = df['salary'].apply(lambda x: abs(x - mean_salary) > cutoff)
#Creates a new column to
# Print rows where 'Outlier' is True
outliers = df[df['Outlier']]
print(outliers)
When run, the above code results in a new column with Boolean values in your DataFrame, like below:
A scatter plot of this data would reveal a distribution similar to the image below:
Depending on your analysis, this one point may not be a data point that results in a wrong conclusion. Therefore, you would either keep or remove this outlier.
If you want to remove the outlier(s), remove the rows in which the Outlier column value is True, and then remove the entire column from the DataFrame.
# Identifying indices of rows with outliers
outlier_indices = df[df['Outlier']].index
# Removing these rows from the DataFrame
df.drop(index=outlier_indices, inplace=True)
# Removing the 'Outlier' column as it's no longer needed
df.drop(columns=['Outlier'], inplace=True)
As a result, one row of the dataset has been removed, and no more outliers are present.
8. Merging DataFrames
Merging or combining two DataFrames based on a common column or index is useful when data is integrated from different sources.
You can use Pandas merge()
to achieve this.
import pandas as pd
# Sample data
data1 = {'ID': [1, 2, 3], 'Name': ['Rose', 'Nick', 'Leo']}
data2 = {'ID': [1, 2, 4], 'Age': [25, 30, 35]}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
# Merging DataFrames
merged_df = pd.merge(df1, df2, on='ID')
print(merged_df)
In the above code, the two sample DataFrames are combined using the ID
column.
9. Concatenating DataFrames
Unlike merging DataFrames, concatenating combines DataFrames either vertically (by adding rows) or horizontally (by adding columns).
Vertical concatenation:
import pandas as pd
# Sample data
data1 = {'ID': [1, 2, 3], 'Name': ['Rose', 'Nick', 'Leo']}
data2 = {'ID': [4, 5, 6], 'Name': ['Eric', 'Mark', 'Adam']}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
# Concatenate DataFrames vertically
concatenated_df_vertical = pd.concat([df1, df2], ignore_index=True)
Output:
Horizontal concatenation
import pandas as pd
# Sample data
data1 = {'ID': [1, 2, 3], 'Name': ['Rose', 'Nick', 'Leo']}
data2 = {'Age': [25, 30, 35]}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
# Concatenate DataFrames horizontally
concatenated_df_horizontal = pd.concat([df1, df2], axis=1)
Output:
Saving cleaned data
After completing the data cleaning process, you can save your Pandas DataFrame in various file formats using built-in functions.
Saving to CSV
# Save to CSV
df.to_csv('cleaned_data.csv', index=False)
Saving to Excel
# Save to Excel
df.to_excel('cleaned_data.xlsx', index=False)
Saving to JSON
# Save to JSON
df.to_json('cleaned_data.json', orient='records', lines=True)
Advanced data transformation techniques
Data transformation is key to knowing how to clean data in Python for machine learning. It goes beyond basic cleaning tasks like handling missing values or removing duplicates. Advanced techniques convert raw data into formats better suited for analysis and modeling.
This section explores advanced data transformation methods valuable for anyone learning to clean data in Python. Remember, these techniques aren't universal - each dataset is unique and requires a tailored approach. We'll use small sample datasets to illustrate these concepts.
1. Normalization and standardization
These are data transformation techniques that serve two different purposes.
- Normalization: Normalization scales data to a range of
0, 1
or-1, 1
, which is useful when features have different units or scales. - Standardization: This technique transforms data to have a mean of 0 and a standard deviation of 1, which makes it easier to compare features that follow a normal distribution.
Example 1: The following code uses MinMaxScaler
to normalize data in a DataFrame with different scales.
from sklearn.preprocessing import MinMaxScaler
import pandas as pd
# Sample dataset
data = {'feature1': [10, 20, 30, 40, 50], 'feature2': [100, 200, 300, 400, 500]}
df = pd.DataFrame(data)
# Initialize the scaler
scaler = MinMaxScaler()
# Normalize the data
normalized_data = scaler.fit_transform(df)
normalized_df = pd.DataFrame(normalized_data, columns=df.columns)
print(normalized_df)
Output:
Example 2: The following code uses StandardScaler
to standardize the sample dataset, so that they have a mean of 0 and a standard deviation of 1.
from sklearn.preprocessing import StandardScaler
import pandas as pd
# Sample data
data = {'feature1': [10, 20, 30, 40, 50], 'feature2': [100, 200, 300, 400, 500]}
df = pd.DataFrame(data)
# Initialize the scaler
scaler = StandardScaler()
# Standardize the data
standardized_data = scaler.fit_transform(df)
standardized_df = pd.DataFrame(standardized_data, columns=df.columns)
Output:
2. Encoding categorical variables
Categorical variables contain distinct categories or levels rather than numeric values. They represent finite sets of discrete options.
However, these variables can’t be directly used in machine learning and must be transformed into numerical features. This is called as “Encoding Categorical Variables”. There are 2 types of encoding discussed in the table below:
One-Hot Encoding | Label Encoding | |
---|---|---|
Task | - Converts each unique category into a new binary column. |
- Each column corresponds to a single category, and the values are 0 or 1. | - Assign a unique integer to each category. | | Usage | - Used when categorical variables do not have an inherent order.
- Suitable for nominal data (categories without order), like colors, fruits, etc. | - Used when categorical variables have an inherent order.
- Suitable for ordinal data (categories with order), like rankings, ratings, etc. |
The following image further explains how one-hot encoding and label encoding works:
One-Hot Encoding
One-hot encoding converts each unique category into a binary column. The presence of a category is marked with 1, its absence with 0.
Example: One-hot encoding using pandas.get_dummiest()
:
# Sample data
data = {'object': ['ball', 'bat', 'ball', 'doll']}
df = pd.DataFrame(data)
# One-Hot Encoding
encoded_df = pd.get_dummies(df, columns=['object'])
Result:
Label Encoding
In Label Encoding, an integer is assigned to each unique category.
Example: Label encoding using LabelEncoder
:
from sklearn.preprocessing import LabelEncoder
import pandas as pd
# Sample data
data = {'object': ['ball', 'bat', 'ball', 'doll']}
df = pd.DataFrame(data)
# Initialize the encoder
label_encoder = LabelEncoder()
# Label Encoding
df['object_encoded'] = label_encoder.fit_transform(df['object'])
Result:
3. Log transformation
Log transformation replaces each value with its logarithm, simplifying analysis, reducing the impact of outliers, and evening out the data distribution.
Example: Applying log transformation to a simple dataset with a Price
column that contains highly skewed data, using Numpy log()
.
import pandas as pd
import numpy as np
# Sample data
data = {'Price': [100, 150, 200, 300, 1000, 5000, 10000, 20000, 50000, 100000]}
df = pd.DataFrame(data)
df['Log_Price'] = np.log(df['Price'])
Output:
4. Binning
Binning groups continuous data into intervals or "bins", reducing minor errors and making patterns more visible.
Example: Applying the binning technique to a dataset with ages using pd.cut()
.
import pandas as pd
data = {'Age': [22, 25, 47, 52, 46, 56, 72, 77, 48, 37]}
df = pd.DataFrame(data)
bins = [0, 30, 50, 100] # Define bin edges
labels = ['Young', 'Middle-aged', 'Senior'] # Define bin labels
df['Age_Group'] = pd.cut(df['Age'], bins=bins, labels=labels)
Result:
You can find a simple cheatsheet for data cleaning in the GitHub Repository.
Data cleaning tips
Here are a few tips you can follow in any data-cleaning program in Python machine learning.
#1. Understand your data first
Before cleaning, examine the dataset closely. Familiarize yourself with the data points, formats, and potential conflicts. Consider jotting down key observations to guide your cleaning process
#2. Store raw data separately
Before cleaning your dataset, make sure to save a copy of the original data in a separate location, such as a cloud storage service like DynamoDB backup. Additionally, create a backup DataFrame alongside your working one. Using cloud options helps protect against data loss and ensures you can restart the process if needed while maintaining a reference to the original data.
backup_df = df.copy()
#3. Comment on your code
Comments in code help preserve your reasoning and decision-making process. Focus on explaining why certain actions are taken, not just what they do. For example, note your rationale for outlier detection methods. This makes it easier to understand and maintain your work over time.
mean_salary = df['salary'].mean()
std_salary = df['salary'].std()
cutoff = std_salary * 3
# "three sigma of thumb" rule
# cut-off mark is 3 times the standard deviation from the mean value
df['Outlier'] = df['salary'].apply(lambda x: abs(x - mean_salary) > cutoff)
# if any data point is greater than its value minus the mean salary, it's an outlier
#4. Keep a data-cleaning log
Staying organized by keeping a log of your process for cleaning a dataset will help you give a clear overview of what’s next and prevent possible mistakes and oversights.
If detailed logging disrupts your coding flow, use a simple checklist of required changes for each dataset.
Cleaning "Chicago Job Market" - 25/07:
- Removed duplicates to ensure unique entries.
- Filled missing values in 'column_name' with mean value to maintain data consistency.
- Standardized 'category_column' to lowercase for uniformity.
Conclusion
Getting rid of dirty data is a critical step to achieving accurate datasets. We hope this tutorial gives you the confidence to start data cleaning using Python and Pandas. Now you know how to handle missing values, remove duplicates, fix inconsistencies, manage outliers, and transform data for better analysis and modeling.