What is data collection for machine learning?

Learn the whole process of collecting data for AI, from acquisition to generating new datasets.


The domain of data science now encapsulates almost everything, and, looking at the current pace of this field, it's apparent that it'll be used in every sphere in the upcoming years.

As the name suggests, "data science" revolves around data. It has been said that "Data is the new currency" and "Data is the new oil". The rise of AI has made this all the more true today. Machine learning models require vast amounts of data. Larger datasets lead to greater accuracy, resulting in better model performance.

Although the field of data science has evolved a lot during the last two decades, the challenges of data collection for AI remain. There's no doubt that today, we have access to more data than ever before, but gathering and molding the data according to the needs of the model demands a great deal of time and money. That's why big companies like Google and Microsoft spend millions of dollars just on data collection and cleaning.

Why is data collection difficult?

As we mentioned earlier, data collection for machine learning is challenging, but why is that? Here are a few reasons:

  • Quality: Getting lots of data is one thing, but ensuring it's accurate and relevant to the problem is a whole different challenge.
  • Diverse Sources: Data comes from many different sources, like online platforms, sensors, and direct feedback. Each has its own format, making it difficult to handle.
  • Time: Collecting meaningful data isn't always quick. It can be a lengthy process to gather just the right information. It unintentionally skews results if you’re not careful.
  • Integration: Merging data from different sources can be like trying to blend oil and water. You need to be very careful to ensure consistency.
  • Bias: Sometimes, the data you collect is biased and doesn't accurately represent the bigger picture, which is usually required for better results.

These are the most common problems almost every AI expert or data scientist faces. In this article, we'll cover all these problems and their solutions and find out how to generate better datasets for AI and machine learning models.

What do you need to get started?

To completely understand the content and code samples showcased in this post, you should have the following:

  • Python installed
  • Python libraries (apify-client, imblearn, matplotlib, nlpaug, nltk, pandas, sklearn)

To install all these libraries in your environment, open your terminal and enter the following command:

pip install apify-client imblearn matplotlib pandas sklearn

It should be a quick install.

You'll use the Apify API token in this tutorial. Here's how to get it:

  1. Sign up or log in to your Apify account (free, no credit card needed)
  2. Navigate to the integrations section in the account settings
  3. Here, you'll find your unique API token. Keep it handy; you'll need it shortly!

The aim of this tutorial will be a data pipeline with the following steps:

  • Data acquisition
  • Data ingestion
  • Data preprocessing
  • Data generation

Ultimately, the data will be ready for any ML/DL model. Let's start with the first step.

How to acquire data for AI models

The first step in the data collection phase is data acquisition, which means gathering the data from a source. The source can be a website, camera, sound recorder, or any other source of data relevant to your use case.

But extracting data from a source (scraping a website, for example) is a challenging task. The most time-consuming part is creating a scraper for that particular website. The Apify platform has this issue in hand. It provides hundreds of scrapers to fetch data from any website by following just a few steps.

In this tutorial, we'll use two Amazon review scrapers from Apify Store:

These are two different scrapers even though they almost share the same name and perform similar tasks. Developers can create any tool (called Actors) on Apify and share it with other users, either for free or for a small fee. A cornerstone of the Apify platform is making it possible for developers to monetize their code.

So let's use these two Actors to fetch the reviews of two different products from the Amazon store.

Note: The reason for using two different scrapers is that we want to implement two data streams to better understand the concepts of the "data ingestion" part that follows.

The code to scrape reviews of two different products from Amazon is given below:

from apify_client import ApifyClient
import json
# Initialize the ApifyClient with API token
client = ApifyClient("apify_api_JaJdodNpT1eYrxklaWpG2gsL2N291O14MiAf")

# Prepare the first Actor input
first_run_input = { 
    "url": "https://www.amazon.com/Apple-iPhone-11-64GB-White/dp/B08BHKSZ5P/ref=sr_1_1?keywords=iphone&qid=1692321184&sr=8-1&th=1" 

# Prepare the second Actor input
second_run_input = {
    "productUrls": [
            "url": "https://www.amazon.com/iPhone-Pro-1TB-Alpine-Green/dp/B0BGYCHZWF/ref=sr_1_1?crid=1KM34M1FH1SYO&keywords=iphone&qid=1692616691&sprefix=iphone%2Caps%2C384&sr=8-1&th=1"
    "proxyConfiguration": { "useApifyProxy": True },
print("Scraping the data for the first product. You can see the running actors on Apify console")
# Run the first Actor and wait for it to finish
first_run = client.actor("bebity/amazon-reviews-scraper").call(run_input=first_run_input)
first_run_data = list(client.dataset(first_run["defaultDatasetId"]).iterate_items())
# Save the data in a Json file 
with open('first_run.json', 'w') as f:
    json.dump(first_run_data, f)
print("\nFirst five entries from first_run:")
for _, entry in enumerate(first_run_data[:5]):

print("\nScraping the data for the second product. You can see the running actors on Apify console")
# Run the second Actor and wait for it to finish
second_run = client.actor("junglee/amazon-reviews-scraper").call(run_input=second_run_input)
second_run_data = list(client.dataset(second_run["defaultDatasetId"]).iterate_items())
# Save the data in a Json file 
with open('second_run.json', 'w') as f:
    json.dump(second_run_data, f)
print("\nFirst five entries from second_run:")
for _, entry in enumerate(second_run_data[:5]):

In a nutshell, we're using the Apify client to scrape reviews for two different products. First, initialize the Apify client with an API token for authentication. Then set up the URLs as input. After that, trigger two separate Actors to collect their reviews and put the reviews in json files. Finally, print the scraped data. We'll use that data in the next step.

Note: We'll develop a classification model with five classes, ranging from 1 star to 5 stars. This model will predict a rating based on the customer's review or description

Data ingestion for AI

Data ingestion is essentially the process of collecting and importing data from different sources, cleaning and structuring it, and then storing it for the next steps.

In this case, you have two data sources, and you may need to structure them to merge them and train the model.

Note: There are two data sources - both are scraped data from Amazon. It's important to note that the sources could be different - one from Amazon and another from a different website. However, this way it's easier to learn the concept of formatting and managing data from two sources.

Using Pandas for data ingestion

The first thing to do is put the data in Pandas data frames. Let's do that now.

import pandas as pd

# Convert the first_run.json file into a dataframe
first_run_df = pd.read_json('first_run.json')

# Convert the second_run.json file into a dataframe
second_run_df = pd.read_json('second_run.json')

# To confirm, you can print the first few rows of each dataframe:

The next step is to select the columns that you need for the model and remove the remaining ones. Let's print the columns and see what we have:


The output would be something like this:

Index(['ratingScore', 'reviewTitle', 'reviewUrl', 'reviewReaction',
       'reviewedIn', 'date', 'country', 'countryCode', 'reviewDescription',
       'isVerified', 'variant', 'reviewImages', 'position', 'productAsin',
       'reviewCategoryUrl', 'totalCategoryRatings', 'totalCategoryReviews'],
Index(['reviewContent', 'title', 'date', 'clientName', 'notation',
       'profilePicture', 'commentImages', 'commentLink'],

Let’s go with the following columns:

  • reviewContent or reviewDescription
  • ratingScore or notation
  • reviewTitle or title
  • reviewUrl or commentLink
  • date

Now let's remove the remaining columns from both data frames.

# For the first_run DataFrame
first_run_df = first_run_df[['reviewContent', 'title', 'date','notation']]
# For the second_run DataFrame
second_run_df = second_run_df[['reviewDescription', 'ratingScore', 'reviewTitle', 'date']]

Finding and filling missing values

The next step is to find missing values from the columns and fill them.

# Check for null values in both datasets
null_values_dataset_1 = first_run_df.isnull().sum()
null_values_dataset_2 = second_run_df.isnull().sum()

# Identify columns with null values
columns_with_null_values_1 = null_values_dataset_1[null_values_dataset_1 > 0]
columns_with_null_values_2 = null_values_dataset_2[null_values_dataset_2 > 0]

You can see some columns with missing values here. We'll drop those entries and generate new data from the remaining dataset in the next steps.

# Drop rows with any NULL values for the first_run DataFrame
first_run_df = first_run_df.dropna()
# Drop rows with any NULL values for the second_run DataFrame
second_run_df = second_run_df.dropna()

This code will drop all the rows with missing values.

Note: The target column is "ratings" or "notations." If there are any missing values in the target column, it's not advisable to fill in those values. Instead, it's preferable to drop those rows. Filling in missing values in the target column can introduce bias or inaccuracies.

Making the data types and values consistent

The next step is to make the data types and values from the data sources consistent. In this case, if you look at the data column in both data frames, the format and the values of the columns differ. Let's make them consistent.

# Format the date in the first dataframe
first_run_df['date'] = first_run_df['date'].str.extract(r'on (\w+ \d+, \d+)')
first_run_df['date'] = pd.to_datetime(first_run_df['date'], format='%B %d, %Y')
# Format the date in the second dataframe
second_run_df['date'] = pd.to_datetime(second_run_df['date'], format='%Y-%m-%d')
first_run_df.head(), second_run_df.head()

The last step in this pipeline is to make the names of the columns consistent in both the data frames and merge them.

# Rename the columns in first dataframe for consistency
    'reviewContent': 'reviewDescription',
    'title': 'reviewTitle',
    'notation': 'ratingScore'
}, inplace=True)
# Merge the two dataframes
merged_df = pd.concat([first_run_df, second_run_df], ignore_index=True)

The refined data is now ready to be saved in any database. It could be an SQL database or any other tool. In this case, we'll save the data in a csv file.

# Name the file
file_name = "merged_reviews.csv"
# Save the file as csv
merged_df.to_csv(file_name, index=False)

AI and data augmentation

Data augmentation is the process of generating new synthetic data from the current data if the current data has fewer samples. Augmentation methods are very popular in computer vision applications but are just as powerful for natural language processing. In computer vision, you just flip the images to generate a new data entry, but in NLP, you change the text by applying different methods. One of them is synonym replacement.

For this method, replace the synonyms of the current text to generate a new text with the same meaning. It's a very delicate technique because just one wrong synonym or word can change the whole context of the text. For this, we'll use nlpaug , a very powerful library that replaces synonyms.

pip install nlpaug

After this, you can use the wordnet library to help with synonyms.

import nlpaug.augmenter.word as naw
# Initialize the SynonymAug augmenter. "n" represents the number of synonyms in each sentence
description_aug = naw.SynonymAug(aug_src='wordnet', aug_max=5)
# define a new list to store the augmented rows
new_rows = []

# Loop through each row of the dataset
for _, row in merged_df.iterrows():
    # Augment the reviewDescription columns. "n" denotes the number of sentences we want to augment
    augmented_description = description_aug.augment(row['reviewDescription'], n=1)
    # Create a new row with the augmented data and other columns unchanged
    new_row = row.copy()
    if len(augmented_description) != 0:
        new_row['reviewDescription'] = augmented_description[0]

# Convert list of new rows to a DataFrame
new_rows_df = pd.DataFrame(new_rows)

# Append the new rows DataFrame
merged_df_augmented = pd.concat([merged_df, new_rows_df], ignore_index=True)

The code above uses the nlpaug library to augment the reviewDescription column with synonyms from wordnet and generate a new version of each description. It then appends the augmented descriptions.

Data preprocessing for machine learning

The preprocessing stage involves transforming raw textual data into a structured and clean format that can be easily fed into machine learning or deep learning models. This phase is crucial because, as the saying goes, "garbage in, garbage out." If the data is not cleaned enough, the model will eventually give bad results.

To clean the data, you need to go through several steps:

  • Lower casing
  • Removing punctuation
  • Tokenization
  • Removing stopwords
  • Lemmatization

If nltk is not installed in your environment, you can install it using the following command:

pip install nltk

Pass reviewDescription through all these steps:

import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
import re
# Download necessary NLTK data
# Initialize the lemmatizer
lemmatizer = WordNetLemmatizer()
# Set the stop words
stop_words = set(stopwords.words('english'))
# Perform preprocessing methods to each row
def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()
    # Remove punctuation
    text = re.sub(r'[^\w\s]', '', text)
    # Tokenization
    tokens = word_tokenize(text)
    # Remove stopwords and lemmatization
    tokens = [lemmatizer.lemmatize(token) for token in tokens if token not in stop_words]
    return ' '.join(tokens)

# Apply preprocessing to reviewDescription columns
merged_df['reviewDescription'] = merged_df['reviewDescription'].apply(preprocess_text)

In this code, we used the Natural Language Toolkit (NLTK) to clean the text. We changed the text to lowercase, removed punctuation, and broke it into individual words as tokens. Then we removed common stopwords like "and" or "the" and simplified each word to its root form. Finally, we transformed the tokenized words back into sentences.

What if the data contains imbalanced classes?

After all that hassle, you don't want to have imbalanced data classes, do you? Imbalanced classes may contain 90 samples of 1 class and just 10 samples of the other. In this case, the model will be biased towards the first class, no matter how well-designed it is.

It's very important to have a balanced dataset for an unbiased and optimized model performance. To achieve this, you need to implement methods that generate the data points of the minority class, ensuring the dataset is better suited for the model.

How to generate data for AI

In the end, it's very important to have a complete overview of the dataset. If the data is biased, you need to try to make it unbiased to improve the model's performance. Let's take a look at the distribution of the classes for the target column, that is ratingScore.

import matplotlib.pyplot as plt
# Count the number of samples for each class in ratingScore column
rating_counts = merged_df['ratingScore'].value_counts()
# Plot
plt.figure(figsize=(8, 6))
rating_counts.plot(kind='bar', color='skyblue')
# Set the title of the plot 
plt.title('Class distribution for ratingScore')
# Set the x and y labels
plt.xlabel('Rating Score')
plt.ylabel('Number of Reviews')

The resulting graph of this example would look a little bit like this:

Graph of class distribution for review description in the training set - graph 1

This means you have an unequal distribution of classes in the dataset, and you need to increase the data points or data samples of the minority classes. For this, you'll use the Synthetic Minority Oversampling Technique (SMOTE). This technique generates synthetic samples of the minority classes and tries to balance the class distribution.

Before applying SMOTE, you need to first perform two steps:

  • Vectorization: Convert text into a numerical format so the model can understand and process it.
  • Train-Test Split: Separate the data to ensure the model learns from one portion and gets tested on an unseen portion, keeping the evaluation genuine.
# Import the necessary libraries
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
# Define the features
tfidf_vectorizer_description = TfidfVectorizer(max_features=5000)
# Transform reviews to vesctors
X_description = tfidf_vectorizer_description.fit_transform(merged_df['reviewDescription'])
# Set the target column
y = merged_df['ratingScore']
# Apply train and test split
X_description_train, X_description_test, y_description_train, y_description_test = train_test_split(X_description, y, test_size=0.2, random_state=42)

The above code divides the data into train and test splits with a proportion of 80% training and 20% test.

Now, we're ready to apply SMOTE on the training data.

from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42,k_neighbors=4)
X_description_train_resampled, y_description_train_resampled = smote.fit_resample(X_description_train, y_description_train)

The code above will generate synthetic samples for the minority classes and make an equal distribution if you try to make the same plot again.

description_counts_train = y_description_train_resampled.value_counts()
# Plotting
plt.figure(figsize=(8, 6))
description_counts_train.plot(kind='bar', color='skyblue')
plt.title('Class distribution for reviewDescription in the training set')
plt.ylabel('Number of Samples')

You'll see a graph like this:

Graph of class distribution for review description in the training set - graph 2

This equal distribution will help the model to generalize the concepts. After this step, the data is ready to be fed to the model, and probably the model will perform better.

An AI model is only as good as the data

We've covered almost the entire pipeline in this blog, from acquiring data to generating new samples before training a model. You learned that no matter how good the model is, it will only perform well if you feed it quality data. You learned about removing irrelevant data and only retaining what's necessary, and you discovered the importance of equal representation of all classes in our training sample.

In machine learning or deep learning, as important as the model is, perhaps the data fed to it is even more crucial.

Get started now

Step up your web scraping and automation