Machine learning models require vast amounts of data. Larger datasets lead to greater accuracy, resulting in better model performance.
Although the field of data science has evolved a lot during the last two decades, the challenges of data collection for AI remain. What are these challenges?
- Quality: Getting lots of data is one thing, but making sure it's accurate and relevant to the problem is a whole different thing.
- Diverse sources: Data comes from many different sources, like online platforms, sensors, and direct feedback. Each has its own format, which makes it difficult to handle.
- Time: Collecting meaningful data isn't always quick. It can be a lengthy process to gather just the right information, while the wrong information can skew results.
- Integration: Merging data from different sources can be like trying to blend oil and water. You need to be very careful to ensure consistency.
- Bias: Sometimes, the data you collect is biased and doesn't accurately represent the bigger picture, which is usually required for better results.
In this article
We'll cover all these problems and their solutions and find out how to generate better datasets for AI and machine learning models.
The aim will be a data pipeline with the following steps:
Ultimately, the data will be ready for any ML/DL model.
Getting started
To completely understand the content and code samples showcased in this post, you should have Python installed and the following libraries:
apify-client
imblearn
matplotlib
nlpaug
nltk
pandas
sklearn
To install all these libraries in your environment, open your terminal and enter the following command:
It should be a quick install.
We'll also use an Apify API token in this tutorial. Here's how to get yours:
- Sign up or log in to your Apify account (free, no credit card needed)
- Navigate to the integrations section in the account settings
- Here, you'll find your unique API token. Keep it handy; you'll need it shortly!

How to collect data for machine learning
Step by step
Step 1. Data acquisition
The first step in the data collection phase is data acquisition, which means collecting the data from a source. The source can be a website, camera, sound recorder, or any other source of data relevant to your use case.
But extracting data from a source (scraping a website, for example) is a challenging task. The most time-consuming part is creating a scraper for that particular website. The Apify platform has this issue in hand. It provides hundreds of scrapers to fetch data from any website by following just a few steps.
In this tutorial, we'll use two Amazon review scrapers (Actors) from Apify Store. The reason for using two different scrapers is that we want to implement two data streams to better understand the concepts of the "data ingestion" part that follows.
Here's the code to scrape reviews of two different products from Amazon:
In a nutshell, we're using the Apify client to scrape reviews for two different products.
- First, initialize the Apify client with an API token for authentication.
- Then set up the URLs as input.
- After that, trigger two separate
Actors
to collect their reviews and put the reviews injson
files. - Finally, print the scraped data. We'll use that data in the next step.
Step 2. Data ingestion for ML
Data ingestion is essentially the process of collecting and importing data from different sources, cleaning and structuring it, and then storing it for the next steps.
In this case, you have two data sources, and you may need to structure them to merge them and train the model.
#1. Using Pandas for data ingestion
The first thing to do is put the data in Pandas data frames. Let's do that now.
#2. Selecting columns
The next step is to select the columns that you need for the model and remove the remaining ones. Let's print the columns and see what we have:
The output would be something like this:
Let’s go with the following columns:
reviewContent
orreviewDescription
ratingScore
ornotation
reviewTitle
ortitle
reviewUrl
orcommentLink
date
Now let's remove the remaining columns from both data frames.
#3. Finding and filling missing values
The next step is to find missing values from the columns and fill them.
You can see some columns with missing values here. We'll drop those entries and generate new data from the remaining dataset in the next steps.
This code will drop all the rows with missing values.
#4. Making the data types and values consistent
The next step is to make the data types and values from the data sources consistent. In this case, if you look at the data column in both data frames, the format and the values of the columns differ. Let's make them consistent.
The last step in this pipeline is to make the column names consistent in both data frames and merge them.
The refined data is now ready to be saved in any database. It could be an SQL database or any other tool. In this case, we'll save the data in a csv
file.
Step 3. Data augmentation
Data augmentation is the process of generating new synthetic data from the current data if the current data has fewer samples. Augmentation methods are very popular in computer vision applications but are just as powerful for natural language processing. In computer vision, you just flip the images to generate a new data entry, but in NLP, you change the text by applying different methods. One of them is synonym replacement.
For this method, replace the synonyms of the current text to generate a new text with the same meaning. It's a very delicate technique because just one wrong synonym or word can change the whole context of the text. For this, we'll use nlpaug
, a very powerful library that replaces synonyms.
After this, you can use the wordnet
library to help with synonyms.
The code above uses the nlpaug
library to augment the reviewDescription
column with synonyms from wordnet
and generate a new version of each description. It then appends the augmented descriptions.
Step 4. Data preprocessing
The preprocessing stage involves transforming raw textual data into a structured and clean format that can be easily fed into machine learning or deep learning models. This phase is crucial because, as the saying goes, "garbage in, garbage out." If the data is not cleaned enough, the model will eventually give bad results.
To clean the data, you need to go through several steps:
- Lower casing
- Removing punctuation
- Tokenization
- Removing stopwords
- Lemmatization
We'll do all this using nltk
. If you don't have it installed in your environment, you can install it using the following command:
Now, pass reviewDescription
through all these steps:
In this code, we used the Natural Language Toolkit (NLTK) to clean the text. We did the following:
- Changed the text to lowercase.
- Removed punctuation.
- Broke it into individual words as tokens.
- Removed common stopwords like "and" or "the" and simplified each word to its root form.
- Transformed the tokenized words back into sentences.
What if the data contains imbalanced classes?
After all that hassle, you don't want to have imbalanced data classes, do you? Imbalanced classes may contain 90 samples of 1 class and just 10 samples of the other. In this case, the model will be biased towards the first class, no matter how well-designed it is.
It's very important to have a balanced dataset for an unbiased and optimized model performance. To achieve this, you need to implement methods that generate the data points of the minority class, ensuring the dataset is better suited for the model.
Step 5. Generating a dataset for ML
In the end, it's very important to have a complete overview of the dataset. If the data is biased, you need to try to make it unbiased to improve the model's performance. Let's take a look at the distribution of the classes for the target column that is ratingScore
.
The resulting graph of this example would look a little bit like this:

This means you have an unequal distribution of classes in the dataset, and you need to increase the data points or data samples of the minority classes. For this, you'll use the Synthetic Minority Oversampling Technique (SMOTE). This technique generates synthetic samples of the minority classes and tries to balance the class distribution.
Before applying SMOTE, you need to first perform two steps:
- Vectorization: Convert text into a numerical format so the model can understand and process it.
- Train-Test Split: Separate the data to ensure the model learns from one portion and gets tested on an unseen portion, keeping the evaluation genuine.
The above code divides the data into train and test splits with a proportion of 80% training and 20% test.
Now, we're ready to apply SMOTE on the training data.
The code above will generate synthetic samples for the minority classes and make an equal distribution if you try to make the same plot again.
You'll see a graph like this:

This equal distribution will help the model to generalize the concepts. After this step, the data is ready to be fed to the machine learning model, which will almost certainly perform better.
An ML model is only as good as the data
We've covered almost the entire pipeline for data collection for machine learning, from acquiring data to generating new samples before training a model.
We learned about removing irrelevant data and only retaining what's necessary and discovered the importance of equal representation of all classes in our training sample.
This shows how important data quality is in machine learning, deep learning, and AI in general. The models are, of course, important, but perhaps the data fed to them is even more crucial.