Getting data for training AI

Every AI model is trained on data, which comes in many forms. The largest and most diverse data source for AI training is the web, but collecting it and making it usable for AI models comes with challenges.

In this blog post, we'll look at different ways to collect training datasets, the limitations of these methods, and the best way to gather web data for AI.

Sources of training data

We'll go through four common data collection methods:

Public datasets
Crowdsourcing
Web APIs
Web scraping

Public datasets

There are many public datasets available, from UCI to HuggingFace. HuggingFace datasets remain one of the best options as people continue to contribute to them, while most other datasets are becoming outdated.

Even HuggingFace datasets have limitations, though. If you need fresh, up-to-date information so your AI model has the latest data on events, brands, or your company's products and policies, retrieving current data from the web is a better option.

Crowdsourcing

Crowdsourcing in the context of AI training data means outsourcing data collection, labeling, or validation to a larger group of people - usually via platforms like Amazon Mechanical Turk or Toloka. It is costly, and the data comes with biases as well. A common example is the particular English style and some over-preferred words in ChatGPT-written text.

Web APIs

APIs are quite easy to program and provide a to-the-point interface. The issue with APIs is that they are sparse; most of them are behind a paywall and have uptime issues, too.

Here's a simple example of how you can retrieve web data via API.

We'll demonstrate with the Wikipedia API – it's quite uncomplicated without any authentication requirements.

Install the API using pip:

pip install wikipedia

For a basic example, we can fetch the summary and content of an article, say Treaty of Versailles.

import wikipedia
wikipedia.set_lang("en")

topic = "Treaty of Versailles"

summary = wikipedia.summary(topic, sentences=5)
print(f"Summary for '{topic}':\\n")
print(summary)

We can further extend it to have a proper dataset and store the results in CSV (or some other preferred format).

Web scraping

API usage is great, but limited. For a start, many APIs are not available, and some of the available ones are behind a paywall. Then, they can also have rate or pagination limits, plus there can be a content difference in some cases between the results returned by the API and the latest design.

Web scraping is a cost-effective method that combats those challenges by returning the web content in its original form. It has its share of issues, as we will shortly see. But let's start with an example.

import requests
from bs4 import BeautifulSoup

url = "https://www.nature.com/articles/171737a0"
headers = {"User-Agent": "Mozilla/5.0"}

response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, "html.parser")

paragraphs = soup.find_all("p")

for p in paragraphs[:7]:  #Just first few paras are enough for demo
    print(p.get_text(strip=True))

The output contains clutter and needs to be preprocessed or handled in the scraping.

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain
            the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in
            Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles
            and JavaScript.
Advertisement
Naturevolume171,pages737–738 (1953)Cite this article
235kAccesses
8598Citations
2292Altmetric
Metricsdetails

The challenges of web scraping for AI training data

The data returned by scraping is quite raw (as we saw above), as it doesn’t differentiate between the main content and headers/footers. So we need to be quite smart and apply additional checks for web scraping.

But these challenges don’t stop there as websites try to suppress web scraping by measures like:

CAPTCHA
Blocking the caller IP(s)
Rate limits

📌

It’s not always the web scraping itself that websites are trying to prevent, but malicious attacks or bot traffic that can skew statistics, hence these anti-scraping bot measures.

Apify Actors: efficient web scraping solutions

Actors are pre-built web scraping and automation solutions on the Apify platform that solve many of the problems of web data collection.

These serverless cloud programs take input (either in JSON or GUI fields) and return the output in your preferred format ( JSON, CSV, XML, and others). They use techniques like dynamic IP addresses and human-like browsing fingerprints to bypass the captchas.

For AI training, Website Content Crawler would be the most suitable choice. Some quick advantages over traditional web scraping are:

Clean results – Excludes the header, footer, and other unnecessary data, and returns the web content in the form of clean markdown (HTML or plain text options are also there). It can also download hosted files in different formats like PDF, XLS, etc.
Integrations – OpenAI, LangChain, vector databases – this Actor has integrations for all of them - making it a great choice for AI training data (as of writing this, it has more than 6,000 active monthly users).

Example of Website Content Crawler in action

Working with the Actor is simple. You can make a free account on Apify and go to the settings tab for your API token. Using this API token, you can run both GUI or direct Python scripts.

🆓

A free account on Apify doesn’t require any credit card information and comes with $5 credit monthly, which is good enough to run plenty of tasks.

We'll use the same webpage as in the previous example to show the difference between using the Actor and conventional web scraping.

from apify_client import ApifyClient
client = ApifyClient("apify_api_xxxxx") #Replace it with your actual Apify API token here

# Crawler parameters
run = client.actor("apify/website-content-crawler").call(run_input={
    "startUrls": [
        {"url": "https://www.nature.com/articles/171737a0"}
    ],
    "crawlerType": "cheerio",
    "extractText": True,
    "textFormat": "markdown",
    "includePdfLinks": True,
    "maxPagesPerCrawl": 1
})

# Output
dataset_items = client.dataset(run["defaultDatasetId"]).list_items().items
print(dataset_items[0]["text"][:750])

Here, the output is quite clear and focuses only on the main content, as you can see.

Molecular Structure of Nucleic Acids: A Structure for Deoxyribose Nucleic Acid
Article
Published: 25 April 1953
Nature volume 171, pages 737–738 (1953)Cite this article
235k Accesses
8598 Citations
2292 Altmetric
Metrics details
WE wish to suggest a structure for the salt of deoxyribose nucleic acid (D.N.A.). This structure has novel features which are of considerable biological interest.
A structure for nucleic acid has already been proposed by Pauling and Corey1. They kindly made their manuscript available to us in advance of publication. Their model consists of three intertwined chains, with the phosphates near the fibre axis, and the bases on the outside. In our opinion, this structure is unsatisfactory for two reasons : (1) We believe

And that's just the tip of the iceberg. Features like anti-blocking, CAPTCHA bypass, and integrations mean you can collect data even from the most complex of websites and integrate the data with your preferred AI frameworks and databases.

🔎

You might be interested in this step-by-step guide to using the Website Content Crawler UI for AI data collection

To summarize, a table here would make it easier for you to compare across available methods.

Method	Up-to-date	Effort required	Cost	Legal risk	Control
Public datasets	Medium	Low	Free	Low	Low
APIs	Med-to-high	Medium	Variable	Low	Medium
Scraping	High	High	Free	Med-to-high	Med-to-high
Apify Actors	High	Low	Free tier available	Low	High

Get better data for AI

Training data is key to AI models. No wonder AI leaders like Andrew Ng have been stressing data-centric AI.

In this article, we went through the different ways of getting training data. We saw the limitations of datasets, crowdsourcing, and APIs, and the limitations and challenges of web scraping. In the end, we saw how Actors are the best tools for getting this data.

If you need a ready-made solution for AI training data, sign up for an Apify account and try Website Content Crawler for free.