Every AI model is trained on data, which comes in many forms. The largest and most diverse data source for AI training is the web, but collecting it and making it usable for AI models comes with challenges.
In this blog post, we'll look at different ways to collect training datasets, the limitations of these methods, and the best way to gather web data for AI.
Sources of training data
We'll go through four common data collection methods:
- Public datasets
- Crowdsourcing
- Web APIs
- Web scraping
Public datasets
There are many public datasets available, from UCI to HuggingFace. HuggingFace datasets remain one of the best options as people continue to contribute to them, while most other datasets are becoming outdated.
Even HuggingFace datasets have limitations, though. If you need fresh, up-to-date information so your AI model has the latest data on events, brands, or your company's products and policies, retrieving current data from the web is a better option.
Crowdsourcing
Crowdsourcing in the context of AI training data means outsourcing data collection, labeling, or validation to a larger group of people - usually via platforms like Amazon Mechanical Turk or Toloka. It is costly, and the data comes with biases as well. A common example is the particular English style and some over-preferred words in ChatGPT-written text.
Web APIs
APIs are quite easy to program and provide a to-the-point interface. The issue with APIs is that they are sparse; most of them are behind a paywall and have uptime issues, too.
Here's a simple example of how you can retrieve web data via API.
We'll demonstrate with the Wikipedia API – it's quite uncomplicated without any authentication requirements.
Install the API using pip:
pip install wikipedia
For a basic example, we can fetch the summary and content of an article, say Treaty of Versailles.
import wikipedia
wikipedia.set_lang("en")
topic = "Treaty of Versailles"
summary = wikipedia.summary(topic, sentences=5)
print(f"Summary for '{topic}':\\n")
print(summary)
We can further extend it to have a proper dataset and store the results in CSV (or some other preferred format).
Web scraping
API usage is great, but limited. For a start, many APIs are not available, and some of the available ones are behind a paywall. Then, they can also have rate or pagination limits, plus there can be a content difference in some cases between the results returned by the API and the latest design.
Web scraping is a cost-effective method that combats those challenges by returning the web content in its original form. It has its share of issues, as we will shortly see. But let's start with an example.
import requests
from bs4 import BeautifulSoup
url = "https://www.nature.com/articles/171737a0"
headers = {"User-Agent": "Mozilla/5.0"}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, "html.parser")
paragraphs = soup.find_all("p")
for p in paragraphs[:7]: #Just first few paras are enough for demo
print(p.get_text(strip=True))
The output contains clutter and needs to be preprocessed or handled in the scraping.
Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain
the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in
Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles
and JavaScript.
Advertisement
Naturevolume171,pages737–738 (1953)Cite this article
235kAccesses
8598Citations
2292Altmetric
Metricsdetails
The challenges of web scraping for AI training data
The data returned by scraping is quite raw (as we saw above), as it doesn’t differentiate between the main content and headers/footers. So we need to be quite smart and apply additional checks for web scraping.
But these challenges don’t stop there as websites try to suppress web scraping by measures like:
- CAPTCHA
- Blocking the caller IP(s)
- Rate limits
Apify Actors: efficient web scraping solutions
Actors are pre-built web scraping and automation solutions on the Apify platform that solve many of the problems of web data collection.
These serverless cloud programs take input (either in JSON or GUI fields) and return the output in your preferred format ( JSON, CSV, XML, and others). They use techniques like dynamic IP addresses and human-like browsing fingerprints to bypass the captchas.
For AI training, Website Content Crawler would be the most suitable choice. Some quick advantages over traditional web scraping are:
- Clean results – Excludes the header, footer, and other unnecessary data, and returns the web content in the form of clean markdown (HTML or plain text options are also there). It can also download hosted files in different formats like PDF, XLS, etc.
- Integrations – OpenAI, LangChain, vector databases – this Actor has integrations for all of them - making it a great choice for AI training data (as of writing this, it has more than 6,000 active monthly users).
Example of Website Content Crawler in action
Working with the Actor is simple. You can make a free account on Apify and go to the settings tab for your API token. Using this API token, you can run both GUI or direct Python scripts.
We'll use the same webpage as in the previous example to show the difference between using the Actor and conventional web scraping.
from apify_client import ApifyClient
client = ApifyClient("apify_api_xxxxx") #Replace it with your actual Apify API token here
# Crawler parameters
run = client.actor("apify/website-content-crawler").call(run_input={
"startUrls": [
{"url": "https://www.nature.com/articles/171737a0"}
],
"crawlerType": "cheerio",
"extractText": True,
"textFormat": "markdown",
"includePdfLinks": True,
"maxPagesPerCrawl": 1
})
# Output
dataset_items = client.dataset(run["defaultDatasetId"]).list_items().items
print(dataset_items[0]["text"][:750])
Here, the output is quite clear and focuses only on the main content, as you can see.
Molecular Structure of Nucleic Acids: A Structure for Deoxyribose Nucleic Acid
Article
Published: 25 April 1953
Nature volume 171, pages 737–738 (1953)Cite this article
235k Accesses
8598 Citations
2292 Altmetric
Metrics details
WE wish to suggest a structure for the salt of deoxyribose nucleic acid (D.N.A.). This structure has novel features which are of considerable biological interest.
A structure for nucleic acid has already been proposed by Pauling and Corey1. They kindly made their manuscript available to us in advance of publication. Their model consists of three intertwined chains, with the phosphates near the fibre axis, and the bases on the outside. In our opinion, this structure is unsatisfactory for two reasons : (1) We believe
And that's just the tip of the iceberg. Features like anti-blocking, CAPTCHA bypass, and integrations mean you can collect data even from the most complex of websites and integrate the data with your preferred AI frameworks and databases.
To summarize, a table here would make it easier for you to compare across available methods.
Method | Up-to-date | Effort required | Cost | Legal risk | Control |
---|---|---|---|---|---|
Public datasets | Medium | Low | Free | Low | Low |
APIs | Med-to-high | Medium | Variable | Low | Medium |
Scraping | High | High | Free | Med-to-high | Med-to-high |
Apify Actors | High | Low | Free tier available | Low | High |
Get better data for AI
Training data is key to AI models. No wonder AI leaders like Andrew Ng have been stressing data-centric AI.
In this article, we went through the different ways of getting training data. We saw the limitations of datasets, crowdsourcing, and APIs, and the limitations and challenges of web scraping. In the end, we saw how Actors are the best tools for getting this data.
If you need a ready-made solution for AI training data, sign up for an Apify account and try Website Content Crawler for free.