AI data collection (how to feed your LLM)

Outdated data is always a problem for AI models. It results in LLM responses that miss current trends or factual accuracy. The best solution to this problem is automating web data collection to supply AI with fresh information, but this solution is also fraught with problems.

Transforming raw web data into a format that machines can use requires careful cleaning and structuring. You need to improve data quality by filtering out irrelevant or redundant information before it reaches your AI systems.

Website Content Crawler, a purpose-built serverless cloud program (or Actor), handles this process efficiently. It crawls entire websites and eliminates duplicate entries and irrelevant information, which is necessary for constructing a high-quality dataset.

Step-by-step guide to AI data collection

To use Website Content Crawler and follow along with this tutorial, sign in or sign up for a forever-free plan (no credit card required).

Get started with Website Content Crawler for AI data collection

When logged in to your account, you'll be taken to your dashboard (Apify Console), where you'll see the User Interface we're about to go through.

1. Start URLs

For this example, we'll use the default input and scrape the Apify docs using the following start URL: https://docs.apify.com/academy/web-scraping-for-beginners.

Step 1. Insert URLs to crawl in the Start URLs field

In this case, the crawler will only crawl the links beginning with https://docs.apify.com/academy/web-scraping-for-beginners. Other links, such as https://docs.apify.com, will be ignored.

You can add other URLs to the list, as well. These will be added to the crawler queue, and the Actor will process them one by one.

You can use the Text file option for batch processing if you have lots of URLs and want to crawl them all. You can either upload a file with a list that has each URL on a separate line, or you can provide a URL of the file.

2. Crawler settings

Crawler type

The default crawler type is Adaptive, which switches between a headless browser (Firefox + Playwright) and a raw HTTP client (Cheerio). This crawler dynamically adapts, combining the advantages of both crawlers.

The Raw HTTP client (Cheerio) is extremely fast (up to 20x faster than a headless browser) and uses less memory. However, it cannot handle pages with client-side JavaScript rendering. On the other hand, a headless browser can process dynamic pages and is effective at handling anti-bot blocking, but is generally slower, consumes more memory, and therefore costs more.

In most cases, the adaptive switching option saves you time by automatically determining which crawler to use based on the page type, eliminating the need for manual decision-making.

Step 2. Choose the crawler type based on the type of web pages you need data from

If you want full control and need to render client-side JavaScript, choose the headless browser option. If client-side JavaScript rendering isn’t required, opt for the Raw HTTP client (Cheerio).

Exclude URLs (globs)

By default, the crawler will visit all the web pages in the Start URLs field (plus all the linked pages - but only if their path prefixes match). However, there might be some you don’t want to visit. If that's the case, you can use the exclude URLs (globs) option.

AI data collection. Exclude URLs — Fill in webpages you don‘t want to visit by using the exclude URLs option

You can also check if the glob matches what you want with the Test Glob button.

AI data collection. Glob matches — Check whether glob matches your interest with the Test Glob tool

Initial cookies

Cookies are sometimes used to identify the user with the server it’s trying to access. You can use the initial cookies option if you want to access content behind a log-in or authenticate your crawler with the website you’re scraping. Here are a couple of examples.

AI data collection. Cookies — Use the initial cookie option to scrape the content behind a log-in

3. HTML processing

There are two steps to HTML processing: a) waiting for content to load and b) processing the HTML from the web page (data cleaning). Although the UI doesn't strictly follow this order, I've decided to break it up this way: 3. HTML processing and 4. Data cleaning.

Wait for dynamic content

Some web pages have lazy loading, which is when the web page is loading more content as you scroll down. In such cases, you can tell the crawler to wait for dynamic content to load. The crawler will wait for up to 10 seconds as long as the web page is changing.

Maximum scroll height

The maximum scroll height is the height at which you scroll down before starting to process the page. This is there just to prevent infinite scrolling. Imagine an online store loading more and more products as you scroll, for example.

Once the content is loaded, the crawler may try to click on the cookie modals. With the remove cookie warnings option, it will click and hide the modals. It's enabled by default.

Expand clickable elements

The expand clickable elements option lets you add a selector of things the crawler should click on. If you don't select this, the Actor won't crawl any links in collapsed content. So, use this option to scrape content from collapsed sections of the webpage.

4. Data cleaning

Remove HTML elements

You can clean the data by removing HTML elements. These are the selectors of things you don’t want to include in your results (banners, ads, menus, alerts, and so on). The default setting covers most things, but you can add more to the list if you need to. This way, you'll have only the content you need to feed your language model.

Step 4. Data cleaning - remove banners, ads, menus, and other HTML elements

HTML transformer

With this option, you can try to remove more elements, but it may strip useful parts of the content you want to extract. So, if you discover that this is the case after running the Actor, you can choose None.

Remove duplicate text lines

You can remove duplicate text lines if the crawler keeps seeing the same line again and again. You can enable this in case you keep seeing some parts of footers or menus in your output, but you don’t want to look for the correct CSS selectors. The Actor strips the repeated content after 4 or 5 occurrences. This will prevent saving the same information repeatedly and so keep the data clean.

5. Output settings

You can save the data as HTML or Markdown or save screenshots if you're using a headless browser. The Save files option deserves some special attention, though.

Step 5. Choose the output setting that fits your needs

If you choose Save files, the crawler inspects the web page, and whenever it sees a link that goes to, say a PDF, Word doc, or Excel sheet, it will download it to the Apify key-value store.

6. Running the Actor

With the UI, you can execute code with the click of a button (the Start button at the bottom of the screen).

While running, you'll see what the crawler is up to in the log and can check if it's experiencing any issues. You can abort the run at any point.

When the crawler has completed a successful run, you can retrieve the data from the output tab.

Step 6. Run the Actor for AI data collection — Step 6. Run the Actor

7. Storing and exporting the data

The results of the Actor are stored in the default Dataset associated with the Actor run, from where you can access it via API and export it to formats like JSON, XML, or CSV. You need only click the Export results button to view or download the data in your preferred format.

Here's the Markdown from the first of the 26 results from this demo run.

AI data collection - Data exported in Markdown — Markdown data for AI

8. Integrate your data with LangChain, Pinecone, and other tools

You can now use the data you've collected to feed and fine-tune LLMs by integrating your data with LangChain or with a vector database such as Pinecone or any Pinecone alternatives.

For a detailed example, check out this tutorial on how to use LangChain and Pinecone with Apify.

Alternatively, you can orchestrate everything using platforms like Make.com to automate crawling and data updates. Apify offers a rich ecosystem and integrates with major automation platforms and AI tools.

Get quality data for AI

A high-quality dataset is the foundation of any AI system, and making sure that your data is fresh, relevant, and well-structured is the key to better model performance. As we've demonstrated, Website Content Crawler simplifies the entire process, from collecting web data to refining it for AI applications, all without requiring deep technical expertise.

If you're serious about improving your AI’s accuracy with real-time web data, try Website Content Crawler. Sign up for a free account and start building cleaner, more reliable datasets in just a few clicks.