AI web scraping tools: do they really work?

Can AI do web scraping? We tried a couple of “AI-powered” tools and a GPT model. This is what we found.

Content

What is AI-powered web scraping?

The internet is currently saturated with terms like “AI-driven” and “AI-powered” web scrapers. But is “AI web scraping” really a thing?

While there are certainly some excellent web scraping tools for AI out there, most of those advertised as “AI-powered web scrapers” are… well, just web scrapers. “AI” has been stuffed in there because that’s what everyone’s obsessed with at the moment.

So, I did my due diligence and tried out a few of these so-called AI scraping tools, identified two that are arguably worthy of the name, and explored the possibility of using GPT models to do web data extraction, as well.

Here are my findings.

AI-powered web scrapers

BrowseAI

First up is BrowseAI. It’s basically an Apify-like SaaS platform. The Chrome extension/web app allows you to record user actions in the browser, and you can upload the recording on the platform and run it there.

You provide the URL of a page you want to scrape, like so:

AI web scraper - BrowseAI - Apify Blog URL

Naturally, I chose the Apify blog because the content is awesome!

AI web scraper - BrowseAI - Apify Blog

After a while, the web page opens in your browser:

AI web scraper - BrowseAI - opening Apify blog web page

You can then extract data with point-and-click tools that automatically recognize repeating components:

AI web scraper - BrowseAI - repeating components
AI web scraper - BrowseAI - point and click tools

Now you can pick parts of those repeating components (things like title and author in this case), select them by clicking, and name the columns in the resulting table:

AI web scraper - BrowseAI - resulting table

Pros of BrowseAI
  • The recorder has intuitive controls and a smart UI for selecting data to scrape.
  • It’s a no-code solution, so it’s easy for those who are not developers to use it.
Cons of BrowseAI
  • The performance dips when recording.
  • Because it’s a no-code solution, there’s little space for customization.
Is BrowseAI an AI web scraping tool?

Credit where it’s due: BrowseAI is a well-chosen name because the only AI-like feature is the browser recorder.

BrowseAI is basically Apify plus a recorder. It provides “Prebuilt Robots” (which are essentially what Apify Actors are) and a platform to run the bots on (just like the Apify platform).

While BrowseAI is a pretty neat no-code web scraping tool, I wouldn’t go so far as to call it an AI-powered web scraper. And if you’re a dev who wants more customization, anti-blocking features, proxies, datasets, and other crucial things for serious data extraction projects, web scraping with Apify is an alternative solution you should consider.

Extract text content from the web to feed your vector databases and fine-tune or train large language models such as ChatGPT or LLaMA.

Browse scrapers for AI

Kadoa.com

Next up is Kadoa.com - an online service that uses generative AI models for automated data extraction.

With Kadoa Playground, you input a URL, and the service will analyze the page using AI models to extract data automatically.

You can then select which data you want to scrape, making the process quick and efficient.

This can be especially useful for those who need to collect large amounts of data from websites for research or business purposes.

AI web scraper - Kadoa - Automated data extraction with AI

Again, I went with the Apify blog. Did I mention how awesome it is?

AI web scraper - Kadoa - Apify Blog

After analyzing the page, the service asks what data you want to extract. In this case, it found out that blog.apify.com contains links to blog posts and articles, so it offered to scrape these:

AI web scraper - Kadoa - Apify blog posts

After picking “Blog posts”, Kadoa gave me the option to customize the scrape even more:

AI web scraper - Kadoa - customize the scrape

What's cool is that it didn't ask for CSS / XPath selectors but allowed me to provide the commands using regular natural language (English, in this case).

As I wanted to scrape titles of the blog posts, their respective authors, and the publication dates, I just added 3 fields named “title”, “author”, and “pub_date”.

There’s no required syntax, as Kadoa makes extensive use of generative AI models to deal with that.

After a while, the service gave me the result as a neatly-formatted JSON array:

AI web scraper - Kadoa - JSON array

Pros of Kadoa.com
  • Fast and easy to use.
Cons of Kadoa.com
  • The whole project is still in the early phase, so it has some limitations:
The playground doesn't work for generic homepages, sites behind a login, sites with scraping preventions, or sites that require click automation.

If you’re a developer who needs to scrape those things (and frankly, for any large-scale scraping task, you really do need to), then Website Content Crawler is an alternative you should consider.

Is Kadoa an AI web scraping tool?

I think Kadoa is worthy of the “AI” in “AI-powered web scraping”. The AI is what makes Kadoa very easy for non-developers to use.

Use Website Content Crawler to extract and process web data to feed, fine-tune, or train large language models.

Tutorial

Using GPT models for data extraction

So, those are two ready-made “AI web scraping” products you could try, but another possibility is to use AI (LLMs in this case) directly.

For example, you can build a scraper with Crawlee that extracts text from a page, feeds it to an LLM, and says, “Make a JSON out of this.”

💻
Prompt

Here follows body.innerText for a blog listing page. Extract article names, descriptions, author names, and creation dates, and format those as a JSON array...

You can see an example in the OpenAI playground:

AI web scraping - data extraction with GPT model - prompt

And here's the response:

{
    "articleName": "How web scraping and AI are helping to find missing children",
    "description": "The Missing Children initiative began with a Facebook page. Web scraping Facebook for data labeling has taken it to a whole new level and the initiative is now reuniting families all over Egypt.",
    "authorName": "Theo Vasilis",
    "creationDate": "Aug 7, 2023"
  },
  {
    "articleName": "Google Maps scraping manual: how to extract reviews, images, restaurants, and more 📍 📚",
    "description": "Welcome to your comprehensive guide to extracting valuable data from Google Maps. In this manual, we will walk you through various techniques and tools to help you scrape images, extract restaurant data, gather contact details, scrape reviews, and much more.",
    "authorName": "Natasha Lekh",
    "creationDate": "Jul 26, 2023"
  },
  {
    "articleName": "What is generative AI?",
    "description": "What you need to know about generative AI and what it means for developers.",
    "authorName": "Theo Vasilis",
    "creationDate": "Jun 23, 2023"
  },
  {
    "articleName": "Apify tutorial contest 🏆",
    "description": "Create guides on how to use the Apify platform or Crawlee to build web scrapers, and win up to $1,000 of free platform credits for your own projects!",
    "authorName": "Theo Vasilis",
    "creationDate": "May 19, 2023"
  },
  {
    "articleName": "Web scraping for AI: how to collect data for LLMs",
    "description": "A tutorial that shows you how to crawl, extract, and process web data to feed, fine-tune, or train large language models.",
    "authorName": "Theo Vasilis",
    "creationDate": "Aug 10, 2023"
  },
  {
    "articleName": "Puppeteer tutorial: submitting forms, clicking buttons, and handling inputs",
    "description": "Find out how to use Puppeteer to handle forms, buttons, and inputs. Learn about type method, click method, and how to deal with text fields, dropdowns, and checkboxes.",
    "authorName": "Ayodele Aransiola",
    "creationDate": "Aug 9, 2023"
  },
  {
    "articleName": "How web scraping and AI are helping to find missing children",
    "description": "The Missing Children initiative began with a Facebook page. Web scraping Facebook for data labeling has taken it to a whole new level and the initiative is now reuniting families all over Egypt.",
    "authorName": "Theo Vasilis",
    "creationDate": "Aug 7, 2023"
  },
  {
    "articleName": "Selenium Grid: what it is and how to set it up",
    "description": "Learn about the Selenium Grid architecture and explore its use in scenarios like large test suites, cross-browser testing, and continuous integration. This comprehensive guide also includes steps for setting up Selenium Grid and practical tips for efficient parallel test execution.",
    "authorName": "Percival Villalva",
    "creationDate": "Aug 3, 2023"
  },
  {
    "articleName": "Is web scraping legal?",
    "description": "Web scraping is legal if you scrape data publicly available on the internet. But you should be careful when scraping personal data or intellectual property. We cover the confusion surrounding the legality of web scraping and give you tips for compliant and ethical scrapers.",
    "authorName": "Ondra Urban",
    "creationDate": "Aug 3, 2023"
  },
  {
    "articleName": "Traditional NLP techniques and the rise of LLMs",
    "description": "The field of NLP has changed with the rise of LLMs, but NLP still has a role to play. Apply NLP techniques to scraped data and learn about tokenization, stemming, lemmatization, removing stop words, and more NLP techniques.",
    "authorName": "Usama Jamil",
    "creationDate": "Aug 2, 2023"
  },
  {
    "articleName": "10 reasons tourists hate European landmarks (according to data from Google Maps)",
    "description": "A small data project to visualize and analyze bad Google Maps reviews of popular European landmarks.",
    "authorName": "Natasha Lekh",
    "creationDate": "Aug 1, 2023"
  },
  {
    "articleName": "Python and machine learning",
    "description": "Learn how Python and machine learning intersect to solve complex problems that defeat traditional programming methods. Find out about Pandas, TensorFlow, Scikit-learn, and how they can transform data.",
    "authorName": "Percival Villalva",
    "creationDate": "Jul 31, 2023"
  },
  {
    "articleName": "Top 5 books on AI",
    "description": "Explore the world of AI through a comprehensive selection of books recommended by business leaders. These reads provide an in-depth understanding of AI's history, machine learning, generative AI, diversity in AI, and AI for cybersecurity.",
    "authorName": "Guest Author",
    "creationDate": "Jul 29, 2023"
}

This approach is more resilient to page changes than regular scraping approaches as it doesn’t use CSS selectors. These can stop working after a redesign or when developers change the page layout, for example.

The data consistency heavily depends on the prompt you provide. You should be as specific as possible and always describe the schema (attribute names, such as “title”, “author_name”, “publication_date”).

Also, you have to keep in mind that the GPT model only remembers the current context and will not keep references to articles, authors, or concepts it has seen on the previous pages.

As a result, while it will transform pages to JSON arrays just fine, it will struggle to transform the data to RDF (Resource Description Framework).

AI web scraping - data extraction with GPT model - RDF graph
Example of an RDF graph (taken from Stardog). Utilizing GPT for creating these graphs from web content can be difficult because of the limited context size (e.g., I’m scraping the Love Me Do page, but I don’t know how to refer to the “The Beatles” node because its “definition” was on a different page).

RDF models are cool because they can answer questions like, “Which friends of John Lennon live in Liverpool and have worked with him on at least two albums?” which is something regular databases struggle with because you have to label all the entities and their relations manually.

Connecting a large language model with a web crawler may seem like the go-to solution for parsing webpages and creating RDF graphs out of them, but it’s tricky because of the limited context memory of today’s LLMs.

Can AI do web scraping?

So, can you use AI to do web scraping, and – more to the point – should you?

As with other uses of GPT models, AI tools are most helpful to those who know their field well enough to moderate and correct them.

If you don't know how to code, you shouldn’t trust an AI to do it for you.

If you’re a developer, then you may find GPT models helpful for certain aspects of web scraping, especially if you’re particularly good at “prompt engineering”, but I don’t think they’re ready to steal your job just yet.

Theo Vasilis
Theo Vasilis
Writer, Python dabbler, and crafter of web scraping tutorials. Loves to inform, inspire, and illuminate. Interested in human and machine learning alike.

Get started now

Step up your web scraping and automation