Hi, we're Apify, a full-stack web scraping and browser automation platform. This article about AI and web scraping trends in 2023 and predictions for 2024 was inspired by our work on getting better data for AI. Check us out.
Web scraping and AI: where are we, and how did we get here?
Remember, remember,
the 30th of November.
Of course, I'm talking about November 2022, when ChatGPT hit the scene. Since that fateful day, artificial intelligence has had a few brand makeovers.
It all began with apocalyptic speculations reminiscent of science-fiction movies and fear of massive job losses on an unprecedented scale. What followed was a little more boring and somewhat more predictable. 'AI' became the mother of all buzzwords. Companies all joined in a zombie-like scramble to sprinkle that term onto every product.
Less than a year after the launch of ChatGPT, the term AI is nearly bereft of meaning. It seems that almost anything that involves a computer doing what computers do is “AI-powered” technology.
➡️ Getting your software to organize your photos by date? AI!
➡️ A computer recognizing pixels in an image? AI!
➡️ Robotic process automation? AI again!
➡️ Generating browser fingerprints to avoid your web scraper getting detected? You guessed it! AI!
Almost anything we were automating or using software for before November 2022 was rebranded as “AI-powered” in 2023.
All of this has blurred the lines between artificial intelligence and web scraping. So, before we look at what the future may hold in 2024, let's remind ourselves of what has happened and the state of AI/web scraping today.
How I built an AI-powered tool in 10 minutes
AI and the internet 🌎
Web scraping was a silent hero /culprit in the widespread adoption of AI models. Hardly surprising. The web is the largest and most convenient repository of information we've ever known, and web scraping is the most efficient method of extracting that data. All LLMs (large language models) - ChatGPT, Bard, PaLM, Gopher, and the like - were trained on data extracted from the web. The same goes for image models like Stable Diffusion, DALL-E, and Midjourney.
That was the first use case of web scraping in the domain of AI and machine learning: extracting data for training datasets.
The rise of vector databases 🔢
Before ChatGPT arrived, the worlds of machine learning and databases were on a head-on collision path. AI was representing data as vectors, and no existing database was really able to manage them. Hence the rise of vector databases.
Designed to handle the unique structure of vector embeddings (dense vectors of numbers that represent text), these databases can index vectors for easy search and retrieval by comparing values and finding those that are most similar to one another. That makes them ideal for NLP and AI-driven applications.
The first on the scene (a year before ChatGPT) was Pinecone, but a range of open-source Pinecone alternatives swiftly rose up to challenge it.
Such vector databases can be used to extend large language models with long-term memory by providing your own data. You can query relevant documents from your database to update the context, which customizes the final response to a prompt given to an LLM.
That was the second use case for web scraping in the domain of AI: providing data to feed vector databases.
Enter LangChain 🦜🔗
The arrival of LangChain in October 2022 was a big deal for AI. Unlike the aforementioned vector databases, which are designed specifically for storing vectors, LangChain is a more generic library that simplifies the process of integrating different vector databases into an application.
LangChain connects to the AI models, frameworks, and platforms you want to use, such as OpenAI, Hugging Face, and Haystack, and links them with outside sources. That means you can chain commands together so an AI model can know what it needs to do to produce the answers or perform the tasks you require.
LangChain quickly became and remains the library of choice for building on top of AI models. Amongst other things, It's a vital ingredient in creating custom AI chatbots, which in the past few months has become the number one use case for AI (take Intercom's AI chatbot, for example).
Thus, a third use case for AI-targeted web scraping arose:
Integrating scraped data with LangChain means you can customize AI models with up-to-date information, thus overcoming the memory limitation of LLMs and AI knowledge restricted to the data used at the time of training.
As of September 27, 2023, GPT-4's knowledge is no longer restricted to the data used at the time of training.
GPT plugins 🔌
Another development that unlocked a vast range of use cases is GPT plugins. Plugins are tools designed specifically for language models that help ChatGPT access up-to-date information, run computations, or use third-party services. In this respect, they fulfill a function similar to LangChain.
The advantage of LangChain over ChatGPT plugins, however, is its compatibility with most available LLMs. Developers using ChatGPT are restricted to defining specific actions or HTTP endpoints for the language model to call. This makes it challenging for third-party developers to create tools that need an LLM.
What has web scraping got to do with this?
You can use a ChatGPT plugin, like Code Interpreter (now included in GPT-4), to craft code to scrape websites with complicated webpage structures or with active anti-scraping protocols. It can help visualize outputs, parse, debug, and execute code, integrate with software binaries, and do other programming-related things.
You can use it with Google Search Scraper to fetch Google Search results or Website Content Crawler to download individual web pages or files
Web scraping for AI vs. AI for web scraping 💻
That brings us to using AI for web scraping. Until now, I've referred to AI web scraping use cases in the context of training machine learning models, feeding vector databases, customizing and fine-tuning LLMs, and creating chatbots. That's web scraping for AI. Now, there's also AI for web scraping. The question is, do AI web scraping tools really work? And are they really AI?
As I said at the beginning, just about everything that involves technology is being labeled as AI these days. This is no less true for web scraping, with those two letters - AI - being inserted into just about every web scraping product out there.
AI has become synonymous with 'no-code' and 'low-code' web scraping tools. We had those before AI was put into the hands of consumers. Rebranding them doesn't change how they work or what they're capable of.
That being said, there are some tools out there that really do use generative AI to handle some aspects of web scraping.
Kadoa, for example, lets you scrape without CSS or Path selectors. You can provide commands in natural language, and there's no syntax required. However, the capabilities of such tools are currently limited. For serious and large-scale data extraction projects, you need infrastructure, not just an “AI-powered” web scraper.
The rising tide of multimodal AI 🌊
The idea of AI having eyes and ears might sound like something from science fiction, but to an extent, it's already happening.
Until recently, AI has meant almost nothing but chatbots (think of the web interfaces of Bard, GPT-powered Bing, and Claude), but new modes of using AI have appeared in the past few months.
When GPT-4 was released, it was heralded as a multimodal AI. It turned out to be a damp squib. Its multimodality never saw the light of day due to costs. But Bing has now introduced multimodal features to its creative mode, which uses GPT-4, and OpenAI’s ChatGPT now includes GPT-4V(ision), which can analyze graphics and photos. Google is also planning to release its multimodal competitor, Gemini, by the end of 2023.
Multimodal AI lets the model see and understand images. In other words, AI is now capable of describing and explaining photos, memes, and videos. That means it could even create reviews for products based on their photos. Of course, AI remains as prone to error as ever and still suffers from hallucinations. Nonetheless, it has become apparent that multimodal AI allows us to do things we couldn't do before.
The same is true for audio. Some may not be aware that OpenAI introduced an automatic speech recognition system called Whisper in September 2022. This voice-to-text system, which is part of the ChatGPT app on mobile, is more accurate and efficient than the likes of Siri and Google. Rather than having to dictate every word, as is the case with Google and Siri, you need only state your intent, and the AI will perform the action you require. For example, you can say something like, “I need you to write a letter explaining…”, and Whisper will create what you asked for.
How does this relate to web scraping? Let's find out in my predictions for AI and web scraping in 2024…
AI and web scraping predictions for 2024
Better AI chatbots 🤖
I've never quite understood why companies love chatbots so much, given how much customers hate them. However, the combination of web scraping and AI means that chatbots can have the information needed to answer many queries quickly and efficiently.
Scraping your own website and feeding the information to a large language model means that your chatbots can provide information based on the content of your website or documentation to provide true and accurate answers to customer inquiries.
If done properly, chatbots will only get better in 2024.
Text scraping won't be enough 🖼️
Multimodal AI is likely to be the next big thing in 2024. That means scraping text won't be enough to meet the needs of users. Even for language models like GPT, image, video, and audio will be just as important as the written word. Multimodal AI requires the extraction of such content for data labeling to train and feed LLMs. That means methods and tools for scraping multimedia will be in big demand.
AI-powered web scrapers 🦾
AI-powered web scraping is still limited, but inroads are being made. Any companies claiming their web scrapers are AI-powered when they're not are going to be left by the wayside.
These new 'no/low code' scraping tools won't only be of use to those with little coding knowledge; they'll also make the lives of seasoned developers easier. Using tools like GPT can make web scrapers more resilient to page changes, as you don't require CSS selectors. These selectors can stop working after a redesign or when developers change the page layout, for example.
Will AI be able to do large-scale scraping? 🤔
The big question is, will AI be able to handle large-scale scraping tasks? The greatest obstacles any serious scraping project faces are anti-bot protections: CAPTCHAs, Cloudflare bans, honeypot traps, and the like. Dealing with issues like these requires robust infrastructure and sophisticated web scraping techniques, such as smart proxy rotation and browser fingerprint generation.
AI may be able to help make some web scraping tasks easier, but given the current limits of AI models, it's unlikely that they'll be able to handle these things without a great deal of human intervention.
Ethical and legal considerations for AI and web scraping ⚖️
With the legal landscape slowly shifting under our feet, it's hard to predict where legislation around the use of AI and web scraping will take us.
Debates about the legal limits of web scraping and the ethics of data collection are nothing new, but there's also a debate about whether AI-generated content is protected by copyright law.
Legislation doesn't move as fast as AI does, and for now, it would seem that AI-generated content is not yet eligible for copyright protection. But we'll have to see how copyright law evolves to find out what limitations (if any) will be placed on scraping and using AI-generated content.
AI models will need retraining with large datasets 📚
One more thing to watch out for (possibly as early as 2024) is the need to frequently retrain even the largest of language models and other generative AI. The problems of AI degradation and model collapse mean that these models will only get worse otherwise.
The only viable solution is to retrain AI models with new sources of ground truth, manual data labeling, and large volumes of data.
We've learned that generating synthetic data only adds to the problem, while pre-packaged datasets are outdated and hard to customize. So, what will be the best way to collect data for AI? As always, by scraping the web.
Data is our business, and business is good
What do AI and web scraping have in common? Data. Lots and lots of data!
From training and retraining LLMs and feeding vector databases to creating and customizing AI chatbots, web data extraction remains the go-to solution for AI systems and applications.
For as long as data remains the fuel for AI, which will almost certainly be the case in 2024, web scraping will stay in business.
Extract text content from the web to feed your vector databases and fine-tune or train large language models.