Bing and Bard can search, and ChatGPT can be used to process any live web page with a bit of help from web scraping. All great fun, but there’s an even more interesting use case for combining large language models and scraping. By crawling a website and ingesting its content using large language models (LLMs), you enable a new level of interaction – it's like talking to the website directly.
This is true of documentation, knowledge bases, help articles, blogs, research, or any other content. It means an end to search boxes and trying to guess the terms that will lead you to the right page. And giving it the right data means that the LLM can give you an easily understandable, natural-language answer to any question about the content.
This functionality can be used to create a custom AI chatbot, feed and fine-tune any LLM, or generate personalized content on the fly that accurately reflects a brand tone. The ingested data can also be processed by the LLM to update or improve it.
Using LLM web scraping to talk to any website
Apify recently released a new Apify Actor to make it easy to ingest content from any website. Website Content Crawler performs a deep crawl of a website and automatically removes headers, footers, menus, ads, and other noise from the web pages in order to return only text content that can be directly fed to the LLM.
It has a simple input configuration so that it can be easily integrated into customer-facing products. It scales gracefully and can be used for small sites as well as sites with millions of pages. The results can be retrieved using API in formats such as JSON or CSV, which can be fed directly to your LLM, vector database, or directly to ChatGPT.
Website Content Crawler has an integration for LangChain and an Apify Dataset Loader for LlamaIndex. So go ahead and try it out for your own website or build on it. Incorporate it into your custom AI chatbot, create apps on it, whatever you can imagine.
Here’s a step-by-step guide on how to use it.
How to extract data to feed your LLM
Step 1. Get Website Content Crawler
Step 2. Enter the URL of the website you want to scrape
Website Content Crawler will run just fine on the default settings, so you can click Start if you want to take it for a quick test drive. The default example will crawl a single page from the Apify documentation.
Step 3. Configure input parameters to control the crawl
Website Content Crawler can do extremely deep crawls, so you will definitely want to set some limits to minimize your platform usage (every free Apify account comes with $5 of prepaid usage, which should be enough to test or scrape small websites).
Each of these settings will adjust the crawler behavior. Here’s a quick overview of the main ones:
- Max crawling depth: tells the crawler the maximum number of links starting from the start URL that the crawler will recursively descend.
Check out the input parameters for a full description of all settings.
Once you’ve established sensible limits, you can go ahead and crawl any website. Try it on your own documentation or knowledge base.
Step 4. Refine HTML processing and output settings
Website Content Crawler can be configured to output scraped content so that you don’t give your LLM unwanted content, such as headers, nav, and footers, and this is the default setting. You can customize the HTML elements you want to ignore.
And there are plenty of output settings for you to experiment with, such as saving HTML or Markdown, screenshots, and so on.
Step 5. Feed the content to your LLM
Once the crawl is finished, you can export the scraped content in JSON, HTML, and a range of other formats, so choose whatever works for your LLM.
Here’s an extract from some of the scraped content from web scraping for beginners in JSON format:
It really is that easy for you to talk to any website with GPT, Llama, Alpaca, or any other large language model. You can use Website Content Crawler for your own AI and web scraping projects or build upon it for your customers. Enhance the performance of your LLMs, create personalized content, develop custom chatbots, and improve existing content with summarization, proofreading, translation, or style changes.
Give your LLM a memory with LangChain
The LangChain framework is designed to simplify the creation of applications using large language models. LangChain acts as an abstraction layer that handles integration with APIs, cloud storage platforms, other large language models, and an extensive range of other services, enabling document analysis, custom AI chatbot creation, code analysis, and data manipulation.
Check out this guide on how to get started with LangChain and some more examples of how combining Website Content Crawler and LangChain can be used to easily create ChatGPT-like query interfaces for websites.
Internal Apify AI hackathon: how to create a custom AI chatbot
Apify is really excited about what LLMs can do, and we recently held an internal AI hackathon to do a couple of days of intense work on projects our devs found exciting. One of the most interesting results was how to create a custom AI chatbot with Python.