How to use ChatGPT for web scraping (step-by-step guide)

If you know web scraping basics, this tutorial will teach you how to speed things up with AI.

Content

We’re Apify, a full-stack web scraping and browser automation platform. A big part of what we do is getting better data for AI. But this time, we'll show you how to use AI to get data.


I'll level with you.

If you know nothing at all about web scraping, this tutorial won't help you much.

ChatGPT can't do all the hard work for you.

However, if you know web scraping basics, let me assure you that this guide will teach you how to use AI as a web scraping assistant and save a ton of time coding.

I'll show you how to scrape X (formerly known as Twitter), which is a JavaScript-rendered website. That means content is not present in the HTML source code but is loaded or generated after the initial page load.

So, not the easiest of sites to scrape.

But believe me, AI will make it a lot easier.

Scrape data with ChatGPT

What you'll need

To follow this, you don't even need to know a programming language. But you will need:

How to scrape a website with ChatGPT

Step 1. Choose the website and scraping tools

For this tutorial, I'm going to scrape X posts (you probably still call them tweets) from the Apify X account: https://x.com/apify.

I happen to know that X is a dynamic website.

I also happen to know that to extract data from such a site, I need a tool that can execute JavaScript and interact with web pages.

My favorite framework for that is Playwright for Python, so that's the tool I'll go for here.

Step 2. Inspect the website and identify the right tags

You probably know that web scraping involves identifying the appropriate tags when you inspect the web page:

How to scrape a website with ChatGPT.  Step 2. Inspect the web page
Press F12 on Windows or Option + ⌘ + I on Mac to open DevTools on Google Chrome

So, let's do some tag-hunting for Apify tweets.

Right-click on a post to inspect the elements you need to scrape:

How to scrape a website with ChatGPT. Step 2. Identify the right tags.
Look for the elements and attributes you want to scrape

The div element contains the lang attribute, which is often used on tweet text to denote the language of the tweet.

I'll choose this so the web scraping script focuses on tweets rather than other types of content on the page.

Step 3. Prompt ChatGPT

Now you know the URL, tool, and tags you want, it's time to prompt your AI web scraping assistant.

For this particular case, here's the prompt I gave ChatGPT:

#Scrape https://x.com/apify using Python Playwright.

#Maximize the window, wait 10 seconds, and locate all the elements with the following XPath: "div" tag, attribute name "lang".

#Print the text inside these elements.

#Scroll down until it has scraped and printed at least 20 different tweets.

Prompt explanation:

  • Maximizing the window: This helps to load more content initially and reduces the need for excessive scrolling. It also helps ensure that responsive websites display their content in the desktop view, which often includes more information than mobile views.
  • Waiting 10 seconds: Waiting allows the webpage to load fully to make sure all elements, including dynamically loaded content like tweets, are completely rendered before the script tries to interact with them. This helps avoid issues where the script might try to access elements that aren't yet available.
  • XPath: The XPath expression //div[@lang] targets all div elements that have a lang attribute so we can extract relevant content (tweets) only.

📌
I already had Playwright installed on my system. If you don't have it, you'll need to install it first:
pip install Playwright

After telling me to set up Playwright and follow the same steps I gave it, ChatGPT provided me with this code:

import asyncio
from playwright.async_api import async_playwright

async def scrape_x_com_apify():
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=False)
        context = await browser.new_context()
        page = await context.new_page()
        
        # Maximize the window
        await page.set_viewport_size({"width": 1920, "height": 1080})
        
        # Navigate to the URL
        await page.goto('https://x.com/apify')
        
        # Wait for the page to load
        await page.wait_for_timeout(10000)
        
        tweet_texts = set()
        
        while len(tweet_texts) < 20:
            # Locate all elements with the given XPath
            tweet_elements = await page.query_selector_all('//div[@lang]')
            
            # Extract and print text from each element
            for element in tweet_elements:
                text = await element.inner_text()
                tweet_texts.add(text)
                print(text)
                
                if len(tweet_texts) >= 20:
                    break
            
            if len(tweet_texts) < 20:
                # Scroll down to load more tweets
                await page.evaluate('window.scrollBy(0, window.innerHeight)')
                await page.wait_for_timeout(2000)
        
        await browser.close()

# Run the function
asyncio.run(scrape_x_com_apify())

Did it work? Let's find out.

Step 4. Run the code in your terminal

Copy-paste the GPT-generated script in your code editor of choice. Here it is in Jupyter Notebook on VSCode:

How to scrape a website with ChatGPT. Step 4. Run the code in your terminal.
Run ChatGPT-generated code in your terminal

It started well: it opened a Chrome browser and navigated to the Apify Twitter page:

ChatGPT web scraping: Playwright opening Chrome browser and navigating to Twitter
Playwright opening the Chrome browser and navigating to the given URL

However, it failed to print the data. Instead, I got this runtime error:

How to scrape a website with ChatGPT. Step 5. Get ChatGPT to debug the code
The GPT-generated web scraping script produced a RuntimeError

Which brings us to the next step.

Step 5. Get ChatGPT to debug the code

ChatGPT is pretty good at rectifying code. The best thing is you don't even need to spend time typing. Just taking a screenshot and uploading it usually does the trick.

So, I took a screenshot of the error I received and uploaded it into ChatGPT.

It recognized the problem and responded with a solution:

The error message indicates that asyncio.run() cannot be called from a running event loop. This typically occurs when running asynchronous code in environments like Jupyter notebooks or certain IDEs.

To resolve this, you can use a different approach to run your asyncio code within these environments. Instead of using asyncio.run(), use the event loop directly.

It gave me some adapted code to run. So, I tried again.

How I built an AI-powered tool in 10 minutes

Step 6. Iterate

AI rarely gets things right the first or even the second time. So, if it doesn't succeed, don't give up.

I ran the adapted code and ended up with a similar error:

How to scrape a website with ChatGPT. Step 6. Iterate
The adapted GPT-generated code also produced a RuntimeError

Once again, I showed ChatGPT the screenshot and got this response:

To avoid the error that occurs because the event loop is already running, we can use an alternative approach by creating a new event loop or using nest_asyncio to allow nested use of asyncio.run. The nest_asyncio library can patch the running event loop to allow it to be re-entered.

First, install nest_asyncio using pip if you haven't already:
pip install nest_asyncio
Then modify the script as follows:
import asyncio
import nest_asyncio
from playwright.async_api import async_playwright

nest_asyncio.apply()

async def scrape_x_com_apify():
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=False)
        context = await browser.new_context()
        page = await context.new_page()
        
        # Maximize the window
        await page.set_viewport_size({"width": 1920, "height": 1080})
        
        # Navigate to the URL
        await page.goto('https://x.com/apify')
        
        # Wait for the page to load
        await page.wait_for_timeout(10000)
        
        tweet_texts = set()
        
        while len(tweet_texts) < 20:
            # Locate all elements with the given XPath
            tweet_elements = await page.query_selector_all('//div[@lang]')
            
            # Extract and print text from each element
            for element in tweet_elements:
                text = await element.inner_text()
                tweet_texts.add(text)
                print(text)
                
                if len(tweet_texts) >= 20:
                    break
            
            if len(tweet_texts) < 20:
                # Scroll down to load more tweets
                await page.evaluate('window.scrollBy(0, window.innerHeight)')
                await page.wait_for_timeout(2000)
        
        await browser.close()

# Run the function using the event loop
asyncio.run(scrape_x_com_apify())

When I ran the modified script, it proceeded to open the Chrome browser and navigate to the web page...

And success! It printed the data in my terminal:

Output (truncated):
Scrollable Content
12,500+ Stars on GitHub We are giving away 3 Crawlee shirts to celebrate this milestone Rules: - Follow @apify - Retweet this tweet - Star us at: https://apify.it/3UZ22lv - Reply with your GitHub username Deadline: 7th June 2024 And finally there's also an universal #webscraping and #crawling library for #JavaScript / #NodeJS, similar to @ScrapyProject for #Python that was around for years, but also working with headless Chrome and Puppeteer. Better late than never https://github.com/apifytech/apify-js… That moment when @Microsoft starts using your open-source web scraping libraries... https://github.com/microsoft/accessibility-insights-service… #webscraping #opensource Today we're launching Crawlee on Product Hunt Please show us some love and support, to encourage the team who spent years building this open-source library! Learn more about Crawlee in the thread 1/5 https://producthunt.com/posts/crawlee We've just launched Crawlee on @ycombinator's Launch YC! Crawlee is an open-source Node.js library for developing web scrapers and crawlers — an essential tool to acquire data for fine-tuning LLMs and RAG. 11K stars on GitHub and counting Our price watcher, done together with @topmonks @keboola and @BizzTreat for the biggest three Czech e-shops and their Black Friday discounts made it into @ForbesCesko magazine. And we're in good company opposite delicious @DiplomaticoRum https://blog.apify.com/black-friday-in-czechia-magical-prices-one-year-on-4ca9b25d2256… Excited to sponsor @MariyaSha888 in the new #Python Simplified Code Jam - an initiative that incentivizes experienced and #beginnercoders alike to build exciting stuff together Check out Mariya's channel and register if you're up for a challenge: https://youtube.com/watch?v=tRlEkCLQ-fk… This is how we hacked @WebExpo, Central Europe's largest conference about tech and web Using #HTTP status codes we redirected visitors of their opening rooftop party to our office next door, where they could enjoy our #API beer #guerrilla #marketing #webexpo It's been almost 5 years since we first launched on @ProductHunt as Apifier. Since then, we've grown from two guys in a small house in Mountain View into a company of 30 amazing people and completely reinvented all of our products. Thanks for your support https://producthunt.com/posts/apify-1 Need to extract vital information locked away in PDF documents for your AI projects? Here's a solution! Apify's PDF Text Extractor + @LangChainAI! Read the tutorial below to find out how to combine them for QA from PDF docs We just released our new #Google search scraper. Now you can get your or your competitions's #SEO or paid rankings in search results in just a few seconds. Check out the full blog post. Another important court ruling in favor of web scraping: Scraping a public website without the approval of the website owner isn't a violation of the Computer Fraud and Abuse Act (via @arstechnica) https://arstechnica.com/tech-policy/2019/09/web-scraping-doesnt-violate-anti-hacking-law-appeals-court-rules/… Packed house! 150+ attendees at yesterday's Prague Fall LLM Meetup in our Lucerna office! Huge thx to co-organizers @weights_biases, @kaya_vc & speakers @KocmiTom, @jiri_moravcik, @hansramsl, Tomáš Mikolov, @PetrBrzek. Stay tuned for more LLM- and GenAI-related events Crawlee has surpassed 10,000 stars on GitHub! We are thrilled to see it helping so many developers simplify their web scraping and web crawling workflows And if you haven't already, give Crawlee a try (and maybe even a star) at the link below https://apify.it/40PnMms 1/ Today, we're launching a major innovation—developers can now charge for the usage of their actors on Apify. This gives a clear incentive to keep building and improving software that brings value to the community. We've joined the fight against #COVID19 by turning official pages with statistics into #APIs that can be used by other apps. Check out the latest data from multiple countries. Want to see data from pages without APIs not yet listed? Tell us! #coronavirus https://bit.ly/2x1qYBG 30 girls from @czechitas party with Apify nerds at our #Lucerna #rooftop office Some really wild conversations about tech and software Apify ChatGPT Plugin is here! The new plugin is a web browser for ChatGPT on steroids—it can scrape Google Search results, crawl websites and query web pages found, and browse individual web pages. And yes, it's free. Apify is now integrated to @FlowiseAI You can now use Apify with the open-source UI visual tool to build your customized #LLM flow using @LangChainAI, written in Node TypeScript or JavaScript Learn more about the integration https://apify.it/3sQwHa7 U.S. Appeals Court preliminarily reaffirms web scraping is not hacking.

Want to add scraping abilities to custom GPTs?

How to not get blocked

I've shown you how to perform a simple web scraping task with ChatGPT, but code isn't enough for any serious, large-scale scraping project.

Extracting data from modern websites poses challenges beyond dynamic content: CAPTCHAs, IP blocks and bans, browser fingerprints, and more.

To overcome these problems, you need infrastructure that provides you with things like smart proxy management and browser fingerprint generation.

The best solution? Deploy your code to a cloud platform like Apify.

The Apify platform was built to serve large-scale and high-performance web scraping and automation needs. It provides easy access to compute instances, storage, proxies, scheduling, webhooks, integrations, and more.

Check us out.

Another way to use ChatGPT for web scraping

Extended GPT Scraper is a web scraping tool that combines web content extraction with OpenAI's GPT language model. It scrapes websites, converts the content to markdown, and then uses GPT to process or transform the scraped text based on user-provided instructions. This allows for automated summarization, content analysis, or other text manipulations of web content at scale.

Try for free

Summary: Tips for scraping with ChatGPT

  1. Use DevTools to inspect the target website.
  2. Pick a programming language and determine the right tools for the scraping task.
  3. Be as specific as possible and always describe the schema (div tags, attribute names, etc.).
  4. Always run the code yourself.
  5. Upload screenshots to ChatGPT to help you fix errors.

Frequently asked questions

Can ChatGPT read a website?

ChatGPT can access and read some websites through its browsing mode. However, when browsing, it's not clear whether it's running searches on the Bing index or actually visiting pages. Regardless, there are some websites ChatGPT can't read, such as paywalled content, dynamic content, and private sites.

Can ChatGPT do web scraping?

No. ChatGPT can't do web scraping directly. But it can provide guidance, code examples, and explanations on how to use frameworks and libraries for scraping tasks. For actual web scraping, you would need to use a programming environment that can execute code and interact with websites.

How do I get ChatGPT to scrape websites?

ChatGPT is a language model designed to process and generate text based on the input it receives. It doesn't have the capability to interact with external websites or execute code. However, GPT Scraper and Extended GPT Scraper are two Apify tools you can use to extract content from a website and then pass that content to the OpenAI API.

Theo Vasilis
Theo Vasilis
I used to write books. Then I took an arrow in the knee. Now I'm a technical content marketer, crafting tutorials for developers and conversion-focused content for SaaS.

Get started now

Step up your web scraping and automation