ChatGPT web scraping guide for 2025 (all you need to know)

This tutorial will show you how to use ChatGPT as a web scraping assistant

You don't need to be a web scraping master if you know how to get ChatGPT to help you out. This guide will teach you how to use it as a web scraping assistant and save a ton of time coding.

Can ChatGPT scrape websites?

No, ChatGPT can't do web scraping directly. But it can provide guidance, code examples, and explanations on how to use frameworks and libraries for scraping tasks. For actual web scraping, you would need to use a programming environment that can execute code and interact with websites.

That's precisely what this tutorial will demonstrate, only we'll be using ChatGPT to help us.

We'll scrape X (formerly known as Twitter), which is a JavaScript-rendered website. That means content is not present in the HTML source code but is loaded or generated after the initial page load. So, not the easiest of sites to scrape. We'll demonstrate how ChatGPT makes it a lot easier.

Apify logo
Scrape data with ChatGPT
Want an even easier way to use AI for web scraping?
Try GPT Scraper

To follow this, you don't even need to know a programming language. But you will need:

ChatGPT web scraping tutorial

We're going to scrape X posts (you probably still call them tweets) from the Apify X account: https://x.com/apify. As X is a dynamic website, we'll need a tool that can execute JavaScript and interact with web pages.

1. Take care of prerequisites

One of the best frameworks for scraping dynamic websites like X is Playwright for Python, so that's the tool we'll use here. Install it using pip:

pip install Playwright

2. Inspect the website and identify the right tags

Web scraping involves identifying the appropriate tags when you inspect the web page. So, right-click and choose 'Inspect' and let's do some tag-hunting for Apify tweets:

How to scrape a website with ChatGPT.  Step 2. Inspect the web page
Press F12 on Windows or Option + ⌘ + I on Mac to open DevTools on Google Chrome

Right-click on a post to inspect the elements you need to scrape:

How to scrape a website with ChatGPT. Step 2. Identify the right tags.
Look for the elements and attributes you want to scrape

The div element contains the lang attribute, which is often used on tweet text to denote the language of the tweet.

We'll choose this so the web scraping script focuses on tweets rather than other types of content on the page.

3. Craft a prompt for ChatGPT

Now you know the URL, tool, and tags you want, it's time to prompt your AI web scraping assistant.

For this particular case, here's the prompt we'll give ChatGPT:

#Scrape https://x.com/apify using Python Playwright.

#Maximize the window, wait 10 seconds, and locate all the elements with the following XPath: "div" tag, attribute name "lang".

#Print the text inside these elements.

#Scroll down until it has scraped and printed at least 20 different tweets.

Prompt explanation:

  • Maximizing the window: This helps to load more content initially and reduces the need for excessive scrolling. It also helps ensure that responsive websites display their content in the desktop view, which often includes more information than mobile views.
  • Waiting 10 seconds: Waiting allows the webpage to load fully to make sure all elements, including dynamically loaded content like tweets, are completely rendered before the script tries to interact with them. This helps avoid issues where the script might try to access elements that aren't yet available.
  • XPath: The XPath expression //div[@lang] targets all div elements that have a lang attribute so we can extract relevant content (tweets) only.

After telling us to set up Playwright and follow the same steps we gave it, ChatGPT provided us with this code:

import asyncio
from playwright.async_api import async_playwright

async def scrape_x_com_apify():
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=False)
        context = await browser.new_context()
        page = await context.new_page()
        
        # Maximize the window
        await page.set_viewport_size({"width": 1920, "height": 1080})
        
        # Navigate to the URL
        await page.goto('https://x.com/apify')
        
        # Wait for the page to load
        await page.wait_for_timeout(10000)
        
        tweet_texts = set()
        
        while len(tweet_texts) < 20:
            # Locate all elements with the given XPath
            tweet_elements = await page.query_selector_all('//div[@lang]')
            
            # Extract and print text from each element
            for element in tweet_elements:
                text = await element.inner_text()
                tweet_texts.add(text)
                print(text)
                
                if len(tweet_texts) >= 20:
                    break
            
            if len(tweet_texts) < 20:
                # Scroll down to load more tweets
                await page.evaluate('window.scrollBy(0, window.innerHeight)')
                await page.wait_for_timeout(2000)
        
        await browser.close()

# Run the function
asyncio.run(scrape_x_com_apify())

4. Review the code

Once ChatGPT responds with code, it’s always a good idea to review it first. Just check that it does what it's supposed to do. Make sure it uses the libraries or packages you want to include. If you find any glitches, provide feedback to ChatGPT and ask it to generate the code again.

5. Run and debug the code

Copy-paste the GPT-generated script in your code editor of choice. Here it is in Jupyter Notebook on VSCode:

How to scrape a website with ChatGPT. Step 4. Run the code in your terminal.
Run ChatGPT-generated code in your terminal

It started well: it opened a Chrome browser and navigated to the Apify Twitter page:

ChatGPT web scraping: Playwright opening Chrome browser and navigating to Twitter
Playwright opening the Chrome browser and navigating to the given URL

However, it failed to print the data. Instead, we got this runtime error:

How to scrape a website with ChatGPT. Step 5. Get ChatGPT to debug the code
The GPT-generated web scraping script produced a RuntimeError

This brings us to debugging.

ChatGPT is pretty good at rectifying code errors. The best thing is you don't even need to spend time typing. Just taking a screenshot and uploading it usually does the trick.

So, we took a screenshot of the error and uploaded it into ChatGPT.

It recognized the problem and responded with a solution:

The error message indicates that asyncio.run() cannot be called from a running event loop. This typically occurs when running asynchronous code in environments like Jupyter notebooks or certain IDEs. To resolve this, you can use a different approach to run your asyncio code within these environments. Instead of using asyncio.run(), use the event loop directly.

It gave us some adapted code to run. So, let's try again:

How to scrape a website with ChatGPT. Step 6. Iterate
The adapted GPT-generated code also produced a RuntimeError

Another error. We showed ChatGPT the screenshot and got this response:

To avoid the error that occurs because the event loop is already running, we can use an alternative approach by creating a new event loop or using nest_asyncio to allow nested use of asyncio.run. The nest_asyncio library can patch the running event loop to allow it to be re-entered. First, install nest_asyncio using pip if you haven't already:
pip install nest_asyncio
Then modify the script as follows:
import asyncio
import nest_asyncio
from playwright.async_api import async_playwright

nest_asyncio.apply()

async def scrape_x_com_apify():
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=False)
        context = await browser.new_context()
        page = await context.new_page()
        
        # Maximize the window
        await page.set_viewport_size({"width": 1920, "height": 1080})
        
        # Navigate to the URL
        await page.goto('https://x.com/apify')
        
        # Wait for the page to load
        await page.wait_for_timeout(10000)
        
        tweet_texts = set()
        
        while len(tweet_texts) < 20:
            # Locate all elements with the given XPath
            tweet_elements = await page.query_selector_all('//div[@lang]')
            
            # Extract and print text from each element
            for element in tweet_elements:
                text = await element.inner_text()
                tweet_texts.add(text)
                print(text)
                
                if len(tweet_texts) >= 20:
                    break
            
            if len(tweet_texts) < 20:
                # Scroll down to load more tweets
                await page.evaluate('window.scrollBy(0, window.innerHeight)')
                await page.wait_for_timeout(2000)
        
        await browser.close()

# Run the function using the event loop
asyncio.run(scrape_x_com_apify())

When we ran the modified script, it proceeded to open the Chrome browser and navigate to the web page...

And success! It printed the data in the terminal:

Output (truncated):
12,500+ Stars on GitHub We are giving away 3 Crawlee shirts to celebrate this milestone Rules: - Follow @apify - Retweet this tweet - Star us at: https://apify.it/3UZ22lv - Reply with your GitHub username Deadline: 7th June 2024 And finally there's also an universal #webscraping and #crawling library for #JavaScript / #NodeJS, similar to @ScrapyProject for #Python that was around for years, but also working with headless Chrome and Puppeteer. Better late than never https://github.com/apifytech/apify-js… That moment when @Microsoft starts using your open-source web scraping libraries... https://github.com/microsoft/accessibility-insights-service… #webscraping #opensource Today we're launching Crawlee on Product Hunt Please show us some love and support, to encourage the team who spent years building this open-source library! Learn more about Crawlee in the thread 1/5 https://producthunt.com/posts/crawlee We've just launched Crawlee on @ycombinator's Launch YC! Crawlee is an open-source Node.js library for developing web scrapers and crawlers — an essential tool to acquire data for fine-tuning LLMs and RAG. 11K stars on GitHub and counting Our price watcher, done together with @topmonks @keboola and @BizzTreat for the biggest three Czech e-shops and their Black Friday discounts made it into @ForbesCesko magazine. And we're in good company opposite delicious @DiplomaticoRum https://blog.apify.com/black-friday-in-czechia-magical-prices-one-year-on-4ca9b25d2256… Excited to sponsor @MariyaSha888 in the new #Python Simplified Code Jam - an initiative that incentivizes experienced and #beginnercoders alike to build exciting stuff together Check out Mariya's channel and register if you're up for a challenge: https://youtube.com/watch?v=tRlEkCLQ-fk… This is how we hacked @WebExpo, Central Europe's largest conference about tech and web Using #HTTP status codes we redirected visitors of their opening rooftop party to our office next door, where they could enjoy our #API beer #guerrilla #marketing #webexpo It's been almost 5 years since we first launched on @ProductHunt as Apifier. Since then, we've grown from two guys in a small house in Mountain View into a company of 30 amazing people and completely reinvented all of our products. Thanks for your support https://producthunt.com/posts/apify-1 Need to extract vital information locked away in PDF documents for your AI projects? Here's a solution! Apify's PDF Text Extractor + @LangChainAI! Read the tutorial below to find out how to combine them for QA from PDF docs We just released our new #Google search scraper. Now you can get your or your competitions's #SEO or paid rankings in search results in just a few seconds. Check out the full blog post. Another important court ruling in favor of web scraping: Scraping a public website without the approval of the website owner isn't a violation of the Computer Fraud and Abuse Act (via @arstechnica) https://arstechnica.com/tech-policy/2019/09/web-scraping-doesnt-violate-anti-hacking-law-appeals-court-rules/… Packed house! 150+ attendees at yesterday's Prague Fall LLM Meetup in our Lucerna office! Huge thx to co-organizers @weights_biases, @kaya_vc & speakers @KocmiTom, @jiri_moravcik, @hansramsl, Tomáš Mikolov, @PetrBrzek. Stay tuned for more LLM- and GenAI-related events Crawlee has surpassed 10,000 stars on GitHub! We are thrilled to see it helping so many developers simplify their web scraping and web crawling workflows And if you haven't already, give Crawlee a try (and maybe even a star) at the link below https://apify.it/40PnMms 1/ Today, we're launching a major innovation—developers can now charge for the usage of their actors on Apify. This gives a clear incentive to keep building and improving software that brings value to the community. We've joined the fight against #COVID19 by turning official pages with statistics into #APIs that can be used by other apps. Check out the latest data from multiple countries. Want to see data from pages without APIs not yet listed? Tell us! #coronavirus https://bit.ly/2x1qYBG 30 girls from @czechitas party with Apify nerds at our #Lucerna #rooftop office Some really wild conversations about tech and software Apify ChatGPT Plugin is here! The new plugin is a web browser for ChatGPT on steroids—it can scrape Google Search results, crawl websites and query web pages found, and browse individual web pages. And yes, it's free. Apify is now integrated to @FlowiseAI You can now use Apify with the open-source UI visual tool to build your customized #LLM flow using @LangChainAI, written in Node TypeScript or JavaScript Learn more about the integration https://apify.it/3sQwHa7 U.S. Appeals Court preliminarily reaffirms web scraping is not hacking.


How to not get blocked when scraping with ChatGPT

We've shown you how to perform a simple web scraping task with ChatGPT, but code isn't enough for any serious, large-scale scraping project.

Extracting data from modern websites poses challenges beyond dynamic content: CAPTCHAs, IP blocks and bans, browser fingerprints, and more.

To prevent your scraper from getting blocked, you need infrastructure that provides you with things like smart proxy management and browser fingerprint generation.

The best solution? Deploy your code to a cloud platform like Apify.

The Apify platform was built to serve large-scale and high-performance web scraping and automation needs. It provides easy access to compute instances, storage, proxies, scheduling, webhooks, integrations, and more. Check us out.

Apify logo
Extended GPT Scraper
Combines web content extraction with OpenAI's GPT language model
Try for free

Tips for scraping with ChatGPT

  1. Use DevTools to inspect the target website.
  2. Pick a programming language and determine the right tools for the scraping task.
  3. Be as specific as possible and always describe the schema (div tags, attribute names, etc.).
  4. Always run the code yourself.
  5. Upload screenshots to ChatGPT to help you fix errors.

Conclusion

Although ChatGPT can't do web scraping directly, we've demonstrated how you can use it to help you scrape web content, even from dynamically loaded pages. By using Playwright in Python with ChatGPT, we successfully extracted tweets from X.com. This goes to show that you don't need to be a web scraping master or AI engineer to do ChatGPT web scraping.

Frequently asked questions

Can ChatGPT read websites?

ChatGPT can access and read some websites through its browsing mode. However, when browsing, it's not clear whether it's running searches on the Bing index or actually visiting pages. Regardless, there are some websites ChatGPT can't read, such as paywalled content, dynamic content, and private sites.

Can ChatGPT do web scraping?

No. ChatGPT can't do web scraping directly. But it can provide guidance, code examples, and explanations on how to use frameworks and libraries for scraping tasks. For actual web scraping, you would need to use a programming environment that can execute code and interact with websites.

How do I get ChatGPT to scrape websites?

ChatGPT is a language model designed to process and generate text based on the input it receives. It doesn't have the capability to interact with external websites or execute code. However, GPT Scraper and Extended GPT Scraper are two Apify tools you can use to extract content from a website and then pass that content to the OpenAI API.

On this page

Build the scraper you want

No credit card required

Start building