We’re Apify, a full-stack web scraping and browser automation platform. A big part of what we do is getting better data for AI. But this time, we'll show you how to use AI to get data.
I'll level with you.
If you know nothing at all about web scraping, this tutorial won't help you much.
ChatGPT can't do all the hard work for you.
However, if you know web scraping basics, let me assure you that this guide will teach you how to use AI as a web scraping assistant and save a ton of time coding.
I'll show you how to scrape X (formerly known as Twitter), which is a JavaScript-rendered website. That means content is not present in the HTML source code but is loaded or generated after the initial page load.
So, not the easiest of sites to scrape.
But believe me, AI will make it a lot easier.
What you'll need
To follow this, you don't even need to know a programming language. But you will need:
- Any code editor (I'll be using a Jupyter Notebook inside VSCode).
- Python downloaded and installed on your system.
- A ChatGPT account (free or paid).
How to scrape a website with ChatGPT
Step 1. Choose the website and scraping tools
For this tutorial, I'm going to scrape X posts (you probably still call them tweets) from the Apify X account: https://x.com/apify
.
I happen to know that X is a dynamic website.
I also happen to know that to extract data from such a site, I need a tool that can execute JavaScript and interact with web pages.
My favorite framework for that is Playwright for Python, so that's the tool I'll go for here.
Step 2. Inspect the website and identify the right tags
You probably know that web scraping involves identifying the appropriate tags when you inspect the web page:
So, let's do some tag-hunting for Apify tweets.
Right-click on a post to inspect the elements you need to scrape:
The div
element contains the lang
attribute, which is often used on tweet text to denote the language of the tweet.
I'll choose this so the web scraping script focuses on tweets rather than other types of content on the page.
Step 3. Prompt ChatGPT
Now you know the URL, tool, and tags you want, it's time to prompt your AI web scraping assistant.
For this particular case, here's the prompt I gave ChatGPT:
#Maximize the window, wait 10 seconds, and locate all the elements with the following XPath: "div" tag, attribute name "lang".
#Print the text inside these elements.
#Scroll down until it has scraped and printed at least 20 different tweets.
Prompt explanation:
- Maximizing the window: This helps to load more content initially and reduces the need for excessive scrolling. It also helps ensure that responsive websites display their content in the desktop view, which often includes more information than mobile views.
- Waiting 10 seconds: Waiting allows the webpage to load fully to make sure all elements, including dynamically loaded content like tweets, are completely rendered before the script tries to interact with them. This helps avoid issues where the script might try to access elements that aren't yet available.
- XPath: The XPath expression
//div[@lang]
targets alldiv
elements that have alang
attribute so we can extract relevant content (tweets) only.
pip install Playwright
After telling me to set up Playwright and follow the same steps I gave it, ChatGPT provided me with this code:
import asyncio
from playwright.async_api import async_playwright
async def scrape_x_com_apify():
async with async_playwright() as p:
browser = await p.chromium.launch(headless=False)
context = await browser.new_context()
page = await context.new_page()
# Maximize the window
await page.set_viewport_size({"width": 1920, "height": 1080})
# Navigate to the URL
await page.goto('https://x.com/apify')
# Wait for the page to load
await page.wait_for_timeout(10000)
tweet_texts = set()
while len(tweet_texts) < 20:
# Locate all elements with the given XPath
tweet_elements = await page.query_selector_all('//div[@lang]')
# Extract and print text from each element
for element in tweet_elements:
text = await element.inner_text()
tweet_texts.add(text)
print(text)
if len(tweet_texts) >= 20:
break
if len(tweet_texts) < 20:
# Scroll down to load more tweets
await page.evaluate('window.scrollBy(0, window.innerHeight)')
await page.wait_for_timeout(2000)
await browser.close()
# Run the function
asyncio.run(scrape_x_com_apify())
Did it work? Let's find out.
Step 4. Run the code in your terminal
Copy-paste the GPT-generated script in your code editor of choice. Here it is in Jupyter Notebook on VSCode:
It started well: it opened a Chrome browser and navigated to the Apify Twitter page:
However, it failed to print the data. Instead, I got this runtime error:
Which brings us to the next step.
Step 5. Get ChatGPT to debug the code
ChatGPT is pretty good at rectifying code. The best thing is you don't even need to spend time typing. Just taking a screenshot and uploading it usually does the trick.
So, I took a screenshot of the error I received and uploaded it into ChatGPT.
It recognized the problem and responded with a solution:
The error message indicates that asyncio.run() cannot be called from a running event loop. This typically occurs when running asynchronous code in environments like Jupyter notebooks or certain IDEs. To resolve this, you can use a different approach to run your asyncio code within these environments. Instead of using asyncio.run(), use the event loop directly.
It gave me some adapted code to run. So, I tried again.
How I built an AI-powered tool in 10 minutes
Step 6. Iterate
AI rarely gets things right the first or even the second time. So, if it doesn't succeed, don't give up.
I ran the adapted code and ended up with a similar error:
Once again, I showed ChatGPT the screenshot and got this response:
To avoid the error that occurs because the event loop is already running, we can use an alternative approach by creating a new event loop or using nest_asyncio to allow nested use of asyncio.run. The nest_asyncio library can patch the running event loop to allow it to be re-entered. First, install nest_asyncio using pip if you haven't already:
pip install nest_asyncio
Then modify the script as follows:
import asyncio
import nest_asyncio
from playwright.async_api import async_playwright
nest_asyncio.apply()
async def scrape_x_com_apify():
async with async_playwright() as p:
browser = await p.chromium.launch(headless=False)
context = await browser.new_context()
page = await context.new_page()
# Maximize the window
await page.set_viewport_size({"width": 1920, "height": 1080})
# Navigate to the URL
await page.goto('https://x.com/apify')
# Wait for the page to load
await page.wait_for_timeout(10000)
tweet_texts = set()
while len(tweet_texts) < 20:
# Locate all elements with the given XPath
tweet_elements = await page.query_selector_all('//div[@lang]')
# Extract and print text from each element
for element in tweet_elements:
text = await element.inner_text()
tweet_texts.add(text)
print(text)
if len(tweet_texts) >= 20:
break
if len(tweet_texts) < 20:
# Scroll down to load more tweets
await page.evaluate('window.scrollBy(0, window.innerHeight)')
await page.wait_for_timeout(2000)
await browser.close()
# Run the function using the event loop
asyncio.run(scrape_x_com_apify())
When I ran the modified script, it proceeded to open the Chrome browser and navigate to the web page...
And success! It printed the data in my terminal:
How to not get blocked
I've shown you how to perform a simple web scraping task with ChatGPT, but code isn't enough for any serious, large-scale scraping project.
Extracting data from modern websites poses challenges beyond dynamic content: CAPTCHAs, IP blocks and bans, browser fingerprints, and more.
To overcome these problems, you need infrastructure that provides you with things like smart proxy management and browser fingerprint generation.
The best solution? Deploy your code to a cloud platform like Apify.
The Apify platform was built to serve large-scale and high-performance web scraping and automation needs. It provides easy access to compute instances, storage, proxies, scheduling, webhooks, integrations, and more.
Another way to use ChatGPT for web scraping
Extended GPT Scraper is a web scraping tool that combines web content extraction with OpenAI's GPT language model. It scrapes websites, converts the content to markdown, and then uses GPT to process or transform the scraped text based on user-provided instructions. This allows for automated summarization, content analysis, or other text manipulations of web content at scale.
Summary: Tips for scraping with ChatGPT
- Use DevTools to inspect the target website.
- Pick a programming language and determine the right tools for the scraping task.
- Be as specific as possible and always describe the schema (div tags, attribute names, etc.).
- Always run the code yourself.
- Upload screenshots to ChatGPT to help you fix errors.
Frequently asked questions
Can ChatGPT read a website?
ChatGPT can access and read some websites through its browsing mode. However, when browsing, it's not clear whether it's running searches on the Bing index or actually visiting pages. Regardless, there are some websites ChatGPT can't read, such as paywalled content, dynamic content, and private sites.
Can ChatGPT do web scraping?
No. ChatGPT can't do web scraping directly. But it can provide guidance, code examples, and explanations on how to use frameworks and libraries for scraping tasks. For actual web scraping, you would need to use a programming environment that can execute code and interact with websites.
How do I get ChatGPT to scrape websites?
ChatGPT is a language model designed to process and generate text based on the input it receives. It doesn't have the capability to interact with external websites or execute code. However, GPT Scraper and Extended GPT Scraper are two Apify tools you can use to extract content from a website and then pass that content to the OpenAI API.