We’re Apify, a full-stack web scraping and browser automation platform. A big part of what we do is getting better data for AI. But this time, we'll show you how to use AI to get data.
I'll level with you.
If you know nothing at all about web scraping, this tutorial won't help you much.
I'll show you how to scrape X (formerly known as Twitter), which is a JavaScript-rendered website. That means content is not present in the HTML source code but is loaded or generated after the initial page load.
#Maximize the window, wait 10 seconds, and locate all the elements with the following XPath: "div" tag, attribute name "lang".
#Print the text inside these elements.
#Scroll down until it has scraped and printed at least 20 different tweets.
Prompt explanation:
Maximizing the window: This helps to load more content initially and reduces the need for excessive scrolling. It also helps ensure that responsive websites display their content in the desktop view, which often includes more information than mobile views.
Waiting 10 seconds: Waiting allows the webpage to load fully to make sure all elements, including dynamically loaded content like tweets, are completely rendered before the script tries to interact with them. This helps avoid issues where the script might try to access elements that aren't yet available.
XPath: The XPath expression //div[@lang] targets all div elements that have a lang attribute so we can extract relevant content (tweets) only.
📌
I already had Playwright installed on my system. If you don't have it, you'll need to install it first:
pip install Playwright
After telling me to set up Playwright and follow the same steps I gave it, ChatGPT provided me with this code:
import asyncio
from playwright.async_api import async_playwright
async def scrape_x_com_apify():
async with async_playwright() as p:
browser = await p.chromium.launch(headless=False)
context = await browser.new_context()
page = await context.new_page()
# Maximize the window
await page.set_viewport_size({"width": 1920, "height": 1080})
# Navigate to the URL
await page.goto('https://x.com/apify')
# Wait for the page to load
await page.wait_for_timeout(10000)
tweet_texts = set()
while len(tweet_texts) < 20:
# Locate all elements with the given XPath
tweet_elements = await page.query_selector_all('//div[@lang]')
# Extract and print text from each element
for element in tweet_elements:
text = await element.inner_text()
tweet_texts.add(text)
print(text)
if len(tweet_texts) >= 20:
break
if len(tweet_texts) < 20:
# Scroll down to load more tweets
await page.evaluate('window.scrollBy(0, window.innerHeight)')
await page.wait_for_timeout(2000)
await browser.close()
# Run the function
asyncio.run(scrape_x_com_apify())
Did it work? Let's find out.
Step 4. Run the code in your terminal
Copy-paste the GPT-generated script in your code editor of choice. Here it is in Jupyter Notebook on VSCode:
It started well: it opened a Chrome browser and navigated to the Apify Twitter page:
However, it failed to print the data. Instead, I got this runtime error:
Which brings us to the next step.
Step 5. Get ChatGPT to debug the code
ChatGPT is pretty good at rectifying code. The best thing is you don't even need to spend time typing. Just taking a screenshot and uploading it usually does the trick.
So, I took a screenshot of the error I received and uploaded it into ChatGPT.
It recognized the problem and responded with a solution:
The error message indicates that asyncio.run() cannot be called from a running event loop. This typically occurs when running asynchronous code in environments like Jupyter notebooks or certain IDEs.
To resolve this, you can use a different approach to run your asyncio code within these environments. Instead of using asyncio.run(), use the event loop directly.
It gave me some adapted code to run. So, I tried again.
AI rarely gets things right the first or even the second time. So, if it doesn't succeed, don't give up.
I ran the adapted code and ended up with a similar error:
Once again, I showed ChatGPT the screenshot and got this response:
To avoid the error that occurs because the event loop is already running, we can use an alternative approach by creating a new event loop or using nest_asyncio to allow nested use of asyncio.run. The nest_asyncio library can patch the running event loop to allow it to be re-entered.
First, install nest_asyncio using pip if you haven't already:
pip install nest_asyncio
Then modify the script as follows:
import asyncio
import nest_asyncio
from playwright.async_api import async_playwright
nest_asyncio.apply()
async def scrape_x_com_apify():
async with async_playwright() as p:
browser = await p.chromium.launch(headless=False)
context = await browser.new_context()
page = await context.new_page()
# Maximize the window
await page.set_viewport_size({"width": 1920, "height": 1080})
# Navigate to the URL
await page.goto('https://x.com/apify')
# Wait for the page to load
await page.wait_for_timeout(10000)
tweet_texts = set()
while len(tweet_texts) < 20:
# Locate all elements with the given XPath
tweet_elements = await page.query_selector_all('//div[@lang]')
# Extract and print text from each element
for element in tweet_elements:
text = await element.inner_text()
tweet_texts.add(text)
print(text)
if len(tweet_texts) >= 20:
break
if len(tweet_texts) < 20:
# Scroll down to load more tweets
await page.evaluate('window.scrollBy(0, window.innerHeight)')
await page.wait_for_timeout(2000)
await browser.close()
# Run the function using the event loop
asyncio.run(scrape_x_com_apify())
When I ran the modified script, it proceeded to open the Chrome browser and navigate to the web page...
And success! It printed the data in my terminal:
Output (truncated):
Scrollable Content
12,500+ Stars on GitHub
We are giving away 3 Crawlee shirts to celebrate this milestone
Rules:
- Follow @apify
- Retweet this tweet
- Star us at: https://apify.it/3UZ22lv
- Reply with your GitHub username
Deadline: 7th June 2024
And finally there's also an universal #webscraping and #crawling library for #JavaScript / #NodeJS, similar to @ScrapyProject for #Python that was around for years, but also working with headless Chrome and Puppeteer. Better late than never https://github.com/apifytech/apify-js…
That moment when @Microsoft starts using your open-source web scraping libraries... https://github.com/microsoft/accessibility-insights-service… #webscraping #opensource
Today we're launching Crawlee on Product Hunt
Please show us some love and support, to encourage the team who spent years building this open-source library!
Learn more about Crawlee in the thread
1/5
https://producthunt.com/posts/crawlee
We've just launched Crawlee on @ycombinator's Launch YC!
Crawlee is an open-source Node.js library for developing web scrapers and crawlers — an essential tool to acquire data for fine-tuning LLMs and RAG. 11K stars on GitHub and counting
Our price watcher, done together with @topmonks @keboola and @BizzTreat for the biggest three Czech e-shops and their Black Friday discounts made it into @ForbesCesko magazine. And we're in good company opposite delicious @DiplomaticoRum https://blog.apify.com/black-friday-in-czechia-magical-prices-one-year-on-4ca9b25d2256…
Excited to sponsor @MariyaSha888 in the new #Python Simplified Code Jam - an initiative that incentivizes experienced and #beginnercoders alike to build exciting stuff together
Check out Mariya's channel and register if you're up for a challenge: https://youtube.com/watch?v=tRlEkCLQ-fk…
This is how we hacked @WebExpo, Central Europe's largest conference about tech and web Using #HTTP status codes we redirected visitors of their opening rooftop party to our office next door, where they could enjoy our #API beer #guerrilla #marketing #webexpo
It's been almost 5 years since we first launched on @ProductHunt as Apifier. Since then, we've grown from two guys in a small house in Mountain View into a company of 30 amazing people and completely reinvented all of our products.
Thanks for your support https://producthunt.com/posts/apify-1
Need to extract vital information locked away in PDF documents for your AI projects? Here's a solution! Apify's PDF Text Extractor + @LangChainAI! Read the tutorial below to find out how to combine them for QA from PDF docs
We just released our new #Google search scraper. Now you can get your or your competitions's #SEO or paid rankings in search results in just a few seconds. Check out the full blog post.
Another important court ruling in favor of web scraping: Scraping a public website without the approval of the website owner isn't a violation of the Computer Fraud and Abuse Act (via @arstechnica) https://arstechnica.com/tech-policy/2019/09/web-scraping-doesnt-violate-anti-hacking-law-appeals-court-rules/…
Packed house! 150+ attendees at yesterday's Prague Fall LLM Meetup in our Lucerna office!
Huge thx to co-organizers @weights_biases, @kaya_vc & speakers @KocmiTom, @jiri_moravcik, @hansramsl, Tomáš Mikolov, @PetrBrzek. Stay tuned for more LLM- and GenAI-related events
Crawlee has surpassed 10,000 stars on GitHub! We are thrilled to see it helping so many developers simplify their web scraping and web crawling workflows And if you haven't already, give Crawlee a try (and maybe even a star) at the link below https://apify.it/40PnMms
1/ Today, we're launching a major innovation—developers can now charge for the usage of their actors on Apify. This gives a clear incentive to keep building and improving software that brings value to the community.
We've joined the fight against #COVID19 by turning official pages with statistics into #APIs that can be used by other apps. Check out the latest data from multiple countries. Want to see data from pages without APIs not yet listed? Tell us! #coronavirus https://bit.ly/2x1qYBG
30 girls from @czechitas party with Apify nerds at our #Lucerna #rooftop office Some really wild conversations about tech and software
Apify ChatGPT Plugin is here!
The new plugin is a web browser for ChatGPT on steroids—it can scrape Google Search results, crawl websites and query web pages found, and browse individual web pages. And yes, it's free.
Apify is now integrated to @FlowiseAI You can now use Apify with the open-source UI visual tool to build your customized #LLM flow using @LangChainAI, written in Node TypeScript or JavaScript Learn more about the integration https://apify.it/3sQwHa7
U.S. Appeals Court preliminarily reaffirms web scraping is not hacking.
I've shown you how to perform a simple web scraping task with ChatGPT, but code isn't enough for any serious, large-scale scraping project.
Extracting data from modern websites poses challenges beyond dynamic content: CAPTCHAs, IP blocks and bans, browser fingerprints, and more.
To overcome these problems, you need infrastructure that provides you with things like smart proxy management and browser fingerprint generation.
The best solution? Deploy your code to a cloud platform like Apify.
The Apify platform was built to serve large-scale and high-performance web scraping and automation needs. It provides easy access to compute instances, storage, proxies, scheduling, webhooks, integrations, and more.
Extended GPT Scraper is a web scraping tool that combines web content extraction with OpenAI's GPT language model. It scrapes websites, converts the content to markdown, and then uses GPT to process or transform the scraped text based on user-provided instructions. This allows for automated summarization, content analysis, or other text manipulations of web content at scale.
Pick a programming language and determine the right tools for the scraping task.
Be as specific as possible and always describe the schema (div tags, attribute names, etc.).
Always run the code yourself.
Upload screenshots to ChatGPT to help you fix errors.
Frequently asked questions
Can ChatGPT read a website?
ChatGPT can access and read some websites through its browsing mode. However, when browsing, it's not clear whether it's running searches on the Bing index or actually visiting pages. Regardless, there are some websites ChatGPT can't read, such as paywalled content, dynamic content, and private sites.
Can ChatGPT do web scraping?
No. ChatGPT can't do web scraping directly. But it can provide guidance, code examples, and explanations on how to use frameworks and libraries for scraping tasks. For actual web scraping, you would need to use a programming environment that can execute code and interact with websites.
How do I get ChatGPT to scrape websites?
ChatGPT is a language model designed to process and generate text based on the input it receives. It doesn't have the capability to interact with external websites or execute code. However, GPT Scraper and Extended GPT Scraper are two Apify tools you can use to extract content from a website and then pass that content to the OpenAI API.
I used to write books. Then I took an arrow in the knee. Now I'm a technical content marketer, crafting tutorials for developers and conversion-focused content for SaaS.