If you've been reading up on web scraping, you've most likely at least once heard the words "dynamic page" or "dynamic loading." To a seasoned web-automation engineer, these technical terms are a part of our daily lives, but for someone fairly new to scraping, they can sound confusing and a bit daunting. By the end of this article, any confusion you may have regarding these terms will be cleared right up.
What makes a page "dynamic"? 🤔
Some websites load all of the page's content immediately when you first load the page, while others return some content on the first load, and then more is rendered "dynamically" (updating the DOM) based on certain actions you may take (scrolling, clicking, hovering, etc.). The latter is the definition of a dynamic site, and once you know what that entails, it's quite easy to spot one!
Here are some defining factors of a dynamic site. If a page meets any of these criteria, it's most likely dynamic:
Here is a GIF of me scrolling through my Twitter feed:
Take note of these factors:
When we reach the bottom of the page, the scroll bar jumps back up, indicating that more content has been fetched from the backend and rendered onto the page dynamically after the page's initial load.
As we scroll, images and videos for posts are not visible for a split second (they are loading).
Going a step further, we can take a look at the DOM (the HTML of the site) to watch these elements being rendered live as we scroll and confirm our hunch that a page is in fact dynamic:
Twitter is most definitely a dynamically loaded web app, as it aligns with all of the criteria defined in the previous section. Not only does it lazy-load content and dynamically fetch page data to be rendered, but it is also written in React (the current most popular front-end JavaScript framework).
Still not sure whether or not it's dynamic?
To reiterate, a dynamic web page is simply a page where the DOM gets updated in some way after the initial load. More often than not (99.9% of the time), a web app will use JavaScript to render new content. This means that we can use tools like the "Quick JavaScript Switcher" Chrome extension to help us.
We want to scrape this article on the Apify blog. Let's go ahead and utilize our new JavaScript Switcher tool to discern whether it's a static or dynamic page.
With JavaScript Enabled
With JavaScript Disabled
Because JavaScript has been disabled, the theme of the blog site switches back to light mode instead of my preferred dark mode; however, that is insignificant. All of the article's data and content on the page remain the same between our two GIFs, which means that all of the content we want to scrape is provided on the first load of the page. This page is NOT dynamic.
Why does this matter when building a web scraper?
Whether or not a page is dynamic heavily affects the tools we use and the approaches we take when building a scraper.
When a page is non-dynamic, we can easily retrieve all of its content within one single GET request, which allows us to use tools like CheerioScraper or CheerioCrawler.
The extra complexity comes when the page IS dynamic. Because content (usually containing data we want to scrape) loads dynamically, we must use a tool that simulates a browser, such as PuppeteerScraper or PlaywrightCrawler. Tools like Puppeteer and Playwright utilize headless browsers that allow us to load the page's JavaScript (essential for sites written in libraries like React), as well as programmatically take actions on a page (just like a real human does!).
This is not necessarily a problem; however, the downside of these tools and the methods that come along with them is that they are significantly less performant than something like Cheerio, as it simply takes longer for the computer to automate browser actions. Because of this, when dealing with a dynamic site, it's a good idea to first check out different avenues of scraping, such as sniffing out the web app's API and utilizing that instead of automating a headless browser and scraping the HTML for data.
A challenge for you 💪
Hopefully, we've been able to clarify some things about dynamic web pages. To end on a strong note, here is a small challenge.
For each of these links, determine whether or not the page is dynamic: