If you've been reading up on web scraping, you've most likely at least once heard the words "dynamic page" or "dynamic loading." To a seasoned web-automation engineer, these technical terms are a part of our daily lives, but for someone fairly new to scraping, they can sound confusing and a bit daunting. By the end of this article, any confusion you may have regarding these terms will be cleared right up.
What makes a page "dynamic"? 🤔
Some websites load all of the page's content immediately when you first load the page, while others return some content on the first load, and then more is rendered "dynamically" (updating the DOM) based on certain actions you may take (scrolling, clicking, hovering, etc.). The latter is the definition of a dynamic site, and once you know what that entails, it's quite easy to spot one!
Here are some defining factors of a dynamic site. If a page meets any of these criteria, it's most likely dynamic:
- It has lazy-loading for content such as posts/images (depends on the type of lazy-loading being used).
- New "Fetch/XHR" requests for page data that will be rendered onto the DOM can be seen in Chrome's network tab after doing certain actions on the page.
Example of a dynamic page
Let's take a look at
Here is a GIF of me scrolling through my Twitter feed:
Take note of these factors:
- When we reach the bottom of the page, the scroll bar jumps back up, indicating that more content has been fetched from the backend and rendered onto the page dynamically after the page's initial load.
- As we scroll, images and videos for posts are not visible for a split second (they are loading).
Going a step further, we can take a look at the DOM (the HTML of the site) to watch these elements being rendered live as we scroll and confirm our hunch that a page is in fact dynamic:
Still not sure whether or not it's dynamic?
We want to scrape
Why does this matter when building a web scraper?
Whether or not a page is dynamic heavily affects the tools we use and the approaches we take when building a scraper.
When a page is non-dynamic, we can easily retrieve all of its content within one single GET request, which allows us to use tools like CheerioScraper or CheerioCrawler.
This is not necessarily a problem; however, the downside of these tools and the methods that come along with them is that they are significantly less performant than something like Cheerio, as it simply takes longer for the computer to automate browser actions. Because of this, when dealing with a dynamic site, it's a good idea to first check out different avenues of scraping, such as sniffing out the web app's API and utilizing that instead of automating a headless browser and scraping the HTML for data.
A challenge for you 💪
Hopefully, we've been able to clarify some things about dynamic web pages. To end on a strong note, here is a small challenge.
For each of these links, determine whether or not the page is dynamic: