How to block cookie modals when scraping

Scraping the web for LLM training data shouldn’t mean collecting the same cookie disclaimer thousands of times. We dug into why headless scrapers still capture these modals, and how we finally built a reliable way to block them in Apify’s Website Content Crawler.

You’re scraping website content to feed into an LLM, but every page contains the same useless paragraph: “This website uses cookies…” It’s bloating your dataset and making your results look messy.

What’s happening? EU cookie consent laws require websites to show modals on load. Some sites even embed their entire cookie policy in the DOM. That means your crawler ends up scraping paragraphs of irrelevant legalese – over and over.

These dialogs aren’t just visual clutter. They waste tokens, pollute your dataset, and consume compute cycles you could spend on useful content.

Even when you're using a modern, headless scraper, this stuff still slips through.

That’s exactly the problem we ran into with Website Content Crawler, Apify’s flagship data collection tool for AI — so we dug deep to fix it.

What we tried (and why most things didn’t work)

We’ve been refining how Website Content Crawler handles cookie modals. While Crawlee, the web scraping framework that powers the crawler, attempts to mitigate them with the I don’t care about cookies (IDCAC) browser extension, that method is now outdated and unreliable.

This is mainly because the browser extension hasn’t seen any maintenance since November 2023, which means that many new variations of cookie dialogs have appeared since then, and the extension is not prepared to deal with them.

That being said, the way Crawlee makes IDCAC work with a headless browser controlled by Playwright is interesting from a technical point of view. We essentially download the extension from the Firefox addons page, pull out the relevant scripts, and add some clever monkey-patching code that allows the extension to be directly executed by a headless browser.

If you’re interested, you can check out the details on Apify’s GitHub.

Attempt 1: Trying alternative browser extensions

Since IDCAC went into hibernation, forks such as I still don’t care about cookies and alternative addons like Cookie Dialog Monster or Consent-O-Matic have come into existence. Also, did you know that uBlock Origin (yes, the ad blocker that you probably use) can block cookie dialogs too, if you enable the Annoyances filters in its settings?

We evaluated these and several other privacy-related browser extensions.

  • Some couldn’t be wrapped using the approach we used with IDCAC, or would require tons of hacky work to adjust.
  • Others didn’t block cookie modals consistently.
  • Bottom line: No existing browser extension offered both satisfactory coverage and ease of integration.

Attempt 2: Using EasyList filters directly

EasyList filters are text-based rules that tell browser extensions what content to block. They support several types of filtering:

  • Network filters: Block requests to specific URLs or domains (e.g., blocking tracker scripts or ad servers)
  • Cosmetic filters: Hide DOM elements using CSS selectors (e.g., ##.cookie-banner to hide cookie banners)
  • Script filters: Prevent specific JavaScript from executing

These filters use pattern matching with wildcards, domain restrictions, and element hiding syntax. Popular filter lists include EasyList (ads), EasyPrivacy (trackers), and various "Annoyances" lists specifically targeting cookie modals and pop-ups.

One key advantage of using EasyList filters is that they're open source and continuously updated by the community. Unlike IDCAC, which maintained its own proprietary ruleset that eventually went stale, EasyList benefits from thousands of contributors who regularly add new patterns as websites evolve their cookie consent implementations.

EasyLists are predominantly used by ad blocker extensions such as uBlock Origin, Adblock Plus, and AdGuard to filter unwanted content across the web.

However, the “Annoyances” lists can block cookie banners quite effectively.

There are open source libraries for working with EasyList filters, such as Brave’s adblock-rust. Unfortunately, its public API only provides an Engine class that can be used to match content against a parsed EasyList filter set.

If you want to tie this in with network request filtering and with the DOM of the page loaded in a headless browser, you have to implement that on your own, using the Engine class. Especially the DOM part is tricky, because a cookie dialog may be added dynamically after a delay. Many other libraries that provide support for EasyList have similarly restricted functionality.

Final solution: Ghostery’s Adblocker for Playwright

What finally worked is Ghostery’s adblocker integration for Playwright — a lightweight, actively maintained library designed to intercept and block unwanted content.

Compared to the low-level libraries such as adblock-rust, it has out-of-the-box support for Playwright. This means that we don’t have to bother with setting up network interception or observing DOM mutations to prevent cookie dialogs from popping up - it is all handled by the library.

Still, there were a bunch of challenges:

  • Hand-picking appropriate filter lists
  • Figuring out how to update them and how to prevent the parsing from slowing down the Website Content Crawler
  • Figuring out how to prevent network request interception from tanking Playwright’s performance

But in the end, this gave us some extra benefits, apart from removing annoying cookie popups - Website Content Crawler can now also block ads and optionally even images, video, and other bandwidth-intensive resources. This improves both performance and output quality.

When using Website Content Crawler, cookie modals are now reliably filtered out. This reduces token waste, speeds up crawls, and keeps your dataset clean for downstream LLM tasks.

No extra setup required. Just run the crawler as usual.

Apify logo
Try Website Content Crawler
Now with smart cookie modal filtering
Get started
On this page

Build the scraper you want

No credit card required

Start building