How to handle forms and authentication in Scrapy

Learn how to use Scrapy to deal with complex login forms and token authentication

Content

Hi! We're Apify, a full-stack web scraping and browser automation platform. If you're interested in using Python for web scraping, this article shows when and how to use Scrapy to deal with complex login forms and token authentication.


When scraping a website, you'll likely encounter pages that require submitting a login form. Scrapy - the most powerful framework for web scraping in Python - provides ways to handle forms and authentication. While Web scraping with Scrapy 101 shows you how to use Scrapy's features in general, this short guide equips you with some Scrapy know-how to deal with login forms and handle token authentication.

Dealing with login forms with FormRequest

For straightforward login forms, use Scrapy's FormRequest. Here's how to do it:

  1. Identify how the form works: Open the login page and inspect the form using your browser's developer tools. Then, try to submit the form and inspect what request it makes and what parameters it sends. We'll use this form as an example. Note the form action URL and input field names (unamepsw) for future steps.
Using Scrapy to open a login page and submit a form
  1. Craft the FormRequest: In your spider's parse method, use FormRequest with the following arguments:
  • URL: The login form action URL from network tab (https://www.w3schools.com/action_page.php).
  • formdata: A dictionary containing login credentials (e.g., {'username': 'your_username', 'psw': 'your_password'}).
  • callback: The function to handle the response after login.

Example:

def start_requests(self):
    login_url = "https://example.com/login"
    yield FormRequest(login_url,
                      formdata={'username': 'your_username', 'password': 'your_password'},
                      callback=self.after_login)

def after_login(self, response):
    # Check for successful login and proceed with scraping
    if b'Login successful' in response.body:
        # Your scraping logic here
        pass

Managing complex login forms with CSFR token

Some logins involve multiple steps or dynamic elements. In such cases, we need to extend the previous example by extracting some additional data before we send the request. We can do this by using Scrapy's from_response and xpathmethod:

  1. Scrape the initial login page: Use a standard Request to fetch the login page.
  2. Extract login details: Within your parse method, use XPath or CSS selectors to extract necessary information like hidden form fields or CSRF tokens (often required for security).
  3. Craft the FormRequest dynamically: Build a new FormRequest using formdata populated with extracted data.

Remember:

  • Use response.xpath or response.css for element selection.
  • Refer to Scrapy's documentation for detailed selector syntax.

Example (assuming a hidden CSRF token is present):

def start_requests(self):
    login_url = "https://example.com/login"
    yield Request(login_url, callback=self.parse_login)

def parse_login(self, response):
    csrf_token = response.xpath('//input[@name="_csrf"]/@value').get()
    yield FormRequest.from_response(response,
                                    formdata={'username': 'your_username', 'password': 'your_password', '_csrf': csrf_token},
                                    callback=self.after_login)

# ... after_login logic from previous example
đź’ˇ
Scrapy makes XPath selectors easy to use. Learn how XPath works and how to use it in Scrapy here

Scrapy and token authentication

For token-based authentication, you'll need to extract the token from the login response and include it in subsequent requests:

  1. Extract the token: Use XPath or CSS selectors to identify the element containing the token after a successful login.
  2. Include the token in headers: Use the headers argument in your Request objects to send the token in the authorization header.

Example (assuming a token is stored in a hidden element):

def after_login(self, response):
    token = response.xpath('//input[@id="auth_token"]/@value').get()
    yield Request("https://example.com/protected-data",
                  headers={'Authorization': f'Bearer {

What about CAPTCHAs and JavaScript validations?

Some forms, especially signup forms, may have CAPTCHAs or JavaScript validations that need to be handled. Here are a few solutions to these issues.

CAPTCHAs

  1. Manual solving: For low-volume tasks, you might manually solve CAPTCHAs when they appear. This is the simplest approach, but doesn't scale well.
  2. ReCAPTCHA tokens: For sites using Google's reCAPTCHA, some services can generate tokens based on site keys. These tokens can then be submitted with your form.
  3. Avoidance (recommended): The best way to deal with CAPTCHAs is to avoid getting them in the first place. You can avoid CAPTCHAs by mimicking human behavior more closely (e.g., randomizing click locations and intervals, using a browser with a full execution environment like Selenium). The best framework for getting your bots to emulate human behavior is Crawlee, as it automates the process. It's a Node.js library, but support for Python is coming very soon!

JavaScript validations

  1. Headless browsers: Tools like Playwright, Selenium, or Puppeteer can automate a real browser, including executing JavaScript. This means JavaScript validations and dynamically generated tokens can be handled almost as they would be in a manual browsing session.
  2. AJAX handling: If validations involve AJAX requests, you can mimic those requests directly in your script. You can use the Developer Tools in Chrome or Firefox to inspect what requests are made.
  3. API endpoints: Sometimes, the functionality behind form submissions (especially validations) is powered by API endpoints. These can sometimes be used directly to bypass the need for dealing with the form or its JavaScript validations on a web page.

Moving your Scrapy project to the cloud

If you want to take your web scraping projects to the next level with the Apify cloud platform so you can take advantage of its proxies, scheduling features, storage, code templates, and more, here are some useful resources that might help you:

Integrating Scrapy with Apify

Python code templates

Theo Vasilis
Theo Vasilis
I used to write books. Then I took an arrow in the knee. Now I'm a technical content marketer, crafting tutorials for developers and conversion-focused content for SaaS.

Get started now

Step up your web scraping and automation