Scraping real estate data with Python: step-by-step guide

Extracting bulk information from websites that are not designed for bulk data access is no easy task.

That's why I'll explain everything you need to know about scraping real estate data.

I'll cover the process from start to finish, including choosing the right tools and efficiently parsing the data you need.

Web scraping in real estate

Web scraping is the process of automatically extracting useful data from a website with the help of programs called web scrapers.

Real estate web scraping helps gather current information in that industry to improve market awareness, manage online reputation, develop standout content, stay ahead of your competitors, analyze market trends faster, and eventually achieve greater business predictability.

How to scrape real estate data on Zillow with Python

In this section, I'll walk you through a comprehensive step-by-step guide on how to scrape real estate data from Zillow.

There are many methods, tools, and technologies you can use to build your scraper. However, for speed and efficiency and to save you development time, I'll use the Apify Python template to bootstrap the scraper creation quickly.

You'll learn how to build your scraper using the Apify SDK for Python - a toolkit for simplifying building Apify Actors and scrapers in Python. I'll use the Playwright + Chrome template specifically to build the Zillow scraper with Python.

1. Prerequisites and preparing your environment

To follow along, you need to satisfy the following conditions:

Have Python installed on your computer.
Have a basic understanding of CSS selectors.
Be comfortable navigating the browser DevTools to find and select page elements.
Have a text editor installed on your machine, such as VSCode, PyCharm, or any editor of choice.
Basic terminal/command line knowledge to run commands for initializing projects, installing packages, deploying sites, etc.
Apify CLI installed globally by running this command: npm -g install apify-cli
An account with Apify. Create a new account on the Apify platform.

Assuming you satisfy these requirements, let's proceed with setting up your development environment for scraping real estate data.

2. Getting started with Apify templates

Let's get started with the Playwright + Chrome template from Apify to build your real estate scraper.

In Apify, scrapers are referred to as Actors, so I'll be using the term "Actor" to refer to the scraper throughout this article.

Start by cloning the Playwright + Chrome template from the Apify templates repository.

Next, click on “Use locally”, then, copy and paste the command on your terminal to create a new Actor using the template.

apify create my-actor -t python-playwright

Replace my-actor with the name of the Actor you want to create.

Creating an Actor with an Apify template

For this example, I'm naming my Actor zillow-scraper, so I'll run this command:

apify create zillow-scraper -t python-playwright

The above uses the Apify CLI command to build your zillow-scraper using the python-playwright template.

The command will install all necessary libraries and display a bunch of logs on your terminal while running. This will take a couple of minutes.

The file tree below shows your folder structure:

├───.actor
│   └───actor.json
│   └───Dockerfile
│   └───input_schema.json
├───.idea
│   └───inspectionProfiles
├───.venv
├───src
│   └───__main__.py
│   └───__pycache__
│   └───main.py
└───storage
    ├───datasets
    │   └───default
    ├───key_value_stores
    │   └───default
    │      └───INPUT.json
    └───request_queues
        └───default

Each file in the .actor folder has a specific function it performs. Below is the description of the files in the .actor folder:

[actor.json](): The .actor/actor.json file, located at the root of your Actor's directory, is where you define the main configuration for your Apify Actor. This file acts as a connection between your local development and the Apify platform. It contains important details like your Actor's name, version, build tag, and environment variables.
[Dockerfile](): The Dockerfile specifies the base image your Actor will be built on. Apify offers various base images suited for different Actor types and requirements.
[input_schema](): The input_schema file, defined as a JSON object, outlines the expected input for your Apify Actor. This schema specifies valid field types supported by Apify, allowing the platform to generate a user-friendly interface for your Actor.

Because we're only focusing on building your real estate data Actor, you'll not modify any of these files. You'll only be making changes in the src/main.py, and the storage/key_value_stores/default/INPUT.json files.

3. Using Chrome DevTools to inspect the Zillow search page

Now, I'll explain how to use Chrome DevTools to understand Zillow’s site structure to extract data about properties for sale and rent.

Before scraping the Zillow search page, you need to inspect it. To do this, follow the following steps:

Open the Zillow website in an incognito tab in your browser
Then, input the location you want to search for a real estate home,
Then, hit Enter

You'll be taken to a page with real estate listings that match your search
Inspect the page in your Chrome browser by hitting CTRL + SHIFT + I on Windows or Cmd + Option + I on OSX. This will open the Inspect window

The following HTML elements and CSS classes below represent the data to be scraped from each card on the Zillow search page:

$(".photo-cards"): Represents the wrapper or container for all list item cards
$("article.property-card img").src: returns the first image url of each card
$("span[data-test=property-card-price]").textContent returns the price and apartment availability.
$("a.property-card-link").href returns the URL of each home
$("address[data-test=property-card-addr").textContent returns the address of each home

Now that you know the HTML elements and CSS classes to target to get the data you need, in the next section, you will learn how to use Playwright and Chrome to build your Actor.

4. Building your Zillow Actor

Now it's time to start writing code and playing around with files specific to the Apify SDK.

The [INPUT.json]() is used to accept input with the Apify SDK. Open the INPUT.json and replace its content with the following code:

{
    "url": "https://www.zillow.com/baltimore-md/rentals/",
    "homepage": "https://zillow.com"
}

In the code snippet above, the url represents the search page, while the homepage corresponds to the homepage URL of the Zillow website. This will be useful later in our code.

In the src/main.py file, you'll need to modify and update the code.

Start by deleting all code, as you'll be building your Zillow Actor from scratch.

Next, paste the code below. This snippet reads the input URL using the [Actor.get_input()]() method. This allows you to access the input record in the Actor's default key-value store.

from apify import Actor

async def main() -> None:
        actor_input = await Actor.get_input() or {}
        url = actor_input.get('url')
        homepage = actor_input.get('homepage')

        if not homepage:
            Actor.log.info('Homepage not specified, using default value...')
            homepage = "https://zillow.com"

        if not url:
            Actor.log.info('No URL specified in actor input, exiting...')
            await Actor.exit()

In the code snippet above, you're getting the url and homepage values from the Actor's input from the INPUT.json file.

If no homepage value is set, assign https://zillow.com to the homepage.

5. Writing your code

Now, you need to write the code to complete your Actor functionality.

The Zillow real estate data extractor will scrape the following information: imageUrl, price, link, and address from the search page.

from apify import Actor
from playwright.async_api import async_playwright

async def main() -> None:

    async with Actor:

        # Structure of input is defined in input_schema.json
        actor_input = await Actor.get_input() or {}
        url = actor_input.get('url')
        homepage = actor_input.get('homepage')

        if not homepage:
            Actor.log.info('Homepage not specified, using default value...')
            homepage = "https://zillow.com"

        if not url:
            Actor.log.info('No URL specified in actor input, exiting...')
            await Actor.exit()

        data: list[dict[str, str | Any]] = []  # Define an empty list

        user_agent_strings = [
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.2227.0 '
            'Safari/537.36',
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 '
            'Safari/537.36',
            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) '
            'Chrome/109.0.3497.92 Safari/537.36',
            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 '
            'Safari/537.36',
        ]

        async with async_playwright() as p:

            # Launch Chromium browser instance
            browser = await p.chromium.launch(headless=False)

            # Create a new browser context with a randomly selected user agent string
            context = await browser.new_context(
                user_agent=user_agent_strings[random.randint(0, len(user_agent_strings) - 1)])

            page = await context.new_page()
            await page.goto(url)

            # Wait for all cards to load
            await page.wait_for_selector(".photo-cards")

            # Extract data from each card
            card_elements = await page.query_selector_all(".property-card")
            for card in card_elements:
                image_url = await card.query_selector("img")
                price_size = await card.query_selector("span[data-test=property-card-price]")
                link = await card.query_selector("a.property-card-link")
                address = await card.query_selector("address[data-test=property-card-addr]")

                card_details = {
                    "imageUrl": await image_url.evaluate("el => el.src"),
                    "price": await price_size.inner_text(),
                    "link": homepage + await link.get_attribute("href"),  # Get full URL
                    "address": await address.inner_text()
                }

                # Append card_details to data
                data.append(card_details)
                Actor.log.info(f"Extracted details: {card_details}")

        # Save jobs data to Apify Dataset
        await Actor.push_data(data)

In the code snippet above, you're doing the following:

1. Initializing data structure:

You create an empty list data to store the scraped apartment information.

2. Launching Playwright in an asynchronous context:

You use an async with statement to manage Playwright resources asynchronously.
Inside the context, you launch a headless Chromium browser instance using p.chromium.launch(headless=True).
You create a new page object (page) to interact with the webpage.
You navigate the page to the target Zillow URL using page.goto(url).

3. Waiting for content and extract cards:

You wait for the element with class .photo-cards to load, ensuring all apartment cards are available on the page.
You use page.query_selector_all(".property-card") to find all elements representing individual apartment listings.

4. Iterate and extract details from each card (loop):

You iterate through each card element in the previously retrieved list card_elements.
For each card, you use:
- card.query_selector to target specific elements within the card:
- img element for the image URL.
- span[data-test=property-card-price] element for price information.
- a.property-card-link element for the apartment details link.
- address[data-test=property-card-addr] element for the address information.
You use [evaluate]() on the retrieved image_urlelement to extract its src attribute.
You use inner_text on the retrieved price_size element to extract its visible text content.
You construct the full URL for the apartment details by combining the base URL (homepage) with the relative URL retrieved using get_attribute("href") on the link element.
You use inner_text on the retrieved address element to extract its visible text content (address).

5. Organize extracted data and append it to a list:

You create a dictionary card_details to store the extracted information for each apartment, including:
- imageUrl: The image URL obtained using evaluate.
- price: The price and size information obtained using inner_text.
- link: The full URL for apartment details was constructed earlier.
- address: The address information that is obtained using inner_text.
You append the card_details dictionary to the data list, accumulating scraped information for each apartment.
You log the extracted details for each apartment using [Actor.log.info]().

6. Save the scraped data and close resources:

You use [Actor.push_data(data)]()to save the scraped apartment information (data) to the Apify Dataset.

6. How to run your Zillow real estate Actor

To run your Actor, run the following command in your terminal:

apify run

7. Downloading your data in various formats

Apify allows you to deploy your Actor to Apify Console. Run the command below to deploy your Actor to the cloud:

apify push

After successfully deploying your Actor to the Apify cloud platform, you can see your newly deployed Actor on the My Actors page.

You can export your dataset by clicking on the ”Export” button. The supported formats include JSON, CSV, XML, Excel, HTML Table, RSS, and JSONL.

Select the file format you want your data to be in, then click “Download”

Export scraped Zillow real estate data in multiple formats

See the complete code for this article with comments hosted on GitHub Gist.

Check out Apify’s readymade Zillow Scraper, which allows you to extract data about properties for sale and rent on Zillow using the Zillow API but with no daily call limits.

Use cases of web scraping in real estate

Price optimization (comparison and monitoring): Real estate businesses can use competitors' data to understand market values and customer expectations, which can help them optimize prices.
Property market research: Web data from property listings, real estate agencies, and public records can be used to identify undervalued properties, areas with projected growth, and emerging hotspots
Investment scouting: Real estate web scraping can provide insights to help real estate businesses make data-driven investments
Lead generation: Web scraping can be used to generate leads for marketing to potential clients. Real estate companies can identify individuals who are interested in buying or selling property by scanning forums, social media, and other venues where property discussions take place.

Websites you can scrape for real estate data

Below are some popular real estate websites you can scrape for real estate data. I also highlighted the ease of being scraped and the kind of data they provide. The data typically includes listings details like price, photos, description, number of bedrooms/bathrooms, square footage, etc. Some sites provide additional data like neighborhood statistics, home value estimates, sale history, etc.

Website	Ease of Scraping	Data Provided
http://zillow.com/	Moderate	Property listings, prices, details, estimates, neighborhood data
http://realtor.com/	Difficult	Listings, prices, agent information, neighborhood data
http://redfin.com/	Easy	Listings, prices, estimates, sale history, agent insights
http://rightmove.co.uk/	Moderate	UK property listings, prices, descriptions, local area stats
http://idealista.com/	Easy	Spanish/Portuguese listings, prices, details, neighborhood data
http://immobilienscout24.de/	Difficult	German property listings, prices, descriptions, reports
http://funda.nl/	Easy	Dutch residential listings, prices, photos, neighborhood data
http://cian.ru/	Difficult	Russian real estate listings across types, prices, statistics
http://zoopla.co.uk/	Moderate	UK listings, prices, area stats, value estimates

How the classification was made:

Easy: The website allows scraping and has few anti-scraping measures
Moderate: Some anti-scraping measures like rate-limiting, CAPTCHA
Difficult: Heavy anti-scraping protections like IP blocking, JS rendering

Conclusion and next steps

Now you know how to use the Apify platform and its Actor framework for building scalable web scrapers. You learned how to use Playwright to interact with Zillow web pages and extract valuable real estate data. You also saw some popular real estate websites from which you can collect valuable data, and the complexity of scraping data from these websites.

Time to use what you learned here and try out this Zillow real estate data scraper on Apify Store!