Extracting bulk information from websites that are not designed for bulk data access is no easy task.
That's why I'll explain everything you need to know about scraping real estate data.
I'll cover the process from start to finish, including choosing the right tools and efficiently parsing the data you need.
Web scraping in real estate
Web scraping is the process of automatically extracting useful data from a website with the help of programs called web scrapers.
Real estate web scraping helps gather current information in that industry to improve market awareness, manage online reputation, develop standout content, stay ahead of your competitors, analyze market trends faster, and eventually achieve greater business predictability.
How to scrape real estate data on Zillow with Python
In this section, I'll walk you through a comprehensive step-by-step guide on how to scrape real estate data from Zillow.
There are many methods, tools, and technologies you can use to build your scraper. However, for speed and efficiency and to save you development time, I'll use the Apify Python template to bootstrap the scraper creation quickly.
You'll learn how to build your scraper using the Apify SDK for Python - a toolkit for simplifying building Apify Actors and scrapers in Python. I'll use the Playwright + Chrome template specifically to build the Zillow scraper with Python.
1. Prerequisites and preparing your environment
To follow along, you need to satisfy the following conditions:
- Have Python installed on your computer.
- Have a basic understanding of CSS selectors.
- Be comfortable navigating the browser DevTools to find and select page elements.
- Have a text editor installed on your machine, such as VSCode, PyCharm, or any editor of choice.
- Basic terminal/command line knowledge to run commands for initializing projects, installing packages, deploying sites, etc.
- Apify CLI installed globally by running this command:
npm -g install apify-cli
- An account with Apify. Create a new account on the Apify platform.
Assuming you satisfy these requirements, let's proceed with setting up your development environment for scraping real estate data.
2. Getting started with Apify templates
Let's get started with the Playwright + Chrome template from Apify to build your real estate scraper.
In Apify, scrapers are referred to as Actors, so I'll be using the term "Actor" to refer to the scraper throughout this article.
Start by cloning the Playwright + Chrome template from the Apify templates repository.
Next, click on “Use locally”, then, copy and paste the command on your terminal to create a new Actor using the template.
apify create my-actor -t python-playwright
Replace my-actor
with the name of the Actor you want to create.
For this example, I'm naming my Actor zillow-scraper
, so I'll run this command:
apify create zillow-scraper -t python-playwright
The above uses the Apify CLI command to build your zillow-scraper
using the python-playwright
template.
The command will install all necessary libraries and display a bunch of logs on your terminal while running. This will take a couple of minutes.
The file tree below shows your folder structure:
├───.actor
│ └───actor.json
│ └───Dockerfile
│ └───input_schema.json
├───.idea
│ └───inspectionProfiles
├───.venv
├───src
│ └───__main__.py
│ └───__pycache__
│ └───main.py
└───storage
├───datasets
│ └───default
├───key_value_stores
│ └───default
│ └───INPUT.json
└───request_queues
└───default
Each file in the .actor
folder has a specific function it performs. Below is the description of the files in the .actor
folder:
[actor.json]()
: The.actor/actor.json
file, located at the root of your Actor's directory, is where you define the main configuration for your Apify Actor. This file acts as a connection between your local development and the Apify platform. It contains important details like your Actor's name, version, build tag, and environment variables.[Dockerfile]()
: TheDockerfile
specifies the base image your Actor will be built on. Apify offers various base images suited for different Actor types and requirements.[input_schema]()
: Theinput_schema
file, defined as a JSON object, outlines the expected input for your Apify Actor. This schema specifies valid field types supported by Apify, allowing the platform to generate a user-friendly interface for your Actor.
Because we're only focusing on building your real estate data Actor, you'll not modify any of these files. You'll only be making changes in the src/main.py
, and the storage/key_value_stores/default/INPUT.json
files.
3. Using Chrome DevTools to inspect the Zillow search page
Now, I'll explain how to use Chrome DevTools to understand Zillow’s site structure to extract data about properties for sale and rent.
Before scraping the Zillow search page, you need to inspect it. To do this, follow the following steps:
- Open the Zillow website in an incognito tab in your browser
- Then, input the location you want to search for a real estate home,
- Then, hit Enter
- You'll be taken to a page with real estate listings that match your search
- Inspect the page in your Chrome browser by hitting
CTRL + SHIFT + I
on Windows orCmd + Option + I
on OSX. This will open the Inspect window
The following HTML elements and CSS classes below represent the data to be scraped from each card on the Zillow search page:
$(".photo-cards")
: Represents the wrapper or container for all list item cards$("article.property-card img").src
: returns the first image url of each card$("span[data-test=property-card-price]").textContent
returns the price and apartment availability.$("a.property-card-link").href
returns the URL of each home$("address[data-test=property-card-addr").textContent
returns the address of each home
Now that you know the HTML elements and CSS classes to target to get the data you need, in the next section, you will learn how to use Playwright and Chrome to build your Actor.
4. Building your Zillow Actor
Now it's time to start writing code and playing around with files specific to the Apify SDK.
The [INPUT.json]()
is used to accept input with the Apify SDK. Open the INPUT.json
and replace its content with the following code:
{
"url": "https://www.zillow.com/baltimore-md/rentals/",
"homepage": "https://zillow.com"
}
In the code snippet above, the url
represents the search page, while the homepage
corresponds to the homepage URL of the Zillow website. This will be useful later in our code.
In the src/main.py
file, you'll need to modify and update the code.
Start by deleting all code, as you'll be building your Zillow Actor from scratch.
Next, paste the code below. This snippet reads the input URL using the [Actor.get_input()]()
method. This allows you to access the input record in the Actor's default key-value store.
from apify import Actor
async def main() -> None:
actor_input = await Actor.get_input() or {}
url = actor_input.get('url')
homepage = actor_input.get('homepage')
if not homepage:
Actor.log.info('Homepage not specified, using default value...')
homepage = "https://zillow.com"
if not url:
Actor.log.info('No URL specified in actor input, exiting...')
await Actor.exit()
In the code snippet above, you're getting the url
and homepage
values from the Actor's input from the INPUT.json
file.
If no homepage
value is set, assign https://zillow.com
to the homepage
.
5. Writing your code
Now, you need to write the code to complete your Actor functionality.
The Zillow real estate data extractor will scrape the following information: imageUrl
, price
, link
, and address
from the search page.
from apify import Actor
from playwright.async_api import async_playwright
async def main() -> None:
async with Actor:
# Structure of input is defined in input_schema.json
actor_input = await Actor.get_input() or {}
url = actor_input.get('url')
homepage = actor_input.get('homepage')
if not homepage:
Actor.log.info('Homepage not specified, using default value...')
homepage = "https://zillow.com"
if not url:
Actor.log.info('No URL specified in actor input, exiting...')
await Actor.exit()
data: list[dict[str, str | Any]] = [] # Define an empty list
user_agent_strings = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.2227.0 '
'Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 '
'Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/109.0.3497.92 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 '
'Safari/537.36',
]
async with async_playwright() as p:
# Launch Chromium browser instance
browser = await p.chromium.launch(headless=False)
# Create a new browser context with a randomly selected user agent string
context = await browser.new_context(
user_agent=user_agent_strings[random.randint(0, len(user_agent_strings) - 1)])
page = await context.new_page()
await page.goto(url)
# Wait for all cards to load
await page.wait_for_selector(".photo-cards")
# Extract data from each card
card_elements = await page.query_selector_all(".property-card")
for card in card_elements:
image_url = await card.query_selector("img")
price_size = await card.query_selector("span[data-test=property-card-price]")
link = await card.query_selector("a.property-card-link")
address = await card.query_selector("address[data-test=property-card-addr]")
card_details = {
"imageUrl": await image_url.evaluate("el => el.src"),
"price": await price_size.inner_text(),
"link": homepage + await link.get_attribute("href"), # Get full URL
"address": await address.inner_text()
}
# Append card_details to data
data.append(card_details)
Actor.log.info(f"Extracted details: {card_details}")
# Save jobs data to Apify Dataset
await Actor.push_data(data)
In the code snippet above, you're doing the following:
1. Initializing data structure:
- You create an empty list
data
to store the scraped apartment information.
2. Launching Playwright in an asynchronous context:
- You use an
async with
statement to manage Playwright resources asynchronously. - Inside the context, you launch a headless Chromium browser instance using
p.chromium.launch(headless=True)
. - You create a new page object (
page
) to interact with the webpage. - You navigate the page to the target Zillow URL using
page.goto(url)
.
3. Waiting for content and extract cards:
- You wait for the element with class
.photo-cards
to load, ensuring all apartment cards are available on the page. - You use
page.query_selector_all(".property-card")
to find all elements representing individual apartment listings.
4. Iterate and extract details from each card (loop):
- You iterate through each
card
element in the previously retrieved listcard_elements
. - For each card, you use:
card.query_selector
to target specific elements within the card:img
element for the image URL.span[data-test=property-card-price]
element for price information.a.property-card-link
element for the apartment details link.address[data-test=property-card-addr]
element for the address information.
- You use
[evaluate]()
on the retrievedimage_url
element to extract itssrc
attribute. - You use
inner_text
on the retrievedprice_size
element to extract its visible text content. - You construct the full URL for the apartment details by combining the base URL (
homepage
) with the relative URL retrieved usingget_attribute("href")
on thelink
element. - You use
inner_text
on the retrievedaddress
element to extract its visible text content (address).
5. Organize extracted data and append it to a list:
- You create a dictionary
card_details
to store the extracted information for each apartment, including:imageUrl
: The image URL obtained usingevaluate
.price
: The price and size information obtained usinginner_text
.link
: The full URL for apartment details was constructed earlier.address
: The address information that is obtained usinginner_text
.
- You append the
card_details
dictionary to thedata
list, accumulating scraped information for each apartment. - You log the extracted details for each apartment using
[Actor.log.info]()
.
6. Save the scraped data and close resources:
- You use
[Actor.push_data(data)]()
to save the scraped apartment information (data
) to the Apify Dataset.
6. How to run your Zillow real estate Actor
To run your Actor, run the following command in your terminal:
apify run
7. Downloading your data in various formats
Apify allows you to deploy your Actor to Apify Console. Run the command below to deploy your Actor to the cloud:
apify push
After successfully deploying your Actor to the Apify cloud platform, you can see your newly deployed Actor on the My Actors page.
You can export your dataset by clicking on the ”Export” button. The supported formats include JSON
, CSV
, XML
, Excel
, HTML Table
, RSS
, and JSONL
.
Select the file format you want your data to be in, then click “Download”
See the complete code for this article with comments hosted on GitHub Gist.
Check out Apify’s readymade Zillow Scraper, which allows you to extract data about properties for sale and rent on Zillow using the Zillow API but with no daily call limits.
Use cases of web scraping in real estate
- Price optimization (comparison and monitoring): Real estate businesses can use competitors' data to understand market values and customer expectations, which can help them optimize prices.
- Property market research: Web data from property listings, real estate agencies, and public records can be used to identify undervalued properties, areas with projected growth, and emerging hotspots
- Investment scouting: Real estate web scraping can provide insights to help real estate businesses make data-driven investments
- Lead generation: Web scraping can be used to generate leads for marketing to potential clients. Real estate companies can identify individuals who are interested in buying or selling property by scanning forums, social media, and other venues where property discussions take place.
Websites you can scrape for real estate data
Below are some popular real estate websites you can scrape for real estate data. I also highlighted the ease of being scraped and the kind of data they provide. The data typically includes listings details like price, photos, description, number of bedrooms/bathrooms, square footage, etc. Some sites provide additional data like neighborhood statistics, home value estimates, sale history, etc.
Website | Ease of Scraping | Data Provided |
---|---|---|
http://zillow.com/ | Moderate | Property listings, prices, details, estimates, neighborhood data |
http://realtor.com/ | Difficult | Listings, prices, agent information, neighborhood data |
http://redfin.com/ | Easy | Listings, prices, estimates, sale history, agent insights |
http://rightmove.co.uk/ | Moderate | UK property listings, prices, descriptions, local area stats |
http://idealista.com/ | Easy | Spanish/Portuguese listings, prices, details, neighborhood data |
http://immobilienscout24.de/ | Difficult | German property listings, prices, descriptions, reports |
http://funda.nl/ | Easy | Dutch residential listings, prices, photos, neighborhood data |
http://cian.ru/ | Difficult | Russian real estate listings across types, prices, statistics |
http://zoopla.co.uk/ | Moderate | UK listings, prices, area stats, value estimates |
How the classification was made:
- Easy: The website allows scraping and has few anti-scraping measures
- Moderate: Some anti-scraping measures like rate-limiting, CAPTCHA
- Difficult: Heavy anti-scraping protections like IP blocking, JS rendering
Conclusion and next steps
Now you know how to use the Apify platform and its Actor framework for building scalable web scrapers. You learned how to use Playwright to interact with Zillow web pages and extract valuable real estate data. You also saw some popular real estate websites from which you can collect valuable data, and the complexity of scraping data from these websites.
Time to use what you learned here and try out this Zillow real estate data scraper on Apify Store!