Extracting bulk information from websites that are not designed for bulk data access is no easy task.
That's why I'll explain everything you need to know about scraping real estate data.
I'll cover the process from start to finish, including choosing the right tools and efficiently parsing the data you need.
Web scraping in real estate
Web scraping is the process of automatically extracting useful data from a website with the help of programs called web scrapers.
Real estate web scraping helps gather current information in that industry to improve market awareness, manage online reputation, develop standout content, stay ahead of your competitors, analyze market trends faster, and eventually achieve greater business predictability.
How to scrape real estate data on Zillow with Python
In this section, I'll walk you through a comprehensive step-by-step guide on how to scrape real estate data from Zillow.
You'll learn how to build your scraper using the Apify SDK for Python - a toolkit for simplifying building Apify Actors and scrapers in Python. I'll use the Playwright + Chrome template specifically to build the Zillow scraper with Python.
1. Prerequisites and preparing your environment
To follow along, you need to satisfy the following conditions:
Each file in the .actor folder has a specific function it performs. Below is the description of the files in the .actor folder:
[actor.json](): The .actor/actor.json file, located at the root of your Actor's directory, is where you define the main configuration for your Apify Actor. This file acts as a connection between your local development and the Apify platform. It contains important details like your Actor's name, version, build tag, and environment variables.
[Dockerfile](): The Dockerfile specifies the base image your Actor will be built on. Apify offers various base images suited for different Actor types and requirements.
[input_schema](): The input_schema file, defined as a JSON object, outlines the expected input for your Apify Actor. This schema specifies valid field types supported by Apify, allowing the platform to generate a user-friendly interface for your Actor.
Because we're only focusing on building your real estate data Actor, you'll not modify any of these files. You'll only be making changes in the src/main.py, and the storage/key_value_stores/default/INPUT.json files.
3. Using Chrome DevTools to inspect the Zillow search page
Now, I'll explain how to use Chrome DevTools to understand Zillow’s site structure to extract data about properties for sale and rent.
Before scraping the Zillow search page, you need to inspect it. To do this, follow the following steps:
Open the Zillow website in an incognito tab in your browser
Then, input the location you want to search for a real estate home,
Then, hit Enter
You'll be taken to a page with real estate listings that match your search
Inspect the page in your Chrome browser by hitting CTRL + SHIFT + I on Windows or Cmd + Option + I on OSX. This will open the Inspect window
The following HTML elements and CSS classes below represent the data to be scraped from each card on the Zillow search page:
$(".photo-cards"): Represents the wrapper or container for all list item cards
$("article.property-card img").src: returns the first image url of each card
$("span[data-test=property-card-price]").textContent returns the price and apartment availability.
$("a.property-card-link").href returns the URL of each home
$("address[data-test=property-card-addr").textContent returns the address of each home
Now that you know the HTML elements and CSS classes to target to get the data you need, in the next section, you will learn how to use Playwright and Chrome to build your Actor.
4. Building your Zillow Actor
Now it's time to start writing code and playing around with files specific to the Apify SDK.
The [INPUT.json]() is used to accept input with the Apify SDK. Open the INPUT.json and replace its content with the following code:
In the code snippet above, the url represents the search page, while the homepage corresponds to the homepage URL of the Zillow website. This will be useful later in our code.
In the src/main.py file, you'll need to modify and update the code.
Start by deleting all code, as you'll be building your Zillow Actor from scratch.
Next, paste the code below. This snippet reads the input URL using the [Actor.get_input()]() method. This allows you to access the input record in the Actor's default key-value store.
from apify import Actor
async def main() -> None:
actor_input = await Actor.get_input() or {}
url = actor_input.get('url')
homepage = actor_input.get('homepage')
if not homepage:
Actor.log.info('Homepage not specified, using default value...')
homepage = "https://zillow.com"
if not url:
Actor.log.info('No URL specified in actor input, exiting...')
await Actor.exit()
In the code snippet above, you're getting the url and homepage values from the Actor's input from the INPUT.json file.
If no homepage value is set, assign https://zillow.com to the homepage.
5. Writing your code
Now, you need to write the code to complete your Actor functionality.
The Zillow real estate data extractor will scrape the following information: imageUrl, price, link, and address from the search page.
from apify import Actor
from playwright.async_api import async_playwright
async def main() -> None:
async with Actor:
# Structure of input is defined in input_schema.json
actor_input = await Actor.get_input() or {}
url = actor_input.get('url')
homepage = actor_input.get('homepage')
if not homepage:
Actor.log.info('Homepage not specified, using default value...')
homepage = "https://zillow.com"
if not url:
Actor.log.info('No URL specified in actor input, exiting...')
await Actor.exit()
data: list[dict[str, str | Any]] = [] # Define an empty list
user_agent_strings = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.2227.0 '
'Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 '
'Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/109.0.3497.92 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 '
'Safari/537.36',
]
async with async_playwright() as p:
# Launch Chromium browser instance
browser = await p.chromium.launch(headless=False)
# Create a new browser context with a randomly selected user agent string
context = await browser.new_context(
user_agent=user_agent_strings[random.randint(0, len(user_agent_strings) - 1)])
page = await context.new_page()
await page.goto(url)
# Wait for all cards to load
await page.wait_for_selector(".photo-cards")
# Extract data from each card
card_elements = await page.query_selector_all(".property-card")
for card in card_elements:
image_url = await card.query_selector("img")
price_size = await card.query_selector("span[data-test=property-card-price]")
link = await card.query_selector("a.property-card-link")
address = await card.query_selector("address[data-test=property-card-addr]")
card_details = {
"imageUrl": await image_url.evaluate("el => el.src"),
"price": await price_size.inner_text(),
"link": homepage + await link.get_attribute("href"), # Get full URL
"address": await address.inner_text()
}
# Append card_details to data
data.append(card_details)
Actor.log.info(f"Extracted details: {card_details}")
# Save jobs data to Apify Dataset
await Actor.push_data(data)
In the code snippet above, you're doing the following:
1. Initializing data structure:
You create an empty list data to store the scraped apartment information.
2. Launching Playwright in an asynchronous context:
You use an async with statement to manage Playwright resources asynchronously.
Inside the context, you launch a headless Chromium browser instance using p.chromium.launch(headless=True).
You create a new page object (page) to interact with the webpage.
You navigate the page to the target Zillow URL using page.goto(url).
3. Waiting for content and extract cards:
You wait for the element with class .photo-cards to load, ensuring all apartment cards are available on the page.
You use page.query_selector_all(".property-card") to find all elements representing individual apartment listings.
4. Iterate and extract details from each card (loop):
You iterate through each card element in the previously retrieved list card_elements.
For each card, you use:
card.query_selector to target specific elements within the card:
img element for the image URL.
span[data-test=property-card-price] element for price information.
a.property-card-link element for the apartment details link.
address[data-test=property-card-addr] element for the address information.
You use [evaluate]() on the retrieved image_urlelement to extract its src attribute.
You use inner_text on the retrieved price_size element to extract its visible text content.
You construct the full URL for the apartment details by combining the base URL (homepage) with the relative URL retrieved using get_attribute("href") on the link element.
You use inner_text on the retrieved address element to extract its visible text content (address).
5. Organize extracted data and append it to a list:
You create a dictionary card_details to store the extracted information for each apartment, including:
imageUrl: The image URL obtained using evaluate.
price: The price and size information obtained using inner_text.
link: The full URL for apartment details was constructed earlier.
address: The address information that is obtained using inner_text.
You append the card_details dictionary to the data list, accumulating scraped information for each apartment.
You log the extracted details for each apartment using [Actor.log.info]().
6. Save the scraped data and close resources:
You use [Actor.push_data(data)]()to save the scraped apartment information (data) to the Apify Dataset.
6. How to run your Zillow real estate Actor
To run your Actor, run the following command in your terminal:
After successfully deploying your Actor to the Apify cloud platform, you can see your newly deployed Actor on the My Actors page.
You can export your dataset by clicking on the ”Export” button. The supported formats include JSON, CSV, XML, Excel, HTML Table, RSS, and JSONL.
Select the file format you want your data to be in, then click “Download”
See the complete code for this article with comments hosted on GitHub Gist.
Check out Apify’s readymade Zillow Scraper, which allows you to extract data about properties for sale and rent on Zillow using the Zillow API but with no daily call limits.
Use cases of web scraping in real estate
Price optimization (comparison and monitoring): Real estate businesses can use competitors' data to understand market values and customer expectations, which can help them optimize prices.
Property market research: Web data from property listings, real estate agencies, and public records can be used to identify undervalued properties, areas with projected growth, and emerging hotspots
Investment scouting: Real estate web scraping can provide insights to help real estate businesses make data-driven investments
Lead generation: Web scraping can be used to generate leads for marketing to potential clients. Real estate companies can identify individuals who are interested in buying or selling property by scanning forums, social media, and other venues where property discussions take place.
Websites you can scrape for real estate data
Below are some popular real estate websites you can scrape for real estate data. I also highlighted the ease of being scraped and the kind of data they provide. The data typically includes listings details like price, photos, description, number of bedrooms/bathrooms, square footage, etc. Some sites provide additional data like neighborhood statistics, home value estimates, sale history, etc.
Easy: The website allows scraping and has few anti-scraping measures
Moderate: Some anti-scraping measures like rate-limiting, CAPTCHA
Difficult: Heavy anti-scraping protections like IP blocking, JS rendering
Conclusion and next steps
Now you know how to use the Apify platform and its Actor framework for building scalable web scrapers. You learned how to use Playwright to interact with Zillow web pages and extract valuable real estate data. You also saw some popular real estate websites from which you can collect valuable data, and the complexity of scraping data from these websites.
Time to use what you learned here and try out this Zillow real estate data scraper on Apify Store!