Training a machine learning model with Zillow data

Accurately predicting rental prices is a common regression problem with practical applications in real estate analytics. It requires handling multiple variables — location, size, amenities, property age — and building a model that generalizes well to new listings.

In this tutorial, we use Zillow listings from Cambridge, Somerville, and Boston as our machine learning training data to build a regression model for monthly rent. You’ll learn the full process: preparing and cleaning raw data, engineering features, and evaluating model performance — skills directly applicable to other machine learning prediction tasks. The Jupyter Notebook and dataset are available on GitHub or can be opened in Google Colab using the links below.

How to build a rental price prediction model

1) Setup

Install required libraries

To start, let’s install and import all the packages we’ll need for this tutorial. From here on out, every code example will follow Jupyter Notebook formatting, meaning each snippet represents its own cell, just like in a real notebook.

# Install required packages
%pip install pandas scikit-learn matplotlib joblib scikit-optimize apify-client python-dotenv

Import libraries

# Import all the libraries we'll need for this tutorial
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pathlib import Path

# Machine learning imports
from sklearn.model_selection import train_test_split, cross_validate, GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, r2_score

print("✅ All packages installed and libraries imported successfully!")

2) Data preparation

2.1 Data loading

As mentioned earlier, in this tutorial, we’ll work with real estate data scraped from rental listings in the Greater Boston area (Boston, Cambridge, and Somerville). The data was collected from Zillow using Apify’s Zillow Search Scraper, and you can download the dataset from the GitHub link here.

💡

You can also run the scraper yourself to get a dataset for any city you want. Go to the Scraping Zillow data for machine learning section to learn how.

With the dataset in hand, let’s load it and start exploring. Here’s what we expect to find:

Property details (bedrooms, bathrooms, area)
Location information (addresses, coordinates)
Rental prices (our target variable)

data_path = Path("./data/boston-rental-properties.json")
with open(data_path) as f:
    data = json.load(f)

print(f"🏠 Successfully loaded {len(data):,} rental listings!")

2.2 Data exploration

By now, you should have your data ready and loaded into your code. Let’s start by taking a peek at its structure to get a better sense of what we’re working with. Think of this step as getting to know the ingredients before cooking a meal.

To that end, let's examine a couple of sample records to understand what information we have available:

# Print the first two listings in the dataset
data[:2]

After running it, you'll get an output similar to the one below.

2.3 Creating a working dataset

The Apify Zillow Scraper gave us a well-structured, detailed dataset to work with, saving us a ton of time right from the start. But raw data alone isn’t ready to be fed into a model just yet.

We still need to pull out only the information that’s actually useful for predicting rent prices. Sticking with our kitchen analogy, this step is like picking the right ingredients from a fully stocked pantry.

What data are we looking for?

Property characteristics: beds, baths, area (square footage)
Location data: latitude, longitude (extracted from nested latLong field)
Target variable: unformattedPrice (what we want to predict)
Context: address, city, zipcode (for reference, though we won't use these in modeling)

Let's extract these fields and create a clean pandas DataFrame:

# First, let's see all available columns in our data
df_full = pd.DataFrame(data)
print(f"📋 All available columns ({len(df_full.columns)} total):")
print(list(df_full.columns))

# Now extract only the columns we need for rent prediction
print(f"\n🎯 Selecting relevant columns for our model...")
df_work = df_full[['address', 'addressCity', 'addressZipcode', 'beds', 'baths', 'area', 'unformattedPrice']].copy()

# Extract latitude and longitude from the nested latLong field
print(f"📍 Extracting location coordinates...")
latitude = df_full['latLong'].apply(lambda x: x.get('latitude') if isinstance(x, dict) else None)
longitude = df_full['latLong'].apply(lambda x: x.get('longitude') if isinstance(x, dict) else None)

# Add coordinates to our filtered dataframe
df_work['latitude'] = latitude
df_work['longitude'] = longitude

print(f"✅ Created working dataset with {len(df_work)} listings and {len(df_work.columns)} columns")
print(f"\n📊 Our working dataset preview:")
df_work.head()

The data is looking much more digestible, but we are not done. We selected our features, but now we're facing another issue: some rows are missing key data points that we will need to train our model. Let’s handle them in the next section.

3) Data cleaning

You might’ve noticed some pesky NaN values when we visualized the dataset earlier. Real-world data is rarely perfect. And just like you wouldn’t start cooking without washing your veggies (at least, I hope you wouldn’t), we shouldn’t train a model without giving our data a good clean first.

Why cleaning matters?

Missing values can crash our model or give weird predictions
Outliers (like a 50-bedroom apartment) can throw off our model's learning
Impossible values (negative square footage) indicate data errors

How can we clean this dataset?

Remove missing values: Drop listings missing critical information
Remove outliers: Use statistical methods to identify and remove extreme values
Validate data: Ensure all values make sense (positive areas, reasonable prices)

3.1 Handling missing values

Let's start by checking for missing data and dropping some “faulty” rows while we are at it:

# First, let's see how much missing data we have
missing_data = df_work.isnull().sum()
for column, missing_count in missing_data.items():
    if missing_count > 0:
        percentage = (missing_count / len(df_work)) * 100
        print(f"{column}: {missing_count} missing ({percentage:.1f}%)")

print(f"\n📊 Total listings before cleaning: {len(df_work):,}")

# Remove listings with missing critical data
listings_before_drop = len(df_work)
df_filtered = df_work.dropna()
listings_after_drop = len(df_filtered)

removed_count = listings_before_drop - listings_after_drop
removal_percentage = (removed_count / listings_before_drop) * 100

print(f"🗑️  Removed {removed_count:,} listings due to missing data ({removal_percentage:.1f}%)")
print(f"✅ Clean dataset now has {listings_after_drop:,} listings")

Over half our rows were missing at least one key feature, so we ended up tossing about 55% of the dataset. Sounds bad, but bad data hurts a model more than less data; quality wins over quantity.

In real projects, you’d usually fill in those gaps (imputation, predictions, extra data sources), but we’ll skip that here to keep things simple. Just keep in mind that: handling missing data well can save more samples and boost model performance.

3.2 Outlier detection and removal

The missing data rows are gone, but that’s not the only thing that can affect our dataset quality. Outliers are a common and dangerous occurrence in any data, especially in user-generated data without any quality control, like Zillow’s, which might cause users to input wrong or unrealistic values.

What are outliers?

In short, outliers are data points that are extremely different from the typical values. For example:

A 20-bedroom apartment (might be a data entry error)
A $50,000/month studio (probably a typo)
A 10,000 square foot "apartment" (might be a commercial space)

How to deal with outliers?

While there are many ways of handling outliers, and the method choice depends on the characteristics of the data you are working with, for our use case, we'll go with the IQR (Interquartile Range) method to detect outliers. This is how the IQR works:

Calculate Q1 (25th percentile) and Q3 (75th percentile)
Anything below Q1 - 1.5×IQR or above Q3 + 1.5×IQR is considered an outlier

# Remove outliers using the IQR method
def remove_outliers_iqr(df, columns):
    """
    Remove outliers using the IQR (Interquartile Range) method.
    """

    df_clean = df.copy()

    print(f"\n🔧 Applying outlier removal:")

    for column in columns:
        # Calculate quartiles and IQR
        Q1 = df_clean[column].quantile(0.25)
        Q3 = df_clean[column].quantile(0.75)
        IQR = Q3 - Q1

        # Define outlier bounds
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR

        # Remove outliers
        before_count = len(df_clean)
        df_clean = df_clean[(df_clean[column] >= lower_bound) & (df_clean[column] <= upper_bound)]
        after_count = len(df_clean)

        removed = before_count - after_count
        print(f"   • {column}: removed {removed} outliers (bounds: {lower_bound:.0f} - {upper_bound:.0f})")

    # Additional business logic filter
    before_area_filter = len(df_clean)
    df_clean = df_clean[df_clean['area'] < 9000]  # Remove unreasonably large areas
    area_removed = before_area_filter - len(df_clean)
    if area_removed > 0:
        print(f"   • Additional filter: removed {area_removed} listings with area > 9000 sq ft")

    return df_clean

# Apply outlier removal
numeric_cols = ['beds', 'baths', 'area', 'unformattedPrice']
df_clean = remove_outliers_iqr(df_filtered, numeric_cols)

# Summary of cleaning results
initial_count = len(df_filtered)
final_count = len(df_clean)
total_removed = initial_count - final_count
removal_rate = (total_removed / initial_count) * 100

print(f"\n📊 CLEANING SUMMARY:")
print(f"   • Started with: {initial_count:,} listings")
print(f"   • Removed: {total_removed:,} outliers ({removal_rate:.1f}%)")
print(f"   • Final dataset: {final_count:,} clean listings")
print(f"   • Data retention rate: {(final_count/initial_count)*100:.1f}%")

Outliers removed, and we’re left with 3,556 clean, complete listings out of the 10k we started with. Not bad, especially since we didn’t have to use any fancy data imputation tricks to get here.

4) Feature engineering

Now that our data is clean, it's time to prepare it for machine learning.

Starting by adding new features to our dataset through a process called Feature Engineering. This means creating new features from existing ones, basically giving our model extra “clues” to help it make better predictions.

# Step 1: Select our base features for modeling
base_features = ['beds', 'baths', 'area', 'latitude', 'longitude']

# Step 2: Engineer additional features that might be helpful
df_clean['totalRooms'] = df_clean['beds'] + df_clean['baths']
print("✅ Created 'totalRooms' = beds + baths")

# Final feature list
final_features = base_features + ['totalRooms']
print(f"\n📊 Final feature set ({len(final_features)} features):")
print(final_features)

# Prepare our feature matrix (X) and target variable (y)
X = df_clean[final_features]
y = df_clean['unformattedPrice']

print(f"\n📈 Dataset Summary:")
print(f"   • Features (X): {X.shape[0]:,} samples × {X.shape[1]} features")
print(f"   • Target (y): {len(y):,} rent prices")
print(f"   • Target range: ${y.min():,.0f} - ${y.max():,.0f}")
print(f"   • Average rent: ${y.mean():,.0f}")

# Quick peek at our engineered dataset
print(f"\n👀 Feature Preview:")
X.head()

And just like that, our newly engineered feature is part of the dataset. It might seem trivial now, but later you’ll see that even a single, simple but meaningful feature can make a noticeable difference in our model’s performance.

5) Modeling

5.1 Train/Test split + baseline model

Now that all features are in place, it’s time for one of the most important steps in building a model: splitting the data.

Train/Test Split means dividing our data into two parts:

Training set (80%): The data our model learns from
Test set (20%): The data we keep hidden until the end to see how well the model really performs on something new

This split is essential to prevent overfitting. We want our model to handle fresh, never-before-seen rental listings, not just ace the “exam” by memorizing the practice questions.

# Create train/test split to evaluate our model properly
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2,      # 20% for testing, 80% for training
    random_state=42     # For reproducible results
)

print(f"📊 Split summary:")
print(f"   • Training set: {len(X_train):,} listings ({len(X_train)/len(X)*100:.0f}%)")
print(f"   • Test set: {len(X_test):,} listings ({len(X_test)/len(X)*100:.0f}%)")
print(f"   • Average rent in training: ${y_train.mean():,.0f}")
print(f"   • Average rent in test: ${y_test.mean():,.0f}")

# Quick baseline: Let's see how our first simple linear regression model performs
baseline_model = LinearRegression()
baseline_model.fit(X_train, y_train)

# Make predictions and evaluate
y_pred_baseline = baseline_model.predict(X_test)
baseline_mae = mean_absolute_error(y_test, y_pred_baseline)
baseline_r2 = r2_score(y_test, y_pred_baseline)

print(f"\n📈 Baseline Performance:")
print(f"   • MAE (Mean Absolute Error): ${baseline_mae:,.0f}")
print(f"   • R² (R-squared): {baseline_r2:.3f}")
print(f"\n💡 What this means:")
print(f"   • On average, our predictions are off by ${baseline_mae:,.0f}")
print(f"   • Our model explains {baseline_r2:.1%} of rent price variation")
print("   • Room for improvement - let\'s try more advanced models!")

We kicked things off with a simple Linear Regression model, but with an R² of just 39.4% for rent price variation, it’s not exactly impressive. Next, we’ll try out a different model to see if we can do better.

5.2 Model comparison - Finding the best algorithm

Our baseline linear regression gave us a starting point, but its performance is far from ideal. When a model scores this low, it often means the underlying data patterns are more complex than the model can capture.

In cases like this, a good next step is to try different algorithms, compare them using cross-validation, and pick a “champion” model, one that gives us the strongest baseline for further tuning and optimization later on.

Why try different models?

Linear Regression assumes a straight-line relationship between features and price
Random Forest can capture complex, non-linear patterns (like "location matters more for expensive areas")
Real estate prices often have complex relationships that linear models miss

How to compare models

Use cross-validation (test on 5 different data splits) for reliable performance estimates
Compare MAE (average prediction error in dollars) and R² (explained variance)
Watch for overfitting because we want a model that generalizes, not one that just memorizes the training data

Let's see which algorithm works best for our data.

# Define our models to compare
models = {
    'Linear Regression': LinearRegression(),
    'Random Forest': RandomForestRegressor(
        n_estimators=100,       # Number of trees
        max_depth=10,           # Maximum tree depth  
        min_samples_split=5,    # Minimum samples to split
        random_state=42,        # For reproducible results
        n_jobs=-1               # Use all CPU cores
    )
}

def evaluate_models_with_cv(models, X, y, cv_folds=5):
    """
    Evaluate multiple models using cross-validation.

    Returns detailed performance metrics including overfitting analysis.
    """
    results = {}

    for name, model in models.items():
        print(f"🎯 Evaluating {name}...")

        # Perform cross-validation with multiple metrics
        cv_results = cross_validate(
            model, X, y, 
            cv=cv_folds,
            scoring=['neg_mean_absolute_error', 'r2'],
            return_train_score=True
        )

        # Calculate test metrics
        test_mae = -cv_results['test_neg_mean_absolute_error'].mean()
        test_mae_std = cv_results['test_neg_mean_absolute_error'].std()
        test_r2 = cv_results['test_r2'].mean()
        test_r2_std = cv_results['test_r2'].std()

        # Calculate train metrics (to check overfitting)
        train_mae = -cv_results['train_neg_mean_absolute_error'].mean()
        train_r2 = cv_results['train_r2'].mean()

        # Overfitting indicators
        overfit_mae = train_mae - test_mae  # Positive = overfitting
        overfit_r2 = train_r2 - test_r2    # Positive = overfitting

        # Store results
        results[name] = {
            'test_mae': test_mae,
            'test_mae_std': test_mae_std,
            'test_r2': test_r2,
            'test_r2_std': test_r2_std,
            'overfit_mae': overfit_mae,
            'overfit_r2': overfit_r2
        }

        # Print detailed results
        print(f"   📊 Test MAE: ${test_mae:,.0f} (±${test_mae_std:,.0f})")
        print(f"   📊 Test R²: {test_r2:.3f} (±{test_r2_std:.3f})\n")

    return results

# Run the model comparison
model_results = evaluate_models_with_cv(models, X_train, y_train)

# Create and display comparison summary
comparison_df = pd.DataFrame(model_results).T
comparison_df = comparison_df.round(2)

# Display the key metrics
print(comparison_df[['test_mae', 'test_r2', 'overfit_mae']])

# Identify the best model
best_model_name = comparison_df['test_r2'].idxmax()
best_r2 = comparison_df.loc[best_model_name, 'test_r2']
best_mae = comparison_df.loc[best_model_name, 'test_mae']

print(f"\n🥇 Best model: {best_model_name}")
print(f"   • R²: {best_r2:.3f} (explains {best_r2:.1%} of price variation)")
print(f"   • MAE: ${best_mae:,.0f} (average prediction error)")

Linear regression vs Random Forest model

From our comparison, the Random Forest model clearly outperformed Linear Regression on this dataset. For simplicity, we only compared it against one other model here, but the same approach can be used to test multiple algorithms against each other and choose the best baseline for your data.

5.3 Hyperparameter tuning

We’ve identified our best-performing algorithm: Random Forest. Now it’s time to fine-tune it and squeeze out as much performance as we can.

Hyperparameter tuning isn’t strictly necessary, especially when our baseline is already doing reasonably well without it, but it’s often worth the effort.

Even if the gains aren’t as dramatic as adding more data, engineering better features, or switching to a more suitable algorithm, tuning can still give your model a valuable performance boost.

Why bother tuning hyperparameters?

Squeeze out extra accuracy and stability from the model
Find the sweet spot between underfitting and overfitting
Adapt the model to the unique quirks of our dataset
Avoid wasting capacity. There’s no point in having 500 trees if 200 get you the same results

# Define parameter grid to search
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [10, 15, 20, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Establish a proper CV baseline for the true default RF
rf_default = RandomForestRegressor(random_state=42, n_jobs=-1)
cv_baseline = cross_validate(
    rf_default, X_train, y_train,
    cv=5,
    scoring=['neg_mean_absolute_error', 'r2'],
    return_train_score=False
)
baseline_cv_mae = -cv_baseline['test_neg_mean_absolute_error'].mean()
baseline_cv_r2 = cv_baseline['test_r2'].mean()

# Set up and run grid search - OPTIMIZING FOR MAE
print("🚀 Running GridSearchCV...")
grid_search = GridSearchCV(
    RandomForestRegressor(random_state=42, n_jobs=-1),
    param_grid,
    scoring={'neg_mae': 'neg_mean_absolute_error', 'r2': 'r2'},
    refit='r2',
    cv=5,
    n_jobs=-1,
    verbose=1
)

grid_search.fit(X_train, y_train)

# Get best model and evaluate
best_rf = grid_search.best_estimator_

# Extract CV metrics for the best params
best_idx = grid_search.best_index_
cv_results = grid_search.cv_results_
tuned_cv_mae = -cv_results['mean_test_neg_mae'][best_idx]
tuned_cv_r2 = cv_results['mean_test_r2'][best_idx]

# Evaluate on the held-out test set once (final check)
y_pred_default = rf_default.fit(X_train, y_train).predict(X_test)
y_pred_tuned = best_rf.predict(X_test)

print("\n📊 CROSS-VALIDATION (5-fold) COMPARISON")
print(f"   Default RF (true defaults)  → CV MAE: ${baseline_cv_mae:,.0f} | CV R²: {baseline_cv_r2:.3f}")
print(f"   Tuned RF (best params)      → CV MAE: ${tuned_cv_mae:,.0f} | CV R²: {tuned_cv_r2:.3f}")

print("\n📊 TEST SET (held-out) COMPARISON")
print(f"   Default RF MAE: ${mean_absolute_error(y_test, y_pred_default):,.0f} | R²: {r2_score(y_test, y_pred_default):.3f}")
print(f"   Tuned RF MAE:   ${mean_absolute_error(y_test, y_pred_tuned):,.0f} | R²: {r2_score(y_test, y_pred_tuned):.3f}")
print(f"\n   Best params: {grid_search.best_params_}")

In this case, we’re focusing on optimizing the model to reduce our MAE. For rental price prediction, a lower MAE means narrowing the average gap between the model’s predictions and the actual prices. We improved from a MAE of $351 to $349—a change that might look negligible at first glance.

But small gains still count, especially when they make the model more stable and reliable on unseen data. In real-world ML applications, even a modest 1% improvement can be worth its weight in gold.

Why didn’t tuning give us bigger gains?

Hyperparameter tuning is valuable, but it’s not a magic switch that instantly doubles performance. Here’s why:

Diminishing returns – The biggest jumps usually come from better features or cleaner data, not just parameter tweaks.
Strong baseline – If your model is already performing well, there’s less room for dramatic improvement.
Trade-offs matter – Some settings boost accuracy but slow down predictions, which might not be worth it in production.
Random Forest is forgiving – Unlike algorithms such as gradient boosting, Random Forest isn’t overly sensitive to small parameter changes, so improvements tend to be incremental.

5.4 Feature importance

Now that our Random Forest is tuned, let’s peek under the hood to see what’s actually driving its predictions. A feature importance plot shows each feature’s relative contribution to the model’s output.

# Feature importance
feature_importance = pd.DataFrame({
    'Feature': X_train.columns,
    'Importance': best_rf.feature_importances_
}).sort_values('Importance', ascending=False)

# Visualize as a bar chart
import matplotlib.pyplot as plt
from matplotlib.ticker import PercentFormatter

top_n = min(20, len(feature_importance))
fi_top = feature_importance.head(top_n)

fig, ax = plt.subplots(figsize=(10, 6))
ax.barh(fi_top['Feature'], fi_top['Importance'], color='forestgreen', alpha=0.85)
ax.invert_yaxis()  # Highest importance at the top
ax.set_xlabel('Relative Importance')
ax.set_title('Random Forest Feature Importance')
ax.xaxis.set_major_formatter(PercentFormatter(1.0))  # RF importances sum to 1

# Annotate each bar
for i, v in enumerate(fi_top['Importance'].values):
    ax.text(v + 0.005, i, f"{v:.3f}", va='center')

plt.tight_layout()
plt.show()

Surprise, surprise, the feature we engineered earlier ended up topping the importance chart. But at first glance, this might feel contradictory. The least important features are the number of baths and beds individually, yet the most important one – totalRooms – is just the sum of those two “least important” features. Why?

In short, Random Forest cares about how features interact with the target, not just their standalone power. Beds and baths on their own might not have strong, consistent relationships with rent, maybe because some listings have huge bedrooms but low rent, or small units with luxury finishes. But when you combine them into totalRooms, we capture a more robust signal: overall space. This aggregated feature smooths out the noise and becomes a more reliable predictor.

5.5 Sanity-checking the model

Before we put too much trust in a model, it’s worth taking a quick peek “under the hood.” Plotting a few diagnostics can reveal where the model performs well, where it struggles, and whether its errors are balanced.

So, why bother?

Spot bias – If the model keeps overpricing cheaper rentals or underpricing expensive ones, we might be missing key features.
Catch error trends – If errors grow with price, we may need extra information (like location perks) or data transformations.
Compare models – The model with predictions hugging the diagonal and residuals clustered around zero usually comes out on top.

# Set up the visualization
fig, axes = plt.subplots(2, 2, figsize=(16, 12))
fig.suptitle('Linear Regression vs Tuned Random Forest', 
             fontsize=16, fontweight='bold')

# Prepare data for both models
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)
y_pred_lr = lr_model.predict(X_test)
y_pred_rf = best_rf.predict(X_test)

# Calculate metrics
lr_mae = mean_absolute_error(y_test, y_pred_lr)
lr_r2 = r2_score(y_test, y_pred_lr)
rf_mae = mean_absolute_error(y_test, y_pred_rf)
rf_r2 = r2_score(y_test, y_pred_rf)

# 1. Random Forest: Predicted vs Actual
axes[0, 0].scatter(y_test, y_pred_rf, alpha=0.6, color='forestgreen', s=30)
axes[0, 0].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 
               'r--', lw=2, label='Perfect Prediction')
axes[0, 0].set_xlabel("Actual Rent ($)", fontsize=12)
axes[0, 0].set_ylabel("Predicted Rent ($)", fontsize=12)
axes[0, 0].set_title(f"Tuned Random Forest\nMAE: ${rf_mae:,.0f} | R²: {rf_r2:.3f}", 
                    fontsize=13, fontweight='bold')
axes[0, 0].grid(True, alpha=0.3)
axes[0, 0].legend()

# 2. Random Forest: Residuals plot
residuals_rf = y_test - y_pred_rf
axes[1, 0].scatter(y_pred_rf, residuals_rf, alpha=0.6, color='forestgreen', s=30)
axes[1, 0].axhline(y=0, color='r', linestyle='--', lw=2)
axes[1, 0].set_xlabel("Predicted Rent ($)", fontsize=12)
axes[1, 0].set_ylabel("Residuals ($)", fontsize=12)
axes[1, 0].set_title("Random Forest Residuals", 
                    fontsize=13, fontweight='bold')
axes[1, 0].grid(True, alpha=0.3)

# 3. Linear Regression: Predicted vs Actual  
axes[0, 1].scatter(y_test, y_pred_lr, alpha=0.6, color='steelblue', s=30)
axes[0, 1].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 
               'r--', lw=2, label='Perfect Prediction')
axes[0, 1].set_xlabel("Actual Rent ($)", fontsize=12)
axes[0, 1].set_ylabel("Predicted Rent ($)", fontsize=12)
axes[0, 1].set_title(f"Linear Regression\nMAE: ${lr_mae:,.0f} | R²: {lr_r2:.3f}", 
                    fontsize=13, fontweight='bold')
axes[0, 1].grid(True, alpha=0.3)
axes[0, 1].legend()

# 4. Linear Regression: Residuals plot
residuals_lr = y_test - y_pred_lr
axes[1, 1].scatter(y_pred_lr, residuals_lr, alpha=0.6, color='steelblue', s=30)
axes[1, 1].axhline(y=0, color='r', linestyle='--', lw=2)
axes[1, 1].set_xlabel("Predicted Rent ($)", fontsize=12)
axes[1, 1].set_ylabel("Residuals ($)", fontsize=12)
axes[1, 1].set_title("Linear Regression Residuals", 
                    fontsize=13, fontweight='bold')
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.subplots_adjust(top=0.90)
plt.show()

Linear regression vs tuned Random Forest

What the Random Forest plots tell us (and how it compares with Linear Regression)

Predicted vs. Actual (top left) – The green points hug the diagonal much more tightly than the blue points from Linear Regression, showing what a “better fit” looks like in practice. Most mid-range prices line up well, though the model still underestimates some of the highest rents and slightly overestimates a few at the lower end. This suggests there are factors, like luxury amenities or very specific neighborhood effects, that our current features don’t fully capture.
Residuals vs. Predicted (bottom left) – Residuals stay centered around zero with no strong patterns, which is exactly what we want. Compared to Linear Regression’s residuals, the spread here is tighter and more balanced, especially in the $3,000–$4,000 range, where our data is densest. At the extremes, errors widen a bit, again hinting at missing features for those outlier cases.

💡

Takeaway: The point of having these plots side by side is to highlight the visual difference between a solid model and one that’s not performing well. The Random Forest gives us more stable, accurate predictions for most listings, while also giving us clear hints to where additional features could help, especially at the pricing extremes.

6) The results

6.1 Model performance

In this guide, we trained a rental price prediction model using structured Zillow data collected with an Apify scraper. Our resulting model achieved an average error rate of about $349 and the ability to explain roughly 64% of price variation. Not bad at all, given our limited dataset and just a touch of feature engineering.

6.2 Key takeaways

This tutorial demonstrated how effective it is to start with well-structured data, and how nice it is not to have to waste our time wrestling with web scraping. Apify Store has over 6,000 ready-made scrapers just like the one we used here; that's plenty of raw material for your next ML project.

7) How to improve further

Our final plots showed the model holding up pretty well, but they also hinted at where it struggles, and how we could improve it. If you’re up for a challenge, here are a few next steps:

a. Gather more data

A bigger, more diverse dataset helps the model generalize better. Try adding listings from Apartments.com or even Airbnb, depending on your use case.

b. Enrich what you have

We dropped over half our original 10,000 listings due to missing key info. Filling those gaps could dramatically increase training data. You could:

Use geolocation APIs for location-based attributes.
Pull extra details with the Zillow Detail Scraper.

c. Double down on feature engineering

We only added one new feature here, but you can create many more. For example, distance to major points of interest, like universities, parks, or transit hubs, often correlates strongly with rent and could give the model a serious boost.

8) Scraping Zillow data for machine learning

In this article, we’ve been working with a Zillow listings dataset pulled using Apify’s Zillow Search Scraper. But let’s be honest, you probably don’t care that much about predicting rent prices in the Greater Boston area. You’d rather do it for your own city, right?

In this final section, I’ll show you exactly how I gathered the Greater Boston dataset so you can grab data for any location you want to build your own prediction model.

8.1 Apify scraper setup

You can access the scraper we’re using here.

Zillow Search Scraper - Apify Store — Use Zillow Search Scraper for free

If you don’t have an Apify account yet, you can create one for free.

Once you’re logged in, you’ll see the scraper’s UI, similar to the one below. The inputs are simple; you just need a Zillow search URL with the filters and settings you want. From there, click Start to begin scraping, and in just a few minutes, your dataset will be ready to export.

Below, I’ll also show you how to use the Apify CLI to call this scraper programmatically and pull the data directly into your code.

Here’s where you’ll grab the Zillow URL to give the scraper as input. In the example below, we’re looking at Zillow search for rentals in Cambridge, MA.

Now that your input URL is ready and your Apify account is set up, the next step is to grab your APIFY_API_TOKEN and save it in a .env file for the upcoming code snippets to work.

To find your API token, head to your Apify Console, go to Settings → API & Integrations, and copy your personal token.

8.2 Using the Apify CLI

With your APIFY_API_TOKEN safely stored in a .env file, you’re ready to run the scraper using the Apify CLI. Just swap out the dummy URL in the code below for your own.

from apify_client import ApifyClient
from dotenv import load_dotenv
import os, json

load_dotenv()
client = ApifyClient(os.getenv("APIFY_API_TOKEN"))

run_input = {
    "searchUrls": [
        {
            # Paste any Zillow search URL you’d run in the browser
            "url": "https://www.zillow.com/boston-ma/rentals/?searchQueryState=..."
        }
    ]
}

run = client.actor("maxcopell/zillow-scraper").call(run_input=run_input)
items = list(client.dataset(run["defaultDatasetId"]).iterate_items())

Path("data").mkdir(exist_ok=True)
with open("data/zillow_rentals.json", "w") as f:
    json.dump(items, f)

data = items
print(f"🏠 Scraped {len(data):,} rental listings!")

You now have a dataset for your city, similar to the one we used in this tutorial.

Use Apify for machine learning training data

Zillow Scraper isn't the only tool on Apify Store that collects web data for training machine learning models. There are over 6,000 tools built to extract data from specific websites at scale. You can use that data to give ML applications the information they need to learn and perform the tasks you're training them for.

Get web data for AI