In this article, you will learn how to build a simple scraper to extract valuable information like job titles, companies, job URLs, and locations on LinkedIn. You'll also learn the intricacies of crawling LinkedIn jobs using Python.
Imagine collecting, analyzing, and visualizing information such as job openings without lifting a finger. Automated web scraping allows you to do this. Web scrapers can be useful for recruiters, salespeople, and marketers to gather data on potential clients or employees.
If you're excited about learning about LinkedIn job scraping as much as I am writing about it, then let's dive right in 🚀
Prerequisites and preparing your environment
To follow along with this article and understand the content and code samples showcased here, you need to satisfy the following:
Be comfortable navigating browser DevTools to find and select page elements.
Have a text editor installed on your machine, such as VSCode, PyCharm, or any editor of choice.
Basic terminal/command line knowledge to run commands for initializing projects, installing packages, deploying sites, etc.
Apify CLI installed globally by running this command: npm -g install apify-cli . Check out otter installation methods in our documentation.
An account with Apify. Create a new account on the Apify platform.
Assuming that you satisfy the requirements in the prerequisites above, let's begin with setting up your development environment for scraping LinkedIn jobs
Building a simple scraper to crawl smaller or less established websites is relatively straightforward and minimally complex. However, when you try web scraping on websites with greater traffic, such as LinkedIn, with dynamically loading pages and JavaScript, you may be faced with a set of challenges.
Beyond merely scraping LinkedIn jobs, you need to bypass LinkedIn's sophisticated anti-scraping measures. This will require you to scrape responsibly and also adopt techniques such as rotating proxies, rate limiting, respecting robots.txt, and IP rotation to avoid detection and possibly getting blocked by LinkedIn firewalls.
The Actor templates allow you to quickly build the foundation of your scraper and benefit from the Apify platform's features right from the start. This saves you valuable development time and gets you scraping faster. Head over to the Actor templates repository and choose one.
Libraries included in the template
Beautiful Soup is a Python library for extracting data from HTML and XML files. It requires minimal code and presents itself as a lightweight option for efficiently tackling basic scraping tasks.
HTTPX offers a comprehensive set of features for making HTTP requests in Python. It allows for both synchronous and asynchronous programming styles and even has an integrated command-line client.
Running your Actor locally
To start using the Apify template to scaffold your project, run the following command in your terminal:
apify create python-actor-example -t python-start
The above command does the following:
Uses the Apify CLI to create a new folder name python-actor-example using the python-start template from Apify.
Installs all necessary libraries, as shown in the screenshot below. This will take a couple of minutes.
Below is my folder structure of the files and folders generated:
Each of these files and folder have their respective functions. I covered the functionality of each folder and file especially in the .actor folder. Read more about their functionalities in my earlier article on Automating Data Collection with Apify: From Script to Deployment.
To run your Actor, navigate to the newly created Actor's folder and run it locally. Run the following commands to do this:
cd python-actor-example
apify run
The above commands change your working directory to python-actor-example, and then use the Apify CLI runcommand to run the scraper locally.
In the next section, I'll walk you through making changes to some of these files. You'll also learn how to use Chrome DevTools to understand LinkedIn’s site structure and scrape LinkedIn jobs.
2. Using Chrome DevTools to understand LinkedIn’s site structure
There are two ways to search for LinkedIn jobs:
Advanced search (with cookies): Search as an authenticated user (you'll need to provide your LinkedIn cookies)
Basic search (without cookies): Search as an unauthenticated user (this entails crawling through the LinkedIn job search page as a visitor; you won't need to provide your cookies). This is safer and supports proxy rotation for faster scraping with concurrency.
We'll focus on the second method: scraping LinkedIn job data as a visitor.
Apply the filters you need to narrow down your search. In this case, I'm searching for jobs that match 'Software Developer' in 'United States' posted 'Any time".
Once you're happy with the results, copy the entire URL from your browser's address bar (see screenshot below).
Open Chrome DevTools by pressing F12 or right-clicking anywhere on the page and choosing Inspect. Inspecting the page from the URL above reveals the following:
Running the commands below in the console returns the respective details about each job posting, such as job title, company name, job URL, job location, and date posted:
This information will constitute the data returned by the LinkedIn job scraper.
4. Dealing with infinite scrolling
LinkedIn is one of those social media websites that adopted infinite scrolling. Infinite scrolling replaces pagination to improve the user experience and increase engagement. While infinite scrolling has numerous advantages, it poses serious issues for web scraping.
You might have noticed that as soon as you scroll down on the LinkedIn job search page, additional content keeps loading. This means that new job listings are loaded dynamically as you scroll down the page, making it difficult to extract data from all available jobs in a single request.
To solve this infinite scroll problem, I'll modify the search page URL with the DevTools' network tab.
Using DevTools to bypass infinite scroll
Open the network tab: Open your browser's developer tools and navigate to the network tab. This tab displays all the requests made by the page as it loads.
Search for jobs requests: When you scroll down on the LinkedIn jobs page, you will see new requests appear in the Network tab. Look for requests that are related to fetching new job listings. This contains the keyword "search" in the URL.
Identify pagination parameters: Once you've identified the relevant request, click on it and examine the details. Look for parameters in the URL that change as you scroll down. You'll notice that the "start" parameter changes from 0 and increases as more job data are loaded on the page.
In the next step, we'll focus on the main purpose of this article: how to scrape LinkedIn job postings.
5. Scraping LinkedIn job data
You'll need to re-visit the Actor you created earlier. Open the INPUT.json file in ./storage/key_value_stores/default/INPUT.json and then replace https://apify.com/ with the LinkedIn job search page URL like so:
Now, open your ./src/main.py file, and let's make some changes to this file to allow scraping of the different job details about each job posting.
Copy and paste this:
# main.py
# leave other imports as they are
async def main() -> None:
"""
The main coroutine is being executed using `asyncio.run()`, so do not attempt to make a normal function
out of it, it will not work. Asynchronous execution is required for communication with the Apify platform,
and it also enhances performance in the field of web scraping significantly.
"""
async with Actor:
# Structure of input is defined in input_schema.json
actor_input = await Actor.get_input() or {}
url = actor_input.get('url')
# Create an asynchronous HTTPX client
async with AsyncClient() as client:
# Fetch the HTML content of the page.
response = await client.get(url, follow_redirects=True)
# Parse the HTML content using Beautiful Soup
soup = BeautifulSoup(response.content, 'html.parser')
# Extract all job details from the page
jobs = []
for job_card in soup.find_all("div", class_="base-card"): # Because job details are
# within ".base-card" class
job_details = {}
# Extract job title
job_title = job_card.find("h3", class_="base-search-card__title")
if job_title:
job_details["title"] = job_title.text.strip()
else:
job_details["title"] = None # Handle cases where title is missing
# Extract company name
company_name = job_card.find("a", class_="hidden-nested-link")
if company_name:
job_details["company"] = company_name.text.strip()
else:
job_details["company"] = None # Handle cases where company name is missing
# Extract job URL
job_url = job_card.find("a", class_="base-card__full-link")
if job_url:
job_details["job_url"] = job_url["href"]
else:
job_details["job_url"] = None # Handle cases where job URL is missing
# Extract job location
job_location = job_card.find("span", class_="job-search-card__location")
if job_location:
job_details["location"] = job_location.text.strip()
else:
job_details["location"] = None # Handle cases where location is missing
# Extract date posted (assuming class reflects posting time)
date_posted = job_card.find("time", class_="job-search-card__listdate--new")
if date_posted:
job_details["date_posted"] = date_posted.text.strip()
else:
job_details["date_posted"] = None # Handle cases where date is missing
Actor.log.info(f"Extracted job details: {job_details}")
jobs.append(job_details)
# Save jobs data to Apify Dataset
await Actor.push_data(jobs)
What the above code snippet does
The url corresponds to the URL from the INPUT.json file. The code then uses the Beautiful Soup library to extract data. After getting the job data, I iterate over the job data. I then find all elements on the webpage that represent individual job listings based on a specific class name "base-card" representing the parent CSS class for each job.
I extract the job title, job URL, company name, job location, and date posted iteratively by looping through each job listing. I return None if any of these details are missing. I then use Apify SDK to save the list of extracted job details into a dataset.
To run your scraper, insert the following command in your terminal:
python src/__main__.py
Next, we'll go through how to export the scraped data to a CSV file.
Data cleaning is important because it ensures you have data of the highest quality, which can prevent errors, increase productivity, and improve data analysis and decision-making.
For this reason, you'll also need to import the re library, a Regular expression operations Python library.
Update the code above to the following:
# other imports remain unchanged
# Import libraries for data cleaning
import re
import csv
# ...
# Save jobs data to CSV file
with open("scraped_linkedin_jobs.csv", "w", newline="", encoding="utf-8") as csvfile:
fieldnames = job_details.keys() # Get field names from first job dictionary
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(jobs)
# Save jobs data to Apify Dataset
await Actor.push_data(jobs)
Inside the loop that iterates through jobs, I added a block to clean the data. The code iterates through each key-value pair in the job_details dictionary.
If a value exists, it uses a regular expression (re.sub) to replace any non-alphanumeric characters (\\\\w) or spaces (\\\\s) with an empty string (''). This helps ensure cleaner data for the CSV file.
I opened a CSV file for writing with open(). I then retrieved the field names (column headers) from the keys of the first job_details dictionary.
Using the csv library, I created a DictWriter object specifying the fieldnames and wrote the header row (writer.writeheader()).
Finally, I used writer.writerows(jobs) to write all the job details dictionaries to the CSV file.
I'm still using the Apify SDK to save the list of extracted job details into a dataset.
7. Bypassing anti-bot detection
When scraping small or less established websites, you might not face any blocking issues. But if you try to scrape established websites or dynamically loading pages with JavaScript, you might run into issues such as blocked requests, IP address blocking and blacklisting, Cloudflare errors (403, 1020), and more.
To bypass these restrictions, you must employ some responsible behaviors while scraping. Some of these include:
8. Scheduling and monitoring your LinkedIn job scraper
You can configure your scraper to run at different schedules and triggers. This will allow you to scrape data at designated periods of time. The Apify platform also allows you to manage and monitor your scraper.
Follow the steps below to learn how to deploy and automate your scraper (Actor).
Yes, it is legal to scape LinkedIn for publicly available information, such as job postings. But it's important to remember that this public information might still include personal details. We wrote a blog post on the legal aspects of web scraping and ethical considerations. You can learn more there.
Is there a LinkedIn API that can be used with Python?
Yes, LinkedIn offers an official API that you can use to create applications that help LinkedIn members. However, it might not be the most suitable solution for your project.
In this article, I walked you through everything you need to know to build your LinkedIn job posting scraper with Python and Apify.
You learned how important LinkedIn is as a source of valuable data, including job listings, profiles, and connections. This data can be useful for various purposes like lead generation, talent sourcing, and market research.