Learn how to automatically scrape emails, phone numbers, and Facebook, Twitter, LinkedIn and Instagram profiles from web pages using a new actor on Apify called Contact Information Scraper (vdrmota/contact-info-scraper).
Note: tech-savvy users can find the regular expressions used for extraction at the end of the article.
Searching for contact information on the web can be painful. If you’re lucky, finding an email address may be a matter of a few clicks. But what if you also want to find phone numbers, Facebook pages, LinkedIn profiles, Twitter handles, and Instagram profiles? Today, everyone has different habits when it comes to their online presence, so getting in touch with someone often entails finding all possible ways of contacting them. Doing this job manually is a nightmare, especially if multiple web pages or websites need to be inspected. Luckily, you can automate this job using a technique called web scraping, which lets you automatically extract meaningful data from websites.
Tip: For better understanding of the valuability of generating leads read Use Web Scraping to Fuel Your Sales and Marketing Growth Engine
In Apify Store, there is a new actor called Contact Information Scraper (vdrmota/contact-info-scraper). Its job is to automatically crawl web pages of your choice, scrape the contact information from them and then save it so that you can download it in Excel, CSV, JSON, or some other format. Note that actors are cloud programs running on the Apify platform that is a great tool for web scraping, automation, and data extraction tasks.
Before you start, you’ll need to create an Apify account and verify your email address. This only takes a minute and a basic account is free, with no credit card required. The free account has usage limits, but they are sufficient for crawling a couple of hundred web pages.
Running the actor
To get started, head over to the Contact Information Scraper actor page and click the Use Actor button.
You will be redirected to the Apify app, where you can enter settings (actor input configuration) such as the website URLs. You can enter multiple website URLs and the actor will automatically scrape all of them.
Tip: A recently added feature allows you to limit how many URLs it will crawl within each domain, so now you can say that you load for example 5000 domains and you limit 10 webpages at each one of them.
Once you’re ready, click the Run button.
The actor will start and you will then see a log where you can monitor its progress. As the actor runs, you can view the results by clicking on the Dataset tab.
You can download the results in formats such as Excel, CSV, or JSON.
Or you can preview the results.
In JSON format, the results look like this:
Under input, the actor has several input options that let you specify which pages shall be crawled:
- Start URLs — A list of URLs of web pages where the crawler should start. You can enter multiple URLs, a text file with URLs, or even a Google Sheets document.
- Maximum link depth — Specifies how many links away from the web pages specified in Start URLs shall the crawler visit. If zero, the actor ignores the links and only crawls the Start URLs.
- Stay within the domain — If enabled, the actor will only follow links that are on the same domain as the referring page. For example, if this setting is enabled and the actor finds on a page http://www.example.com/some-page a link to http://www.another-domain.com/, it will not crawl the second page, since
www.example.comis not the same as
Note that the actor accepts additional input options to specify proxy servers, limit the number of pages, etc. See Actor input for details.
The technology behind the actor
When started, the actor loads the web pages provided in Start URLs. It does so using Google’s headless Chrome browser, with the help of the Puppeteer library. Through Puppeteer, the actor is able to simulate user inputs on the web page, such as clicks and scrolling. The actor looks for any links to other pages on the website and crawls them recursively, easily using the
PuppeteerCrawler class provided by Apify SDK.
Once the web page has loaded, the actor downloads the web page’s HTML source code. By using headless Chrome, the downloaded HTML represents the actual content of the web page that the user would see, including dynamic content loaded using AJAX. This allows the actor to extract all contact details as they are presented on the pages.
Regular expression for LinkedIn profiles
The expression finds and extracts LinkedIn profile URLs such as:
https://www.linkedin.com/in/alan-turing en.linkedin.com/in/alan-turing linkedin.com/in/alan-turing
Regular expression for Twitter handles
The expression finds and extracts Twitter profile URLs such as:
Regular expression for Facebook profile
The expression finds and extracts Facebook profile and page URLs such as:
https://www.facebook.com/apifytech facebook.com/apifytech fb.com/apifytech https://www.facebook.com/profile.php?id=123456789
Regular expression for Instagram profiles
The expression finds and extracts Instagram profile URLs such as:
https://www.instagram.com/old_prague www.instagram.com/old_prague/ instagr.am/old_prague
And that’s everything you need to know to get started using this actor. Be sure to check out other actors in Apify Store.
Happy scraping of contact details!