Extract emails, phone numbers and social profiles from websites

Administrator
Administrator
Learn how to automatically scrape emails, phone numbers, and Facebook, Twitter, LinkedIn and Instagram profiles from web pages using a new actor on Apify called Contact Information Scraper (vdrmota/contact-info-scraper).

Note: tech-savvy users can find the regular expressions used for extraction at the end of the article.
Photo by Brett Jordan on Unsplash

Searching for contact information on the web can be painful. If you’re lucky, finding an email address may be a matter of a few clicks. But what if you also want to find phone numbers, Facebook pages, LinkedIn profiles, Twitter handles, and Instagram profiles? Today, everyone has different habits when it comes to their online presence, so getting in touch with someone often entails finding all possible ways of contacting them. Doing this job manually is a nightmare, especially if multiple web pages or websites need to be inspected. Luckily, you can automate this job using a technique called web scraping, which lets you automatically extract meaningful data from websites.

Tip: For better understanding of the valuability of generating leads read Use Web Scraping to Fuel Your Sales and Marketing Growth Engine

In Apify Store, there is a new actor called Contact Information Scraper (vdrmota/contact-info-scraper). Its job is to automatically crawl web pages of your choice, scrape the contact information from them and then save it so that you can download it in Excel, CSV, JSON, or some other format. Note that actors are cloud programs running on the Apify platform that is a great tool for web scraping, automation, and data extraction tasks.

Before you start, you’ll need to create an Apify account and verify your email address. This only takes a minute and a basic account is free, with no credit card required. The free account has usage limits, but they are sufficient for crawling a couple of hundred web pages.

Running the actor

To get started, head over to the Contact Information Scraper actor page and click the Use Actor button.

Contact Information Scraper (vdrmota/contact-info-scraper) actor page

You will be redirected to the Apify app, where you can enter settings (actor input configuration) such as the website URLs. You can enter multiple website URLs and the actor will automatically scrape all of them.

Tip: A recently added feature allows you to limit how many URLs it will crawl within each domain, so now you can say that you load for example 5000 domains and you limit 10 webpages at each one of them.

Once you’re ready, click the Run button.

Actor input configuration

The actor will start and you will then see a log where you can monitor its progress. As the actor runs, you can view the results by clicking on the Dataset tab.

Output in the datasets

You can download the results in formats such as Excel, CSV, or JSON.

Or you can preview the results.

HTML dataset preview
JSON dataset preview

In JSON format, the results look like this:

Under input, the actor has several input options that let you specify which pages shall be crawled:

  • Start URLs — A list of URLs of web pages where the crawler should start. You can enter multiple URLs, a text file with URLs, or even a Google Sheets document.
  • Maximum link depth — Specifies how many links away from the web pages specified in Start URLs shall the crawler visit. If zero, the actor ignores the links and only crawls the Start URLs.
  • Stay within the domain — If enabled, the actor will only follow links that are on the same domain as the referring page. For example, if this setting is enabled and the actor finds on a page http://www.example.com/some-page a link to http://www.another-domain.com/, it will not crawl the second page, since www.example.com is not the same as www.another-domain.com

Note that the actor accepts additional input options to specify proxy servers, limit the number of pages, etc. See Actor input for details.

The technology behind the actor

The actor is built in Node.js and uses Apify SDK — an open-source web scraping and automation library. The full source code of the actor is available on GitHub.

When started, the actor loads the web pages provided in Start URLs. It does so using Google’s headless Chrome browser, with the help of the Puppeteer library. Through Puppeteer, the actor is able to simulate user inputs on the web page, such as clicks and scrolling. The actor looks for any links to other pages on the website and crawls them recursively, easily using the PuppeteerCrawler class provided by Apify SDK.

Once the web page has loaded, the actor downloads the web page’s HTML source code. By using headless Chrome, the downloaded HTML represents the actual content of the web page that the user would see, including dynamic content loaded using AJAX. This allows the actor to extract all contact details as they are presented on the pages.

To extract contact details from the HTML, the actor harnesses the power of regular expressions. Below are a few of the regular expressions used to extract contact information from HTML (ECMAScript / JavaScript format). You can test them on regex101.com. Note that all the expressions are provided by the Social Utils in Apify SDK (see source code).

Regular expression for LinkedIn profiles

(?<!\\w)(?:http(?:s)?:\\/\\/)?(?:(?:[a-z]+\\.)?linkedin\\.com\\/in\\/)([a-z0–9\\-_%]{2,60})(?![a-z0–9\\-_%])(?:/)?

The expression finds and extracts LinkedIn profile URLs such as:

https://www.linkedin.com/in/alan-turing
en.linkedin.com/in/alan-turing
linkedin.com/in/alan-turing

Regular expression for Twitter handles

(?<!\\w)(?:http(?:s)?:\\/\\/)?(?:www.)?(?:twitter.com)\\/(?!(?:oauth|account|tos|privacy|signup|home|hashtag|search|login|widgets|i|settings|start|share|intent|oct)(?:[\\'\\"\\?\\.\\/]|$))([a-z0-9_]{1,15})(?![a-z0-9_])(?:/)?

The expression finds and extracts Twitter profile URLs such as:

https://www.twitter.com/apify
twitter.com/apify

Regular expression for Facebook profile

(?<!\\w)(?:http(?:s)?:\\/\\/)?(?:www.)?(?:facebook.com|fb.com)\\/(?!(?:rsrc\\.php|apps|groups|events|l\\.php|friends|images|photo.php|chat|ajax|dyi|common|policies|login|recover|reg|help|security|messages|marketplace|pages|live|bookmarks|games|fundraisers|saved|gaming|salesgroups|jobs|people|ads|ad_campaign|weather|offers|recommendations|crisisresponse|onthisday|developers|settings|connect|business|plugins|intern|sharer)(?:[\\'\\"\\?\\.\\/]|$))(profile\\.php\\?id\\=[0-9]{3,20}|(?!profile\\.php)[a-z0-9\\.]{5,51})(?![a-z0-9\\.])(?:/)?

The expression finds and extracts Facebook profile and page URLs such as:

https://www.facebook.com/apifytech
facebook.com/apifytech
fb.com/apifytech
https://www.facebook.com/profile.php?id=123456789

Regular expression for Instagram profiles

(?<!\\w)(?:http(?:s)?:\\/\\/)?(?:(?:www\\.)?(?:instagram\\.com|instagr\\.am)\\/)([a-z0–9_.]{2,30})(?![a-z0–9_.])(?:/)?

The expression finds and extracts Instagram profile URLs such as:

https://www.instagram.com/old_prague
www.instagram.com/old_prague/
instagr.am/old_prague

If you need to have more control over the crawling and data extraction process, you can fork the actor on GitHub and build your own version. For more details, see our Actors documentation.

And that’s everything you need to know to get started using this actor. Be sure to check out other actors in Apify Store.

Happy scraping of contact details!



Great! Next, complete checkout for full access to Apify
Welcome back! You've successfully signed in
You've successfully subscribed to Apify
Success! Your account is fully activated, you now have access to all content
Success! Your billing info has been updated
Your billing was not updated