Introduction and requirements
Web scraping (also known as web data extraction or data scraping) is the automated process of collecting data from the web in a comprehensible and structured format.
To perform data extraction, we need to use an HTTP client, like Axios, to send requests to the target website and retrieve information such as the website’s HTML code.
Next, we feed the obtained code to an HTML parser, in this case, Cheerio, which will help us select specific elements in the code and extract their data.
Our goal in this tutorial is to build a Hacker News scraper using the Axios and Cheerio Node.js libraries to extract the rank, link, title, author, and points from each article displayed on the first page of the website.
Requirements
- Have Node.js installed
- Familiarity with JavaScript ES6+
- Basic understanding of CSS selectors
- Basic understanding of the browser DevTools
Initial setup
First, let's create a new directory hacker-news-scraper
to house our scraper, then move into it and create a new file named main.js
. We can either do it manually or straight from the terminal by using the following commands:
mkdir hacker-news-scraper
cd hacker-news-scraper
touch main.js
Still in the terminal, let's initialize our Node.js development project and install Axios and Cheerio. Finally, we can open our project in our code editor of choice. Since I'm using VS Code, I can type the command code .
to open the current directory in VS Code.
npm init -y
npm install axios cheerio
code .
Right after we open our project, we can expect to see a node_modules
folder, the main.js
, package-lock.json
and package.json
files.
Next, let's add "type": "module"
to our package.json
file. This will give us access to import declarations and top-level awaits, which means we can use the await
keyword outside of async functions.
Since we are already in the package.json
file, let's also add a script to run our scraper by using the command npm start
. To do that, we just have to include the string "start": "node main.js"
to the existing "scripts"
object.
And now, we are ready to move to the next step and start writing some code in our main.js
file.
How to make an HTTP GET request with Axios
In the main.js
file, we will use Axios to make a GET request to our target website and save the obtained HTML code of the page to a variable named html
and log it to the console.
Code
import axios from "axios";
const response = await axios.get("https://news.ycombinator.com/");
const html = response.data;
console.log(html);
Output
And here is the result we expect to see after running the npm start
command:
Great! Now that we are properly targeting the page's HTML code, it's time to use Cheerio to parse the code and extract the specific data we want.
Parsing the data with Cheerio
Next, let's use Cheerio to parse the HTML data and scrape the contents from all the articles on the first page of Hacker News.
const axios = require("axios");
const cheerio = require("cheerio");
(async () => {
const response = await axios.get("https://news.ycombinator.com/");
const html = response.data;
// Use Cheerio to parse the HTML
const $ = cheerio.load(html);
})();
Now that Cheerio is loading and parsing the HTML, we can use the variable $
to select elements on the page.
But before we select an element, let's use the developer tools to inspect the page and find what selectors we need to use to target the data we want to extract.
When analyzing the website's structure, we can find each article's rank and title by selecting the element containing the class athing
.
So, let's use Cheerio to select all elements containing the athing
class and save them to a variable named articles
.
Next, to verify we have successfully selected the correct elements, let's loop through each article and log its text contents to the console.
Code
import axios from "axios";
import * as cheerio from "cheerio";
const response = await axios.get("https://news.ycombinator.com/");
const html = response.data;
// Use Cheerio to parse the HTML
const $ = cheerio.load(html);
// Select all the elements with the class name "athing"
const articles = $(".athing");
// Loop through the selected elements
for (const article of articles) {
const text = $(article).text().trim();
// Log each article's text content to the console
console.log(text);
}
Output
1. Hyundai Head Unit Hacking (xakcop.com)
2. The Art of Knowing When to Quit (jim-nielsen.com)
3. Tailscale bug allowed a person to share nodes from other tailnets without auth (tailscale.com)
4. Show HN: Plus – Self-updating screenshots (plusdocs.com)
5. Ruby 3.2’s YJIT is Production-Ready (shopify.engineering)
6. In the past, I've had students call my problem sets “emotionally trying” (twitter.com/shengwuli)
7. EV batteries alone could satisfy short-term grid storage demand as early as 2030 (nature.com)
8. Ask HN: Has anyone worked at the US National Labs before?
9. Let's build GPT: from scratch, in code, spelled out by Andrej Karpathy [video] (youtube.com)
10. Is a venture studio right for you? (steveblank.com)
11. How Do AIs' Political Opinions Change as They Get Smarter and Better-Trained? (astralcodexten.substack.com)
12. I Am the Ghost Here (guernicamag.com)
13. ChrysaLisp (github.com/vygr)
14. Git security vulnerabilities announced (github.blog)
15. Show HN: A tool for motion-capturing 3D characters using a VR headset (diegomacario.github.io)
16. A flurry of new studies identifies causes of the Industrial Revolution (economist.com)
17. My grandfather was almost shot down at the White House (2018) (nones-leonard.medium.com)
18. Common Lisp and Music Composition (ldbeth.sdf.org)
19. Cultivating Depth and Stillness in Research (andymatuschak.org)
20. Patterns (YC S21) is hiring (patterns.app)
21. A Not-So-Brief History of the United States Navy Steel Band (panonthenet.com)
22. Glitching a microcontroller to unlock the bootloader (grazfather.github.io)
23. The Metapict Blog – TikZ like figures using Racket (soegaard.github.io)
24. Show HN: Stack-chan – Open-source companion robot easy to assemble and customize (github.com/meganetaaan)
25. A new scan to detect and cure the commonest cause of high blood pressure (qmul.ac.uk)
26. Learning Physics With Ringworld (2010) (tor.com)
27. What’s going on in the world of extensions (blog.mozilla.org)
28. UT-Austin blocks access to TikTok on campus Wi-Fi networks (texastribune.org)
29. The Amagasaki Derailment [video] (youtube.com)
30. We could stumble into AI catastrophe (effectivealtruism.org)
Great! We've managed to access each element's rank and title. However, we are still missing the article's URL, points, and author.
In the next step, we will use Cheerio's find
method to grab the missing values and organize the obtained data in a JavaScript object.
The Cheerio find method
The find
method is used to get the descendants of an element in the current set of matched elements filtered by a selector.
In the context of our scraper, we can use find
to select specific descendants of each article
element.
Returning to the Hacker News website, we can find the selectors we need to extract our target data.
Code
Here's what our code looks like now:
import axios from "axios";
import * as cheerio from "cheerio";
const response = await axios.get("https://news.ycombinator.com/");
const html = response.data;
// Use Cheerio to parse the HTML
const $ = cheerio.load(html);
// Select all the elements with the class name "athing"
const articles = $(".athing");
// Loop through the selected elements
for (const article of articles) {
// Organize the extracted data in an object
const structuredData = {
url: $(article).find(".titleline a").attr("href"),
rank: $(article).find(".rank").text().replace(".", ""),
title: $(article).find(".titleline").text(),
author: $(article).find("+tr .hnuser").text(),
points: $(article).find("+tr .score").text().replace(" points", ""),
};
// Log each element's strcutured data results to the console
console.log(structuredData);
}
Output
And after running node main.js
we can expect the following output:
{
url: 'https://cookieplmonster.github.io/2023/01/15/remastering-colin-mcrae-rally-3-silentpatch/',
rank: '1',
title: 'Remastering Colin McRae Rally 3 with SilentPatch (cookieplmonster.github.io)',
author: 'breakingcups',
points: '246'
}
{
url: 'https://www.geoffreylitt.com/2023/01/08/for-your-next-side-project-make-a-browser-extension.html',
rank: '2',
title: 'For your next side project, make a browser extension (geoffreylitt.com)',
author: 'Glench',
points: '130'
}
{
url: 'https://www.fosslife.org/awk-power-and-promise-40-year-old-language',
rank: '3',
title: 'Awk: Power and Promise of a 40 yr old language (2021) (fosslife.org)',
author: 'sargstuff',
points: '58'
}
{
url: 'https://jackevansevo.github.io/revisiting-kde.html',
rank: '4',
title: 'Revisiting KDE (jackevansevo.github.io)',
author: 'rc00',
points: '190'
}
{
url: 'https://community.stadia.com/t5/Stadia-General/A-Gift-from-the-Stadia-Team-amp-Bluetooth-Controller/m-p/85936#M34875',
rank: '5',
title: 'Google announces update to unlock Stadia controllers to work with other devices (stadia.com)',
author: 'anderspitman',
points: '290'
}
{
url: 'https://furbo.org/2023/01/15/the-shit-show/',
rank: '6',
title: 'The Shit Show (furbo.org)',
author: 'chazeon',
points: '393'
}
{
url: 'https://chriswarrick.com/blog/2023/01/15/how-to-improve-python-packaging/',
rank: '7',
title: 'How to improve Python packaging (chriswarrick.com)',
author: 'Kwpolska',
points: '157'
}
{
url: 'https://www.instructables.com/DIY-Raspberry-Orange-Pi-NAS-That-Really-Looks-Like/',
rank: '8',
title: 'DIY Raspberry / Orange Pi NAS That Looks Like a NAS – 2023 Edition (instructables.com)',
author: 'axiomdata316',
points: '91'
}
{
url: 'https://viewfromthewing.com/what-we-know-now-about-friday-nights-near-disaster-at-jfk-airport/',
rank: '9',
title: 'What we know now about Friday night’s near-disaster at JFK airport (viewfromthewing.com)',
author: 'bgc',
points: '86'
}
{
url: 'https://nliu.net/posts/2021-03-19-interview.html',
rank: '10',
title: 'Subverting the software interview (2021) (nliu.net)',
author: 'g0xA52A2A',
points: '137'
}
{
url: 'https://chriskiehl.com/article/practical-lenses',
rank: '11',
title: 'Making Lenses Practical in Java (chriskiehl.com)',
author: 'goostavos',
points: '36'
}
{
url: 'https://www.construct.net/en/blogs/ashleys-blog-2/rts-devlog-beat-lag-1607',
rank: '12',
title: 'How to beat lag when developing a multiplayer RTS game (construct.net)',
author: 'AshleysBrain',
points: '51'
}
{
url: 'https://arxiv.org/abs/2201.12601',
rank: '13',
title: 'A formula for the nth digit of 𝜋 and 𝜋^n (arxiv.org)',
author: 'georgehill',
points: '206'
}
{
url: 'https://www.infoq.com/articles/architecture-skeptics-guide/',
rank: '14',
title: 'A skeptic’s guide to software architecture decisions (infoq.com)',
author: 'valand',
points: '17'
}
{
url: 'https://blog.alexewerlof.com/p/tech-debt-day',
rank: '15',
title: "We invested 10% to pay back tech debt; Here's what happened (alexewerlof.com)",
author: 'hanifbbz',
points: '9'
}
{
url: 'https://www.neuralframes.com',
rank: '16',
title: 'Show HN: Create your own video clips with Stable Diffusion (neuralframes.com)',
author: 'nicollegah',
points: '166'
}
{
url: 'https://maritime.org/tour/seashadow/index.php',
rank: '17',
title: 'Virtual Tour of the Hughes Mining Barge and Sea Shadow (maritime.org)',
author: 'walrus01',
points: '8'
}
{
url: 'https://www.theregister.com/2023/01/14/in_brief_security/',
rank: '18',
title: 'NSA asks Congress to let it get on with that warrantless data harvesting, again (theregister.com)',
author: 'LinuxBender',
points: '164'
}
{
url: 'https://skio.com/careers/',
rank: '19',
title: 'Skio (YC S20) Is Hiring (skio.com)',
author: '',
points: ''
}
{
url: 'item?id=34392783',
rank: '20',
title: 'Tell HN: Repurposing old iPads as home security cameras',
author: 'evo_9',
points: '73'
}
{
url: 'item?id=34388866',
rank: '21',
title: 'Ask HN: How do you trust that your personal machine is not compromised?',
author: 'coderatlarge',
points: '400'
}
{
url: 'https://blog.revolutionanalytics.com/2014/01/the-fourier-transform-explained-in-one-sentence.html',
rank: '22',
title: 'The Fourier Transform, explained in one sentence (2014) (revolutionanalytics.com)',
author: 'signa11',
points: '397'
}
{
url: 'https://github.com/furkanonder/beetrace',
rank: '23',
title: 'Trace your Python process line by line with minimal overhead (github.com/furkanonder)',
author: 'fywvzqhvnn',
points: '35'
}
{
url: 'item?id=34393273',
rank: '24',
title: 'Tell HN: Windows 10 might have tricked you into using a online account',
author: 'xchip',
points: '71'
}
{
url: 'https://github.com/Enerccio/SLT',
rank: '25',
title: 'SLT – A Common Lisp Language Plugin for Jetbrains IDE Lineup (github.com/enerccio)',
author: 'gjvc',
points: '117'
}
{
url: 'https://www.fastcompany.com/90270226/the-origins-of-silicon-valleys-garage-myth',
rank: '26',
title: "The Origins of Silicon Valley's Garage Myth that (2018) (fastcompany.com)",
author: '2-718-281-828',
points: '21'
}
{
url: 'https://goodereader.com/blog/kindle/amazon-is-no-longer-allowing-downloading-kindle-unlimited-titles-via-usb',
rank: '27',
title: 'Amazon is no longer allowing downloading Kindle Unlimited titles via USB (goodereader.com)',
author: 'dodgermax',
points: '130'
}
{
url: 'https://gitlab.com/tsoding/porth',
rank: '28',
title: "Porth, It's Like Forth but in Python (gitlab.com/tsoding)",
author: 'Alifatisk',
points: '91'
}
{
url: 'https://goldensyrupgames.com/blog/2023-01-14-gobgp-windows/',
rank: '29',
title: 'BGP on Windows Desktop (goldensyrupgames.com)',
author: 'GSGBen',
points: '22'
}
{
url: 'https://useadrenaline.com',
rank: '30',
title: 'Show HN: AI-powered code correction that teaches you along the way (useadrenaline.com)',
author: 'jshobrook',
points: '65'
}
Congratulations! We've just scraped information from all the articles displayed on the first page of Hacker News using Axios and Cheerio.
In theory, we've accomplished our goal. However, there are still challenges that we might come across when scraping the web, and getting blocked is one of the most common issues web scrapers face. Fortunately, there are multiple ways to avoid getting blocked when crawling.
Avoid being blocked with Axios
Hacker News is a simple website without any aggressive anti-bot protections in place, so we were able to scrape it without running into any major blocking issues.
Complex websites might employ different techniques to detect and block bots, such as analyzing the data encoded in HTTP requests received by the server, fingerprinting, CAPTCHAS, and more.
Avoiding all types of blocking can be a very challenging task, and its difficulty varies according to your target website and the scale of your scraping activities.
Nevertheless, there are some simple techniques, like passing the correct User-Agent
header that can already help our scrapers pass basic website verifications.
What is the User-Agent header?
The User-Agent
header informs the server about the operating system, vendor, and version of the requesting client. This is relevant because any inconsistencies in the information the website receives can alert it about suspicious bot-like activity, leading to our scrapers getting blocked.
One of the ways we can avoid this is by passing custom headers to the HTTP request we made earlier using Axios, thus ensuring that the User-Agent
used matches the one from the machine sending the request.
You can check your own User-Agent
by accessing the http://whatsmyuseragent.org/ website. For example, this is my computer's User-Agent
:
With this information, we can now pass the User-Agent
header to our Axios HTTP request.
How to use the User-Agent header in Axios
In order to verify that Axios is indeed sending the specified headers, let's create a new file named headers-test.js
and send a request to the website https://httpbin.org/.
To send custom headers using Axios, we will pass a params
parameter to the request method:
import axios from "axios";
const params = {
headers: {
"User-Agent":
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36",
},
};
const response = await axios.get("https://httpbin.org/headers", params);
console.log(response.data);
After running the node headers-test.js
command, we can expect to see our request headers printed to the console:
As we can verify by checking the User-Agent
, Axios used the custom headers we passed as a parameter to the request.
In contrast, that's how the User-Agent
for the same request would look like if we didn't pass any custom parameters:
Cool, now that we know how to properly pass custom headers to an Axios HTTP request, we can implement the same logic in our Hacker News scraper.
Required headers, cookies, and tokens
Setting the proper User-Agent
header will definitely help you avoid blocking, but it is not enough to overcome more sophisticated anti-bot systems present in modern websites.
There are many other types of information, such as additional headers, cookies, and access tokens, that we might be required to send with our request in order to get to the data we want. If you want to know more about the topic, check out the Dealing with headers, cookies, and tokens section of the Apify Web Scraping Academy.
An alternative to Axios
Despite Axios being a solid choice for scraping, it was not primarily designed for the needs of modern web scraping, and because of that, it requires extra setup to ensure that its requests are not easily blocked.
Got-scraping, on the other hand, is an open-source HTTP client maintained by Apify, which was made for scraping. Its purpose is to send browser-like requests out of the box, helping our scrapers blend in with the website traffic.
To demonstrate that, let's first add got scraping to our project by running the following command:
npm install got-scraping
Now, let's go back to our headers-test.js
file and modify the code to use Got-Scraping instead of Axios.
import { gotScraping } from "got-scraping";
const response = await gotScraping.get("https://httpbin.org/headers");
console.log(response.body);
Next, run the command node headers-test.js
to see the headers that Got Scraping automatically added to the request.
Note that Got-scraping included the correct User-Agent
without us having to pass any additional parameters to the request like we did for Axios.
Not only that, but it also included additional headers that will help our requests look more "human-like" and not be blocked by the target website.
Conclusion and final code
We've shown you how to combine Axios and Cheerio when web scraping in Node.js and avoid getting blocked. So now, let's wrap it all up with full code for using Axios and Got-scraping.
Using Axios:
import axios from "axios";
import * as cheerio from "cheerio";
const params = {
headers: {
"User-Agent":
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36",
},
};
const response = await axios.get("https://news.ycombinator.com/", params);
const html = response.data;
// Use Cheerio to parse the HTML
const $ = cheerio.load(html);
// Select all the elements with the class name "athing"
const articles = $(".athing");
// Loop through the selected elements
for (const article of articles) {
// Organize the extracted data in an object
const structuredData = {
url: $(article).find(".titleline a").attr("href"),
rank: $(article).find(".rank").text().replace(".", ""),
title: $(article).find(".titleline").text(),
author: $(article).find("+tr .hnuser").text(),
points: $(article).find("+tr .score").text().replace(" points", ""),
};
// Log each element's strcutured data results to the console
console.log(structuredData);
}
Using Got-scraping:
import { gotScraping } from "got-scraping";
import * as cheerio from "cheerio";
const response = await gotScraping.get("https://news.ycombinator.com/");
const html = response.body;
// Use Cheerio to parse the HTML
const $ = cheerio.load(html);
// Select all the elements with the class name "athing"
const articles = $(".athing");
// Loop through the selected elements
for (const article of articles) {
// Organize the extracted data in an object
const structuredData = {
url: $(article).find(".titleline a").attr("href"),
rank: $(article).find(".rank").text().replace(".", ""),
title: $(article).find(".titleline").text(),
author: $(article).find("+tr .hnuser").text(),
points: $(article).find("+tr .score").text().replace(" points", ""),
};
// Log each element's strcutured data results to the console
console.log(structuredData);
}