Hi! We're Apify, a full-stack web scraping and browser automation platform.
In this guide, we'll explore the fundamentals of parsing HTML content and the various methods and libraries for parsing HTML in JavaScript. We'll also cover best practices and common issues, as well as practical applications.
Why you need this guide
As demand for data increases, so does the need to extract, process, and make sense of it. But most of this data is present in websites across the web. So, how can you access it programmatically?
The solution lies in automating the process of analyzing, extracting, and transforming raw data into a structured, easily readable format. This process is known as HTML parsing, and we'll show you how to do it with JavaScript.
So let's get started!
Parsing HTML files in JavaScript
Parsing HTML involves analyzing a collection of HTML strings, including HTML tags, attributes, and their values, to generate a structured representation, the Document Object Model (DOM). The idea is to map out the entire page so we can easily extract specific data or pinpoint elements to interact with, like buttons and forms.
HTML is designed to be programmatically parsable. The process involves breaking down the HTML document into its key constituent elements, down to the smallest components, and constructing the DOM.
For instance, let's consider the following HTML example:
Without diving deep into the technical implementations, the HTML document goes through several steps while parsing, including tokenization. This generates a parse tree that eventually becomes the DOM tree, as shown in the illustration below:
This process makes it possible to traverse the HTML tree through each node to understand its structure, manipulate the DOM, and extract relevant data from HTML documents programmatically.
JavaScript HTML parsers
There are several libraries and APIs available that you can use to parse HTML in JavaScript, each with unique features and use cases. We'll cover the following:
DOMParser API
Cheerio
Axios
Parse5
JSDOM
1. Using the DOMParser API
The DOMParser API is a built-in interface in the browser environment that allows you to parse HTML source code from a string into a DOM document. This is useful for manipulating and extracting data from HTML content programmatically.
In a nutshell, the DOMParser API provides the DOMParser constructor, which you can use to create a new DOM parser object. From this object, you can call the parseFromString method to easily parse the HTML content to DOM documents.
Let's take a look at an example:
const htmlString = `
<div>
<h1>Hello, World!</h1>
<p>This is a paragraph.</p>
</div>
`;
const parser = new DOMParser();
const document = parser.parseFromString(htmlString, 'text/html');
console.log(document)
In this example, we first define a string htmlString that contains some HTML elements - a div element, header, and paragraph tags - along with their respective values.
To parse this content, we need to first create a new instance of the DOMParser object. Then, use the parseFromString method to parse the htmlString into a DOM Document object.
The output, in this case, is the complete DOM tree structure, including the additional HTML document elements that should be present, e.g., the and elements:
Please note that the parseFromString method takes two arguments: the HTML string you want to parse, as well as a content type argument, which, in this case, is the 'text/html'. It's important to specify the kind of content you want to parse because this method can parse XML content as well.
To extract relevant data, you can then use querySelector method to find the element in the parsed document and extract its text content.
Depending on your specific use case, you might also want to convert the DOM tree back into an HTML string. This process is known as serializing.
For this, you need to use a different interface, that is, the XMLSerializer.
Here's an example:
const serializer = new XMLSerializer();
const serializedHTML = serializer.serializeToString(document);
console.log(serializedHTML); // prints the current page HTML as string
Just like the DOMParser constructor, you need to first create a new instance of XMLSerializer. Then, we call the serializeToString method, passing in the DOM document, to convert it back into an HTML string.
Serializing the DOM tree can be useful in scenarios where you need to manipulate the HTML structure programmatically and then convert it back to a string for further processing or rendering.
2. Using Cheerio
Cheerio is a fast, flexible, and lean implementation of core jQuery that provides a fairly workable API to parse and manipulate HTML with a familiar jQuery-like syntax.
Generally, Node.js does not allow parsing and manipulation of markup because it executes code outside of the browser. However, by using the Cheerio library, you can parse and traverse HTML documents in a Node.js environment.
Before using Cheerio, ensure you have a sufficient understanding of Node.js and its related technologies; you'll be working on the server side from here on in this guide.
Then, go ahead and initialize a Node.js project (in your preferred project directory) with the following command on your terminal.
npm init --y
This command will create a package.json file in the root of your project directory to manage dependencies, scripts, and configurations.
Next, install Cheerio. You can use your preferred package manager to install the package, or you can use npm:
npm install cheerio
The first step in working with Cheerio is to load the HTML content that you want to parse. This is required to ensure Node.js has direct access to the markup content in the server environment.
To load the HTML content, we'll use the cheerio.load() method, passing along the document type argument — the HTML document you want to load.
To parse using Cheerio, create an app.js file, and paste this code:
import cheerio from 'cheerio';
const htmlString = `
<div>
<h1>Hello, World!</h1>
<p>This is a paragraph.</p>
</div>
`;
const $ = cheerio.load(htmlString);
In this example, we first load the HTML string into Cheerio. Then, we can use jQuery-like syntax to query and manipulate the DOM, including accessing the target content as follows:
Make sure to include "type": "module" in the package.json file to enable the use of ES6 modules in your Node.js app.
To test this out, start the node server and observe the output in your terminal.
node app.js
3. Using Axios
In the above example, we were parsing and extracting data from a basic HTML document structure. But what happens when you want to parse large HTML documents like websites with a couple of web pages?
In this case, the same principles apply; however, you need to first download the site's raw HTML using tools like Axios or the Fetch API. For this tutorial, we'll use Axios.
First, run the following command on your terminal to install it:
npm install axios
Now, let's take a look at an example:
import axios from 'axios';
import cheerio from 'cheerio';
async function fetchAndParse(url) {
try {
const response = await axios.get(url);
if (response.status !== 200) {
throw new Error(`Failed to fetch webpage: ${url}`);
}
const html = response.data;
const $ = cheerio.load(html);
const title = $('title').text();
console.log(`Page Title: ${title}`);
} catch (error) {
console.error('Error fetching or parsing webpage:', error);
}
}
fetchAndParse('<https://jsonplaceholder.typicode.com>');
Once you download the webpage, load it using Cheerio and then parse it. After loading the content, it's very easy to traverse the HTML document from the parsed object to extract as much data as possible for further processing or use in your applications.
Though very basic, the above code snippet is a good example of a web scraper.
Parse5 is a flexible HTML parser that provides a simple API for parsing and serializing HTML documents. It is designed to be used as a building block for other tools but can also be used to parse HTML directly for simple tasks.
To use Parse5, install it in your local environment:
npm install --save parse5
Now, Let's take a look at an example. You can paste this code into your app.js file.
const parse5 = require ('parse5') ;
const htmlString = `
<div>
<h1>Hello, World!</h1>
<p>This is a paragraph.</p>
</div>
`;
const document = parse5.parse(htmlString);
// Traverse the document to find elements
function traverse(node) {
if (node.nodeName === 'h1') {
console.log(node.childNodes[0].value); // Outputs: Hello, World!
}
if (node.childNodes) {
node.childNodes.forEach(traverse);
}
}
traverse(document);
In this example, similar to other parsing technologies, we need to load and parse the HTML string into a document. We then traverse the document tree to find and manipulate elements.
However, the difference occurs when you want to traverse the DOM object to extract values. For large web pages, it's not as straightforward to traverse and retrieve content using Parse5.
💡
Please note that Parse5 does not provide a default export, as is the case with ES6 modules. Therefore, importing the package using the standard ES6 import syntax could cause issues. Instead, use the traditional require syntax to import and use the package.
5. Using JSDOM
JSDOM is another great library that simulates a browser-like environment for Node.js, allowing you to parse and interact with HTML as if you were in a real browser.
Using JSDOM is relatively simple. It expects you to pass valid HTML content as a string to its constructor. Then, it'll parse that HTML just like a browser does. From there, you can then manipulate, extract, or interact with the HTML elements as needed.
To use JSDOM, install it with the following command:
npm install jsdom
Here's a basic example of using JSDOM:
import { JSDOM } from 'jsdom';
const htmlString = `
<div>
<h1>Hello, World!</h1>
<p>This is a paragraph.</p>
</div>
`;
const dom = new JSDOM(htmlString);
const document = dom.window.document;
const heading = document.querySelector('h1').textContent;
console.log(heading); // Outputs: Hello, World!
In this example, it's important to note that the object returned from the constructor contains both the data about the parsed HTML document and the metadata JSDOM uses to parse that document. To access the actual document, you need to read the window property, which is similar to the window property in a browser.
After that, you read the document property, which brings you into the actual DOM, just like if you were working in JavaScript in a regular browser. From there, you can use the DOM methods to select and manipulate elements.
Bonus: Parsing HTML using regex
Parsing HTML using regex is generally not recommended due to the complexity and potential for errors, especially when dealing with complex HTML structures. However, for simple HTML parsing tasks, regex can sometimes be a viable option.
A good example of implementing parsing using regex is when implementing data validation on client-side forms, such as checking if a user has provided the correct format or a password.
Here's a simple example of using regex to parse HTML content. However, it's important to note that, in this example, we're essentially parsing a string using the native match() JavaScript method rather than performing a fully-fledged HTML parse. The method returns the result matching the provided regex; otherwise, it returns null.
const htmlString = `
<div>
<h1>Hello, World!</h1>
<p>This is a paragraph.</p>
</div>
`;
const headingMatch = htmlString.match(/<h1>(.*?)<\/h1>/);
if (headingMatch) {
console.log(headingMatch[1]); // Outputs: Hello, World!
}
In this example, we use the regex pattern /<h1>(.*?)<\/h1>/ to match the <h1> tag and extract its content.
Let's break down this pattern:
/<h1>: This part matches the opening <h1> tag.
(.*?): The parentheses create a capturing group to access the content inside the <h1> tag. The .*? is a non-greedy quantifier that matches any character sequence (except newline) as few times as possible.
<\\/h1>: This part matches the closing </h1> tag. The non-greedy quantifier .*? is crucial here. Without the ?, the greedy quantifier .* would match as much content as possible, potentially including content outside the <h1> tag.
While regular expressions can work well for basic HTML parsing tasks, it's generally recommended to use dedicated HTML parsing libraries like Cheerio or JSDOM.
These libraries provide a more robust and efficient way to parse and manipulate HTML documents while offering a wide range of methods and functionality for working with the DOM.
If you decide to use regular expressions for simple HTML parsing tasks, make sure to thoroughly test your patterns against various scenarios and edge cases to ensure they work as expected and do not introduce vulnerabilities or errors in your application.
Common issues with parsing HTML in JavaScript
There are several issues that might affect the parsing process, including:
1. Malformed HTML content
This is the most basic problem you might encounter. Browsers and parsers can handle malformed HTML differently, leading to inconsistent results.
Malformed HTML might include missing or mismatched tags, incorrect attribute names or values, or extra or missing characters.
To address this issue, make sure to check the syntax of your HTML code carefully and ensure that it is well-formed and valid.
However, you don’t need to worry as much since most code editors have syntax highlighting and error detection features that can help you identify and fix these types of problems.
2. Erroneous web scripts
JavaScript code running on your web page with errors, conflicts, or compatibility issues can affect the HTML parsing and DOM manipulation, causing unexpected or undesired behavior.
Make sure to thoroughly test your JavaScript code and ensure that it's free of errors and conflicts.
You can also use tools like linters and unit tests to catch and fix any issues before they impact the HTML parsing process.
3. Performance bottlenecks
Parsing large HTML documents can be slow and memory-intensive, leading to performance issues if not optimized properly.
Several techniques can improve the performance of your apps, including progressive rendering, lazy loading, and optimizing the HTML structure to reduce the overall complexity and size of the document.
Additionally, you can profile your code and identify any areas that may be causing performance issues.
4. Security vulnerabilities
There's always the risk of security vulnerabilities when working with HTML, particularly Cross-Site Scripting (XSS) attacks.
XSS attacks happen when malicious code gets injected into a web page through unsanitized user input.
To avoid this, it's absolutely crucial that you always sanitize and validate any user-generated HTML input before using it in your application.
Use cases of HTML parsing
HTML parsing has several use cases, largely in web development, as well as, data processing and analysis.
1. Web scraping and data extraction
While some web data is available through dedicated APIs, a significant portion is only accessible through the website's HTML content. To access this data, you'll need to set up automated data collection processes and use parsers to read and process it.
Generally, this is entirely what web scraping is—using scrapers (automated scripts) to extract data from websites. The process involves sending HTTP requests to a website, retrieving the HTML content, and then parsing the HTML to locate and extract the desired data.
Let's look at an example of web scraping using the popular Hacker News website as a case study.
We can use HTML parsing to extract relevant data from the website, such as the titles, URLs, and scores of the top stories.
In this example, inside the scrapeHackerNews() function, we first send a GET request to the Hacker News homepage using Axios and then use Cheerio to load the HTML content into a jQuery-like object $.
The function will then iterate through the tr elements on the page, which represent the individual stories on Hacker News. It identifies the rows that contain the story information (those with the 'athing' class) and extracts the title, URL, and score for each story. These story details are then stored in an array called stories.
Finally, call the following function to log the title, URL, and points for each story in the stories array to the console:
It's important to note that web scraping can be a challenging task since websites often change their HTML structure and layout, making it necessary to regularly update the scraping logic. Additionally, some websites may put measures in place to detect and prevent scraping, such as rate limiting or IP blocking.
You should be aware of these obstacles and implement appropriate strategies, such as using rotating proxies, to ensure the reliability and sustainability of your scrapers.
Accessibility auditing is the process of analyzing web pages to identify potential accessibility issues that may prevent users with disabilities from accessing or interacting with the content effectively.
This process involves parsing the HTML structure of a web page to programmatically check for various accessibility criteria.
There are various accessibility factors you can evaluate. However, these will largely depend on your end users and your specific accessibility strategy. But generally, you can analyze your site's HTML to check for:
Proper use of semantic HTML tags: Ensuring that the page uses the appropriate HTML tags (e.g., <h1>, <h2>, <h3>, etc.) to define the document structure and hierarchy. This helps screen readers and other assistive technologies understand the content's organization.
Alternative text for images: Verifying that all images on the page have meaningful alternative text (the alt attribute) that describes the purpose and content of the image.
Keyboard accessibility: Checking that all interactive elements on the page (e.g., links, buttons, form controls) can be accessed and operated using only the keyboard without requiring a mouse.
Let's look at an example that uses the Cheerio library to perform an accessibility audit on an HTML file. Go ahead and create an audit.js file, and paste this code:
import cheerio from 'cheerio';
import fs from 'fs';
async function auditAccessibility(htmlFilePath) {
try {
const html = await fs.promises.readFile(htmlFilePath, 'utf8');
const $ = cheerio.load(html);
// Check for proper use of semantic HTML tags
const headings = $('h1, h2, h3, h4, h5, h6');
if (headings.length === 0) {
console.log('No headings found on the page. Consider adding meaningful headings.');
} else {
let prevHeadingLevel = 0;
headings.each((index, heading) => {
const headingLevel = parseInt(heading.tagName[1]);
if (headingLevel > prevHeadingLevel + 1) {
console.log(`Heading level jump from ${prevHeadingLevel} to ${headingLevel} on the page. Consider using proper heading hierarchy.`);
}
prevHeadingLevel = headingLevel;
});
}
// Check for alternative text on images
const images = $('img');
images.each((index, img) => {
const altText = $(img).attr('alt');
if (!altText || altText.trim() === '') {
console.log(`Image at index ${index} does not have alternative text. Consider adding a meaningful alt attribute.`);
}
});
// Check for keyboard accessibility
const focusableElements = $('a, button, input, textarea, select, [tabindex]');
focusableElements.each((index, element) => {
const tabIndex = $(element).attr('tabindex');
if (tabIndex && parseInt(tabIndex) < 0) {
console.log(`Element at index ${index} has a negative tabindex value, which makes it inaccessible via keyboard.`);
}
});
console.log('Accessibility audit complete.');
} catch (error) {
console.error('Error during accessibility audit:', error);
}
}
// Usage example
auditAccessibility('index.html');
In this example, the auditAccessibility() function performs an accessibility audit when called. It takes a file path to an HTML file as its input and reads the file's contents using fs.promises.readFile(), storing it in the html variable.
Next, we use the cheerio.load() function to parse the HTML content and create a jQuery-like $ object, which can be used to traverse and manipulate the HTML structure.
Essentially, the function then performs the following accessibility checks:
Check for heading elements: It first checks for all the heading elements on the page using the $() function. If no headings are found, it logs a warning message.
Check for alt texts: It finds all the img elements on the page and checks if each one has a meaningful alt attribute. If an image is missing alternative text, a warning message is logged.
Check for keyboard accessibility: It looks for all the interactive elements on the page (links, buttons, form controls, etc.) that can receive keyboard focus. It checks if any of these elements have a negative tabindex value, which can make them inaccessible to keyboard users, and logs a warning message if any are found.
To test this out, create an index.html file, and copy and paste this content:
This index.html file includes examples of the issues that the auditAccessibility() function would detect, such as a missing heading, an image without alternative text, and interactive elements with negative tabindex values. Therefore, when you run the Node.js server, you should see a similar response on your terminal.
Summary: Best practices for HTML parsing in JavaScript
There are several best practices you can implement to ensure you achieve the right results when parsing HTML in JavaScript. Here are some of the most important ones:
Use the right tool for the job
Choose a parser that fits your specific needs. For simple tasks, DOMParser might suffice, while libraries like JSDOM or Cheerio are better suited for more complex operations.
Validate and sanitize input
Always validate and sanitize the input HTML to avoid security risks such as XSS attacks. This is especially important if the parsing process is part of a larger workflow in the application that is dependent on the parsing process.
Optimize performance
Parsing large amounts of HTML can be slow and memory-intensive, so you may need to optimize your code for better performance. For example, with Cheerio, you can use methods like .find() with specific selectors to target only the elements you need instead of iterating through the entire document structure.
Avoid regular expressions for complex parsing
While regular expressions can be useful for simple parsing tasks; you should avoid using them for complex HTML parsing tasks.
Next steps
Now, it's time to unleash your creativity, test out a few ideas, and build some awesome projects!
This is just the tip of the iceberg. Beyond the technologies we have covered, there are other fantastic options to explore, like HTMLParser2, node-html-parser, and others.
Feel free to check out these additional resources to learn more: