Web scraping AI: product matching done by data extraction

Product matching, also known as product mapping, poses a significant challenge for e-commerce shops and online retailers when it comes to pricing intelligence. This process of identifying, categorizing, and matching products across different websites can be daunting. Performing this task manually on a daily basis requires great dedication. Is there a way to make product matching scalable?

At Apify, we recognized the demand for not only extracting product data from separate e-commerce stores but also matching them. To address this need, we developed an AI-model-based Product Matcher 🔗. This tutorial provides an easy-to-follow guide on how to utilize this tool effectively.

🏎 💨

Already know the theory? Skip to the tutorial part!

🗺 What is product matching for?

In today's competitive e-commerce landscape, shoppers have a myriad of options, with thousands of products and hundreds of retailers to choose from. They can effortlessly compare prices across multiple e-commerce websites before ultimately settling for the best – often the most affordable – option. Since the price tag plays such a big role here, retailers face the constant pressure of adapting their prices each day in their quest to keep up with the market and still manage to balance their offer against the return on sales.

So how can you keep track of all those price changes? Well, the process of comparing prices across websites consists of identifying, then categorizing, and finally matching products. But all websites look different and contain slightly different information even about the same products. The lack of any standard layout for e-commerce products makes this undertaking truly challenging in technical terms and, no, Amazon ASINs won’t always help here. Hence, product matching or, as it is also known, product mapping.

Product mapping is usually the answer to conducting competitor analysis, implementing dynamic pricing on the website, or even writing an in-depth article comparing various online stores. You might think there’s some clever app for this, but believe it or not, on average, human beings are still better than machines at identifying a product and comparing its price across websites. But getting pricing intelligence via manual mapping is not scalable or consistent.

**The same item looks different across online stores: eBay vs. Barnes&Noble layouts**

💪 What about manual mapping?

The existing pricing intelligence solutions come with a set of challenges. One of the main issues is lag – lots of these solutions often work retrospectively, providing insights way after the fact of a price change. This delay in obtaining actionable information often hinders retailers from gaining a competitive edge. Other challenges facing the existing solutions on the market include:

Inconsistencies across product names and descriptions
Variations across brands, retailers, and sellers
Difficulty in training algorithms to identify accurate matches
Frequent unavailability of universal product identifiers (such as UPC, GTIN, or ASIN)
Limited capabilities for processing large volumes of data

**Different attributes of the products in online shops**

Inaccurate and untimely matches may lead to general pricing errors, preventable mistakes during the promotional season, blind spots in merchandising, overstocking, and a worse customer experience overall.

What is out there on the market for retailers seeking to perform product mapping? Two things: quality without robustness, aka manual mapping, or raw data without insights, aka web scraping.

👎 Manual mapping

These solutions rely heavily on manual data aggregation and have limited capabilities in terms of product matching. They are typically built by professionals who may not specialize in data analysis, leading to operational challenges such as manual research, data entry, maintenance and updates. Just imagine a team of people regularly checking two websites and writing down their comparisons into an Excel sheet. Due to the significant amount of human intervention and effort involved, manual mapping tends to be expensive, slow, and difficult to scale.

📄 Web scraping

Scraping-based solutions by themselves are timely and definitely more on the automated side of things, but they get slowed down because of the need for data normalization. Web scraping by itself is just raw web data, and data on its own won’t deliver relevant and actionable insights, especially when dealing with large volumes during high-stakes seasons involving promotional campaigns.

In summary, pricing intelligence solutions often suffer from issues like delayed information, limited capabilities, operational challenges, and high costs. Overcoming these challenges is necessary to ensure retailers can effectively acquire accurate and timely insights for a competitive advantage in the market.

What is needed is a complex pricing intelligence solution able to match products across various websites, quickly and accurately. A solution that would either outperform manual mapping or accelerate it. Cue in: AI Product Mapping powered by scraping.

🤖 What is AI Product Matcher?

AI Product Matcher 🔗 is an AI model able to compare two items in two different web stores and identify whether they are the same. Part of being a web scraping company is extracting data, including product data; it is what our customers do on the Apify platform daily and successfully so. But raw product data is just that – data. Quite a few of our customers requested a solution able not only to get the product data regularly but also to analyze it and find product matches. As a result, we’ve created this Matcher Actor as an extension of our regular data extraction tools.

Using AI Product Matcher will enable you to:

Monitor exact product matches across your industry
Make sure you only get real-time data via web scraping
Turn that e-commerce data into actionable insights
Complement or replace manual mapping
Get realistic estimates for upcoming promo campaigns

Discover the story behind building the AI Product Matcher:

💡

Find out more about building functional AI models for web scraping.

❓ How to match products using AI

Step 1. Find AI Product Matcher

Head over to Apify Store and find AI Product Matcher in the AI category. Then click on the ▷ Try actor button.

**Step 1. Find the AI Product Matcher in Apify Store**

If you already have an account on Apify and are signed in, you’ll find the Matcher right there. Otherwise, you’ll be asked to create a new account using your email.

No matter which way you get there, you will always end up in Apify Console. Apify Console is your workspace for interacting with all your scraping and web automation tools, including the AI Product Matcher. Now you can move on to the next steps.

✏️

Tip: If you want to simply see how the Product Matcher works, you can go with its prefilled input by skipping all steps and clicking on the Start button.

Step 2. Create the datasets to work with

Before we can start thinking about matching, we need to find products to match. The easiest way to get product data is by scraping websites in real-time. This can be done in several different ways:

We scrape: use one of the many scrapers available on the Apify Store (see, for example, how you can use our Amazon scraper).
You scrape: make your own scraper on the Apify platform by using one of the ready-made boilerplates or open-source scraping library, Crawlee.
You already have the dataset: upload your own dataset to the Apify Platform using our API (with clients available in JavaScript and in Python). Here’s an example of how you can accomplish that in Python:

from apify_client import ApifyClient
import pandas as pd

apify_client = ApifyClient('### Insert your API access token here ###')

data_to_upload = pd.read_csv("productsFromEshop1.csv").fillna("")
dataset_collection_client = apify_client.datasets()
dataset_info = dataset_collection_client.get_or_create(
    name='productsFromEshop1' # the name is just for your convenience, it can be anything you want
)
data_client = apify_client.dataset(dataset_info['id'])
data_client.push_items(data_to_upload.to_dict(orient='records'))

After you’ve prepared datasets containing products from different online stores, let’s move on to configuring the matcher itself.

Step 3. Add the datasets

In order for Product Matcher to run, you need to tell it two things:

Which datasets of products do you want to look for matches in?
What format are these datasets in?

Let’s start with the datasets themselves. No matter how you got your datasets onto the platform, each of them will be assigned an id. All you need to do in the first section called Separate datasets is copy-paste those dataset ids.

Step 3: Add the datasets — **Step 3. Add the datasets (just their ids)**

In our default example, we have two datasets, each containing chairs and laptops. Only the first dataset has chairs and laptops from store 1, and the second, respectively, has chairs and laptops from store 2. Our goal is to see if there are any chair or laptop matches between the two online stores.

Step 3.B

Alternatively, you could go to the Pair datasets field. In this case, you can give the Matcher a pair dataset, meaning a dataset where each row contains information about both products from the different online shops you want to compare. In our example, our pair dataset would have had chairs and laptops from both shops.

**Alternatively for Step 3, give the Matcher a pair dataset.**

❗️

Note: if you use Step 3.B, the Separate datasets input above will be ignored.

Step 4. Specify attributes in your product datasets

The second part is a little more complicated - you need to fill out the Attribute mappings section. In this input, you need to specify which fields in your datasets contain the information the Matcher needs to work with and compare: name of the product, its price, description, code number, specification. You do this by creating a JSON object for each online store:

In this object, each property represents a piece of information the Matcher needs, and the value of each property tells it which column in the dataset you’ve provided contains this piece of information. For a more detailed explanation of what each property should be, check the readme of AI Product Matcher 🔗.

You can omit some of the properties (other than id, which is necessary and should be a unique identifier, the product URL will usually suffice) if you don’t have the corresponding information available. However, you should keep in mind that the accuracy of the Matcher depends on how much information it is given. So the fewer properties you give to the Matcher, the less accurate the result will be.

Step 4.B Precision or Recall

You can also configure several more things: what the output dataset will look like and whether the Matcher should be focused on precision or recall. You can find an explanation of these fields and how to work with them in the Matcher’s readme 🔗.

But for now, the default settings in these extra fields should be enough.

Step 5. Run the Matcher and check the output

Once you are finished with the configuration, you can run the Matcher by clicking the Start button at the bottom of the page. Wait until the Actor finishes the run, you will know it’s done when you see the Succeeded label appear at the top left. In our case, it only took 49 seconds!

**Step 6. Run the Matcher and check the output**

As you can see from the screenshot above, the run we did gave us six considered pairs in total, only two of them being true matches. To see the results, click on the Storage tab. If you pick HTML table as the format to view the results, they should look like this:

Sample output containing 6 results, 2 of which are found matches — **Sample output containing 6 results, two of which are found matches**

Let’s decipher them. As you can see, each row represents every possible combination of products taken from the first and the second shop. The Predicted_match and Predicted_scores columns contain information about whether the two products are the same and how certain the Matcher is about that.

✏️

The Matcher considers two products to be the same by assigning them a score of 1, and different by giving them a 0. All the decimals in between indicate how similar the two products might be, e.g. with 0.9 meaning close to being the same, and 0.1 meaning close to being different.

In our case, we have two product matches: a laptop and a chair. We’ve found the same laptop and same chair across both stores, with scores 0.93 and 0.9 - meaning the Matcher is very sure about it.

As mentioned in step 4.1, you can also add more columns to the output dataset if needed.

Step 6. Set up matching to run regularly

Now that you have the Matcher set up, you can easily run it again and again just by changing the input datasets and hitting the Start button. You can also set the entire pipeline from the scrapers to the matcher and schedule it to regularly run automatically on our platform, making sure you are always on top of pricing changes.

Web scraping for feeding and training AI models

❓ FAQ

🎨 Can the Matcher identify the same item in different colors?

Yes. The Matcher is trained to consider different color variants of the same product to be the same product. For instance, different color variants of the same smartphone will be counted as the same smartphone even though they might slightly differ in price.

💸 How much does it cost to do product matching with this AI?

One thing to keep in mind is that the actor is paid, costing you a small amount per number of results it produces. If you’ve just created your Apify account, your Free plan should have enough free credits to try the Actor, but if you want to run it regularly or on any kind of scale, you will need to upgrade to a paid plan.

⚖️ Will the results of the Matcher be different depending on Precision or Recall?

Yes. Since no ML model is perfect, we've added an option of Precision/Recall tradeoff that will determine the model's focus in minimizing specific types of mistakes. You can choose between Precision, which aims for highly accurate matches but may result in more false negatives, or Recall, which prioritizes finding as many true positives as possible, even if it means more false positives. For more details about Precision/recall choice, see the readme.

📜 Can I change the output dataset?

Yes. The way Matcher's output looks like is not set in stone, so if you want to see the product names right away instead of their ids, for instance, you can set it up. You can do that using the output_mapping Actor input, which is very similar to input_mapping. Same as with the input mapping, you'll have to specify the output attributes separately for each online shop. For more details about output dataset configuration, head over to the readme.

➡️ What’s next?

If you encounter any issues while using the AI Product Matcher or would like to suggest improvements that would be useful to you, add them to the Issues tab. As mentioned above, you can also find more information on the Matcher in its readme page, including how accurate you can expect it to be in practice.

If you need to process large numbers of products or if the Matcher isn’t accurate enough for your use case, it could be cheaper and more effective for you to have us prepare a tailor-made solution for you. If that’s the case, feel free to contact our enterprise team.