Scalable product matching across multiple online shops is a challenge faced by companies and individuals for many different reasons. You might need it to implement dynamic pricing for your own website, perform competitor analysis on a daily basis or write an article rigorously comparing various online stores.
After seeing how many people wanted us at Apify to not only extract product data from separate e-commerce stores but also match them, we went ahead and created an AI-model-based Product Matcher. This is a tutorial on how you can easily use it.
🗺 What is product mapping for?
In today's competitive e-commerce landscape, shoppers have a myriad of options, with thousands of products and hundreds of retailers to choose from. They can effortlessly compare prices across multiple e-commerce websites before ultimately settling for the best – often the most affordable – option. Since the price tag plays such a big role here, retailers face the constant pressure of adapting their prices each day in their quest to keep up with the market and still manage to balance their offer against the return on sales.
So how can you keep track of all those price changes? Well, the process of comparing prices across websites consists of identifying, then categorizing, and finally matching products. But all websites look different and contain slightly different information even about the same products. The lack of any standard layout for e-commerce products makes this undertaking truly challenging in technical terms and, no, Amazon ASINs won’t always help here. Hence, product matching or, as it is also known, product mapping.
You might think there’s some clever app for this, but believe it or not, on average, human beings are still better than machines at identifying a product and comparing its price across websites. But getting pricing intelligence via manual mapping is not scalable or consistent.
🤌 What about manual mapping?
The existing pricing intelligence solutions come with a set of challenges. One of the main issues is lag – lots of these solutions often work retrospectively, providing insights way after the fact of a price change. This delay in obtaining actionable information often hinders retailers from gaining a competitive edge. Other challenges facing the existing solutions on the market include:
- Inconsistencies across product names and descriptions
- Variations across brands, retailers, and sellers
- Difficulty in training algorithms to identify accurate matches
- Frequent unavailability of universal product identifiers (such as UPC, GTIN, or ASIN)
- Limited capabilities for processing large volumes of data
Inaccurate and untimely matches may lead to general pricing errors, preventable mistakes during the promotional season, blind spots in merchandising, overstocking, and a worse customer experience overall.
What is out there on the market for retailers seeking to perform product mapping? Two things: quality without robustness, aka manual mapping, or raw data without insights, aka web scraping.
🤌 Manual mapping
These solutions rely heavily on manual data aggregation and have limited capabilities in terms of product matching. They are typically built by professionals who may not specialize in data analysis, leading to operational challenges such as manual research, data entry, maintenance and updates. Just imagine a team of people regularly checking two websites and writing down their comparisons into an Excel sheet. Due to the significant amount of human intervention and effort involved, manual mapping tends to be expensive, slow, and difficult to scale.
📄 Web scraping
Scraping-based solutions by themselves are timely and definitely more on the automated side of things, but they get slowed down because of the need for data normalization. Web scraping by itself is just raw web data, and data on its own won’t deliver relevant and actionable insights, especially when dealing with large volumes during high-stakes seasons involving promotional campaigns.
In summary, pricing intelligence solutions often suffer from issues like delayed information, limited capabilities, operational challenges, and high costs. Overcoming these challenges is necessary to ensure retailers can effectively acquire accurate and timely insights for a competitive advantage in the market.
What is needed is a complex pricing intelligence solution able to match products across various websites, quickly and accurately. A solution that would either outperform manual mapping or accelerate it. Cue in: AI Product Mapping powered by scraping.
🦾 What is AI Product Matcher?
AI Product Matcher is an AI model able to compare two items in two different web stores and identify whether they are the same. Part of being a web scraping company is extracting data, including product data; it is what our customers do on the Apify platform daily and successfully so. But raw product data is just that – data. Quite a few of our customers requested a solution able not only to get the product data regularly but also to analyze it and find product matches. As a result, we’ve created this Matcher Actor as an extension of our regular data extraction tools.
Using AI Product Matcher will enable you to:
- Monitor exact product matches across your industry
- Make sure you only get real-time data via web scraping
- Turn that e-commerce data into actionable insights
- Complement or replace manual mapping
- Get realistic estimates for upcoming promo campaigns
🥾 How to match products using AI
Step 1. Find AI Product Matcher
Head over to Apify Store and find AI Product Matcher in the AI category. Then click on the ▷ Try actor button.
If you already have an account on Apify and are signed in, you’ll find the Matcher right there. Otherwise, you’ll be asked to create a new account using your email.
No matter which way you get there, you will always end up in Apify Console. Apify Console is your workspace for interacting with all your scraping and web automation tools, including the AI Product Matcher. Now you can move on to the next steps.
Step 2. Create the datasets to work with
Before we can start thinking about matching, we need to find products to match. The easiest way to get product data is by scraping websites in real-time. This can be done in several different ways:
- We scrape: use one of the many scrapers available on the Apify Store (see, for example, how you can use our Amazon scraper).
- You scrape: make your own scraper on the Apify platform by using one of the ready-made boilerplates or open-source scraping library, Crawlee.
from apify_client import ApifyClient import pandas as pd apify_client = ApifyClient('### Insert your API access token here ###') data_to_upload = pd.read_csv("productsFromEshop1.csv").fillna("") dataset_collection_client = apify_client.datasets() dataset_info = dataset_collection_client.get_or_create( name='productsFromEshop1' # the name is just for your convenience, it can be anything you want ) data_client = apify_client.dataset(dataset_info['id']) data_client.push_items(data_to_upload.to_dict(orient='records'))
After you’ve prepared datasets containing products from different online stores, let’s move on to configuring the matcher itself.
Step 3. Add the datasets
In order for Product Matcher to run, you need to tell it two things:
- Which datasets of products do you want to look for matches in?
- What format are these datasets in?
Let’s start with the datasets themselves. No matter how you got your datasets onto the platform, each of them will be assigned an id. All you need to do in the first section called Separate datasets is copy-paste those dataset ids.
In our default example, we have two datasets, each containing chairs and laptops. Only the first dataset has chairs and laptops from store 1, and the second, respectively, has chairs and laptops from store 2. Our goal is to see if there are any chair or laptop matches between the two online stores.
Alternatively, you could go to the Pair datasets field. In this case, you can give the Matcher a pair dataset, meaning a dataset where each row contains information about both products from the different online shops you want to compare. In our example, our pair dataset would have had chairs and laptops from both shops.
Step 4. Specify attributes in your product datasets
The second part is a little more complicated - you need to fill out the Attribute mappings section. In this input, you need to specify which fields in your datasets contain the information the Matcher needs to work with and compare: name of the product, its price, description, code number, specification. You do this by creating a JSON object for each online store:
In this object, each property represents a piece of information the Matcher needs, and the value of each property tells it which column in the dataset you’ve provided contains this piece of information. For a more detailed explanation of what each property should be, check the readme of AI Product Matcher 🔗.
You can omit some of the properties (other than id, which is necessary and should be a unique identifier, the product URL will usually suffice) if you don’t have the corresponding information available. However, you should keep in mind that the accuracy of the Matcher depends on how much information it is given. So the fewer properties you give to the Matcher, the less accurate the result will be.
Step 4.B Precision or Recall
You can also configure several more things: what the output dataset will look like and whether the Matcher should be focused on precision or recall. You can find an explanation of these fields and how to work with them in the Matcher’s readme 🔗.
But for now, the default settings in these extra fields should be enough.
Step 5. Run the Matcher and check the output
Once you are finished with the configuration, you can run the Matcher by clicking the Start button at the bottom of the page. Wait until the Actor finishes the run, you will know it’s done when you see the Succeeded label appear at the top left. In our case, it only took 49 seconds!
As you can see from the screenshot above, the run we did gave us six considered pairs in total, only two of them being true matches. To see the results, click on the Storage tab. If you pick HTML table as the format to view the results, they should look like this:
Let’s decipher them. As you can see, each row represents every possible combination of products taken from the first and the second shop. The Predicted_match and Predicted_scores columns contain information about whether the two products are the same and how certain the Matcher is about that.
In our case, we have two product matches: a laptop and a chair. We’ve found the same laptop and same chair across both stores, with scores 0.93 and 0.9 - meaning the Matcher is very sure about it.
As mentioned in step 4.1, you can also add more columns to the output dataset if needed.
Step 6. Set up matching to run regularly
Now that you have the Matcher set up, you can easily run it again and again just by changing the input datasets and hitting the Start button. You can also set the entire pipeline from the scrapers to the matcher and schedule it to regularly run automatically on our platform, making sure you are always on top of pricing changes.
🎨 Can the Matcher identify the same item in different colors?
Yes. The Matcher is trained to consider different color variants of the same product to be the same product. For instance, different color variants of the same smartphone will be counted as the same smartphone even though they might slightly differ in price.
💸 How much does it cost to do product matching with this AI?
One thing to keep in mind is that the actor is paid, costing you a small amount per number of results it produces. You can find the precise price in the Matcher’s readme. If you’ve just created your Apify account, your Free plan should have enough free credits to try the Actor, but if you want to run it regularly or on any kind of scale, you will need to upgrade to a paid plan.
⚖️ Will the results of the Matcher be different depending on Precision or Recall?
Yes. Since no ML model is perfect, we've added an option of Precision/Recall tradeoff that will determine the model's focus in minimizing specific types of mistakes. You can choose between Precision, which aims for highly accurate matches but may result in more false negatives, or Recall, which prioritizes finding as many true positives as possible, even if it means more false positives. For more details about Precision/recall choice, see the readme.
📜 Can I change the output dataset?
Yes. The way Matcher's output looks like is not set in stone, so if you want to see the product names right away instead of their ids, for instance, you can set it up. You can do that using the output_mapping Actor input, which is very similar to input_mapping. Same as with the input mapping, you'll have to specify the output attributes separately for each online shop. For more details about output dataset configuration, head over to the readme.
➡️ What’s next?
If you encounter any issues while using the AI Product Matcher or would like to suggest improvements that would be useful to you, add them to the Issues tab. As mentioned above, you can also find more information on the Matcher in its readme page, including how accurate you can expect it to be in practice.
If you need to process large numbers of products or if the Matcher isn’t accurate enough for your use case, it could be cheaper and more effective for you to have us prepare a tailor-made solution for you. If that’s the case, feel free to contact our enterprise team.