A guide to data collection for training computer vision models

How to create datasets from image and video data for machine learning applications.


Hi, we're Apify, a full-stack web scraping and browser automation platform. This article about computer vision was inspired by our work on getting better data for AI.

What is computer vision?

Have you ever used Google Translate by pointing your smartphone at a sign in a foreign language to get an almost immediate translation of it? I have, and I thank computer vision for it every time.

You've heard of self-driving cars, right? Computer vision is behind those, too.

What about face detection? Computer vision again.

The history of computer vision goes back to 1959 - almost as far back as AI itself. It's a field of machine learning that helps computers “see” the world around them. It involves developing algorithms, techniques, and systems that allow computers to analyze and extract meaning from images, videos, and other visual data.

The applications of computer vision are very broad, covering fields as wide-ranging as automotive manufacturing, optical character recognition, and face detection.

It doesn't look like this field is going to slow down any time soon, either, with the market for computer vision predicted to reach $82.1 billion by 2032. And with the rise of multimodal AI, computer vision use cases are likely to expand.

Datasets: ground truth for machine learning

For computer vision to work, AI models require access to datasets that serve as their "ground truth" for learning. The process of collecting data for such datasets is pivotal in the development of efficient computer vision models, as the quality and quantity of the data directly influence their accuracy and performance.

What is meant by ground truth?

"Ground truth" refers to the correct values or labels used for training and evaluation purposes. It serves as a reference or benchmark against which the performance of an AI model is measured. Ground truth data is critical for supervised learning, where models are trained using labeled examples and then assessed for their ability to make accurate predictions.

Data collection for AI and computer vision

What is data collection in AI and computer vision?

Data collection is a broad term. But in AI, it’s the process of aggregating relevant data and structuring it into datasets suitable for machine learning. The choice of data type, such as video sequences, frames, photos, or patterns, depends on the specific problem the AI model aims to solve.

In the domain of computer vision, AI models are trained using image datasets to make predictions related to things such as image classification, object detection, and image segmentation. These image or video datasets must contain meaningful information for training the model to recognize patterns and make predictions based on them.

For example, in industrial automation, image data is collected to identify specific part defects. Therefore, cameras capture footage from assembly lines to create video or photo images, which form the dataset.

Data sources for computer vision

Generating a high-quality machine learning dataset requires identifying sources that will be used to train the model. There are two ways of sourcing and collecting image or video data for computer vision tasks.

Public image datasets

Public machine learning datasets are readily available online and are often open-source and free to use. However, it's important to review the dataset's licensing terms, as some may require payment for commercial projects. Public datasets are suitable for common computer vision tasks but may not be suitable for unique or specific problems.

Custom datasets

Custom datasets can be created by collecting data using web scrapers, cameras, and other sensor-equipped devices like mobile phones or webcams. Third-party dataset service providers can assist in collecting data for machine learning tasks, and modern computer vision platforms, such as TensorFlow or PyTorch, host datasets for AI model deployment.

Image annotation and data labeling

Once data is collected, the next step is image annotation and data labeling, where humans manually provide information about the ground truth within the data. This involves indicating the location and characteristics of objects that the AI model should learn to recognize. For example, training a deep learning model to detect giraffes would involve annotating each image or video frame with bounding boxes around the giraffes linked to the label "giraffe." The trained model can then identify giraffes in new images.

Data preparation and characteristics of image data

Most computer vision models are trained on datasets comprising hundreds or thousands of images. The quality of these images is crucial to the AI model's ability to classify or predict outcomes accurately. There are several key characteristics that can help identify a good image dataset:

  • Quality: Images should be detailed enough for the AI model to identify and locate target objects effectively.
  • Variety: Diverse images in the dataset improve the model's performance in various scenarios.
  • Quantity: More data is generally better, as training on a large, accurately labeled dataset increases the model's chances of making accurate predictions.
  • Density: The density of objects within the images also matters, as more data improves the model's efficiency.

Video data collection

While computer vision models are predominantly trained on image datasets, certain applications, like video classification, motion detection, and human activity recognition, require video data. Videos are essentially sequences of images, and the process of collecting video data involves identifying sources, scraping video content, recording video files, extracting frames, and preprocessing the data for machine learning.

The best way to collect image and video data

To train computer vision models, you need vast amounts of data. The go-to solution for collecting real-time data at scale for computer vision and other AI applications is web scraping. This is a method of retrieving unstructured data from websites and converting it into a structured format so machines can process it. One way to go about this is to build your own custom scrapers (you can learn how in these free web scraping courses). Another option is to use pre-built scraping and automation tools for extracting image and video data from the web. Here are five options for starters:

The challenges of web scraping for computer vision

  • Getting blocked

Anyone who has done large-scale data extraction knows that the biggest challenge for web scraping is getting blocked by anti-bot protections.

To deal with these and other challenges, you don't just need a web scraper but infrastructure that sets you up to scrape successfully at scale.

  • Unclean data

Another challenge, particularly in the field of computer vision, where high-quality images are required, is data cleanliness.

The web is full of low-quality images, videos, and audio. So you not only need to perform web scraping, but you also need to clean and process web data to feed AI models.

At Apify, we're well aware of these challenges and have a lot of experience in dealing with them. So, if you want a reliable platform to create your own web scrapers for computer vision models, Apify provides the infrastructure you need. If you prefer a ready-made scraper designed to handle the complexities of a particular website, there's a range of web scraping and automation tools for AI available in Apify Store, where developers publish the micro-apps (Actors) they've created for web scraping and automation projects.

Take your pick!

Theo Vasilis
Theo Vasilis
Writer, Python dabbler, and crafter of web scraping tutorials. Loves to inform, inspire, and illuminate. Interested in human and machine learning alike.

Get started now

Step up your web scraping and automation