Synthetic data vs. real data for AI

Hi, we're Apify, a full-stack web scraping and browser automation platform. This article about synthetic vs. real data was inspired by our work on getting better data for AI.

What is synthetic data?

Synthetic data is artificially created data used to replace real data in machine learning applications. It's generated by computer algorithms and used to train neural networks and generative AI models.

Synthetic data generation and the 3 Big Bangs

Although synthetic data goes back to the 1970s, it didn't get much attention until the “Big Bang of AI” in 2012, when a team of researchers used it to train an artificial neural network to win an image classification competition by a huge margin.

The second Big Bang of AI occurred in 2017 with the arrival of Transformers - the deep learning architectures on which today's generative AI models are based. This breakthrough not only advanced natural language understanding but also emphasized the importance of synthetic monitoring to ensure reliability and performance at scale as AI systems grew more complex.

Despite these two significant events in AI history, in 2021, only 1% of all data for AI was synthetic. But, according to Gartner, that number is expected to rise to 60% by the end of 2024. And by 2030, synthetic data is predicted to overshadow real-world data completely.

"In a report on synthetic data, Gartner predicted by 2030 most of the data used in AI will be artificially generated by rules, statistical models, simulations or other techniques." https://t.co/UdRPq5iJAR pic.twitter.com/voqpLMGu8q
— Abeba Birhane (@Abebab) June 29, 2023

It's easy to understand why. The third Big Bang occurred in 2022 when a Generative Pretrained Transformer was put into the hands of consumers for the first time in the form of ChatGPT.

Since then, the data race for AI models has been escalating at a breathtaking rate. This, in turn, has caused the popularity of synthetic data generation to rise sharply. But why is synthetic data generation the answer to data demand?

Why use synthetic data for AI?

The shift towards synthetic data usage in AI development is fueled by several practical factors.

Privacy issues

Firstly, synthetic data helps overcome privacy issues associated with using real-world data, especially when the data contains sensitive personal information. By using synthetic data that mimics real user data, developers can train models without risking privacy breaches.

Ease and customizability

Moreover, synthetic data is essential for scenarios where real data is scarce or difficult to collect. For instance, in autonomous vehicle development, simulating millions of driving hours with diverse conditions is safer and more feasible than recording actual driving data. Synthetic data generation also makes it easier to customize data to meet the specific needs of a business.

Speed

Synthetic data isn't captured from real-world events. That means it's possible to construct a dataset much faster than by collecting real data. In other words, huge volumes of artificial data can be made available in a shorter period of time.

Cost

Lastly, synthetic data can be generated at a lower cost than gathering and labeling real-world data. This makes the development of AI models more efficient and allows for rapid iteration and improvement.

A single image that would cost $6 from a labeling service can be artificially generated for 6 cents.

– Paul Walborsky, co-founder of AI.Reverie

What's the problem with synthetic data?

Notwithstanding the advantages of generating synthetic data, there's a major problem with it.

A study called The Curse of Recursion: Training on Generated Data Makes Models Forget has demonstrated that training an AI model on synthetic data or samples from another generative model can induce a distribution shift, which over time, causes model collapse. This, in turn, causes an AI model to misperceive the underlying learning task.

The study concludes that access to the original data source needs to be preserved, and additional data not generated artificially must remain available over time.

The problem of synthetic data can be mitigated by a hybrid approach that involves augmenting real-world data, thus combining real and synthetic data. This technique is known as data augmentation, which uses real-world datasets to create new synthetic examples and maintains the quality and diversity of the training dataset. But the prediction that purely synthetic data will overshadow real data is worrying.

Synthetic data can introduce bias or lose realism, which is why it's the main cause of model collapse. Furthermore, the problem of quality degradation in AI models means there's a need to retrain them with new sources of ground truth in the form of fresh real-world datasets. Otherwise, our AI models not only won't improve; they'll get worse at what we trained them to do.

Advantages of real data for AI

Despite the rise of synthetic data, real data holds invaluable benefits for AI. It captures the complexity and unpredictability of the real world, which is often difficult to emulate through synthetic means. Training AI models on real-world data ensures they're exposed to actual scenarios they'll encounter post-deployment. Naturally, this improves their reliability and performance.

Real data can also help in validating and testing AI models trained on synthetic data. This dual approach ensures that the models are not only fed with diverse and comprehensive data during training but also checked against real-world benchmarks to guarantee their efficacy.

How to collect real-world data for AI models

Granted that real-world data collection is necessary for training AI models, what are the options? Vast volumes of data are needed for AI, so how do you collect it?

There are several methods depending on the volume and type of data required.

Existing databases and crowdsourcing

For structured data, organizations can tap into existing databases and records. For unstructured data, like images or sounds, crowdsourcing platforms can be valuable, where individuals contribute data in exchange for compensation or on a voluntary basis.

Partnerships and collaborations

Another method is through partnerships and collaborations with institutions and companies that already possess vast amounts of relevant data. These collaborations can provide access to a wide array of real-world data while benefiting all parties involved.

IoT devices

Additionally, IoT devices and sensors are prolific data collectors and can provide a continuous stream of real-world information that can be used for AI models, especially in domains like environmental monitoring, healthcare, and smart cities.

The problem with all of the above? Scale and customizability. To collect enormous quantities of relevant, up-to-date data, there's really only one solution when it comes to the needs of AI:

Web scraping

Web scraping involves using software to extract information from websites. This technique can harvest large amounts of data by navigating the web automatically, mimicking human browsing to collect specific data from various web pages.

The advantages of web scraping are numerous. For one, it enables the collection of data at scale, which is beneficial for training AI models that require extensive datasets to improve their accuracy. It's also a time-efficient method, as once set up, web scrapers can gather data much faster than a human manually could.

Web scraping tools are usually customizable, allowing for the targeted collection of data. This means that if you're looking to train a model on a specific type of data, scrapers can be programmed to look for and collect just that. It's particularly useful for gathering structured data such as product information, prices, descriptions, and reviews from e-commerce sites or for collecting unstructured data like posts and comments from social media platforms.

Moreover, web scraping democratizes data collection. Even individuals or small companies without access to large datasets can collect the data they need from the public domain. This levels the playing field and fosters innovation in AI.

The challenges? The two primary obstacles are scrapers getting blocked by anti-bot protections and scraping dynamic web pages. But these challenges are not insurmountable…

Collecting real-world data for AI is not unfeasible

Given the amount of data required for AI applications, web scraping isn't feasible with just a web scraping tool. You need infrastructure, tools, and expertise that allow you to tackle and overcome the various challenges of opening websites and extracting data for LLMs and other AI applications legally, ethically, and at scale.

Apify provides all of these things. Its platform gives developers easy access to serverless computation, data storage, distributed queues, and hundreds of web scrapers built by other developers. It's also deeply integrated with Crawlee, an open-source web scraping library that allows you to crawl and scrape websites at scale.

In addition to its vast range of pre-built data extraction tools, Apify offers Enterprise solutions with its team of experts who know how to handle the challenges of collecting data from arbitrary websites.

So, should you come to realize that synthetic data just won't cut it and you need a way to collect real data for AI, we'll be here for you!