Artificial intelligence is everywhere and is used to generate all kinds of content, from text to video to music, as well as acting as a conversational partner for millions of people. However, AI can’t conjure all this content out of thin air, it needs to draw from a well of some kind. Where does AI get its information?
The short answer is from us, and the repositories we’ve created. Some of these databases are perfectly okay to use, but as AI’s needs have grown so has its creators' willingness to color outside the lines.
A quick breakdown of how LLMs work
Large language models use a set of sophisticated algorithms and machine-learning tools to learn from a vast reservoir of data. This process is called training. Once trained, the LLM can retrieve that data whenever prompted, while adding its own spin.
For example, if you ask ChatGPT for a poem in the style of Shakespeare about the love you have for your laptop, the LLM uses its algorithm to search the data it holds for poems by Shakespeare, retrieves this data, and then adapts it to make it about your laptop rather than whomever the Bard was pining after.

When we’re talking about AI, the tech that makes it up is only part of the story; the data used to train these models is just as important. Without AI algorithms, the data would just be some inert bits and bytes, while without the training data the LLM’s code would be nothing more than an interesting exercise in programming.
What kind of data does an LLM need?
For an LLM to work, it needs a lot of data. To give you an idea, GPT-3 was trained on 570GB of so-called filtered, or clean, data, which is a huge amount of text. For example, if you were to strip Wikipedia of all images and formatting, the text files alone would come in under 25GB. GPT-4, meanwhile is speculated to have needed ten times as much as its predecessor, though no official figures have been published.
The data can’t be just anything, either. There are different criteria on what data to feed an LLM depending on what you’ll be using it for, but roughly speaking, you need to tailor what you put in to what you’re trying to get out. If your LLM is meant to generate images, you want to draw from an image-heavy database like ImageNet, while an LLM that is more text-based may use Wikipedia or BookCorpus, which holds the contents of over 7,000 self-published books.
To go back to the earlier example, to get a Shakespearean sonnet on laptops, your LLM needs to know about Shakespeare, laptops, and have enough knowledge of English to process the command. AI isn’t magic, it can’t draw information from thin air — though it will if you let it, and these so-called hallucinations are a serious issue.
The ethics of gathering AI data
To train an LLM, you need databases, the more the better. However, this does raise the question of where these databases get their data from. This is where things can get murky, as they all rely on some kind of scraped data. While web scraping is legal, not all data can be legally scraped, and there are ethical issues, too.
Probably the best example of an often-used data archive is Common Crawl, which is an open-source project that crawls the entire web, catching everything from web page content to metadata. This data is on the open web and thus can be accessed by anybody at any time; Common Crawl has just bundled it all up in one place. If there is anything sensitive, you could argue that this is purely an accident.
Another popular database is Wikipedia, which also allows anybody to use its text (not images, as they are under a different license). Plenty of AI companies have used scraped Wikipedia text from its entries to train their LLMs. Much the same goes for the MNIST dataset, which is a collection of thousands of hand-written letters, which you may recognize from CAPTCHAs.
Dodgy data
However, not all data used by LLMs is this accessible, and over the last few years, most major LLMs have been faced with some kind of scandal over the provenance of their training data.
For example, the popular database BookCorpus, used to train the GPT models, was made up of the works of independent, self-published authors who put out their works for free. However, the wholesale downloading of these works and then their reuse in LLMs was likely against the terms of service of SmashWords, the service that hosted the works. Some authors protested, but to no avail.
A little farther over the line is the example of Meta, Facebook’s parent company, which The Atlantic discovered had used the Books3 dataset, which was made up of thousands of pirated books. The dataset has become the subject of intense legal debate since authors weighed in against having their work used by an LLM without their permission or without payment.
Barrel scrapings
You can also wonder at what is being scraped up to be used for LLM training purposes. A good example is this story from Bayerischen Rundfunk, a German radio station, where a reporter tried to follow the trail of how a picture of her ended up in the LAION database, one of the main sources for Stable Diffusion. On the way, she discovers that many other people have had their pictures disappear into databases, without being asked for permission.
In fact, there seem to be some serious issues involved in the way data is gathered for AI training purposes. As Scientific American explains, as LLMs expand, so does their thirst for new data, forcing AI companies to push the boundaries of what’s acceptable further and further. Researchers have found medical files in datasets or private pictures.
You may also wonder how happy Reddit users were to find out that OpenAI would be training their LLMs on posts made on the social media platform. Some of these posts were years old, before AI was a thing, so nobody could have known this would happen.
Man vs. machine
It seems that anything on the web - from published articles to social media comments to forum posts - is fair game for AI companies to feed their LLMs without the permission of anybody involved. If an AI company wants a dataset, they just take it without regard for its creators or participants.
There doesn’t seem to be a limit to the data needed for these machines, raising all kinds of questions around data ownership. Should AI models be allowed to gobble up whatever data they want, or should there be limits on what they’re allowed to take? These questions will likely accompany us into the AI future.