Artificial intelligence needs data, and lots of it, but not all data is treated equally. What kind of data does AI use, and does it matter what you feed it and how?
AI and data types
When you ask an AI a question or ask it to do something, it doesn’t generate an answer out of thin air. It draws on a store of data that it has been trained on, a massive database of everything that the AI knows - something we talk about in more depth in our article on how to feed your LLM.
Naturally, it’s important you use the right data. If you’re training a text-generating LLM like ChatGPT there’s no point in feeding it images, nor does an image generator need to know the full works of Shakespeare.
If you'd like to learn more about getting data for AI, sign up to our newsletter.
However, this data isn’t all processed the same by an LLM. Some is more accessible than others. There are, roughly speaking, three types of data that machines can process.
Structured data
The first type of data is structured data, which is in most cases data that has been processed in a table and thus is incredibly easy for a machine to “read” - though humans can benefit from this, too. Spreadsheets are the prime example of structured data, with columns and rows clearly denoting what goes where, and what everything is.
The reason why a table is the perfect way to feed data to an LLM is because all the information has already been sorted, or, to use a more technical term, labeled. A string of nine digits has been marked as a phone number, say, and not as a product serial number. A word like “norm” has been labeled as a first or a second name, or as a noun. The guesswork has been taken out of putting the data together.
Unstructured data
As you can guess, unstructured data is the opposite of structured data. There are no tables, tags, or anything else that makes it easier for a machine to ingest. That said, unstructured doesn’t mean complete chaos. It’s just data that can’t be easily parsed or classified by a machine, like large slabs of text, images, audio, and video.
For example, if you give an LLM a body of text or an image, it won’t know what to look for, where to place it within its database. This is where data labeling comes in, where humans “teach” an LLM what’s what. After a few rounds, the machine in question will then label things for itself, have a human check and correct its work, until eventually it can recognize things within a certain margin of error.
This teaching process is an important part of machine learning, and really isn’t that different from how we teach small children. Teach an AI enough, and it should be able, eventually, to create its own structure from unstructured data - though it will always need some help.
Semi-structured data
Of course, not all data can be neatly categorized into structured or unstructured data. There are some types of data that are a bit of both and can be best termed as being semi-structured. These will have certain labels and metadata, but not everything an LLM needs to get to work without human help.
Good examples are JSON files, which have some of the tags a machine needs to make sense of what it’s presented with, but don’t have a tabular structure. Files written in markdown languages have the same issue as they lack a rigid data model.
Turning unstructured data into structured data
While you could train an LLM or other AI on unstructured data by hand, there is, thankfully, a shortcut you can use. You could, for example, create a script or program that does at least some of the work for you. That requires some programming knowledge, of course, but you don’t have to do all the work yourself.
This is where Apify comes in. Apify Store offers many web scrapers and other tools that can turn unstructured data into structured information, or at least semi-structured data. Website Content Crawler, for example, can go to any site, extract the data you need, and export it into a markdown or JSON file. Doing this labels a lot of data in one fell swoop, saving you a lot of time.
Check out Apify Store and get a free account today
Website Content Crawler is a more general tool, though. We also offer more specific ones on our platform, like Universal AI GPT Scraper, which claims it can structure any website’s data for you. Another example is Image to JSON Extractor which can take an image’s information and add some order to it.
Adding structure to data is one of the biggest hurdles when training AI bots, but smart use of human labor and automated tools on Apify Store can make it possible - easy even.