What is data ingestion for large language models?

Data ingestion for LLMs is super easy… said no one ever! The fact is, it’s a complex process that involves collecting, preprocessing, and preparing data. Find out how to go about gathering and processing the data for your own large language models.


Robots reading books - Data ingestion for LLMs
Image produced by DeepAI

In the blink of an eye, LLMs went from being something only AI geeks knew and cared about to something everyone is trying to cash in on. Influential people famous for anything but AI are offering their take on it in interviews, and YouTubers who had never touched the subject are suddenly giving us their two cents, whether we wanted their opinion or not.

You get the idea. Everyone is going nuts for large language models because of ChatGPT and all the spin-offs and sequels that will inevitably ensue. But how do you go about getting the data needed to train one of your very own robot overlords?

Machine learning, smart apps, and real-time analytics: all begin with data and tons of it. And we’re not only talking about structured data, such as databases, but also unstructured data (videos, images, text messages, and whatnot). Getting the data from your data source to your data storage for processing, preparation, and training is a vital step known as data ingestion.

What is data ingestion?

Data ingestion is basically the process of collecting, processing and preparing data for analysis or machine learning. In the context of large language models, data ingestion involves collecting vast quantities of text data (web scraping), preprocessing it (cleaning, normalization, tokenization), and preparing it for training the LLM (feature engineering). If those terms raise more questions for you than answers, don’t panic: all is explained below.

How does data ingestion work?

Data ingestion is a complex process that involves multiple layers and processes, but for the sake of time and clarity, I’ll break it down into four layers (three of which I briefly mentioned earlier): collection, preprocessing, feature engineering, and storage.

Data collection

The first layer involves collecting data from various sources such as the web, social media, or text documents. The data collected needs to be relevant to the task the LLM is being trained for. For example, if the LLM is being trained to perform sentiment analysis, the data collected should include a large number of reviews, comments, and social media posts. So, the first step to data ingestion for LLMs is to define the data requirements. What types of data are needed to train the model?

Once you’ve figured that out, you need to start gathering the data. The most common and popular form of web data collection is web scraping, which is an automated method of extracting data from websites. Two ways to do this are by building a scraper with a web scraping library (try Crawlee) or by using a ready-made scraping tool. Two types of such tools are universal scrapers designed for web data extraction from any site, and site-specific scrapers, for example, a Google Maps scraper or a Twitter scraper.

Fast, reliable data for your AI and machine learning · Apify
Get the data to train ChatGPT API and Large Language Models, fast.

Ingest entire websites automatically. Gather your customers' documentation, knowledge bases, help centers, forums, blog posts, and other sources of information to train or prompt your LLMs. Integrate Apify into your product and let your customers upload their content in minutes.


Once the data has been collected, it needs to be preprocessed before it can be used to train your T-1000 LLM. Preprocessing involves several steps, including cleaning the data, normalization, and tokenization.

  • Data cleaning

Data cleaning involves identifying and correcting or removing inaccurate, incomplete, or irrelevant data. If you want to ensure data quality and consistency, you’ve got to do some data chores. This typically involves things like removing duplicates, fixing missing or incorrect values, and removing outliers.

  • Normalization

Normalization means transforming data into a standard format that allows for easy comparison and analysis. This step is particularly important when dealing with text data, as it helps to reduce the dimensionality of the data and makes it easier to compare and analyze. Typical examples include converting all text to lowercase, removing punctuation, and removing stop words.

  • Tokenization

Tokenization involves breaking down the text into individual words or phrases, which will be used to create the vocabulary for the language model. This is especially important in natural language processing (NLP) because it allows for the analysis of individual words or phrases within the text. This tokenization can be done at a word level, character level, or subword level.

Feature engineering

Feature engineering involves creating features from preprocessed data. Features are numerical representations of the text that the LLM can understand.

There are several feature engineering techniques that can be used, such as word embedding, which represents the text as a dense vector of real numbers to capture the meaning of the words. Word embeddings are produced by techniques that use neural networks, such as Word2Vec.

We could divide this feature engineering stage into three steps:

  • Split
    First, you need to divide the data into training, validation, and testing sets. Use the training set to teach the LLM and the validation and testing sets to evaluate the machine’s performance.
  • Augment
    Next, increase the size and diversity of the data by adding new examples, synthesizing new data, or transforming existing data.
  • Encode
    Finally, do the encoding by embedding data into tokens or vectors.


Once the data has been preprocessed and features have been created, it needs to be stored in a format that can be easily accessed by the language model during training. The data can be stored in a database or file system, and the format may be structured or unstructured.

What is LangChain?
Find out how LangChain overcomes the limits of ChatGPT

That’s it!... Not!

Even after your data is collected, preprocessed, engineered, and stored, you should continuously monitor the quality and relevance of the data and update it as needed to improve the performance of your large language model. Otherwise, your LLM may soon become as obsolete as the T-800.

Let’s sum up

Assuming you have the stomach for a recap on data ingestion, let’s conclude with a quick run-down of the process:

  1. Define the data requirements for training the LLM.
  2. Collect the data (scrape websites, databases, or public datasets).
  3. Organize the data (cleaning, preprocessing, normalization, tokenization).
  4. Split, augment and encode your data (feature engineering).
  5. Save and store for easy access by the LLM during training.
  6. Monitor to ensure data quality and relevance.

Now you have some basic idea of what training your very own LLM involves, but I can’t be held responsible for what you choose to do with this information. If you lose control over your LLM once AI reaches the technological singularity, that’s on you!

How I get ChatGPT to access the internet
Do you dream of letting ChatGPT roam the web?

Get started now

Step up your web scraping and automation