What is data ingestion for large language models?

Learn about the process of collecting and preparing data for your own LLMs. Tools and tips included.

Content

Hi, we're Apify, a full-stack web scraping and browser automation platform. This article about data ingestion was inspired by our work on getting better data for AI.

This article was first published on April 18, 2023, and updated on November 23, 2023.

In the blink of an eye, LLMs went from being something only AI geeks knew and cared about to something everyone is trying to cash in on. Influential people famous for anything but AI are offering their take on it in interviews, and YouTubers who have never touched the subject are suddenly giving us their two cents, whether we want their opinion or not.

You get the idea. Everyone has gone nuts for large language models because of ChatGPT and all the spin-offs and sequels that have ensued - not least because anyone can now build their own GPTs with ease. But how do you go about getting the data needed to train a really powerful and versatile AI model?

Machine learning, smart apps, and real-time analytics: all begin with data and tons of it. And we’re not only talking about structured data, such as databases, but also unstructured data (videos, images, text messages, and whatnot). Getting the data from your data source to your data storage for processing, preparation, and training is a vital step known as data ingestion.

What is data ingestion?

Data ingestion for LLMs is super easy… said no one ever! The fact is it's a complicated process of collecting, processing, and preparing data for AI and machine learning. In the context of large language models, data ingestion involves gathering vast quantities of text data (web scraping), preprocessing it (cleaning, normalization, tokenization), and preparing it for training the LLM (feature engineering). If those terms raise more questions for you than answers, don’t panic: all is explained below.

Challenges of data ingestion for LLMs

Ingesting data into cloud data lakes and warehouses can be tricky due to the diversity of data sources and the complexity of capturing data. Out-of-the-box connectivity, manual monitoring of ingestion jobs, and handling schema drift are significant hurdles​​​​​​​​. For solutions, let's explore how data ingestion works to understand how you can overcome these challenges. I'll also cover a few of these in a list of tools and tips at the end.

How does data ingestion work?

Data ingestion is a complex process that involves multiple layers and processes, but for the sake of time and clarity, I’ll break it down into four layers (three of which I briefly mentioned earlier): collection, preprocessing, feature engineering, and storage.

Data collection

The first layer involves collecting data from various sources such as the web, social media, or text documents. The data collected needs to be relevant to the task the LLM is being trained for. For example, if the LLM is being trained to perform sentiment analysis, the data collected should include a large number of reviews, comments, and social media posts. So, the first step to data ingestion for LLMs is to define the data requirements. What types of data are needed to train the model?

Once you’ve figured that out, you need to start gathering the data. The most common and popular form of web data collection is web scraping, which is an automated method of extracting data from websites. Two ways to do this are by building a scraper with a web scraping library (try Crawlee) or by using a ready-made scraping tool. Two types of such tools are universal scrapers designed for web data extraction from any site, and site-specific scrapers, for example, a Google Maps scraper or a Twitter scraper.

Preprocessing

Once the data has been collected, it needs to be preprocessed before it can be used to train your T-1000 LLM. Preprocessing involves several steps, including cleaning the data, normalization, and tokenization.

  • Data cleaning

Data cleaning involves identifying and correcting or removing inaccurate, incomplete, or irrelevant data. If you want to ensure data quality and consistency, you’ve got to do some data chores. This typically involves things like removing duplicates, fixing missing or incorrect values, and removing outliers.

  • Normalization

Normalization means transforming data into a standard format that allows for easy comparison and analysis. This step is particularly important when dealing with text data, as it helps to reduce the dimensionality of the data and makes it easier to compare and analyze. Typical examples include converting all text to lowercase, removing punctuation, and removing stop words.

  • Tokenization

Tokenization involves breaking down the text into individual words or phrases, which will be used to create the vocabulary for the language model. This is especially important in natural language processing (NLP) because it allows for the analysis of individual words or phrases within the text. This tokenization can be done at a word level, character level, or subword level.

Feature engineering

Feature engineering involves creating features from preprocessed data. Features are numerical representations of the text that the LLM can understand.

There are several feature engineering techniques that can be used, such as word embeddings, which represent the text as a dense vector of real numbers to capture the meaning of the words. Word embeddings are produced by techniques that use neural networks, such as Word2Vec.

We could divide this feature engineering stage into three steps:

  • Split
    First, you need to divide the data into training, validation, and testing sets. Use the training set to teach the LLM and the validation and testing sets to evaluate the machine’s performance.
  • Augment
    Next, increase the size and diversity of the data by adding new examples, synthesizing new data, or transforming existing data.
  • Encode
    Finally, do the encoding by embedding data into tokens or vectors.

Storage

Once the data has been preprocessed and features have been created, it needs to be stored in a format that can be easily accessed by the language model during training. The data can be stored in a vector database, and the format may be structured or unstructured.

That’s it!... Not!

Even after your data is collected, preprocessed, engineered, and stored, you should continuously monitor the quality and relevance of the data and update it as needed to improve the performance of your large language model. Otherwise, your LLM may soon become as obsolete as the T-800.

Useful tools and tips for data ingestion

Web crawlers and scrapers for data collection

Website Content Crawler is an Apify Actor that can perform a deep crawl of one or more websites and extract text content from the web pages. It's useful to download data from websites such as documentation, knowledge bases, help sites, or blogs. It was specifically designed to extract data for feeding, fine-tuning, or training large language models.

PDF Text Extractor allows you to extract text from PDF files and supports chunking of the text to prepare the data for usage with large language models.

Automated data loading tools

  • Airbyte: An open-source platform specializing in extracting and loading data. It handles the setup of data pipelines, offering scheduling, syncing, and monitoring functions. Suitable for businesses not requiring data transformation before uploading to data lakes/databases​​.
  • Matillion: A data loading and ETL tool designed for small to medium-sized businesses to migrate data from various sources to a cloud database, with over 70 connectors to different data sources​​.
  • Fivetran: Focused on larger enterprises, Fivetran offers scalable data loading solutions with hundreds of prebuilt data connectors. It's an excellent choice for large businesses with extensive data ingestion needs​​.

Data preparation tools

  • Talend: A comprehensive ETL tool with more than 1,000 connectors, suitable for large organizations. It enables the aggregation of data from various sources and supports data warehousing solutions like Snowflake, AWS, Google, or Azure​​.
  • Dropbase: Offers an instant database tool for transforming offline data into live databases, ideal for startups and SMBs needing to transform offline files into databases without significant expense​​.
  • Alteryx: An automated analytics tool for data ingestion, especially useful for visualizations and data science in a low-code environment, though it may be cost-prohibitive for small businesses​​.
  • Trifacta: A cloud-based data preparation tool focusing on ELT processes, allowing analysts to prepare and clean data without developer intervention. It’s great for agile organizations and teams of analysts​​.

Tools for LLMs

  • Dagster: Plays a crucial role in orchestrating services for LLM training, running ingestion tasks, and structuring data for LLMs. It's effective when paired with tools like Airbyte and LangChain for maintaining data freshness and scalability​​.
  • Airbyte: Used in LLM training pipelines for data ingestion, supporting hundreds of data sources and allowing custom implementations​​.
  • LangChain: Combines with tools like Airbyte and Dagster for LLM training, particularly useful in the vectorstore using its retrieval QA module for contextual information.

Dealing with out-of-the-box connectivity

This challenge can be overcome by using data integration tools that offer pre-built connectors to a wide range of data sources. The aforementioned Airbyte, Fivetran, and Talend come with numerous connectors, which can help in easily integrating various data sources with minimal custom configuration.

Automated monitoring of ingestion jobs

The aforementioned Airbyte and Fivetran provide features for scheduling and syncing data ingestion tasks, which reduces the need for manual monitoring. Additionally, employing a data orchestration tool like Dagster can help in automating the workflow, including the monitoring of ingestion jobs.

Handling schema drift

Schema drift occurs when the structure of incoming data changes. To handle this, you can use data ingestion and ETL tools that support dynamic schema handling. For instance, Trifacta and Talend offer features that can automatically detect schema changes and adjust data pipelines accordingly. This ensures that the data ingestion process remains stable even when the data format or structure changes over time.

A recap of the data ingestion process

Assuming you have the stomach for a recap on data ingestion, let’s conclude with a quick run-down of the process:

  1. Define the data requirements for training the LLM.
  2. Collect the data (scrape websites, databases, or public datasets).
  3. Organize the data (cleaning, preprocessing, normalization, tokenization).
  4. Split, augment, and encode your data (feature engineering).
  5. Save and store for easy access by the LLM during training.
  6. Monitor to ensure data quality and relevance.

Now you have some basic idea of what training your very own LLM involves, but I can’t be held responsible for what you choose to do with this information. If you lose control over your LLM once AI reaches the technological singularity, that’s on you!

Extract text content from the web to feed your vector databases and fine-tune or train your large language models.

Theo Vasilis
Theo Vasilis
I used to write books. Then I took an arrow in the knee. Now I'm a technical content marketer, crafting tutorials for developers and conversion-focused content for SaaS.

Get started now

Step up your web scraping and automation