In the context of AI and machine learning, particularly large language models, vector databases are really hot right now! People are investing in vector databases like crazy. But what are they?
Before I answer that, I better explain vectors. Thankfully, this part is quite simple. A vector is an array of numbers like this:
[0, 1, 2, 3, 4, … ]
Doesn’t seem very impressive, does it? But what’s really cool about these numbers is that they can represent more complex objects such as words, sentences, images, and audio files in an embedding.
What are embeddings, you ask? In the context of large language models, embeddings represent text as a dense vector of numbers to capture the meaning of words. They map the semantic meaning of words together or similar features in just about any other data type. These embeddings can then be used for search engines, recommendation systems, and generative AIs such as ChatGPT.
The question is, where do you store these embeddings, and how do you query them quickly? Vector databases are the answer. These databases contain arrays of numbers clustered together based on similarity, which can be queried with ultra-low latency. In other words, vector databases index vectors for easy search and retrieval by comparing values and finding those that are most similar to one another. That makes vector databases ideal for AI-driven applications.
Why are vector databases important for LLMs?
The main reason vector databases are in vogue is that they can extend large language models with long-term memory. You begin with a general-purpose model, like GPT-4, LLaMA, or LaMDA, but then you provide your own data in a vector database. When a user gives a prompt, you can query relevant documents from your database to update the context, which will customize the final response. What’s more, vector databases integrate with tools like LangChain that combine multiple LLMs together.
Here are a few of the top vector databases around, but things are moving so fast in AI, who knows how quickly this list might change?
Pinecone is a very popular but closed-source vector database for machine learning applications. Once you have vector embeddings, you can manage and search through them in Pinecone to power semantic search, recommenders, and other applications that rely on relevant information retrieval.
Chroma is an AI-native, open-source embedding database based on ClickHouse under the hood. It’s a vector store designed from the ground up to make it easy to build AI applications with embeddings.
Weaviate and Milvus
I’ve put Weaviate and Milvus together because both are open-source options written in Go. Both allow you to store data objects and vector embeddings generated by machine learning models and scale them.
Qdrant is a vector similarity engine developed entirely in Rust, making it fast and reliable even under high load. Its vector payload supports a large variety of data types and query conditions, and filtering conditions make it useful for all sorts of neural-net or semantic-based matching, faceted search, and other applications.
That’s all well and good, but you can’t do much with a vector database if you don’t have data in the first place, right? So now it’s time to present a great web scraping tool for feeding your vector databases: Website Content Crawler.
Website Content Crawler (let's just call it WCC for brevity) was specifically designed to extract web data for feeding, fine-tuning, or training large language models. It automatically removes headers, footers, menus, ads, and other noise from web pages in order to return only the text content that can be directly fed to the models.
WCC has a simple input configuration. That means it can be easily integrated into customer-facing products. Customers can enter just the URL of the website they want to be indexed by LLMs. The results can be retrieved by an API to formats such as JSON or CSV, which can be fed directly into your vector database or language model.