We’re Apify, and our mission is to make the web more programmable. This article about alternatives to Pinecone was inspired by our work on getting better data for AI. Check us out.
In a previous blog post, I introduced you to Pinecone, which is one of the most popular vector databases around. If you want to know more about Pinecone or vector databases in general, I suggest you read those blog posts I just linked to.
Oh, you’re still here? Then I guess you’re already familiar with Pinecone and vector databases. But that doesn’t mean I won’t subject you to some preamble. Before we get to our list of open-source alternatives, let’s at least touch upon why they’re so important for large language models.
How do vector databases help with LLMs?
Large language models have brought generative AI into the mainstream, but they have a couple of drawbacks:
1) Large language models have a word limit
LLMs have limited memory, so they can accept only a certain number of tokens as input. If you want to fit more than a few thousand words into an LLM at once, you need to fine-tune the model by training it on new data or extracting only the relevant text for your prompt.
That’s where vector embeddings come in. ‘Embedding’ is an AI term for a group of vectors that represent text to capture and map the semantic meaning of words. These embeddings can break up content into manageable chunks that can be fed into the limited context of a language model like ChatGPT. However, this creates a new problem. Where do you store these embeddings? Vector databases are the answer.
2) Large language models are stuck in the past
It’s well-known that LLMs were trained on data from before 2022. Want ChatGPT to help you with content related to current news or finding the best available properties on Airbnb this summer? It will do one of two things: throw its hands up and bleat on about its identity as an AI language model to justify its inability to do what you want, or - worse - hallucinate a fake answer.
As of September 27, 2023, GPT-4's knowledge is no longer limited to data before September 2021
Recently, paying users of ChatGPT gained access to the internet through third-party tools and the use of OpenAI plugins, but an even better solution is to use web scraping to provide GPT or whatever LLM you’re using with the information needed to answer your questions. But this creates another problem: if you have a huge dataset for your LLM, you need a way to store it and pass it on to your language model. Vector databases are again the solution.
Website Content Crawler automatically removes headers, footers, menus, ads, and other noise from web pages in order to return only text content that can be directly fed to language models to create chatbots and other useful AI tools.
Why use a Pinecone alternative?
Pinecone is a service that stores vector data in a cloud-based Pinecone-managed database. Your applications interact with the Pinecone service through APIs to store and retrieve vector data. And while Pinecone is the industry leader when it comes to vector databases, there’s one thing about it that some developers aren’t too keen on: it isn’t open source, which means you don't have the option to host your own instance.
We love open source at Apify (check out our open-source web scraping and automation library, Crawlee), and I’m willing to bet many of you do, too. So here are six popular open-source Pinecone alternatives you might want to explore (all links to these alternatives will take you to their GitHub repo).
Transfer results from Actors to the Pinecone vector database, enabling Retrieval-Augmented Generation (RAG) or semantic search over data extracted from the web.
6 Pinecone alternatives that are open source
Weaviate
In early 2022, Weaviate’s series A funding saw open-source downloads pass the two million mark. In April this year, its series B funding raised $50 million! That's certainly enough to make us pay attention to Weaviate as a Pinecone alternative. Apart from being open source, there’s another difference between Pinecone and Weaviate. Pinecone is a more general-purpose vector database that can be used for multiple data types (images, audio, sensory data), while Weaviate is designed specifically for natural language or numerical data based on contextualized word embeddings.
Milvus
Like Weaviate, Milvus is an open-source vector database written in Go. It was founded by the startup, Zilliz, which reached $113 million in investment last year. The Milvus vector database is specifically designed from the bottom up to handle embedding vectors converted from unstructured data. It can handle queries over input vectors and is capable of indexing vectors on a huge scale.
Chroma
Another provider that got lots of investors for its embedding database this year. Chroma lets you build Python or JavaScript LLM apps with memory and provides a local ephemeral storage option. That means that the vector data is stored on your local machine or the machine running your application. It doesn’t require any external service or database to store the data.
Qdrant
Qdrant is a vector similarity engine developed entirely in Rust, making it fast and reliable even under high load. Its vector payload supports a large variety of data types and query conditions, and filtering conditions make it useful for all sorts of neural-net or semantic-based matching, faceted search, and other applications.
Transfer results from Actors to the Qdrant vector database to train AI models or supply them with fresh web content.
Faiss
Faiss stands for Facebook AI similarity search. It's a library that lets you quickly search for similar multimedia documents using nearest-neighbor search implementation on a huge scale. Faiss is fundamentally an index rather than a database. It solves the approximate nearest neighbor problem rather than the storage problem.
LlamaIndex
Formerly known as GPT Index, LlamaIndex is a data framework for building LLM applications. It provides tools for data ingestion, structuring, retrieval, and integration with multiple application frameworks. LlamaIndex gives you the ability to query your data for any downstream LLM use case, whether it’s question-answering, summarization, or a component in a chatbot.
Apify Actor Loader is designed to do just that and can be subsequently used as a Tool in a LangChain Agent.
View it on GitHub.
Combine vector databases with LangChain
I can't leave the subject of vector databases without saying a few words about LangChain, which has quickly become the library of choice for building on top of generative AI models.
Unlike the aforementioned libraries, which are specifically designed for their vector database services or indexes, LangChain is a more generic library that simplifies the process of integrating different vector databases into an application. That means you can use multiple databases and switch between them without committing to one specific service or its implementation.
Learn their functionalities and differences
You can integrate LangChain with Pinecone and all the vector databases mentioned above. You can also integrate LangChain with Apify, which you can use to collect data for your vector databases.