DocumentationAPI Reference📓 Tutorials🧑‍🍳 Cookbook🤝 Integrations💜 Discord🎨 Studio (Waitlist)
Documentation

ChromaDocumentStore

Chroma is an open source vector database capable of storing collections of documents along with their metadata, creating embeddings for documents and queries, and searching the collections filtering by document metadata or content. Additionally, Chroma supports multi-modal embedding functions.

Chroma can be used in-memory, as an embedded database, or in a client-server fashion. When running in-memory, Chroma can still keep its contents on disk across different sessions. This allows users to quickly put together prototypes using the in-memory version and later move to production, where the client-server version is deployed.

📘

At the moment Haystack only supports using Chroma in-memory, without storing data across different sessions.

Initialization

First, install the Chroma integration, which will install Haystack and Chroma if they are not already present. The following command is all you need to start:

pip install chroma-haystack

To store data in Chroma, create a ChromaDocumentStore instance and write documents with:

from haystack_integrations.document_stores.chroma import ChromaDocumentStore
from haystack import Document

document_store = ChromaDocumentStore()
document_store.write_documents([
    Document(content="This is the first document."),
    Document(content="This is the second document.")
])
print(document_store.count_documents())

In this case, since we didn’t pass any embeddings along with our documents, Chroma will create them for us using its default embedding function.

Connection Options

  1. In-Memory Mode (Local): Chroma can be set up as a local Document Store for fast and lightweight usage. You can use this option during development or small-scale experiments. Set up a local in-memory instance of ChromaDocumentStore like this:

    from haystack_integrations.document_stores.chroma import ChromaDocumentStore
    
    document_store = ChromaDocumentStore()
    
  2. Persistent Storage: If you need to retain the documents between sessions, Chroma supports persistent storage by specifying a path to store data on disk:

    from haystack_integrations.document_stores.chroma import ChromaDocumentStore
    
    document_store = ChromaDocumentStore(persist_path="your_directory_path")
    
  3. Remote Connection: You can connect to a remote Chroma database through HTTP. This is suitable for distributed setups where multiple clients might interact with the same remote Chroma instance.

    Note that this option is incompatible with in-memory or persistent storage modes.

    First, start a Chroma server:

    chroma run --path /db_path
    

    Then, initialize the Document Store with host and port parameters:

    from haystack_integrations.document_stores.chroma import ChromaDocumentStore
    
    document_store = ChromaDocumentStore(host="host_address", port="port_number")
    

Supported Retrievers

The Haystack Chroma integration comes with three Retriever components. They all rely on the Chroma query API, but they have different inputs and outputs so that you can pick the one that best fits your pipeline:

  • ChromaQueryTextRetriever: This Retriever takes a plain-text query string in input and returns a list of matching documents. Chroma will create the embeddings for the query using its default embedding function.
  • ChromaEmbeddingRetriever: This Retriever takes the embeddings of a single query in input and returns a list of matching documents. The query needs to be embedded before being passed to this component. For example, you can use an embedder component.

Additional References

🧑‍🍳 Cookbook: Use Chroma for RAG and Indexing