DocumentationAPI Reference📓 Tutorials🧑‍🍳 Cookbook🤝 Integrations💜 Discord🎨 Studio
API Reference

Chroma

Chroma integration for Haystack

Module haystack_integrations.components.retrievers.chroma.retriever

ChromaQueryTextRetriever

A component for retrieving documents from a Chroma database using the query API.

Example usage:

from haystack import Pipeline
from haystack.components.converters import TextFileToDocument
from haystack.components.writers import DocumentWriter

from haystack_integrations.document_stores.chroma import ChromaDocumentStore
from haystack_integrations.components.retrievers.chroma import ChromaQueryTextRetriever

file_paths = ...

# Chroma is used in-memory so we use the same instances in the two pipelines below
document_store = ChromaDocumentStore()

indexing = Pipeline()
indexing.add_component("converter", TextFileToDocument())
indexing.add_component("writer", DocumentWriter(document_store))
indexing.connect("converter", "writer")
indexing.run({"converter": {"sources": file_paths}})

querying = Pipeline()
querying.add_component("retriever", ChromaQueryTextRetriever(document_store))
results = querying.run({"retriever": {"query": "Variable declarations", "top_k": 3}})

for d in results["retriever"]["documents"]:
    print(d.meta, d.score)

ChromaQueryTextRetriever.__init__

def __init__(document_store: ChromaDocumentStore,
             filters: Optional[Dict[str, Any]] = None,
             top_k: int = 10,
             filter_policy: Union[str, FilterPolicy] = FilterPolicy.REPLACE)

Arguments:

  • document_store: an instance of ChromaDocumentStore.
  • filters: filters to narrow down the search space.
  • top_k: the maximum number of documents to retrieve.
  • filter_policy: Policy to determine how filters are applied.

ChromaQueryTextRetriever.run

@component.output_types(documents=List[Document])
def run(query: str,
        filters: Optional[Dict[str, Any]] = None,
        top_k: Optional[int] = None)

Run the retriever on the given input data.

Arguments:

  • query: The input data for the retriever. In this case, a plain-text query.
  • filters: Filters applied to the retrieved Documents. The way runtime filters are applied depends on the filter_policy chosen at retriever initialization. See init method docstring for more details.
  • top_k: The maximum number of documents to retrieve. If not specified, the default value from the constructor is used.

Raises:

  • ValueError: If the specified document store is not found or is not a MemoryDocumentStore instance.

Returns:

A dictionary with the following keys:

  • documents: List of documents returned by the search engine.

ChromaQueryTextRetriever.run_async

@component.output_types(documents=List[Document])
async def run_async(query: str,
                    filters: Optional[Dict[str, Any]] = None,
                    top_k: Optional[int] = None)

Asynchronously run the retriever on the given input data.

Asynchronous methods are only supported for HTTP connections.

Arguments:

  • query: The input data for the retriever. In this case, a plain-text query.
  • filters: Filters applied to the retrieved Documents. The way runtime filters are applied depends on the filter_policy chosen at retriever initialization. See init method docstring for more details.
  • top_k: The maximum number of documents to retrieve. If not specified, the default value from the constructor is used.

Raises:

  • ValueError: If the specified document store is not found or is not a MemoryDocumentStore instance.

Returns:

A dictionary with the following keys:

  • documents: List of documents returned by the search engine.

ChromaQueryTextRetriever.from_dict

@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "ChromaQueryTextRetriever"

Deserializes the component from a dictionary.

Arguments:

  • data: Dictionary to deserialize from.

Returns:

Deserialized component.

ChromaQueryTextRetriever.to_dict

def to_dict() -> Dict[str, Any]

Serializes the component to a dictionary.

Returns:

Dictionary with serialized data.

ChromaEmbeddingRetriever

A component for retrieving documents from a Chroma database using embeddings.

ChromaEmbeddingRetriever.__init__

def __init__(document_store: ChromaDocumentStore,
             filters: Optional[Dict[str, Any]] = None,
             top_k: int = 10,
             filter_policy: Union[str, FilterPolicy] = FilterPolicy.REPLACE)

Arguments:

  • document_store: an instance of ChromaDocumentStore.
  • filters: filters to narrow down the search space.
  • top_k: the maximum number of documents to retrieve.
  • filter_policy: Policy to determine how filters are applied.

ChromaEmbeddingRetriever.run

@component.output_types(documents=List[Document])
def run(query_embedding: List[float],
        filters: Optional[Dict[str, Any]] = None,
        top_k: Optional[int] = None)

Run the retriever on the given input data.

Arguments:

  • query_embedding: the query embeddings.
  • filters: Filters applied to the retrieved Documents. The way runtime filters are applied depends on the filter_policy chosen at retriever initialization. See init method docstring for more details.
  • top_k: the maximum number of documents to retrieve. If not specified, the default value from the constructor is used.

Returns:

a dictionary with the following keys:

  • documents: List of documents returned by the search engine.

ChromaEmbeddingRetriever.run_async

@component.output_types(documents=List[Document])
async def run_async(query_embedding: List[float],
                    filters: Optional[Dict[str, Any]] = None,
                    top_k: Optional[int] = None)

Asynchronously run the retriever on the given input data.

Asynchronous methods are only supported for HTTP connections.

Arguments:

  • query_embedding: the query embeddings.
  • filters: Filters applied to the retrieved Documents. The way runtime filters are applied depends on the filter_policy chosen at retriever initialization. See init method docstring for more details.
  • top_k: the maximum number of documents to retrieve. If not specified, the default value from the constructor is used.

Returns:

a dictionary with the following keys:

  • documents: List of documents returned by the search engine.

ChromaEmbeddingRetriever.from_dict

@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "ChromaEmbeddingRetriever"

Deserializes the component from a dictionary.

Arguments:

  • data: Dictionary to deserialize from.

Returns:

Deserialized component.

ChromaEmbeddingRetriever.to_dict

def to_dict() -> Dict[str, Any]

Serializes the component to a dictionary.

Returns:

Dictionary with serialized data.

Module haystack_integrations.document_stores.chroma.document_store

ChromaDocumentStore

A document store using Chroma as the backend.

We use the collection.get API to implement the document store protocol, the collection.search API will be used in the retriever instead.

ChromaDocumentStore.__init__

def __init__(collection_name: str = "documents",
             embedding_function: str = "default",
             persist_path: Optional[str] = None,
             host: Optional[str] = None,
             port: Optional[int] = None,
             distance_function: Literal["l2", "cosine", "ip"] = "l2",
             metadata: Optional[dict] = None,
             **embedding_function_params)

Creates a new ChromaDocumentStore instance.

It is meant to be connected to a Chroma collection.

Note: for the component to be part of a serializable pipeline, the init parameters must be serializable, reason why we use a registry to configure the embedding function passing a string.

Arguments:

  • collection_name: the name of the collection to use in the database.
  • embedding_function: the name of the embedding function to use to embed the query
  • persist_path: Path for local persistent storage. Cannot be used in combination with host and port. If none of persist_path, host, and port is specified, the database will be in-memory.
  • host: The host address for the remote Chroma HTTP client connection. Cannot be used with persist_path.
  • port: The port number for the remote Chroma HTTP client connection. Cannot be used with persist_path.
  • distance_function: The distance metric for the embedding space.
  • "l2" computes the Euclidean (straight-line) distance between vectors, where smaller scores indicate more similarity.
  • "cosine" computes the cosine similarity between vectors, with higher scores indicating greater similarity.
  • "ip" stands for inner product, where higher scores indicate greater similarity between vectors. Note: distance_function can only be set during the creation of a collection. To change the distance metric of an existing collection, consider cloning the collection.
  • metadata: a dictionary of chromadb collection parameters passed directly to chromadb's client method create_collection. If it contains the key "hnsw:space", the value will take precedence over the distance_function parameter above.
  • embedding_function_params: additional parameters to pass to the embedding function.

ChromaDocumentStore.count_documents

def count_documents() -> int

Returns how many documents are present in the document store.

Returns:

how many documents are present in the document store.

ChromaDocumentStore.count_documents_async

async def count_documents_async() -> int

Asynchronously returns how many documents are present in the document store.

Asynchronous methods are only supported for HTTP connections.

Returns:

how many documents are present in the document store.

ChromaDocumentStore.filter_documents

def filter_documents(
        filters: Optional[Dict[str, Any]] = None) -> List[Document]

Returns the documents that match the filters provided.

For a detailed specification of the filters, refer to the documentation.

Arguments:

  • filters: the filters to apply to the document list.

Returns:

a list of Documents that match the given filters.

ChromaDocumentStore.filter_documents_async

async def filter_documents_async(
        filters: Optional[Dict[str, Any]] = None) -> List[Document]

Asynchronously returns the documents that match the filters provided.

Asynchronous methods are only supported for HTTP connections.

For a detailed specification of the filters, refer to the documentation.

Arguments:

  • filters: the filters to apply to the document list.

Returns:

a list of Documents that match the given filters.

ChromaDocumentStore.write_documents

def write_documents(documents: List[Document],
                    policy: DuplicatePolicy = DuplicatePolicy.FAIL) -> int

Writes (or overwrites) documents into the store.

Arguments:

  • documents: A list of documents to write into the document store.
  • policy: Not supported at the moment.

Raises:

  • ValueError: When input is not valid.

Returns:

The number of documents written

ChromaDocumentStore.write_documents_async

async def write_documents_async(
        documents: List[Document],
        policy: DuplicatePolicy = DuplicatePolicy.FAIL) -> int

Asynchronously writes (or overwrites) documents into the store.

Asynchronous methods are only supported for HTTP connections.

Arguments:

  • documents: A list of documents to write into the document store.
  • policy: Not supported at the moment.

Raises:

  • ValueError: When input is not valid.

Returns:

The number of documents written

ChromaDocumentStore.delete_documents

def delete_documents(document_ids: List[str]) -> None

Deletes all documents with a matching document_ids from the document store.

Arguments:

  • document_ids: the document ids to delete

ChromaDocumentStore.delete_documents_async

async def delete_documents_async(document_ids: List[str]) -> None

Asynchronously deletes all documents with a matching document_ids from the document store.

Asynchronous methods are only supported for HTTP connections.

Arguments:

  • document_ids: the document ids to delete

ChromaDocumentStore.search

def search(queries: List[str],
           top_k: int,
           filters: Optional[Dict[str, Any]] = None) -> List[List[Document]]

Search the documents in the store using the provided text queries.

Arguments:

  • queries: the list of queries to search for.
  • top_k: top_k documents to return for each query.
  • filters: a dictionary of filters to apply to the search. Accepts filters in haystack format.

Returns:

matching documents for each query.

ChromaDocumentStore.search_async

async def search_async(
        queries: List[str],
        top_k: int,
        filters: Optional[Dict[str, Any]] = None) -> List[List[Document]]

Asynchronously search the documents in the store using the provided text queries.

Asynchronous methods are only supported for HTTP connections.

Arguments:

  • queries: the list of queries to search for.
  • top_k: top_k documents to return for each query.
  • filters: a dictionary of filters to apply to the search. Accepts filters in haystack format.

Returns:

matching documents for each query.

ChromaDocumentStore.search_embeddings

def search_embeddings(
        query_embeddings: List[List[float]],
        top_k: int,
        filters: Optional[Dict[str, Any]] = None) -> List[List[Document]]

Perform vector search on the stored document, pass the embeddings of the queries instead of their text.

Arguments:

  • query_embeddings: a list of embeddings to use as queries.
  • top_k: the maximum number of documents to retrieve.
  • filters: a dictionary of filters to apply to the search. Accepts filters in haystack format.

Returns:

a list of lists of documents that match the given filters.

ChromaDocumentStore.search_embeddings_async

async def search_embeddings_async(
        query_embeddings: List[List[float]],
        top_k: int,
        filters: Optional[Dict[str, Any]] = None) -> List[List[Document]]

Asynchronously perform vector search on the stored document, pass the embeddings of the queries instead of

their text.

Asynchronous methods are only supported for HTTP connections.

Arguments:

  • query_embeddings: a list of embeddings to use as queries.
  • top_k: the maximum number of documents to retrieve.
  • filters: a dictionary of filters to apply to the search. Accepts filters in haystack format.

Returns:

a list of lists of documents that match the given filters.

ChromaDocumentStore.from_dict

@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "ChromaDocumentStore"

Deserializes the component from a dictionary.

Arguments:

  • data: Dictionary to deserialize from.

Returns:

Deserialized component.

ChromaDocumentStore.to_dict

def to_dict() -> Dict[str, Any]

Serializes the component to a dictionary.

Returns:

Dictionary with serialized data.

Module haystack_integrations.document_stores.chroma.errors

ChromaDocumentStoreError

Parent class for all ChromaDocumentStore exceptions.

ChromaDocumentStoreFilterError

Raised when a filter is not valid for a ChromaDocumentStore.

ChromaDocumentStoreConfigError

Raised when a configuration is not valid for a ChromaDocumentStore.

Module haystack_integrations.document_stores.chroma.utils

get_embedding_function

def get_embedding_function(function_name: str, **kwargs) -> EmbeddingFunction

Load an embedding function by name.

Arguments:

  • function_name: the name of the embedding function.
  • kwargs: additional arguments to pass to the embedding function.

Raises:

  • ChromaDocumentStoreConfigError: if the function name is invalid.

Returns:

the loaded embedding function.