Version: 2.27

Chroma

haystack_integrations.components.retrievers.chroma.retriever

ChromaQueryTextRetriever

A component for retrieving documents from a Chroma database using the query API.

Example usage:

python

from haystack import Pipeline
from haystack.components.converters import TextFileToDocument
from haystack.components.writers import DocumentWriter

from haystack_integrations.document_stores.chroma import ChromaDocumentStore
from haystack_integrations.components.retrievers.chroma import ChromaQueryTextRetriever

file_paths = ...

# Chroma is used in-memory so we use the same instances in the two pipelines below
document_store = ChromaDocumentStore()

indexing = Pipeline()
indexing.add_component("converter", TextFileToDocument())
indexing.add_component("writer", DocumentWriter(document_store))
indexing.connect("converter", "writer")
indexing.run({"converter": {"sources": file_paths}})

querying = Pipeline()
querying.add_component("retriever", ChromaQueryTextRetriever(document_store))
results = querying.run({"retriever": {"query": "Variable declarations", "top_k": 3}})

for d in results["retriever"]["documents"]:
    print(d.meta, d.score)

init

python

__init__(
    document_store: ChromaDocumentStore,
    filters: dict[str, Any] | None = None,
    top_k: int = 10,
    filter_policy: str | FilterPolicy = FilterPolicy.REPLACE,
)

Parameters:

document_store (ChromaDocumentStore) – an instance of ChromaDocumentStore.
filters (dict[str, Any] | None) – filters to narrow down the search space.
top_k (int) – the maximum number of documents to retrieve.
filter_policy (str | FilterPolicy) – Policy to determine how filters are applied.

run

python

run(
    query: str, filters: dict[str, Any] | None = None, top_k: int | None = None
) -> dict[str, Any]

Run the retriever on the given input data.

Parameters:

query (str) – The input data for the retriever. In this case, a plain-text query.
filters (dict[str, Any] | None) – Filters applied to the retrieved Documents. The way runtime filters are applied depends on the filter_policy chosen at retriever initialization. See init method docstring for more details.
top_k (int | None) – The maximum number of documents to retrieve. If not specified, the default value from the constructor is used.

Returns:

dict[str, Any] – A dictionary with the following keys:
documents: List of documents returned by the search engine.

Raises:

ValueError – If the specified document store is not found or is not a MemoryDocumentStore instance.

run_async

python

run_async(
    query: str, filters: dict[str, Any] | None = None, top_k: int | None = None
) -> dict[str, Any]

Asynchronously run the retriever on the given input data.

Asynchronous methods are only supported for HTTP connections.

Parameters:

query (str) – The input data for the retriever. In this case, a plain-text query.
filters (dict[str, Any] | None) – Filters applied to the retrieved Documents. The way runtime filters are applied depends on the filter_policy chosen at retriever initialization. See init method docstring for more details.
top_k (int | None) – The maximum number of documents to retrieve. If not specified, the default value from the constructor is used.

Returns:

dict[str, Any] – A dictionary with the following keys:
documents: List of documents returned by the search engine.

Raises:

ValueError – If the specified document store is not found or is not a MemoryDocumentStore instance.

from_dict

python

from_dict(data: dict[str, Any]) -> ChromaQueryTextRetriever

Deserializes the component from a dictionary.

Parameters:

data (dict[str, Any]) – Dictionary to deserialize from.

Returns:

ChromaQueryTextRetriever – Deserialized component.

to_dict

python

to_dict() -> dict[str, Any]

Serializes the component to a dictionary.

Returns:

dict[str, Any] – Dictionary with serialized data.

ChromaEmbeddingRetriever

A component for retrieving documents from a Chroma database using embeddings.

init

python

__init__(
    document_store: ChromaDocumentStore,
    filters: dict[str, Any] | None = None,
    top_k: int = 10,
    filter_policy: str | FilterPolicy = FilterPolicy.REPLACE,
)

Parameters:

document_store (ChromaDocumentStore) – an instance of ChromaDocumentStore.
filters (dict[str, Any] | None) – filters to narrow down the search space.
top_k (int) – the maximum number of documents to retrieve.
filter_policy (str | FilterPolicy) – Policy to determine how filters are applied.

run

python

run(
    query_embedding: list[float],
    filters: dict[str, Any] | None = None,
    top_k: int | None = None,
) -> dict[str, Any]

Run the retriever on the given input data.

Parameters:

query_embedding (list[float]) – the query embeddings.
filters (dict[str, Any] | None) – Filters applied to the retrieved Documents. The way runtime filters are applied depends on the filter_policy chosen at retriever initialization. See init method docstring for more details.
top_k (int | None) – the maximum number of documents to retrieve. If not specified, the default value from the constructor is used.

Returns:

dict[str, Any] – a dictionary with the following keys:
documents: List of documents returned by the search engine.

run_async

python

run_async(
    query_embedding: list[float],
    filters: dict[str, Any] | None = None,
    top_k: int | None = None,
) -> dict[str, Any]

Asynchronously run the retriever on the given input data.

Asynchronous methods are only supported for HTTP connections.

Parameters:

query_embedding (list[float]) – the query embeddings.
filters (dict[str, Any] | None) – Filters applied to the retrieved Documents. The way runtime filters are applied depends on the filter_policy chosen at retriever initialization. See init method docstring for more details.
top_k (int | None) – the maximum number of documents to retrieve. If not specified, the default value from the constructor is used.

Returns:

dict[str, Any] – a dictionary with the following keys:
documents: List of documents returned by the search engine.

from_dict

python

from_dict(data: dict[str, Any]) -> ChromaEmbeddingRetriever

Deserializes the component from a dictionary.

Parameters:

data (dict[str, Any]) – Dictionary to deserialize from.

Returns:

ChromaEmbeddingRetriever – Deserialized component.

to_dict

python

to_dict() -> dict[str, Any]

Serializes the component to a dictionary.

Returns:

dict[str, Any] – Dictionary with serialized data.

haystack_integrations.document_stores.chroma.document_store

ChromaDocumentStore

A document store using Chroma as the backend.

We use the collection.get API to implement the document store protocol, the collection.search API will be used in the retriever instead.

init

python

__init__(
    collection_name: str = "documents",
    embedding_function: str = "default",
    persist_path: str | None = None,
    host: str | None = None,
    port: int | None = None,
    distance_function: Literal["l2", "cosine", "ip"] = "l2",
    metadata: dict | None = None,
    client_settings: dict[str, Any] | None = None,
    **embedding_function_params: Any
)

Creates a new ChromaDocumentStore instance. It is meant to be connected to a Chroma collection.

Note: for the component to be part of a serializable pipeline, the init parameters must be serializable, reason why we use a registry to configure the embedding function passing a string.

Parameters:

collection_name (str) – the name of the collection to use in the database.
embedding_function (str) – the name of the embedding function to use to embed the query
persist_path (str | None) – Path for local persistent storage. Cannot be used in combination with host and port. If none of persist_path, host, and port is specified, the database will be in-memory.
host (str | None) – The host address for the remote Chroma HTTP client connection. Cannot be used with persist_path.
port (int | None) – The port number for the remote Chroma HTTP client connection. Cannot be used with persist_path.
distance_function (Literal['l2', 'cosine', 'ip']) – The distance metric for the embedding space.
"l2" computes the Euclidean (straight-line) distance between vectors, where smaller scores indicate more similarity.
"cosine" computes the cosine similarity between vectors, with higher scores indicating greater similarity.
"ip" stands for inner product, where higher scores indicate greater similarity between vectors. Note: distance_function can only be set during the creation of a collection. To change the distance metric of an existing collection, consider cloning the collection.
metadata (dict | None) – a dictionary of chromadb collection parameters passed directly to chromadb's client method create_collection. If it contains the key "hnsw:space", the value will take precedence over the distance_function parameter above.
client_settings (dict[str, Any] | None) – a dictionary of Chroma Settings configuration options passed to chromadb.config.Settings. These settings configure the underlying Chroma client behavior. For available options, see Chroma's config.py. Note: specifying these settings may interfere with standard client initialization parameters. This option is intended for advanced customization.
embedding_function_params (Any) – additional parameters to pass to the embedding function.

count_documents

python

count_documents() -> int

Returns how many documents are present in the document store.

Returns:

int – how many documents are present in the document store.

count_documents_async

python

count_documents_async() -> int

Asynchronously returns how many documents are present in the document store.

Asynchronous methods are only supported for HTTP connections.

Returns:

int – how many documents are present in the document store.

filter_documents

python

filter_documents(filters: dict[str, Any] | None = None) -> list[Document]

Returns the documents that match the filters provided.

For a detailed specification of the filters, refer to the documentation.

Parameters:

filters (dict[str, Any] | None) – the filters to apply to the document list.

Returns:

list[Document] – a list of Documents that match the given filters.

filter_documents_async

python

filter_documents_async(filters: dict[str, Any] | None = None) -> list[Document]

Asynchronously returns the documents that match the filters provided.

Asynchronous methods are only supported for HTTP connections.

For a detailed specification of the filters, refer to the documentation.

Parameters:

filters (dict[str, Any] | None) – the filters to apply to the document list.

Returns:

list[Document] – a list of Documents that match the given filters.

write_documents

python

write_documents(
    documents: list[Document], policy: DuplicatePolicy = DuplicatePolicy.FAIL
) -> int

Writes (or overwrites) documents into the store.

Parameters:

documents (list[Document]) – A list of documents to write into the document store.
policy (DuplicatePolicy) – Not supported at the moment.

Returns:

int – The number of documents written

Raises:

ValueError – When input is not valid.

write_documents_async

python

write_documents_async(
    documents: list[Document], policy: DuplicatePolicy = DuplicatePolicy.FAIL
) -> int

Asynchronously writes (or overwrites) documents into the store.

Asynchronous methods are only supported for HTTP connections.

Parameters:

documents (list[Document]) – A list of documents to write into the document store.
policy (DuplicatePolicy) – Not supported at the moment.

Returns:

int – The number of documents written

Raises:

ValueError – When input is not valid.

delete_documents

python

delete_documents(document_ids: list[str]) -> None

Deletes all documents with a matching document_ids from the document store.

Parameters:

document_ids (list[str]) – the document ids to delete

delete_documents_async

python

delete_documents_async(document_ids: list[str]) -> None

Asynchronously deletes all documents with a matching document_ids from the document store.

Asynchronous methods are only supported for HTTP connections.

Parameters:

document_ids (list[str]) – the document ids to delete

delete_by_filter

python

delete_by_filter(filters: dict[str, Any]) -> int

Deletes all documents that match the provided filters.

Parameters:

filters (dict[str, Any]) – The filters to apply to select documents for deletion. For filter syntax, see Haystack metadata filtering

Returns:

int – The number of documents deleted.

delete_by_filter_async

python

delete_by_filter_async(filters: dict[str, Any]) -> int

Asynchronously deletes all documents that match the provided filters.

Asynchronous methods are only supported for HTTP connections.

Parameters:

filters (dict[str, Any]) – The filters to apply to select documents for deletion. For filter syntax, see Haystack metadata filtering

Returns:

int – The number of documents deleted.

update_by_filter

python

update_by_filter(filters: dict[str, Any], meta: dict[str, Any]) -> int

Updates the metadata of all documents that match the provided filters.

Note: This operation is not atomic. Documents matching the filter are fetched first, then updated. If documents are modified between the fetch and update operations, those changes may be lost.

Parameters:

filters (dict[str, Any]) – The filters to apply to select documents for updating. For filter syntax, see Haystack metadata filtering
meta (dict[str, Any]) – The metadata fields to update. This will be merged with existing metadata.

Returns:

int – The number of documents updated.

update_by_filter_async

python

update_by_filter_async(filters: dict[str, Any], meta: dict[str, Any]) -> int

Asynchronously updates the metadata of all documents that match the provided filters.

Asynchronous methods are only supported for HTTP connections.

Note: This operation is not atomic. Documents matching the filter are fetched first, then updated. If documents are modified between the fetch and update operations, those changes may be lost.

Parameters:

filters (dict[str, Any]) – The filters to apply to select documents for updating. For filter syntax, see Haystack metadata filtering
meta (dict[str, Any]) – The metadata fields to update. This will be merged with existing metadata.

Returns:

int – The number of documents updated.

delete_all_documents

python

delete_all_documents(*, recreate_index: bool = False) -> None

Deletes all documents in the document store.

A fast way to clear all documents from the document store while preserving any collection settings and mappings.

Parameters:

recreate_index (bool) – Whether to recreate the index after deleting all documents.

delete_all_documents_async

python

delete_all_documents_async(*, recreate_index: bool = False) -> None

Asynchronously deletes all documents in the document store.

A fast way to clear all documents from the document store while preserving any collection settings and mappings.

Parameters:

recreate_index (bool) – Whether to recreate the index after deleting all documents.

search

python

search(
    queries: list[str], top_k: int, filters: dict[str, Any] | None = None
) -> list[list[Document]]

Search the documents in the store using the provided text queries.

Parameters:

queries (list[str]) – the list of queries to search for.
top_k (int) – top_k documents to return for each query.
filters (dict[str, Any] | None) – a dictionary of filters to apply to the search. Accepts filters in haystack format.

Returns:

list[list[Document]] – matching documents for each query.

search_async

python

search_async(
    queries: list[str], top_k: int, filters: dict[str, Any] | None = None
) -> list[list[Document]]

Asynchronously search the documents in the store using the provided text queries.

Asynchronous methods are only supported for HTTP connections.

Parameters:

queries (list[str]) – the list of queries to search for.
top_k (int) – top_k documents to return for each query.
filters (dict[str, Any] | None) – a dictionary of filters to apply to the search. Accepts filters in haystack format.

Returns:

list[list[Document]] – matching documents for each query.

search_embeddings

python

search_embeddings(
    query_embeddings: list[list[float]],
    top_k: int,
    filters: dict[str, Any] | None = None,
) -> list[list[Document]]

Perform vector search on the stored document, pass the embeddings of the queries instead of their text.

Parameters:

query_embeddings (list[list[float]]) – a list of embeddings to use as queries.
top_k (int) – the maximum number of documents to retrieve.
filters (dict[str, Any] | None) – a dictionary of filters to apply to the search. Accepts filters in haystack format.

Returns:

list[list[Document]] – a list of lists of documents that match the given filters.

search_embeddings_async

python

search_embeddings_async(
    query_embeddings: list[list[float]],
    top_k: int,
    filters: dict[str, Any] | None = None,
) -> list[list[Document]]

Asynchronously perform vector search on the stored document, pass the embeddings of the queries instead of their text.

Asynchronous methods are only supported for HTTP connections.

Parameters:

query_embeddings (list[list[float]]) – a list of embeddings to use as queries.
top_k (int) – the maximum number of documents to retrieve.
filters (dict[str, Any] | None) – a dictionary of filters to apply to the search. Accepts filters in haystack format.

Returns:

list[list[Document]] – a list of lists of documents that match the given filters.

count_documents_by_filter

python

count_documents_by_filter(filters: dict[str, Any]) -> int

Returns the number of documents that match the provided filters.

Parameters:

filters (dict[str, Any]) – The filters to apply to count documents. For filter syntax, see Haystack metadata filtering

Returns:

int – The number of documents that match the filters.

count_documents_by_filter_async

python

count_documents_by_filter_async(filters: dict[str, Any]) -> int

Asynchronously returns the number of documents that match the provided filters.

Asynchronous methods are only supported for HTTP connections.

Parameters:

filters (dict[str, Any]) – The filters to apply to count documents. For filter syntax, see Haystack metadata filtering

Returns:

int – The number of documents that match the filters.

count_unique_metadata_by_filter

python

count_unique_metadata_by_filter(
    filters: dict[str, Any], metadata_fields: list[str]
) -> dict[str, int]

Returns the number of unique values for each specified metadata field of the documents that match the provided filters.

Parameters:

filters (dict[str, Any]) – The filters to apply to count documents. For filter syntax, see Haystack metadata filtering
metadata_fields (list[str]) – List of field names to calculate unique values for. Field names can include or omit the "meta." prefix.

Returns:

dict[str, int] – A dictionary mapping each metadata field name to the count of its unique values among the filtered documents.

count_unique_metadata_by_filter_async

python

count_unique_metadata_by_filter_async(
    filters: dict[str, Any], metadata_fields: list[str]
) -> dict[str, int]

Asynchronously returns the number of unique values for each specified metadata field of the documents that match the provided filters.

Asynchronous methods are only supported for HTTP connections.

Parameters:

filters (dict[str, Any]) – The filters to apply to count documents. For filter syntax, see Haystack metadata filtering
metadata_fields (list[str]) – List of field names to calculate unique values for. Field names can include or omit the "meta." prefix.

Returns:

dict[str, int] – A dictionary mapping each metadata field name to the count of its unique values among the filtered documents.

get_metadata_fields_info

python

get_metadata_fields_info() -> dict[str, dict[str, str]]

Returns information about the metadata fields in the collection.

Since ChromaDB doesn't maintain a schema, this method samples documents to infer field types.

If we populated the collection with documents like:

python

Document(content="Doc 1", meta={"category": "A", "status": "active", "priority": 1})
Document(content="Doc 2", meta={"category": "B", "status": "inactive"})

This method would return:

python

{
    'category': {'type': 'keyword'},
    'status': {'type': 'keyword'},
    'priority': {'type': 'long'},
}

Returns:

dict[str, dict[str, str]] – Dictionary mapping field names to their type information.

get_metadata_fields_info_async

python

get_metadata_fields_info_async() -> dict[str, dict[str, str]]

Asynchronously returns information about the metadata fields in the collection.

Asynchronous methods are only supported for HTTP connections.

Since ChromaDB doesn't maintain a schema, this method samples documents to infer field types.

If we populated the collection with documents like:

python

Document(content="Doc 1", meta={"category": "A", "status": "active", "priority": 1})
Document(content="Doc 2", meta={"category": "B", "status": "inactive"})

This method would return:

python

{
    'category': {'type': 'keyword'},
    'status': {'type': 'keyword'},
    'priority': {'type': 'long'},
}

Returns:

dict[str, dict[str, str]] – Dictionary mapping field names to their type information.

get_metadata_field_min_max

python

get_metadata_field_min_max(metadata_field: str) -> dict[str, Any]

Returns the minimum and maximum values for the given metadata field.

Parameters:

metadata_field (str) – The metadata field to get the minimum and maximum values for. Can include or omit the "meta." prefix.

Returns:

dict[str, Any] – A dictionary with the keys "min" and "max", where each value is the minimum or maximum value of the metadata field across all documents. Returns:

python

  {"min": None, "max": None}

if field doesn't exist or has no values.

get_metadata_field_min_max_async

python

get_metadata_field_min_max_async(metadata_field: str) -> dict[str, Any]

Asynchronously returns the minimum and maximum values for the given metadata field.

Asynchronous methods are only supported for HTTP connections.

Parameters:

metadata_field (str) – The metadata field to get the minimum and maximum values for. Can include or omit the "meta." prefix.

Returns:

dict[str, Any] – A dictionary with the keys "min" and "max", where each value is the minimum or maximum value of the metadata field across all documents. Returns:

python

  {"min": None, "max": None}

if field doesn't exist or has no values.

get_metadata_field_unique_values

python

get_metadata_field_unique_values(
    metadata_field: str,
    search_term: str | None = None,
    from_: int = 0,
    size: int = 10,
) -> tuple[list[str], int]

Returns unique values for a metadata field, optionally filtered by a search term in the content field, with pagination support.

Parameters:

metadata_field (str) – The metadata field to get unique values for. Can include or omit the "meta." prefix.
search_term (str | None) – Optional search term to filter documents by matching in the content field.
from_ (int) – The offset to start returning values from (for pagination).
size (int) – The maximum number of unique values to return.

Returns:

tuple[list[str], int] – A tuple containing list of unique values and total count of unique values.

get_metadata_field_unique_values_async

python

get_metadata_field_unique_values_async(
    metadata_field: str,
    search_term: str | None = None,
    from_: int = 0,
    size: int = 10,
) -> tuple[list[str], int]

Asynchronously returns unique values for a metadata field, optionally filtered by a search term in the content field, with pagination support.

Asynchronous methods are only supported for HTTP connections.

Parameters:

metadata_field (str) – The metadata field to get unique values for. Can include or omit the "meta." prefix.
search_term (str | None) – Optional search term to filter documents by matching in the content field.
from_ (int) – The offset to start returning values from (for pagination).
size (int) – The maximum number of unique values to return.

Returns:

tuple[list[str], int] – A tuple containing list of unique values and total count of unique values.

from_dict

python

from_dict(data: dict[str, Any]) -> ChromaDocumentStore

Deserializes the component from a dictionary.

Parameters:

data (dict[str, Any]) – Dictionary to deserialize from.

Returns:

ChromaDocumentStore – Deserialized component.

to_dict

python

to_dict() -> dict[str, Any]

Serializes the component to a dictionary.

Returns:

dict[str, Any] – Dictionary with serialized data.

haystack_integrations.document_stores.chroma.errors

ChromaDocumentStoreError

Bases: DocumentStoreError

Parent class for all ChromaDocumentStore exceptions.

ChromaDocumentStoreFilterError

Bases: FilterError, ValueError

Raised when a filter is not valid for a ChromaDocumentStore.

ChromaDocumentStoreConfigError

Bases: ChromaDocumentStoreError

Raised when a configuration is not valid for a ChromaDocumentStore.

haystack_integrations.document_stores.chroma.utils

get_embedding_function

python

get_embedding_function(function_name: str, **kwargs: Any) -> EmbeddingFunction

Load an embedding function by name.

Parameters:

function_name (str) – the name of the embedding function.
kwargs (Any) – additional arguments to pass to the embedding function.

Returns:

EmbeddingFunction – the loaded embedding function.

Raises:

ChromaDocumentStoreConfigError – if the function name is invalid.

haystack_integrations.components.retrievers.chroma.retriever​

ChromaQueryTextRetriever​

init​

run​

run_async​

from_dict​

to_dict​

ChromaEmbeddingRetriever​

init​

run​

run_async​

from_dict​

to_dict​

haystack_integrations.document_stores.chroma.document_store​

ChromaDocumentStore​

init​

count_documents​

count_documents_async​

filter_documents​

filter_documents_async​

write_documents​

write_documents_async​

delete_documents​

delete_documents_async​

delete_by_filter​

delete_by_filter_async​

update_by_filter​

update_by_filter_async​

delete_all_documents​

delete_all_documents_async​

search​

search_async​

search_embeddings​

search_embeddings_async​

count_documents_by_filter​

count_documents_by_filter_async​

count_unique_metadata_by_filter​

count_unique_metadata_by_filter_async​

get_metadata_fields_info​

get_metadata_fields_info_async​

get_metadata_field_min_max​

get_metadata_field_min_max_async​

get_metadata_field_unique_values​

get_metadata_field_unique_values_async​

from_dict​

to_dict​

haystack_integrations.document_stores.chroma.errors​

ChromaDocumentStoreError​

ChromaDocumentStoreFilterError​

ChromaDocumentStoreConfigError​

haystack_integrations.document_stores.chroma.utils​

get_embedding_function​

haystack_integrations.components.retrievers.chroma.retriever

ChromaQueryTextRetriever

init

run

run_async

from_dict

to_dict

ChromaEmbeddingRetriever

init

run

run_async

from_dict

to_dict

haystack_integrations.document_stores.chroma.document_store

ChromaDocumentStore

init

count_documents

count_documents_async

filter_documents

filter_documents_async

write_documents

write_documents_async

delete_documents

delete_documents_async

delete_by_filter

delete_by_filter_async

update_by_filter

update_by_filter_async

delete_all_documents

delete_all_documents_async

search

search_async

search_embeddings

search_embeddings_async

count_documents_by_filter

count_documents_by_filter_async

count_unique_metadata_by_filter

count_unique_metadata_by_filter_async

get_metadata_fields_info

get_metadata_fields_info_async

get_metadata_field_min_max

get_metadata_field_min_max_async

get_metadata_field_unique_values

get_metadata_field_unique_values_async

from_dict

to_dict

haystack_integrations.document_stores.chroma.errors

ChromaDocumentStoreError

ChromaDocumentStoreFilterError

ChromaDocumentStoreConfigError

haystack_integrations.document_stores.chroma.utils

get_embedding_function