Module haystack_integrations.components.retrievers.chroma.retriever

ChromaQueryTextRetriever

A component for retrieving documents from a Chroma database using the query API.

Example usage:

from haystack import Pipeline
from haystack.components.converters import TextFileToDocument
from haystack.components.writers import DocumentWriter

from haystack_integrations.document_stores.chroma import ChromaDocumentStore
from haystack_integrations.components.retrievers.chroma import ChromaQueryTextRetriever

file_paths = ...

# Chroma is used in-memory so we use the same instances in the two pipelines below
document_store = ChromaDocumentStore()

indexing = Pipeline()
indexing.add_component("converter", TextFileToDocument())
indexing.add_component("writer", DocumentWriter(document_store))
indexing.connect("converter", "writer")
indexing.run({"converter": {"sources": file_paths}})

querying = Pipeline()
querying.add_component("retriever", ChromaQueryTextRetriever(document_store))
results = querying.run({"retriever": {"query": "Variable declarations", "top_k": 3}})

for d in results["retriever"]["documents"]:
    print(d.meta, d.score)

ChromaQueryTextRetriever.init

def __init__(document_store: ChromaDocumentStore,
             filters: Optional[Dict[str, Any]] = None,
             top_k: int = 10,
             filter_policy: Union[str, FilterPolicy] = FilterPolicy.REPLACE)

Arguments:

document_store: an instance of ChromaDocumentStore.
filters: filters to narrow down the search space.
top_k: the maximum number of documents to retrieve.
filter_policy: Policy to determine how filters are applied.

ChromaQueryTextRetriever.run

@component.output_types(documents=List[Document])
def run(query: str,
        filters: Optional[Dict[str, Any]] = None,
        top_k: Optional[int] = None)

Run the retriever on the given input data.

Arguments:

query: The input data for the retriever. In this case, a plain-text query.
filters: Filters applied to the retrieved Documents. The way runtime filters are applied depends on the filter_policy chosen at retriever initialization. See init method docstring for more details.
top_k: The maximum number of documents to retrieve. If not specified, the default value from the constructor is used.

Raises:

ValueError: If the specified document store is not found or is not a MemoryDocumentStore instance.

Returns:

A dictionary with the following keys:

documents: List of documents returned by the search engine.

ChromaQueryTextRetriever.from_dict

@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "ChromaQueryTextRetriever"

Deserializes the component from a dictionary.

Arguments:

data: Dictionary to deserialize from.

Returns:

Deserialized component.

ChromaQueryTextRetriever.to_dict

def to_dict() -> Dict[str, Any]

Serializes the component to a dictionary.

Returns:

Dictionary with serialized data.

ChromaEmbeddingRetriever

A component for retrieving documents from a Chroma database using embeddings.

ChromaEmbeddingRetriever.run

@component.output_types(documents=List[Document])
def run(query_embedding: List[float],
        filters: Optional[Dict[str, Any]] = None,
        top_k: Optional[int] = None)

Run the retriever on the given input data.

Arguments:

query_embedding: the query embeddings.
filters: Filters applied to the retrieved Documents. The way runtime filters are applied depends on the filter_policy chosen at retriever initialization. See init method docstring for more details.
top_k: the maximum number of documents to retrieve. If not specified, the default value from the constructor is used.

Returns:

a dictionary with the following keys:

documents: List of documents returned by the search engine.

Module haystack_integrations.document_stores.chroma.document_store

ChromaDocumentStore

A document store using Chroma as the backend.

We use the collection.get API to implement the document store protocol, the collection.search API will be used in the retriever instead.

ChromaDocumentStore.init

def __init__(collection_name: str = "documents",
             embedding_function: str = "default",
             persist_path: Optional[str] = None,
             host: Optional[str] = None,
             port: Optional[int] = None,
             distance_function: Literal["l2", "cosine", "ip"] = "l2",
             metadata: Optional[dict] = None,
             **embedding_function_params)

Creates a new ChromaDocumentStore instance.

It is meant to be connected to a Chroma collection.

Note: for the component to be part of a serializable pipeline, the init parameters must be serializable, reason why we use a registry to configure the embedding function passing a string.

Arguments:

collection_name: the name of the collection to use in the database.
embedding_function: the name of the embedding function to use to embed the query
persist_path: Path for local persistent storage. Cannot be used in combination with host and port. If none of persist_path, host, and port is specified, the database will be in-memory.
host: The host address for the remote Chroma HTTP client connection. Cannot be used with persist_path.
port: The port number for the remote Chroma HTTP client connection. Cannot be used with persist_path.
distance_function: The distance metric for the embedding space.
"l2" computes the Euclidean (straight-line) distance between vectors, where smaller scores indicate more similarity.
"cosine" computes the cosine similarity between vectors, with higher scores indicating greater similarity.
"ip" stands for inner product, where higher scores indicate greater similarity between vectors. Note: distance_function can only be set during the creation of a collection. To change the distance metric of an existing collection, consider cloning the collection.
metadata: a dictionary of chromadb collection parameters passed directly to chromadb's client method create_collection. If it contains the key "hnsw:space", the value will take precedence over the distance_function parameter above.
embedding_function_params: additional parameters to pass to the embedding function.

ChromaDocumentStore.count_documents

def count_documents() -> int

Returns how many documents are present in the document store.

Returns:

how many documents are present in the document store.

ChromaDocumentStore.filter_documents

def filter_documents(
        filters: Optional[Dict[str, Any]] = None) -> List[Document]

Returns the documents that match the filters provided.

Filters can be provided as a dictionary supporting filtering by ids, metadata, and document content. Metadata filters should use the "meta.<metadata_key>" syntax, while content-based filters use the "content" field directly. Content filters support the contains and not contains operators, while id filters only support the == operator.

Due to Chroma's distinction between metadata filters and document filters, filters with "field": "content" (i.e., document content filters) and metadata fields must be supplied separately. For details on chroma filters, see the Chroma documentation.

Example:

filter_1 = {
       "operator": "AND",
       "conditions": [
           {"field": "meta.name", "operator": "==", "value": "name_0"},
           {"field": "meta.number", "operator": "not in", "value": [2, 9]},
       ],
   }
filter_2 = {
       "operator": "AND",
       "conditions": [
           {"field": "content", "operator": "contains", "value": "FOO"},
           {"field": "content", "operator": "not contains", "value": "BAR"},
       ],
   }

If you need to apply the same logical operator (e.g., "AND", "OR") to multiple conditions at the same level, you can provide a list of dictionaries as the value for the operator, like in the example below:

filters = {
    "operator": "OR",
    "conditions": [
        {"field": "meta.author", "operator": "==", "value": "author_1"},
        {
            "operator": "AND",
            "conditions": [
                {"field": "meta.tag", "operator": "==", "value": "tag_1"},
                {"field": "meta.page", "operator": ">", "value": 100},
            ],
        },
        {
            "operator": "AND",
            "conditions": [
                {"field": "meta.tag", "operator": "==", "value": "tag_2"},
                {"field": "meta.page", "operator": ">", "value": 200},
            ],
        },
    ],
}

:param filters: the filters to apply to the document list. :returns: a list of Documents that match the given filters.

ChromaDocumentStore.write_documents

def write_documents(documents: List[Document],
                    policy: DuplicatePolicy = DuplicatePolicy.FAIL) -> int

Writes (or overwrites) documents into the store.

Arguments:

documents: A list of documents to write into the document store.
policy: Not supported at the moment.

Raises:

ValueError: When input is not valid.

Returns:

The number of documents written

ChromaDocumentStore.delete_documents

def delete_documents(document_ids: List[str]) -> None

Deletes all documents with a matching document_ids from the document store.

Arguments:

document_ids: the document ids to delete

ChromaDocumentStore.search

def search(queries: List[str],
           top_k: int,
           filters: Optional[Dict[str, Any]] = None) -> List[List[Document]]

Search the documents in the store using the provided text queries.

Arguments:

queries: the list of queries to search for.
top_k: top_k documents to return for each query.
filters: a dictionary of filters to apply to the search. Accepts filters in haystack format.

Returns:

matching documents for each query.

ChromaDocumentStore.search_embeddings

def search_embeddings(
        query_embeddings: List[List[float]],
        top_k: int,
        filters: Optional[Dict[str, Any]] = None) -> List[List[Document]]

Perform vector search on the stored document, pass the embeddings of the queries instead of their text.

Arguments:

query_embeddings: a list of embeddings to use as queries.
top_k: the maximum number of documents to retrieve.
filters: a dictionary of filters to apply to the search. Accepts filters in haystack format.

Returns:

a list of lists of documents that match the given filters.

ChromaDocumentStore.from_dict

@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "ChromaDocumentStore"

Deserializes the component from a dictionary.

Arguments:

data: Dictionary to deserialize from.

Returns:

Deserialized component.

ChromaDocumentStore.to_dict

def to_dict() -> Dict[str, Any]

Serializes the component to a dictionary.

Returns:

Dictionary with serialized data.

Module haystack_integrations.document_stores.chroma.errors

ChromaDocumentStoreError

Parent class for all ChromaDocumentStore exceptions.

ChromaDocumentStoreFilterError

Raised when a filter is not valid for a ChromaDocumentStore.

ChromaDocumentStoreConfigError

Raised when a configuration is not valid for a ChromaDocumentStore.

Module haystack_integrations.document_stores.chroma.utils

get_embedding_function

def get_embedding_function(function_name: str, **kwargs) -> EmbeddingFunction

Load an embedding function by name.

Arguments:

function_name: the name of the embedding function.
kwargs: additional arguments to pass to the embedding function.

Raises:

ChromaDocumentStoreConfigError: if the function name is invalid.

Returns:

the loaded embedding function.

Module haystack_integrations.components.retrievers.chroma.retriever

ChromaQueryTextRetriever

ChromaQueryTextRetriever.__init__

ChromaQueryTextRetriever.run

ChromaQueryTextRetriever.from_dict

ChromaQueryTextRetriever.to_dict

ChromaEmbeddingRetriever

ChromaEmbeddingRetriever.run

Module haystack_integrations.document_stores.chroma.document_store

ChromaDocumentStore

ChromaDocumentStore.__init__

ChromaDocumentStore.count_documents

ChromaDocumentStore.filter_documents

ChromaDocumentStore.write_documents

ChromaDocumentStore.delete_documents

ChromaDocumentStore.search

ChromaDocumentStore.search_embeddings

ChromaDocumentStore.from_dict

ChromaDocumentStore.to_dict

Module haystack_integrations.document_stores.chroma.errors

ChromaDocumentStoreError

ChromaDocumentStoreFilterError

ChromaDocumentStoreConfigError

Module haystack_integrations.document_stores.chroma.utils

get_embedding_function

ChromaQueryTextRetriever.init

ChromaDocumentStore.init