Chroma integration for Haystack
Module haystack_integrations.components.retrievers.chroma.retriever
ChromaQueryTextRetriever
A component for retrieving documents from a Chroma database using the query
API.
Example usage:
from haystack import Pipeline
from haystack.components.converters import TextFileToDocument
from haystack.components.writers import DocumentWriter
from haystack_integrations.document_stores.chroma import ChromaDocumentStore
from haystack_integrations.components.retrievers.chroma import ChromaQueryTextRetriever
file_paths = ...
# Chroma is used in-memory so we use the same instances in the two pipelines below
document_store = ChromaDocumentStore()
indexing = Pipeline()
indexing.add_component("converter", TextFileToDocument())
indexing.add_component("writer", DocumentWriter(document_store))
indexing.connect("converter", "writer")
indexing.run({"converter": {"sources": file_paths}})
querying = Pipeline()
querying.add_component("retriever", ChromaQueryTextRetriever(document_store))
results = querying.run({"retriever": {"query": "Variable declarations", "top_k": 3}})
for d in results["retriever"]["documents"]:
print(d.meta, d.score)
ChromaQueryTextRetriever.__init__
def __init__(document_store: ChromaDocumentStore,
filters: Optional[Dict[str, Any]] = None,
top_k: int = 10,
filter_policy: Union[str, FilterPolicy] = FilterPolicy.REPLACE)
Arguments:
document_store
: an instance ofChromaDocumentStore
.filters
: filters to narrow down the search space.top_k
: the maximum number of documents to retrieve.filter_policy
: Policy to determine how filters are applied.
ChromaQueryTextRetriever.run
@component.output_types(documents=List[Document])
def run(query: str,
filters: Optional[Dict[str, Any]] = None,
top_k: Optional[int] = None)
Run the retriever on the given input data.
Arguments:
query
: The input data for the retriever. In this case, a plain-text query.filters
: Filters applied to the retrieved Documents. The way runtime filters are applied depends on thefilter_policy
chosen at retriever initialization. See init method docstring for more details.top_k
: The maximum number of documents to retrieve. If not specified, the default value from the constructor is used.
Raises:
ValueError
: If the specified document store is not found or is not a MemoryDocumentStore instance.
Returns:
A dictionary with the following keys:
documents
: List of documents returned by the search engine.
ChromaQueryTextRetriever.from_dict
@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "ChromaQueryTextRetriever"
Deserializes the component from a dictionary.
Arguments:
data
: Dictionary to deserialize from.
Returns:
Deserialized component.
ChromaQueryTextRetriever.to_dict
def to_dict() -> Dict[str, Any]
Serializes the component to a dictionary.
Returns:
Dictionary with serialized data.
ChromaEmbeddingRetriever
A component for retrieving documents from a Chroma database using embeddings.
ChromaEmbeddingRetriever.run
@component.output_types(documents=List[Document])
def run(query_embedding: List[float],
filters: Optional[Dict[str, Any]] = None,
top_k: Optional[int] = None)
Run the retriever on the given input data.
Arguments:
query_embedding
: the query embeddings.filters
: Filters applied to the retrieved Documents. The way runtime filters are applied depends on thefilter_policy
chosen at retriever initialization. See init method docstring for more details.top_k
: the maximum number of documents to retrieve. If not specified, the default value from the constructor is used.
Returns:
a dictionary with the following keys:
documents
: List of documents returned by the search engine.
Module haystack_integrations.document_stores.chroma.document_store
ChromaDocumentStore
A document store using Chroma as the backend.
We use the collection.get
API to implement the document store protocol,
the collection.search
API will be used in the retriever instead.
ChromaDocumentStore.__init__
def __init__(collection_name: str = "documents",
embedding_function: str = "default",
persist_path: Optional[str] = None,
host: Optional[str] = None,
port: Optional[int] = None,
distance_function: Literal["l2", "cosine", "ip"] = "l2",
metadata: Optional[dict] = None,
**embedding_function_params)
Creates a new ChromaDocumentStore instance.
It is meant to be connected to a Chroma collection.
Note: for the component to be part of a serializable pipeline, the init parameters must be serializable, reason why we use a registry to configure the embedding function passing a string.
Arguments:
collection_name
: the name of the collection to use in the database.embedding_function
: the name of the embedding function to use to embed the querypersist_path
: Path for local persistent storage. Cannot be used in combination withhost
andport
. If none ofpersist_path
,host
, andport
is specified, the database will bein-memory
.host
: The host address for the remote Chroma HTTP client connection. Cannot be used withpersist_path
.port
: The port number for the remote Chroma HTTP client connection. Cannot be used withpersist_path
.distance_function
: The distance metric for the embedding space."l2"
computes the Euclidean (straight-line) distance between vectors, where smaller scores indicate more similarity."cosine"
computes the cosine similarity between vectors, with higher scores indicating greater similarity."ip"
stands for inner product, where higher scores indicate greater similarity between vectors. Note:distance_function
can only be set during the creation of a collection. To change the distance metric of an existing collection, consider cloning the collection.metadata
: a dictionary of chromadb collection parameters passed directly to chromadb's client methodcreate_collection
. If it contains the key"hnsw:space"
, the value will take precedence over thedistance_function
parameter above.embedding_function_params
: additional parameters to pass to the embedding function.
ChromaDocumentStore.count_documents
def count_documents() -> int
Returns how many documents are present in the document store.
Returns:
how many documents are present in the document store.
ChromaDocumentStore.filter_documents
def filter_documents(
filters: Optional[Dict[str, Any]] = None) -> List[Document]
Returns the documents that match the filters provided.
Filters can be provided as a dictionary supporting filtering by ids, metadata, and document content.
Metadata filters should use the "meta.<metadata_key>"
syntax, while content-based filters
use the "content"
field directly.
Content filters support the contains
and not contains
operators,
while id filters only support the ==
operator.
Due to Chroma's distinction between metadata filters and document filters, filters with "field": "content"
(i.e., document content filters) and metadata fields must be supplied separately. For details on chroma filters,
see the Chroma documentation.
Example:
filter_1 = {
"operator": "AND",
"conditions": [
{"field": "meta.name", "operator": "==", "value": "name_0"},
{"field": "meta.number", "operator": "not in", "value": [2, 9]},
],
}
filter_2 = {
"operator": "AND",
"conditions": [
{"field": "content", "operator": "contains", "value": "FOO"},
{"field": "content", "operator": "not contains", "value": "BAR"},
],
}
If you need to apply the same logical operator (e.g., "AND", "OR") to multiple conditions at the same level, you can provide a list of dictionaries as the value for the operator, like in the example below:
filters = {
"operator": "OR",
"conditions": [
{"field": "meta.author", "operator": "==", "value": "author_1"},
{
"operator": "AND",
"conditions": [
{"field": "meta.tag", "operator": "==", "value": "tag_1"},
{"field": "meta.page", "operator": ">", "value": 100},
],
},
{
"operator": "AND",
"conditions": [
{"field": "meta.tag", "operator": "==", "value": "tag_2"},
{"field": "meta.page", "operator": ">", "value": 200},
],
},
],
}
:param filters: the filters to apply to the document list. :returns: a list of Documents that match the given filters.
ChromaDocumentStore.write_documents
def write_documents(documents: List[Document],
policy: DuplicatePolicy = DuplicatePolicy.FAIL) -> int
Writes (or overwrites) documents into the store.
Arguments:
documents
: A list of documents to write into the document store.policy
: Not supported at the moment.
Raises:
ValueError
: When input is not valid.
Returns:
The number of documents written
ChromaDocumentStore.delete_documents
def delete_documents(document_ids: List[str]) -> None
Deletes all documents with a matching document_ids from the document store.
Arguments:
document_ids
: the document ids to delete
ChromaDocumentStore.search
def search(queries: List[str],
top_k: int,
filters: Optional[Dict[str, Any]] = None) -> List[List[Document]]
Search the documents in the store using the provided text queries.
Arguments:
queries
: the list of queries to search for.top_k
: top_k documents to return for each query.filters
: a dictionary of filters to apply to the search. Accepts filters in haystack format.
Returns:
matching documents for each query.
ChromaDocumentStore.search_embeddings
def search_embeddings(
query_embeddings: List[List[float]],
top_k: int,
filters: Optional[Dict[str, Any]] = None) -> List[List[Document]]
Perform vector search on the stored document, pass the embeddings of the queries instead of their text.
Arguments:
query_embeddings
: a list of embeddings to use as queries.top_k
: the maximum number of documents to retrieve.filters
: a dictionary of filters to apply to the search. Accepts filters in haystack format.
Returns:
a list of lists of documents that match the given filters.
ChromaDocumentStore.from_dict
@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "ChromaDocumentStore"
Deserializes the component from a dictionary.
Arguments:
data
: Dictionary to deserialize from.
Returns:
Deserialized component.
ChromaDocumentStore.to_dict
def to_dict() -> Dict[str, Any]
Serializes the component to a dictionary.
Returns:
Dictionary with serialized data.
Module haystack_integrations.document_stores.chroma.errors
ChromaDocumentStoreError
Parent class for all ChromaDocumentStore exceptions.
ChromaDocumentStoreFilterError
Raised when a filter is not valid for a ChromaDocumentStore.
ChromaDocumentStoreConfigError
Raised when a configuration is not valid for a ChromaDocumentStore.
Module haystack_integrations.document_stores.chroma.utils
get_embedding_function
def get_embedding_function(function_name: str, **kwargs) -> EmbeddingFunction
Load an embedding function by name.
Arguments:
function_name
: the name of the embedding function.kwargs
: additional arguments to pass to the embedding function.
Raises:
ChromaDocumentStoreConfigError
: if the function name is invalid.
Returns:
the loaded embedding function.