Chroma integration for Haystack
Module haystack_integrations.components.retrievers.chroma.retriever
ChromaQueryTextRetriever
A component for retrieving documents from a Chroma database using the query
API.
Example usage:
from haystack import Pipeline
from haystack.components.converters import TextFileToDocument
from haystack.components.writers import DocumentWriter
from haystack_integrations.document_stores.chroma import ChromaDocumentStore
from haystack_integrations.components.retrievers.chroma import ChromaQueryTextRetriever
file_paths = ...
# Chroma is used in-memory so we use the same instances in the two pipelines below
document_store = ChromaDocumentStore()
indexing = Pipeline()
indexing.add_component("converter", TextFileToDocument())
indexing.add_component("writer", DocumentWriter(document_store))
indexing.connect("converter", "writer")
indexing.run({"converter": {"sources": file_paths}})
querying = Pipeline()
querying.add_component("retriever", ChromaQueryTextRetriever(document_store))
results = querying.run({"retriever": {"query": "Variable declarations", "top_k": 3}})
for d in results["retriever"]["documents"]:
print(d.meta, d.score)
ChromaQueryTextRetriever.__init__
def __init__(document_store: ChromaDocumentStore,
filters: Optional[Dict[str, Any]] = None,
top_k: int = 10)
Arguments:
document_store
: an instance ofChromaDocumentStore
.filters
: filters to narrow down the search space.top_k
: the maximum number of documents to retrieve.
ChromaQueryTextRetriever.run
@component.output_types(documents=List[Document])
def run(query: str,
_: Optional[Dict[str, Any]] = None,
top_k: Optional[int] = None)
Run the retriever on the given input data.
Arguments:
query
: The input data for the retriever. In this case, a plain-text query.top_k
: The maximum number of documents to retrieve. If not specified, the default value from the constructor is used.
Raises:
ValueError
: If the specified document store is not found or is not a MemoryDocumentStore instance.
Returns:
A dictionary with the following keys:
documents
: List of documents returned by the search engine.
ChromaQueryTextRetriever.from_dict
@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "ChromaQueryTextRetriever"
Deserializes the component from a dictionary.
Arguments:
data
: Dictionary to deserialize from.
Returns:
Deserialized component.
ChromaQueryTextRetriever.to_dict
def to_dict() -> Dict[str, Any]
Serializes the component to a dictionary.
Returns:
Dictionary with serialized data.
ChromaEmbeddingRetriever
A component for retrieving documents from a Chroma database using embeddings.
ChromaEmbeddingRetriever.run
@component.output_types(documents=List[Document])
def run(query_embedding: List[float],
filters: Optional[Dict[str, Any]] = None,
top_k: Optional[int] = None)
Run the retriever on the given input data.
Arguments:
query_embedding
: the query embeddings.
Returns:
a dictionary with the following keys:
documents
: List of documents returned by the search engine.
Module haystack_integrations.document_stores.chroma.document_store
ChromaDocumentStore
A document store using Chroma as the backend.
We use the collection.get
API to implement the document store protocol,
the collection.search
API will be used in the retriever instead.
ChromaDocumentStore.__init__
def __init__(collection_name: str = "documents",
embedding_function: str = "default",
persist_path: Optional[str] = None,
**embedding_function_params)
Initializes the store. The init constructor is not part of the Store Protocol
and the signature can be customized to your needs. For example, parameters needed to set up a database client would be passed to this method.
Note: for the component to be part of a serializable pipeline, the init parameters must be serializable, reason why we use a registry to configure the embedding function passing a string.
Arguments:
collection_name
: the name of the collection to use in the database.embedding_function
: the name of the embedding function to use to embed the querypersist_path
: where to store the database. If None, the database will bein-memory
.embedding_function_params
: additional parameters to pass to the embedding function.
ChromaDocumentStore.count_documents
def count_documents() -> int
Returns how many documents are present in the document store.
Returns:
how many documents are present in the document store.
ChromaDocumentStore.filter_documents
def filter_documents(
filters: Optional[Dict[str, Any]] = None) -> List[Document]
Returns the documents that match the filters provided.
Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and"
,
"$or"
, "$not"
), a comparison operator ("$eq"
, $ne
, "$in"
, $nin
, "$gt"
, "$gte"
, "$lt"
,
"$lte"
) or a metadata field name.
Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata
field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or
(in case of "$in"
) a list of values as value. If no logical operator is provided, "$and"
is used as default
operation. If no comparison operator is provided, "$eq"
(or "$in"
if the comparison value is a list) is used
as default operation.
Example:
filters = {
"$and": {
"type": {"$eq": "article"},
"date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
"rating": {"$gte": 3},
"$or": {
"genre": {"$in": ["economy", "politics"]},
"publisher": {"$eq": "nytimes"}
}
}
}
# or simpler using default operators
filters = {
"type": "article",
"date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
"rating": {"$gte": 3},
"$or": {
"genre": ["economy", "politics"],
"publisher": "nytimes"
}
}
To use the same logical operator multiple times on the same level, logical operators can take a list of dictionaries as value.
Example:
filters = {
"$or": [
{
"$and": {
"Type": "News Paper",
"Date": {
"$lt": "2019-01-01"
}
}
},
{
"$and": {
"Type": "Blog Post",
"Date": {
"$gte": "2019-01-01"
}
}
}
]
}
Arguments:
filters
: the filters to apply to the document list.
Returns:
a list of Documents that match the given filters.
ChromaDocumentStore.write_documents
def write_documents(documents: List[Document],
policy: DuplicatePolicy = DuplicatePolicy.FAIL) -> int
Writes (or overwrites) documents into the store.
Arguments:
documents
: A list of documents to write into the document store.policy
: Not supported at the moment.
Raises:
ValueError
: When input is not valid.
Returns:
The number of documents written
ChromaDocumentStore.delete_documents
def delete_documents(document_ids: List[str]) -> None
Deletes all documents with a matching document_ids from the document store.
Arguments:
document_ids
: the object_ids to delete
ChromaDocumentStore.search
def search(queries: List[str], top_k: int) -> List[List[Document]]
Search the documents in the store using the provided text queries.
Arguments:
queries
: the list of queries to search for.top_k
: top_k documents to return for each query.
Returns:
matching documents for each query.
ChromaDocumentStore.search_embeddings
def search_embeddings(
query_embeddings: List[List[float]],
top_k: int,
filters: Optional[Dict[str, Any]] = None) -> List[List[Document]]
Perform vector search on the stored document, pass the embeddings of the queries instead of their text.
Arguments:
query_embeddings
: a list of embeddings to use as queries.top_k
: the maximum number of documents to retrieve.filters
: a dictionary of filters to apply to the search. Accepts filters in haystack format.
Returns:
a list of lists of documents that match the given filters.
ChromaDocumentStore.from_dict
@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "ChromaDocumentStore"
Deserializes the component from a dictionary.
Arguments:
data
: Dictionary to deserialize from.
Returns:
Deserialized component.
ChromaDocumentStore.to_dict
def to_dict() -> Dict[str, Any]
Serializes the component to a dictionary.
Returns:
Dictionary with serialized data.
Module haystack_integrations.document_stores.chroma.errors
ChromaDocumentStoreError
Parent class for all ChromaDocumentStore exceptions.
ChromaDocumentStoreFilterError
Raised when a filter is not valid for a ChromaDocumentStore.
ChromaDocumentStoreConfigError
Raised when a configuration is not valid for a ChromaDocumentStore.
Module haystack_integrations.document_stores.chroma.utils
get_embedding_function
def get_embedding_function(function_name: str, **kwargs) -> EmbeddingFunction
Load an embedding function by name.
Arguments:
function_name
: the name of the embedding function.kwargs
: additional arguments to pass to the embedding function.
Raises:
ChromaDocumentStoreConfigError
: if the function name is invalid.
Returns:
the loaded embedding function.