Document Stores
document_store
BM25DocumentStats
A dataclass for managing document statistics for BM25 retrieval.
Parameters:
- freq_token (
dict[str, int]) – A Counter of token frequencies in the document. - doc_len (
int) – Number of tokens in the document.
InMemoryDocumentStore
Stores data in-memory. It's ephemeral and cannot be saved to disk.
init
__init__(
bm25_tokenization_regex: str = "(?u)\\b\\w+\\b",
bm25_algorithm: Literal["BM25Okapi", "BM25L", "BM25Plus"] = "BM25L",
bm25_parameters: dict | None = None,
embedding_similarity_function: Literal[
"dot_product", "cosine"
] = "dot_product",
index: str | None = None,
async_executor: ThreadPoolExecutor | None = None,
return_embedding: bool = True,
) -> None
Initializes the DocumentStore.
Parameters:
- bm25_tokenization_regex (
str) – The regular expression used to tokenize the text for BM25 retrieval. - bm25_algorithm (
Literal['BM25Okapi', 'BM25L', 'BM25Plus']) – The BM25 algorithm to use. One of "BM25Okapi", "BM25L", or "BM25Plus". - bm25_parameters (
dict | None) – Parameters for BM25 implementation in a dictionary format. For example:{'k1':1.5, 'b':0.75, 'epsilon':0.25}You can learn more about these parameters by visiting https://github.com/dorianbrown/rank_bm25. - embedding_similarity_function (
Literal['dot_product', 'cosine']) – The similarity function used to compare Documents embeddings. One of "dot_product" (default) or "cosine". To choose the most appropriate function, look for information about your embedding model. - index (
str | None) – A specific index to store the documents. If not specified, a random UUID is used. Using the same index allows you to store documents across multiple InMemoryDocumentStore instances. - async_executor (
ThreadPoolExecutor | None) – Optional ThreadPoolExecutor to use for async calls. If not provided, a single-threaded executor will be initialized and used. - return_embedding (
bool) – Whether to return the embedding of the retrieved Documents. Default is True.
shutdown
Explicitly shutdown the executor if we own it.
storage
Utility property that returns the storage used by this instance of InMemoryDocumentStore.
to_dict
Serializes the component to a dictionary.
Returns:
dict[str, Any]– Dictionary with serialized data.
from_dict
Deserializes the component from a dictionary.
Parameters:
- data (
dict[str, Any]) – The dictionary to deserialize from.
Returns:
InMemoryDocumentStore– The deserialized component.
save_to_disk
Write the database and its data to disk as a JSON file.
Parameters:
- path (
str) – The path to the JSON file.
load_from_disk
Load the database and its data from disk as a JSON file.
Parameters:
- path (
str) – The path to the JSON file.
Returns:
InMemoryDocumentStore– The loaded InMemoryDocumentStore.
count_documents
Returns the number of documents present in the DocumentStore.
filter_documents
Returns the documents that match the filters provided.
Parameters:
- filters (
dict[str, Any] | None) – The filters to apply. For a detailed specification of the filters, refer to the documentation.
Returns:
list[Document]– A list of Documents that match the given filters.
write_documents
write_documents(
documents: list[Document], policy: DuplicatePolicy = DuplicatePolicy.NONE
) -> int
Refer to the DocumentStore.write_documents() protocol documentation.
If policy is set to DuplicatePolicy.NONE defaults to DuplicatePolicy.FAIL.
delete_documents
Deletes all documents with matching document_ids from the DocumentStore.
Parameters:
- document_ids (
list[str]) – The document_ids to delete.
delete_all_documents
Deletes all documents in the document store.
update_by_filter
Updates the metadata of all documents that match the provided filters.
Parameters:
- filters (
dict[str, Any]) – The filters to apply to select documents for updating. For filter syntax, see filter_documents. - meta (
dict[str, Any]) – The metadata fields to update. These will be merged with existing metadata.
Returns:
int– The number of documents updated.
Raises:
ValueError– if filters have invalid syntax.
delete_by_filter
Deletes all documents that match the provided filters.
Parameters:
- filters (
dict[str, Any]) – The filters to apply to select documents for deletion. For filter syntax, see filter_documents.
Returns:
int– The number of documents deleted.
Raises:
ValueError– if filters have invalid syntax.
count_documents_by_filter
Returns the number of documents that match the provided filters.
Parameters:
- filters (
dict[str, Any]) – The filters to apply. For a detailed specification of the filters, refer to the documentation.
Returns:
int– The number of documents that match the filters.
count_unique_metadata_by_filter
count_unique_metadata_by_filter(
filters: dict[str, Any], metadata_fields: list[str]
) -> dict[str, int]
Returns the number of unique values for each specified metadata field from documents matching the filters.
Parameters:
- filters (
dict[str, Any]) – The filters to apply. For a detailed specification of the filters, refer to the documentation. - metadata_fields (
list[str]) – List of field names to count unique values for. Field names can include or omit the "meta." prefix.
Returns:
dict[str, int]– A dictionary mapping each metadata field name (without "meta." prefix) to the count of its unique values among the filtered documents.
get_metadata_fields_info
Returns information about the metadata fields present in the stored documents.
Types are inferred from the stored values (keyword, int, float, boolean).
Returns:
dict[str, dict[str, str]]– A dictionary mapping each metadata field name to a dict with a "type" key.
get_metadata_field_min_max
Returns the minimum and maximum values for the given metadata field across all documents.
Parameters:
- metadata_field (
str) – The metadata field name. Can include or omit the "meta." prefix.
Returns:
dict[str, Any]– A dictionary with "min" and "max" keys. Returns{"min": None, "max": None}if the field is missing or has no values.
get_metadata_field_unique_values
get_metadata_field_unique_values(
metadata_field: str, search_term: str | None = None
) -> tuple[list[str], int]
Returns unique values for a metadata field, optionally filtered by a search term in content.
Parameters:
- metadata_field (
str) – The metadata field name. Can include or omit the "meta." prefix. - search_term (
str | None) – If set, only documents whose content contains this term (case-insensitive) are considered.
Returns:
tuple[list[str], int]– A tuple of (list of unique values, total count of unique values).
bm25_retrieval
bm25_retrieval(
query: str,
filters: dict[str, Any] | None = None,
top_k: int = 10,
scale_score: bool = False,
) -> list[Document]
Retrieves documents that are most relevant to the query using BM25 algorithm.
Parameters:
- query (
str) – The query string. - filters (
dict[str, Any] | None) – A dictionary with filters to narrow down the search space. - top_k (
int) – The number of top documents to retrieve. Default is 10. - scale_score (
bool) – Whether to scale the scores of the retrieved documents. Default is False.
Returns:
list[Document]– A list of the top_k documents most relevant to the query.
embedding_retrieval
embedding_retrieval(
query_embedding: list[float],
filters: dict[str, Any] | None = None,
top_k: int = 10,
scale_score: bool = False,
return_embedding: bool | None = False,
) -> list[Document]
Retrieves documents that are most similar to the query embedding using a vector similarity metric.
Parameters:
- query_embedding (
list[float]) – Embedding of the query. - filters (
dict[str, Any] | None) – A dictionary with filters to narrow down the search space. - top_k (
int) – The number of top documents to retrieve. Default is 10. - scale_score (
bool) – Whether to scale the scores of the retrieved Documents. Default is False. - return_embedding (
bool | None) – Whether to return the embedding of the retrieved Documents. If not provided, the value of thereturn_embeddingparameter set at component initialization will be used. Default is False.
Returns:
list[Document]– A list of the top_k documents most relevant to the query.
Raises:
ValueError– if filters have invalid syntax.
count_documents_async
Returns the number of documents present in the DocumentStore.
filter_documents_async
Returns the documents that match the filters provided.
Parameters:
- filters (
dict[str, Any] | None) – The filters to apply. For a detailed specification of the filters, refer to the documentation.
Returns:
list[Document]– A list of Documents that match the given filters.
write_documents_async
write_documents_async(
documents: list[Document], policy: DuplicatePolicy = DuplicatePolicy.NONE
) -> int
Refer to the DocumentStore.write_documents() protocol documentation.
If policy is set to DuplicatePolicy.NONE defaults to DuplicatePolicy.FAIL.
delete_documents_async
Deletes all documents with matching document_ids from the DocumentStore.
Parameters:
- document_ids (
list[str]) – The document_ids to delete.
update_by_filter_async
Updates the metadata of all documents that match the provided filters.
Parameters:
- filters (
dict[str, Any]) – The filters to apply to select documents for updating. For filter syntax, see filter_documents. - meta (
dict[str, Any]) – The metadata fields to update. These will be merged with existing metadata.
Returns:
int– The number of documents updated.
count_documents_by_filter_async
Returns the number of documents that match the provided filters.
Parameters:
- filters (
dict[str, Any]) – The filters to apply. For a detailed specification of the filters, refer to the documentation.
Returns:
int– The number of documents that match the filters.
count_unique_metadata_by_filter_async
count_unique_metadata_by_filter_async(
filters: dict[str, Any], metadata_fields: list[str]
) -> dict[str, int]
Returns the number of unique values for each specified metadata field from documents matching the filters.
Parameters:
- filters (
dict[str, Any]) – The filters to apply. For a detailed specification of the filters, refer to the documentation. - metadata_fields (
list[str]) – List of field names to count unique values for. Field names can include or omit the "meta." prefix.
Returns:
dict[str, int]– A dictionary mapping each metadata field name (without "meta." prefix) to the count of its unique values among the filtered documents.
get_metadata_fields_info_async
Returns information about the metadata fields present in the stored documents.
Types are inferred from the stored values (keyword, int, float, boolean).
Returns:
dict[str, dict[str, str]]– A dictionary mapping each metadata field name to a dict with a "type" key.
get_metadata_field_min_max_async
Returns the minimum and maximum values for the given metadata field across all documents.
Parameters:
- metadata_field (
str) – The metadata field name. Can include or omit the "meta." prefix.
Returns:
dict[str, Any]– A dictionary with "min" and "max" keys. Returns{"min": None, "max": None}if the field is missing or has no values.
get_metadata_field_unique_values_async
get_metadata_field_unique_values_async(
metadata_field: str, search_term: str | None = None
) -> tuple[list[str], int]
Returns unique values for a metadata field, optionally filtered by a search term in content.
Parameters:
- metadata_field (
str) – The metadata field name. Can include or omit the "meta." prefix. - search_term (
str | None) – If set, only documents whose content contains this term (case-insensitive) are considered.
Returns:
tuple[list[str], int]– A tuple of (list of unique values, total count of unique values).
delete_all_documents_async
Deletes all documents in the document store.
bm25_retrieval_async
bm25_retrieval_async(
query: str,
filters: dict[str, Any] | None = None,
top_k: int = 10,
scale_score: bool = False,
) -> list[Document]
Retrieves documents that are most relevant to the query using BM25 algorithm.
Parameters:
- query (
str) – The query string. - filters (
dict[str, Any] | None) – A dictionary with filters to narrow down the search space. - top_k (
int) – The number of top documents to retrieve. Default is 10. - scale_score (
bool) – Whether to scale the scores of the retrieved documents. Default is False.
Returns:
list[Document]– A list of the top_k documents most relevant to the query.
embedding_retrieval_async
embedding_retrieval_async(
query_embedding: list[float],
filters: dict[str, Any] | None = None,
top_k: int = 10,
scale_score: bool = False,
return_embedding: bool = False,
) -> list[Document]
Retrieves documents that are most similar to the query embedding using a vector similarity metric.
Parameters:
- query_embedding (
list[float]) – Embedding of the query. - filters (
dict[str, Any] | None) – A dictionary with filters to narrow down the search space. - top_k (
int) – The number of top documents to retrieve. Default is 10. - scale_score (
bool) – Whether to scale the scores of the retrieved Documents. Default is False. - return_embedding (
bool) – Whether to return the embedding of the retrieved Documents. Default is False.
Returns:
list[Document]– A list of the top_k documents most relevant to the query.