Version: 2.26-unstable

Document Stores

document_store

BM25DocumentStats

A dataclass for managing document statistics for BM25 retrieval.

Parameters:

freq_token (dict[str, int]) – A Counter of token frequencies in the document.
doc_len (int) – Number of tokens in the document.

InMemoryDocumentStore

Stores data in-memory. It's ephemeral and cannot be saved to disk.

init

python

__init__(
    bm25_tokenization_regex: str = "(?u)\\b\\w\\w+\\b",
    bm25_algorithm: Literal["BM25Okapi", "BM25L", "BM25Plus"] = "BM25L",
    bm25_parameters: dict | None = None,
    embedding_similarity_function: Literal[
        "dot_product", "cosine"
    ] = "dot_product",
    index: str | None = None,
    async_executor: ThreadPoolExecutor | None = None,
    return_embedding: bool = True,
)

Initializes the DocumentStore.

Parameters:

bm25_tokenization_regex (str) – The regular expression used to tokenize the text for BM25 retrieval.
bm25_algorithm (Literal['BM25Okapi', 'BM25L', 'BM25Plus']) – The BM25 algorithm to use. One of "BM25Okapi", "BM25L", or "BM25Plus".
bm25_parameters (dict | None) – Parameters for BM25 implementation in a dictionary format. For example: {'k1':1.5, 'b':0.75, 'epsilon':0.25} You can learn more about these parameters by visiting https://github.com/dorianbrown/rank_bm25.
embedding_similarity_function (Literal['dot_product', 'cosine']) – The similarity function used to compare Documents embeddings. One of "dot_product" (default) or "cosine". To choose the most appropriate function, look for information about your embedding model.
index (str | None) – A specific index to store the documents. If not specified, a random UUID is used. Using the same index allows you to store documents across multiple InMemoryDocumentStore instances.
async_executor (ThreadPoolExecutor | None) – Optional ThreadPoolExecutor to use for async calls. If not provided, a single-threaded executor will be initialized and used.
return_embedding (bool) – Whether to return the embedding of the retrieved Documents. Default is True.

shutdown

python

shutdown()

Explicitly shutdown the executor if we own it.

storage

python

storage: dict[str, Document]

Utility property that returns the storage used by this instance of InMemoryDocumentStore.

to_dict

python

to_dict() -> dict[str, Any]

Serializes the component to a dictionary.

Returns:

dict[str, Any] – Dictionary with serialized data.

from_dict

python

from_dict(data: dict[str, Any]) -> InMemoryDocumentStore

Deserializes the component from a dictionary.

Parameters:

data (dict[str, Any]) – The dictionary to deserialize from.

Returns:

InMemoryDocumentStore – The deserialized component.

save_to_disk

python

save_to_disk(path: str) -> None

Write the database and its' data to disk as a JSON file.

Parameters:

path (str) – The path to the JSON file.

load_from_disk

python

load_from_disk(path: str) -> InMemoryDocumentStore

Load the database and its' data from disk as a JSON file.

Parameters:

path (str) – The path to the JSON file.

Returns:

InMemoryDocumentStore – The loaded InMemoryDocumentStore.

count_documents

python

count_documents() -> int

Returns the number of how many documents are present in the DocumentStore.

filter_documents

python

filter_documents(filters: dict[str, Any] | None = None) -> list[Document]

Returns the documents that match the filters provided.

For a detailed specification of the filters, refer to the DocumentStore.filter_documents() protocol documentation.

Parameters:

filters (dict[str, Any] | None) – The filters to apply to the document list.

Returns:

list[Document] – A list of Documents that match the given filters.

write_documents

python

write_documents(
    documents: list[Document], policy: DuplicatePolicy = DuplicatePolicy.NONE
) -> int

Refer to the DocumentStore.write_documents() protocol documentation.

If policy is set to DuplicatePolicy.NONE defaults to DuplicatePolicy.FAIL.

delete_documents

python

delete_documents(document_ids: list[str]) -> None

Deletes all documents with matching document_ids from the DocumentStore.

Parameters:

document_ids (list[str]) – The object_ids to delete.

delete_all_documents

python

delete_all_documents() -> None

Deletes all documents in the document store.

update_by_filter

python

update_by_filter(filters: dict[str, Any], meta: dict[str, Any]) -> int

Updates the metadata of all documents that match the provided filters.

Parameters:

filters (dict[str, Any]) – The filters to apply to select documents for updating. For filter syntax, see filter_documents.
meta (dict[str, Any]) – The metadata fields to update. These will be merged with existing metadata.

Returns:

int – The number of documents updated.

Raises:

ValueError – if filters have invalid syntax.

delete_by_filter

python

delete_by_filter(filters: dict[str, Any]) -> int

Deletes all documents that match the provided filters.

Parameters:

filters (dict[str, Any]) – The filters to apply to select documents for deletion. For filter syntax, see filter_documents.

Returns:

int – The number of documents deleted.

Raises:

ValueError – if filters have invalid syntax.

bm25_retrieval

python

bm25_retrieval(
    query: str,
    filters: dict[str, Any] | None = None,
    top_k: int = 10,
    scale_score: bool = False,
) -> list[Document]

Retrieves documents that are most relevant to the query using BM25 algorithm.

Parameters:

query (str) – The query string.
filters (dict[str, Any] | None) – A dictionary with filters to narrow down the search space.
top_k (int) – The number of top documents to retrieve. Default is 10.
scale_score (bool) – Whether to scale the scores of the retrieved documents. Default is False.

Returns:

list[Document] – A list of the top_k documents most relevant to the query.

embedding_retrieval

python

embedding_retrieval(
    query_embedding: list[float],
    filters: dict[str, Any] | None = None,
    top_k: int = 10,
    scale_score: bool = False,
    return_embedding: bool | None = False,
) -> list[Document]

Retrieves documents that are most similar to the query embedding using a vector similarity metric.

Parameters:

query_embedding (list[float]) – Embedding of the query.
filters (dict[str, Any] | None) – A dictionary with filters to narrow down the search space.
top_k (int) – The number of top documents to retrieve. Default is 10.
scale_score (bool) – Whether to scale the scores of the retrieved Documents. Default is False.
return_embedding (bool | None) – Whether to return the embedding of the retrieved Documents. If not provided, the value of the return_embedding parameter set at component initialization will be used. Default is False.

Returns:

list[Document] – A list of the top_k documents most relevant to the query.

Raises:

ValueError – if filters have invalid syntax.

count_documents_async

python

count_documents_async() -> int

Returns the number of how many documents are present in the DocumentStore.

filter_documents_async

python

filter_documents_async(filters: dict[str, Any] | None = None) -> list[Document]

Returns the documents that match the filters provided.

For a detailed specification of the filters, refer to the DocumentStore.filter_documents() protocol documentation.

Parameters:

filters (dict[str, Any] | None) – The filters to apply to the document list.

Returns:

list[Document] – A list of Documents that match the given filters.

write_documents_async

python

write_documents_async(
    documents: list[Document], policy: DuplicatePolicy = DuplicatePolicy.NONE
) -> int

Refer to the DocumentStore.write_documents() protocol documentation.

If policy is set to DuplicatePolicy.NONE defaults to DuplicatePolicy.FAIL.

delete_documents_async

python

delete_documents_async(document_ids: list[str]) -> None

Deletes all documents with matching document_ids from the DocumentStore.

Parameters:

document_ids (list[str]) – The object_ids to delete.

bm25_retrieval_async

python

bm25_retrieval_async(
    query: str,
    filters: dict[str, Any] | None = None,
    top_k: int = 10,
    scale_score: bool = False,
) -> list[Document]

Retrieves documents that are most relevant to the query using BM25 algorithm.

Parameters:

query (str) – The query string.
filters (dict[str, Any] | None) – A dictionary with filters to narrow down the search space.
top_k (int) – The number of top documents to retrieve. Default is 10.
scale_score (bool) – Whether to scale the scores of the retrieved documents. Default is False.

Returns:

list[Document] – A list of the top_k documents most relevant to the query.

embedding_retrieval_async

python

embedding_retrieval_async(
    query_embedding: list[float],
    filters: dict[str, Any] | None = None,
    top_k: int = 10,
    scale_score: bool = False,
    return_embedding: bool = False,
) -> list[Document]

Retrieves documents that are most similar to the query embedding using a vector similarity metric.

Parameters:

query_embedding (list[float]) – Embedding of the query.
filters (dict[str, Any] | None) – A dictionary with filters to narrow down the search space.
top_k (int) – The number of top documents to retrieve. Default is 10.
scale_score (bool) – Whether to scale the scores of the retrieved Documents. Default is False.
return_embedding (bool) – Whether to return the embedding of the retrieved Documents. Default is False.

Returns:

list[Document] – A list of the top_k documents most relevant to the query.

document_store​

BM25DocumentStats​

InMemoryDocumentStore​

init​

shutdown​

storage​

to_dict​

from_dict​

save_to_disk​

load_from_disk​

count_documents​

filter_documents​

write_documents​

delete_documents​

delete_all_documents​

update_by_filter​

delete_by_filter​

bm25_retrieval​

embedding_retrieval​

count_documents_async​

filter_documents_async​

write_documents_async​

delete_documents_async​

bm25_retrieval_async​

embedding_retrieval_async​

document_store

BM25DocumentStats

InMemoryDocumentStore

init

shutdown

storage

to_dict

from_dict

save_to_disk

load_from_disk

count_documents

filter_documents

write_documents

delete_documents

delete_all_documents

update_by_filter

delete_by_filter

bm25_retrieval

embedding_retrieval

count_documents_async

filter_documents_async

write_documents_async

delete_documents_async

bm25_retrieval_async

embedding_retrieval_async