Document Stores
document_store
BM25DocumentStats
A dataclass for managing document statistics for BM25 retrieval.
Parameters:
- freq_token (
dict[str, int]) – A Counter of token frequencies in the document. - doc_len (
int) – Number of tokens in the document.
InMemoryDocumentStore
Stores data in-memory. It's ephemeral and cannot be saved to disk.
init
__init__(
bm25_tokenization_regex: str = "(?u)\\b\\w\\w+\\b",
bm25_algorithm: Literal["BM25Okapi", "BM25L", "BM25Plus"] = "BM25L",
bm25_parameters: dict | None = None,
embedding_similarity_function: Literal[
"dot_product", "cosine"
] = "dot_product",
index: str | None = None,
async_executor: ThreadPoolExecutor | None = None,
return_embedding: bool = True,
)
Initializes the DocumentStore.
Parameters:
- bm25_tokenization_regex (
str) – The regular expression used to tokenize the text for BM25 retrieval. - bm25_algorithm (
Literal['BM25Okapi', 'BM25L', 'BM25Plus']) – The BM25 algorithm to use. One of "BM25Okapi", "BM25L", or "BM25Plus". - bm25_parameters (
dict | None) – Parameters for BM25 implementation in a dictionary format. For example:{'k1':1.5, 'b':0.75, 'epsilon':0.25}You can learn more about these parameters by visiting https://github.com/dorianbrown/rank_bm25. - embedding_similarity_function (
Literal['dot_product', 'cosine']) – The similarity function used to compare Documents embeddings. One of "dot_product" (default) or "cosine". To choose the most appropriate function, look for information about your embedding model. - index (
str | None) – A specific index to store the documents. If not specified, a random UUID is used. Using the same index allows you to store documents across multiple InMemoryDocumentStore instances. - async_executor (
ThreadPoolExecutor | None) – Optional ThreadPoolExecutor to use for async calls. If not provided, a single-threaded executor will be initialized and used. - return_embedding (
bool) – Whether to return the embedding of the retrieved Documents. Default is True.
shutdown
Explicitly shutdown the executor if we own it.
storage
Utility property that returns the storage used by this instance of InMemoryDocumentStore.
to_dict
Serializes the component to a dictionary.
Returns:
dict[str, Any]– Dictionary with serialized data.
from_dict
Deserializes the component from a dictionary.
Parameters:
- data (
dict[str, Any]) – The dictionary to deserialize from.
Returns:
InMemoryDocumentStore– The deserialized component.
save_to_disk
Write the database and its' data to disk as a JSON file.
Parameters:
- path (
str) – The path to the JSON file.
load_from_disk
Load the database and its' data from disk as a JSON file.
Parameters:
- path (
str) – The path to the JSON file.
Returns:
InMemoryDocumentStore– The loaded InMemoryDocumentStore.
count_documents
Returns the number of how many documents are present in the DocumentStore.
filter_documents
Returns the documents that match the filters provided.
For a detailed specification of the filters, refer to the DocumentStore.filter_documents() protocol documentation.
Parameters:
- filters (
dict[str, Any] | None) – The filters to apply to the document list.
Returns:
list[Document]– A list of Documents that match the given filters.
write_documents
write_documents(
documents: list[Document], policy: DuplicatePolicy = DuplicatePolicy.NONE
) -> int
Refer to the DocumentStore.write_documents() protocol documentation.
If policy is set to DuplicatePolicy.NONE defaults to DuplicatePolicy.FAIL.
delete_documents
Deletes all documents with matching document_ids from the DocumentStore.
Parameters:
- document_ids (
list[str]) – The object_ids to delete.
delete_all_documents
Deletes all documents in the document store.
update_by_filter
Updates the metadata of all documents that match the provided filters.
Parameters:
- filters (
dict[str, Any]) – The filters to apply to select documents for updating. For filter syntax, see filter_documents. - meta (
dict[str, Any]) – The metadata fields to update. These will be merged with existing metadata.
Returns:
int– The number of documents updated.
Raises:
ValueError– if filters have invalid syntax.
delete_by_filter
Deletes all documents that match the provided filters.
Parameters:
- filters (
dict[str, Any]) – The filters to apply to select documents for deletion. For filter syntax, see filter_documents.
Returns:
int– The number of documents deleted.
Raises:
ValueError– if filters have invalid syntax.
bm25_retrieval
bm25_retrieval(
query: str,
filters: dict[str, Any] | None = None,
top_k: int = 10,
scale_score: bool = False,
) -> list[Document]
Retrieves documents that are most relevant to the query using BM25 algorithm.
Parameters:
- query (
str) – The query string. - filters (
dict[str, Any] | None) – A dictionary with filters to narrow down the search space. - top_k (
int) – The number of top documents to retrieve. Default is 10. - scale_score (
bool) – Whether to scale the scores of the retrieved documents. Default is False.
Returns:
list[Document]– A list of the top_k documents most relevant to the query.
embedding_retrieval
embedding_retrieval(
query_embedding: list[float],
filters: dict[str, Any] | None = None,
top_k: int = 10,
scale_score: bool = False,
return_embedding: bool | None = False,
) -> list[Document]
Retrieves documents that are most similar to the query embedding using a vector similarity metric.
Parameters:
- query_embedding (
list[float]) – Embedding of the query. - filters (
dict[str, Any] | None) – A dictionary with filters to narrow down the search space. - top_k (
int) – The number of top documents to retrieve. Default is 10. - scale_score (
bool) – Whether to scale the scores of the retrieved Documents. Default is False. - return_embedding (
bool | None) – Whether to return the embedding of the retrieved Documents. If not provided, the value of thereturn_embeddingparameter set at component initialization will be used. Default is False.
Returns:
list[Document]– A list of the top_k documents most relevant to the query.
Raises:
ValueError– if filters have invalid syntax.
count_documents_async
Returns the number of how many documents are present in the DocumentStore.
filter_documents_async
Returns the documents that match the filters provided.
For a detailed specification of the filters, refer to the DocumentStore.filter_documents() protocol documentation.
Parameters:
- filters (
dict[str, Any] | None) – The filters to apply to the document list.
Returns:
list[Document]– A list of Documents that match the given filters.
write_documents_async
write_documents_async(
documents: list[Document], policy: DuplicatePolicy = DuplicatePolicy.NONE
) -> int
Refer to the DocumentStore.write_documents() protocol documentation.
If policy is set to DuplicatePolicy.NONE defaults to DuplicatePolicy.FAIL.
delete_documents_async
Deletes all documents with matching document_ids from the DocumentStore.
Parameters:
- document_ids (
list[str]) – The object_ids to delete.
bm25_retrieval_async
bm25_retrieval_async(
query: str,
filters: dict[str, Any] | None = None,
top_k: int = 10,
scale_score: bool = False,
) -> list[Document]
Retrieves documents that are most relevant to the query using BM25 algorithm.
Parameters:
- query (
str) – The query string. - filters (
dict[str, Any] | None) – A dictionary with filters to narrow down the search space. - top_k (
int) – The number of top documents to retrieve. Default is 10. - scale_score (
bool) – Whether to scale the scores of the retrieved documents. Default is False.
Returns:
list[Document]– A list of the top_k documents most relevant to the query.
embedding_retrieval_async
embedding_retrieval_async(
query_embedding: list[float],
filters: dict[str, Any] | None = None,
top_k: int = 10,
scale_score: bool = False,
return_embedding: bool = False,
) -> list[Document]
Retrieves documents that are most similar to the query embedding using a vector similarity metric.
Parameters:
- query_embedding (
list[float]) – Embedding of the query. - filters (
dict[str, Any] | None) – A dictionary with filters to narrow down the search space. - top_k (
int) – The number of top documents to retrieve. Default is 10. - scale_score (
bool) – Whether to scale the scores of the retrieved Documents. Default is False. - return_embedding (
bool) – Whether to return the embedding of the retrieved Documents. Default is False.
Returns:
list[Document]– A list of the top_k documents most relevant to the query.