Module document_store

BM25DocumentStats

A dataclass for managing document statistics for BM25 retrieval.

Arguments:

freq_token: A Counter of token frequencies in the document.
doc_len: Number of tokens in the document.

InMemoryDocumentStore

Stores data in-memory. It's ephemeral and cannot be saved to disk.

InMemoryDocumentStore.init

def __init__(bm25_tokenization_regex: str = r"(?u)\b\w\w+\b",
             bm25_algorithm: Literal["BM25Okapi", "BM25L",
                                     "BM25Plus"] = "BM25L",
             bm25_parameters: Optional[Dict] = None,
             embedding_similarity_function: Literal["dot_product",
                                                    "cosine"] = "dot_product",
             index: Optional[str] = None,
             async_executor: Optional[ThreadPoolExecutor] = None,
             return_embedding: bool = True)

Initializes the DocumentStore.

Arguments:

bm25_tokenization_regex: The regular expression used to tokenize the text for BM25 retrieval.
bm25_algorithm: The BM25 algorithm to use. One of "BM25Okapi", "BM25L", or "BM25Plus".
bm25_parameters: Parameters for BM25 implementation in a dictionary format. For example: {'k1':1.5, 'b':0.75, 'epsilon':0.25} You can learn more about these parameters by visiting https://github.com/dorianbrown/rank_bm25.
embedding_similarity_function: The similarity function used to compare Documents embeddings. One of "dot_product" (default) or "cosine". To choose the most appropriate function, look for information about your embedding model.
index: A specific index to store the documents. If not specified, a random UUID is used. Using the same index allows you to store documents across multiple InMemoryDocumentStore instances.
async_executor: Optional ThreadPoolExecutor to use for async calls. If not provided, a single-threaded executor will be initialized and used.
return_embedding: Whether to return the embedding of the retrieved Documents. Default is True.

InMemoryDocumentStore.del

def __del__()

Cleanup when the instance is being destroyed.

InMemoryDocumentStore.shutdown

def shutdown()

Explicitly shutdown the executor if we own it.

InMemoryDocumentStore.storage

@property
def storage() -> Dict[str, Document]

Utility property that returns the storage used by this instance of InMemoryDocumentStore.

InMemoryDocumentStore.to_dict

def to_dict() -> Dict[str, Any]

Serializes the component to a dictionary.

Returns:

Dictionary with serialized data.

InMemoryDocumentStore.from_dict

@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "InMemoryDocumentStore"

Deserializes the component from a dictionary.

Arguments:

data: The dictionary to deserialize from.

Returns:

The deserialized component.

InMemoryDocumentStore.save_to_disk

def save_to_disk(path: str) -> None

Write the database and its' data to disk as a JSON file.

Arguments:

path: The path to the JSON file.

InMemoryDocumentStore.load_from_disk

@classmethod
def load_from_disk(cls, path: str) -> "InMemoryDocumentStore"

Load the database and its' data from disk as a JSON file.

Arguments:

path: The path to the JSON file.

Returns:

The loaded InMemoryDocumentStore.

InMemoryDocumentStore.count_documents

def count_documents() -> int

Returns the number of how many documents are present in the DocumentStore.

InMemoryDocumentStore.filter_documents

def filter_documents(
        filters: Optional[Dict[str, Any]] = None) -> List[Document]

Returns the documents that match the filters provided.

For a detailed specification of the filters, refer to the DocumentStore.filter_documents() protocol documentation.

Arguments:

filters: The filters to apply to the document list.

Returns:

A list of Documents that match the given filters.

InMemoryDocumentStore.write_documents

def write_documents(documents: List[Document],
                    policy: DuplicatePolicy = DuplicatePolicy.NONE) -> int

Refer to the DocumentStore.write_documents() protocol documentation.

If policy is set to DuplicatePolicy.NONE defaults to DuplicatePolicy.FAIL.

InMemoryDocumentStore.delete_documents

def delete_documents(document_ids: List[str]) -> None

Deletes all documents with matching document_ids from the DocumentStore.

Arguments:

document_ids: The object_ids to delete.

InMemoryDocumentStore.bm25_retrieval

def bm25_retrieval(query: str,
                   filters: Optional[Dict[str, Any]] = None,
                   top_k: int = 10,
                   scale_score: bool = False) -> List[Document]

Retrieves documents that are most relevant to the query using BM25 algorithm.

Arguments:

query: The query string.
filters: A dictionary with filters to narrow down the search space.
top_k: The number of top documents to retrieve. Default is 10.
scale_score: Whether to scale the scores of the retrieved documents. Default is False.

Returns:

A list of the top_k documents most relevant to the query.

InMemoryDocumentStore.embedding_retrieval

def embedding_retrieval(
        query_embedding: List[float],
        filters: Optional[Dict[str, Any]] = None,
        top_k: int = 10,
        scale_score: bool = False,
        return_embedding: Optional[bool] = False) -> List[Document]

Retrieves documents that are most similar to the query embedding using a vector similarity metric.

Arguments:

query_embedding: Embedding of the query.
filters: A dictionary with filters to narrow down the search space.
top_k: The number of top documents to retrieve. Default is 10.
scale_score: Whether to scale the scores of the retrieved Documents. Default is False.
return_embedding: Whether to return the embedding of the retrieved Documents. If not provided, the value of the return_embedding parameter set at component initialization will be used. Default is False.

Returns:

A list of the top_k documents most relevant to the query.

InMemoryDocumentStore.count_documents_async

async def count_documents_async() -> int

Returns the number of how many documents are present in the DocumentStore.

InMemoryDocumentStore.filter_documents_async

async def filter_documents_async(
        filters: Optional[Dict[str, Any]] = None) -> List[Document]

Returns the documents that match the filters provided.

For a detailed specification of the filters, refer to the DocumentStore.filter_documents() protocol documentation.

Arguments:

filters: The filters to apply to the document list.

Returns:

A list of Documents that match the given filters.

InMemoryDocumentStore.write_documents_async

async def write_documents_async(
        documents: List[Document],
        policy: DuplicatePolicy = DuplicatePolicy.NONE) -> int

Refer to the DocumentStore.write_documents() protocol documentation.

If policy is set to DuplicatePolicy.NONE defaults to DuplicatePolicy.FAIL.

InMemoryDocumentStore.delete_documents_async

async def delete_documents_async(document_ids: List[str]) -> None

Deletes all documents with matching document_ids from the DocumentStore.

Arguments:

document_ids: The object_ids to delete.

InMemoryDocumentStore.bm25_retrieval_async

async def bm25_retrieval_async(query: str,
                               filters: Optional[Dict[str, Any]] = None,
                               top_k: int = 10,
                               scale_score: bool = False) -> List[Document]

Retrieves documents that are most relevant to the query using BM25 algorithm.

Arguments:

query: The query string.
filters: A dictionary with filters to narrow down the search space.
top_k: The number of top documents to retrieve. Default is 10.
scale_score: Whether to scale the scores of the retrieved documents. Default is False.

Returns:

A list of the top_k documents most relevant to the query.

InMemoryDocumentStore.embedding_retrieval_async

async def embedding_retrieval_async(
        query_embedding: List[float],
        filters: Optional[Dict[str, Any]] = None,
        top_k: int = 10,
        scale_score: bool = False,
        return_embedding: bool = False) -> List[Document]

Retrieves documents that are most similar to the query embedding using a vector similarity metric.

Arguments:

query_embedding: Embedding of the query.
filters: A dictionary with filters to narrow down the search space.
top_k: The number of top documents to retrieve. Default is 10.
scale_score: Whether to scale the scores of the retrieved Documents. Default is False.
return_embedding: Whether to return the embedding of the retrieved Documents. Default is False.

Returns:

A list of the top_k documents most relevant to the query.

Module document_store

BM25DocumentStats

InMemoryDocumentStore

InMemoryDocumentStore.__init__

InMemoryDocumentStore.__del__

InMemoryDocumentStore.shutdown

InMemoryDocumentStore.storage

InMemoryDocumentStore.to_dict

InMemoryDocumentStore.from_dict

InMemoryDocumentStore.save_to_disk

InMemoryDocumentStore.load_from_disk

InMemoryDocumentStore.count_documents

InMemoryDocumentStore.filter_documents

InMemoryDocumentStore.write_documents

InMemoryDocumentStore.delete_documents

InMemoryDocumentStore.bm25_retrieval

InMemoryDocumentStore.embedding_retrieval

InMemoryDocumentStore.count_documents_async

InMemoryDocumentStore.filter_documents_async

InMemoryDocumentStore.write_documents_async

InMemoryDocumentStore.delete_documents_async

InMemoryDocumentStore.bm25_retrieval_async

InMemoryDocumentStore.embedding_retrieval_async

InMemoryDocumentStore.init

InMemoryDocumentStore.del