Stores your texts and meta data and provides them to the Retriever at query time.
Module document_store
BM25DocumentStats
A dataclass for managing document statistics for BM25 retrieval.
Arguments:
freq_token
: A Counter of token frequencies in the document.doc_len
: Number of tokens in the document.
InMemoryDocumentStore
Stores data in-memory. It's ephemeral and cannot be saved to disk.
InMemoryDocumentStore.__init__
def __init__(bm25_tokenization_regex: str = r"(?u)\b\w\w+\b",
bm25_algorithm: Literal["BM25Okapi", "BM25L",
"BM25Plus"] = "BM25L",
bm25_parameters: Optional[Dict] = None,
embedding_similarity_function: Literal["dot_product",
"cosine"] = "dot_product")
Initializes the DocumentStore.
Arguments:
bm25_tokenization_regex
: The regular expression used to tokenize the text for BM25 retrieval.bm25_algorithm
: The BM25 algorithm to use. One of "BM25Okapi", "BM25L", or "BM25Plus".bm25_parameters
: Parameters for BM25 implementation in a dictionary format. For example: {'k1':1.5, 'b':0.75, 'epsilon':0.25} You can learn more about these parameters by visiting https://github.com/dorianbrown/rank_bm25. By default, no parameters are set.embedding_similarity_function
: The similarity function used to compare Documents embeddings. One of "dot_product" (default) or "cosine". To choose the most appropriate function, look for information about your embedding model.
InMemoryDocumentStore.to_dict
def to_dict() -> Dict[str, Any]
Serializes the component to a dictionary.
Returns:
Dictionary with serialized data.
InMemoryDocumentStore.from_dict
@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "InMemoryDocumentStore"
Deserializes the component from a dictionary.
Arguments:
data
: The dictionary to deserialize from.
Returns:
The deserialized component.
InMemoryDocumentStore.count_documents
def count_documents() -> int
Returns the number of how many documents are present in the DocumentStore.
InMemoryDocumentStore.filter_documents
def filter_documents(
filters: Optional[Dict[str, Any]] = None) -> List[Document]
Returns the documents that match the filters provided.
For a detailed specification of the filters, refer to the DocumentStore.filter_documents() protocol documentation.
Arguments:
filters
: The filters to apply to the document list.
Returns:
A list of Documents that match the given filters.
InMemoryDocumentStore.write_documents
def write_documents(documents: List[Document],
policy: DuplicatePolicy = DuplicatePolicy.NONE) -> int
Refer to the DocumentStore.write_documents() protocol documentation.
If policy
is set to DuplicatePolicy.NONE
defaults to DuplicatePolicy.FAIL
.
InMemoryDocumentStore.delete_documents
def delete_documents(document_ids: List[str]) -> None
Deletes all documents with matching document_ids from the DocumentStore.
Arguments:
document_ids
: The object_ids to delete.
InMemoryDocumentStore.bm25_retrieval
def bm25_retrieval(query: str,
filters: Optional[Dict[str, Any]] = None,
top_k: int = 10,
scale_score: bool = False) -> List[Document]
Retrieves documents that are most relevant to the query using BM25 algorithm.
Arguments:
query
: The query string.filters
: A dictionary with filters to narrow down the search space.top_k
: The number of top documents to retrieve. Default is 10.scale_score
: Whether to scale the scores of the retrieved documents. Default is False.
Returns:
A list of the top_k documents most relevant to the query.
InMemoryDocumentStore.embedding_retrieval
def embedding_retrieval(query_embedding: List[float],
filters: Optional[Dict[str, Any]] = None,
top_k: int = 10,
scale_score: bool = False,
return_embedding: bool = False) -> List[Document]
Retrieves documents that are most similar to the query embedding using a vector similarity metric.
Arguments:
query_embedding
: Embedding of the query.filters
: A dictionary with filters to narrow down the search space.top_k
: The number of top documents to retrieve. Default is 10.scale_score
: Whether to scale the scores of the retrieved Documents. Default is False.return_embedding
: Whether to return the embedding of the retrieved Documents. Default is False.
Returns:
A list of the top_k documents most relevant to the query.