Stores your texts and meta data and provides them to the Retriever at query time.
Module haystack_experimental.document_stores.in_memory.document_store
InMemoryDocumentStore
Asynchronous version of the in-memory document store.
InMemoryDocumentStore.__init__
def __init__(bm25_tokenization_regex: str = r"(?u)\b\w\w+\b",
bm25_algorithm: Literal["BM25Okapi", "BM25L",
"BM25Plus"] = "BM25L",
bm25_parameters: Optional[Dict] = None,
embedding_similarity_function: Literal["dot_product",
"cosine"] = "dot_product",
index: Optional[str] = None,
async_executor: Optional[ThreadPoolExecutor] = None)
Initializes the DocumentStore.
Arguments:
bm25_tokenization_regex
: The regular expression used to tokenize the text for BM25 retrieval.bm25_algorithm
: The BM25 algorithm to use. One of "BM25Okapi", "BM25L", or "BM25Plus".bm25_parameters
: Parameters for BM25 implementation in a dictionary format. For example: {'k1':1.5, 'b':0.75, 'epsilon':0.25} You can learn more about these parameters by visiting https://github.com/dorianbrown/rank_bm25.embedding_similarity_function
: The similarity function used to compare Documents embeddings. One of "dot_product" (default) or "cosine". To choose the most appropriate function, look for information about your embedding model.index
: A specific index to store the documents. If not specified, a random UUID is used. Using the same index allows you to store documents across multiple InMemoryDocumentStore instances.async_executor
: Optional ThreadPoolExecutor to use for async calls. If not provided, a single-threaded executor will initialized and used.
InMemoryDocumentStore.count_documents_async
async def count_documents_async() -> int
Returns the number of how many documents are present in the DocumentStore.
InMemoryDocumentStore.filter_documents_async
async def filter_documents_async(
filters: Optional[Dict[str, Any]] = None) -> List[Document]
Returns the documents that match the filters provided.
For a detailed specification of the filters, refer to the DocumentStore.filter_documents() protocol documentation.
Arguments:
filters
: The filters to apply to the document list.
Returns:
A list of Documents that match the given filters.
InMemoryDocumentStore.write_documents_async
async def write_documents_async(
documents: List[Document],
policy: DuplicatePolicy = DuplicatePolicy.NONE) -> int
Refer to the DocumentStore.write_documents() protocol documentation.
If policy
is set to DuplicatePolicy.NONE
defaults to DuplicatePolicy.FAIL
.
InMemoryDocumentStore.delete_documents_async
async def delete_documents_async(document_ids: List[str]) -> None
Deletes all documents with matching document_ids from the DocumentStore.
Arguments:
document_ids
: The object_ids to delete.
InMemoryDocumentStore.bm25_retrieval_async
async def bm25_retrieval_async(query: str,
filters: Optional[Dict[str, Any]] = None,
top_k: int = 10,
scale_score: bool = False) -> List[Document]
Retrieves documents that are most relevant to the query using BM25 algorithm.
Arguments:
query
: The query string.filters
: A dictionary with filters to narrow down the search space.top_k
: The number of top documents to retrieve. Default is 10.scale_score
: Whether to scale the scores of the retrieved documents. Default is False.
Returns:
A list of the top_k documents most relevant to the query.
InMemoryDocumentStore.embedding_retrieval_async
async def embedding_retrieval_async(
query_embedding: List[float],
filters: Optional[Dict[str, Any]] = None,
top_k: int = 10,
scale_score: bool = False,
return_embedding: bool = False) -> List[Document]
Retrieves documents that are most similar to the query embedding using a vector similarity metric.
Arguments:
query_embedding
: Embedding of the query.filters
: A dictionary with filters to narrow down the search space.top_k
: The number of top documents to retrieve. Default is 10.scale_score
: Whether to scale the scores of the retrieved Documents. Default is False.return_embedding
: Whether to return the embedding of the retrieved Documents. Default is False.
Returns:
A list of the top_k documents most relevant to the query.
Module haystack_experimental.document_stores.opensearch.document_store
OpenSearchDocumentStore
OpenSearchDocumentStore.__init__
def __init__(*,
hosts: Optional[Hosts] = None,
index: str = "default",
max_chunk_bytes: int = DEFAULT_MAX_CHUNK_BYTES,
embedding_dim: int = 768,
return_embedding: bool = False,
method: Optional[Dict[str, Any]] = None,
mappings: Optional[Dict[str, Any]] = None,
settings: Optional[Dict[str, Any]] = DEFAULT_SETTINGS,
create_index: bool = True,
http_auth: Any = None,
use_ssl: Optional[bool] = None,
verify_certs: Optional[bool] = None,
timeout: Optional[int] = None,
**kwargs)
Creates a new OpenSearchDocumentStore instance.
The embeddings_dim
, method
, mappings
, and settings
arguments are only used if the index does not
exists and needs to be created. If the index already exists, its current configurations will be used.
For more information on connection parameters, see the official OpenSearch documentation
Arguments:
hosts
: List of hosts running the OpenSearch client. Defaults to Noneindex
: Name of index in OpenSearch, if it doesn't exist it will be created. Defaults to "default"max_chunk_bytes
: Maximum size of the requests in bytes. Defaults to 100MBembedding_dim
: Dimension of the embeddings. Defaults to 768return_embedding
: Whether to return the embedding of the retrieved Documents.method
: The method definition of the underlying configuration of the approximate k-NN algorithm. Please see the official OpenSearch docs for more information. Defaults to Nonemappings
: The mapping of how the documents are stored and indexed. Please see the official OpenSearch docs for more information. If None, it uses the embedding_dim and method arguments to create default mappings. Defaults to Nonesettings
: The settings of the index to be created. Please see the official OpenSearch docs for more information. Defaults to {"index.knn": True}create_index
: Whether to create the index if it doesn't exist. Defaults to Truehttp_auth
: http_auth param passed to the underying connection class. For basic authentication with default connection classUrllib3HttpConnection
this can be- a tuple of (username, password)
- a list of [username, password]
- a string of "username:password"
For AWS authentication with
Urllib3HttpConnection
pass an instance ofAWSAuth
. Defaults to None use_ssl
: Whether to use SSL. Defaults to Noneverify_certs
: Whether to verify certificates. Defaults to Nonetimeout
: Timeout in seconds. Defaults to None**kwargs
: Optional arguments thatOpenSearch
takes. For the full list of supported kwargs, see the official OpenSearch reference
OpenSearchDocumentStore.to_dict
def to_dict() -> Dict[str, Any]
Serializes the component to a dictionary.
Returns:
Dictionary with serialized data.
OpenSearchDocumentStore.from_dict
@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "OpenSearchDocumentStore"
Deserializes the component from a dictionary.
Arguments:
data
: Dictionary to deserialize from.
Returns:
Deserialized component.
Module haystack_experimental.document_stores.types.protocol
DocumentStore
Stores Documents to be used by the components of a Pipeline.
Classes implementing this protocol often store the documents permanently and allow specialized components to perform retrieval on them, either by embedding, by keyword, hybrid, and so on, depending on the backend used.
In order to retrieve documents, consider using a Retriever that supports the DocumentStore implementation that you're using.
DocumentStore.to_dict
def to_dict() -> Dict[str, Any]
Serializes this store to a dictionary.
DocumentStore.from_dict
@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "DocumentStore"
Deserializes the store from a dictionary.
DocumentStore.count_documents
def count_documents() -> int
Returns the number of documents stored.
DocumentStore.filter_documents
def filter_documents(
filters: Optional[Dict[str, Any]] = None) -> List[Document]
Returns the documents that match the filters provided.
Filters are defined as nested dictionaries that can be of two types:
- Comparison
- Logic
Comparison dictionaries must contain the keys:
field
operator
value
Logic dictionaries must contain the keys:
operator
conditions
The conditions
key must be a list of dictionaries, either of type Comparison or Logic.
The operator
value in Comparison dictionaries must be one of:
==
!=
>
>=
<
<=
in
not in
The operator
values in Logic dictionaries must be one of:
NOT
OR
AND
A simple filter:
filters = {"field": "meta.type", "operator": "==", "value": "article"}
A more complex filter:
filters = {
"operator": "AND",
"conditions": [
{"field": "meta.type", "operator": "==", "value": "article"},
{"field": "meta.date", "operator": ">=", "value": 1420066800},
{"field": "meta.date", "operator": "<", "value": 1609455600},
{"field": "meta.rating", "operator": ">=", "value": 3},
{
"operator": "OR",
"conditions": [
{"field": "meta.genre", "operator": "in", "value": ["economy", "politics"]},
{"field": "meta.publisher", "operator": "==", "value": "nytimes"},
],
},
],
}
**Arguments**:
- `filters`: the filters to apply to the document list.
**Returns**:
a list of Documents that match the given filters.
<a id="haystack_experimental.document_stores.types.protocol.DocumentStore.write_documents"></a>
#### DocumentStore.write\_documents
```python
def write_documents(documents: List[Document],
policy: DuplicatePolicy = DuplicatePolicy.NONE) -> int
Writes Documents into the DocumentStore.
Arguments:
documents
: a list of Document objects.policy
: the policy to apply when a Document with the same id already exists in the DocumentStore.DuplicatePolicy.NONE
: Default policy, behaviour depends on the Document Store.DuplicatePolicy.SKIP
: If a Document with the same id already exists, it is skipped and not written.DuplicatePolicy.OVERWRITE
: If a Document with the same id already exists, it is overwritten.DuplicatePolicy.FAIL
: If a Document with the same id already exists, an error is raised.
Raises:
DuplicateError
: Ifpolicy
is set toDuplicatePolicy.FAIL
and a Document with the same id already exists.
Returns:
The number of Documents written.
If DuplicatePolicy.OVERWRITE
is used, this number is always equal to the number of documents in input.
If DuplicatePolicy.SKIP
is used, this number can be lower than the number of documents in the input list.
DocumentStore.delete_documents
def delete_documents(document_ids: List[str]) -> None
Deletes all documents with a matching document_ids from the DocumentStore.
Fails with MissingDocumentError
if no document with this id is present in the DocumentStore.
Arguments:
document_ids
: the object_ids to delete