DocumentationAPI Reference📓 Tutorials🧑‍🍳 Cookbook🤝 Integrations💜 Discord🎨 Studio
API Reference

Stores your texts and meta data and provides them to the Retriever at query time.

Module haystack_experimental.document_stores.in_memory.document_store

InMemoryDocumentStore

Asynchronous version of the in-memory document store.

InMemoryDocumentStore.__init__

def __init__(bm25_tokenization_regex: str = r"(?u)\b\w\w+\b",
             bm25_algorithm: Literal["BM25Okapi", "BM25L",
                                     "BM25Plus"] = "BM25L",
             bm25_parameters: Optional[Dict] = None,
             embedding_similarity_function: Literal["dot_product",
                                                    "cosine"] = "dot_product",
             index: Optional[str] = None,
             async_executor: Optional[ThreadPoolExecutor] = None)

Initializes the DocumentStore.

Arguments:

  • bm25_tokenization_regex: The regular expression used to tokenize the text for BM25 retrieval.
  • bm25_algorithm: The BM25 algorithm to use. One of "BM25Okapi", "BM25L", or "BM25Plus".
  • bm25_parameters: Parameters for BM25 implementation in a dictionary format. For example: {'k1':1.5, 'b':0.75, 'epsilon':0.25} You can learn more about these parameters by visiting https://github.com/dorianbrown/rank_bm25.
  • embedding_similarity_function: The similarity function used to compare Documents embeddings. One of "dot_product" (default) or "cosine". To choose the most appropriate function, look for information about your embedding model.
  • index: A specific index to store the documents. If not specified, a random UUID is used. Using the same index allows you to store documents across multiple InMemoryDocumentStore instances.
  • async_executor: Optional ThreadPoolExecutor to use for async calls. If not provided, a single-threaded executor will initialized and used.

InMemoryDocumentStore.count_documents_async

async def count_documents_async() -> int

Returns the number of how many documents are present in the DocumentStore.

InMemoryDocumentStore.filter_documents_async

async def filter_documents_async(
        filters: Optional[Dict[str, Any]] = None) -> List[Document]

Returns the documents that match the filters provided.

For a detailed specification of the filters, refer to the DocumentStore.filter_documents() protocol documentation.

Arguments:

  • filters: The filters to apply to the document list.

Returns:

A list of Documents that match the given filters.

InMemoryDocumentStore.write_documents_async

async def write_documents_async(
        documents: List[Document],
        policy: DuplicatePolicy = DuplicatePolicy.NONE) -> int

Refer to the DocumentStore.write_documents() protocol documentation.

If policy is set to DuplicatePolicy.NONE defaults to DuplicatePolicy.FAIL.

InMemoryDocumentStore.delete_documents_async

async def delete_documents_async(document_ids: List[str]) -> None

Deletes all documents with matching document_ids from the DocumentStore.

Arguments:

  • document_ids: The object_ids to delete.

InMemoryDocumentStore.bm25_retrieval_async

async def bm25_retrieval_async(query: str,
                               filters: Optional[Dict[str, Any]] = None,
                               top_k: int = 10,
                               scale_score: bool = False) -> List[Document]

Retrieves documents that are most relevant to the query using BM25 algorithm.

Arguments:

  • query: The query string.
  • filters: A dictionary with filters to narrow down the search space.
  • top_k: The number of top documents to retrieve. Default is 10.
  • scale_score: Whether to scale the scores of the retrieved documents. Default is False.

Returns:

A list of the top_k documents most relevant to the query.

InMemoryDocumentStore.embedding_retrieval_async

async def embedding_retrieval_async(
        query_embedding: List[float],
        filters: Optional[Dict[str, Any]] = None,
        top_k: int = 10,
        scale_score: bool = False,
        return_embedding: bool = False) -> List[Document]

Retrieves documents that are most similar to the query embedding using a vector similarity metric.

Arguments:

  • query_embedding: Embedding of the query.
  • filters: A dictionary with filters to narrow down the search space.
  • top_k: The number of top documents to retrieve. Default is 10.
  • scale_score: Whether to scale the scores of the retrieved Documents. Default is False.
  • return_embedding: Whether to return the embedding of the retrieved Documents. Default is False.

Returns:

A list of the top_k documents most relevant to the query.

Module haystack_experimental.document_stores.opensearch.document_store

OpenSearchDocumentStore

OpenSearchDocumentStore.__init__

def __init__(*,
             hosts: Optional[Hosts] = None,
             index: str = "default",
             max_chunk_bytes: int = DEFAULT_MAX_CHUNK_BYTES,
             embedding_dim: int = 768,
             return_embedding: bool = False,
             method: Optional[Dict[str, Any]] = None,
             mappings: Optional[Dict[str, Any]] = None,
             settings: Optional[Dict[str, Any]] = DEFAULT_SETTINGS,
             create_index: bool = True,
             http_auth: Any = None,
             use_ssl: Optional[bool] = None,
             verify_certs: Optional[bool] = None,
             timeout: Optional[int] = None,
             **kwargs)

Creates a new OpenSearchDocumentStore instance.

The embeddings_dim, method, mappings, and settings arguments are only used if the index does not exists and needs to be created. If the index already exists, its current configurations will be used.

For more information on connection parameters, see the official OpenSearch documentation

Arguments:

  • hosts: List of hosts running the OpenSearch client. Defaults to None
  • index: Name of index in OpenSearch, if it doesn't exist it will be created. Defaults to "default"
  • max_chunk_bytes: Maximum size of the requests in bytes. Defaults to 100MB
  • embedding_dim: Dimension of the embeddings. Defaults to 768
  • return_embedding: Whether to return the embedding of the retrieved Documents.
  • method: The method definition of the underlying configuration of the approximate k-NN algorithm. Please see the official OpenSearch docs for more information. Defaults to None
  • mappings: The mapping of how the documents are stored and indexed. Please see the official OpenSearch docs for more information. If None, it uses the embedding_dim and method arguments to create default mappings. Defaults to None
  • settings: The settings of the index to be created. Please see the official OpenSearch docs for more information. Defaults to {"index.knn": True}
  • create_index: Whether to create the index if it doesn't exist. Defaults to True
  • http_auth: http_auth param passed to the underying connection class. For basic authentication with default connection class Urllib3HttpConnection this can be
  • a tuple of (username, password)
  • a list of [username, password]
  • a string of "username:password" For AWS authentication with Urllib3HttpConnection pass an instance of AWSAuth. Defaults to None
  • use_ssl: Whether to use SSL. Defaults to None
  • verify_certs: Whether to verify certificates. Defaults to None
  • timeout: Timeout in seconds. Defaults to None
  • **kwargs: Optional arguments that OpenSearch takes. For the full list of supported kwargs, see the official OpenSearch reference

OpenSearchDocumentStore.to_dict

def to_dict() -> Dict[str, Any]

Serializes the component to a dictionary.

Returns:

Dictionary with serialized data.

OpenSearchDocumentStore.from_dict

@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "OpenSearchDocumentStore"

Deserializes the component from a dictionary.

Arguments:

  • data: Dictionary to deserialize from.

Returns:

Deserialized component.

Module haystack_experimental.document_stores.types.protocol

DocumentStore

Stores Documents to be used by the components of a Pipeline.

Classes implementing this protocol often store the documents permanently and allow specialized components to perform retrieval on them, either by embedding, by keyword, hybrid, and so on, depending on the backend used.

In order to retrieve documents, consider using a Retriever that supports the DocumentStore implementation that you're using.

DocumentStore.to_dict

def to_dict() -> Dict[str, Any]

Serializes this store to a dictionary.

DocumentStore.from_dict

@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "DocumentStore"

Deserializes the store from a dictionary.

DocumentStore.count_documents

def count_documents() -> int

Returns the number of documents stored.

DocumentStore.filter_documents

def filter_documents(
        filters: Optional[Dict[str, Any]] = None) -> List[Document]

Returns the documents that match the filters provided.

Filters are defined as nested dictionaries that can be of two types:

  • Comparison
  • Logic

Comparison dictionaries must contain the keys:

  • field
  • operator
  • value

Logic dictionaries must contain the keys:

  • operator
  • conditions

The conditions key must be a list of dictionaries, either of type Comparison or Logic.

The operator value in Comparison dictionaries must be one of:

  • ==
  • !=
  • >
  • >=
  • <
  • <=
  • in
  • not in

The operator values in Logic dictionaries must be one of:

  • NOT
  • OR
  • AND

A simple filter:

filters = {"field": "meta.type", "operator": "==", "value": "article"}

A more complex filter:

filters = {
    "operator": "AND",
    "conditions": [
        {"field": "meta.type", "operator": "==", "value": "article"},
        {"field": "meta.date", "operator": ">=", "value": 1420066800},
        {"field": "meta.date", "operator": "<", "value": 1609455600},
        {"field": "meta.rating", "operator": ">=", "value": 3},
        {
            "operator": "OR",
            "conditions": [
                {"field": "meta.genre", "operator": "in", "value": ["economy", "politics"]},
                {"field": "meta.publisher", "operator": "==", "value": "nytimes"},
            ],
        },
    ],
}

**Arguments**:

- `filters`: the filters to apply to the document list.

**Returns**:

a list of Documents that match the given filters.

<a id="haystack_experimental.document_stores.types.protocol.DocumentStore.write_documents"></a>

#### DocumentStore.write\_documents

```python
def write_documents(documents: List[Document],
                    policy: DuplicatePolicy = DuplicatePolicy.NONE) -> int

Writes Documents into the DocumentStore.

Arguments:

  • documents: a list of Document objects.
  • policy: the policy to apply when a Document with the same id already exists in the DocumentStore.
  • DuplicatePolicy.NONE: Default policy, behaviour depends on the Document Store.
  • DuplicatePolicy.SKIP: If a Document with the same id already exists, it is skipped and not written.
  • DuplicatePolicy.OVERWRITE: If a Document with the same id already exists, it is overwritten.
  • DuplicatePolicy.FAIL: If a Document with the same id already exists, an error is raised.

Raises:

  • DuplicateError: If policy is set to DuplicatePolicy.FAIL and a Document with the same id already exists.

Returns:

The number of Documents written. If DuplicatePolicy.OVERWRITE is used, this number is always equal to the number of documents in input. If DuplicatePolicy.SKIP is used, this number can be lower than the number of documents in the input list.

DocumentStore.delete_documents

def delete_documents(document_ids: List[str]) -> None

Deletes all documents with a matching document_ids from the DocumentStore.

Fails with MissingDocumentError if no document with this id is present in the DocumentStore.

Arguments:

  • document_ids: the object_ids to delete