DocumentationAPI ReferenceπŸ““ TutorialsπŸ§‘β€πŸ³ Cookbook🀝 IntegrationsπŸ’œ Discord

Abstract classes for the Document Stores and Keyword Document Stores.

Module base

BaseDocumentStore

class BaseDocumentStore(BaseComponent)

Base class for implementing Document Stores.

BaseDocumentStore.write_documents

@abstractmethod
def write_documents(documents: Union[List[dict], List[Document]],
                    index: Optional[str] = None,
                    batch_size: int = 10_000,
                    duplicate_documents: Optional[str] = None,
                    headers: Optional[Dict[str, str]] = None)

Indexes documents for later queries.

Arguments:

  • documents: a list of Python dictionaries or a list of Haystack Document objects. For documents as dictionaries, the format is {"text": ""}. Optionally: Include meta data via {"text": "", "meta":{"name": ", "author": "somebody", ...}} It can be used for filtering and is accessible in the responses of the Finder.
  • index: Optional name of index where the documents shall be written to. If None, the DocumentStore's default index (self.index) will be used.
  • batch_size: Number of documents that are passed to bulk function at a time.
  • duplicate_documents: Handle duplicates document based on parameter options. Parameter options : ( 'skip','overwrite','fail') skip: Ignore the duplicates documents overwrite: Update any existing documents with the same ID when adding documents. fail: an error is raised if the document ID of the document being added already exists.
  • headers: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='} for basic authentication)

Returns:

None

BaseDocumentStore.get_all_documents

@abstractmethod
def get_all_documents(
        index: Optional[str] = None,
        filters: Optional[FilterType] = None,
        return_embedding: Optional[bool] = None,
        batch_size: int = 10_000,
        headers: Optional[Dict[str, str]] = None) -> List[Document]

Get documents from the document store.

Arguments:

  • index: Name of the index to get the documents from. If None, the DocumentStore's default index (self.index) will be used.

  • filters: Optional filters to narrow down the search space to documents whose metadata fulfill certain conditions. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and", "$or", "$not"), a comparison operator ("$eq", "$in", "$gt", "$gte", "$lt", "$lte") or a metadata field name. Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of "$in") a list of values as value. If no logical operator is provided, "$and" is used as default operation. If no comparison operator is provided, "$eq" (or "$in" if the comparison value is a list) is used as default operation.

    __Example__:
    
    ```python
    filters = {
        "$and": {
            "type": {"$eq": "article"},
            "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
            "rating": {"$gte": 3},
            "$or": {
                "genre": {"$in": ["economy", "politics"]},
                "publisher": {"$eq": "nytimes"}
            }
        }
    }
    ```
    
  • return_embedding: Whether to return the document embeddings.

  • batch_size: Number of documents that are passed to bulk function at a time.

  • headers: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='} for basic authentication)

BaseDocumentStore.get_all_documents_generator

@abstractmethod
def get_all_documents_generator(
    index: Optional[str] = None,
    filters: Optional[FilterType] = None,
    return_embedding: Optional[bool] = None,
    batch_size: int = 10_000,
    headers: Optional[Dict[str,
                           str]] = None) -> Generator[Document, None, None]

Get documents from the document store. Under-the-hood, documents are fetched in batches from the

document store and yielded as individual documents. This method can be used to iteratively process a large number of documents without having to load all documents in memory.

Arguments:

  • index: Name of the index to get the documents from. If None, the DocumentStore's default index (self.index) will be used.
  • filters: Optional filters to narrow down the search space to documents whose metadata fulfill certain conditions. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and", "$or", "$not"), a comparison operator ("$eq", "$in", "$gt", "$gte", "$lt", "$lte") or a metadata field name. Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of "$in") a list of values as value. If no logical operator is provided, "$and" is used as default operation. If no comparison operator is provided, "$eq" (or "$in" if the comparison value is a list) is used as default operation.

Example:

filters = {
    "$and": {
        "type": {"$eq": "article"},
        "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
        "rating": {"$gte": 3},
        "$or": {
            "genre": {"$in": ["economy", "politics"]},
            "publisher": {"$eq": "nytimes"}
        }
    }
}
  • return_embedding: Whether to return the document embeddings.
  • batch_size: When working with large number of documents, batching can help reduce memory footprint.
  • headers: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='} for basic authentication)

BaseDocumentStore.get_all_labels_aggregated

def get_all_labels_aggregated(
        index: Optional[str] = None,
        filters: Optional[FilterType] = None,
        open_domain: bool = True,
        drop_negative_labels: bool = False,
        drop_no_answers: bool = False,
        aggregate_by_meta: Optional[Union[str, list]] = None,
        headers: Optional[Dict[str, str]] = None) -> List[MultiLabel]

Return all labels in the DocumentStore, aggregated into MultiLabel objects.

This aggregation step helps, for example, if you collected multiple possible answers for one question and you want now all answers bundled together in one place for evaluation. How they are aggregated is defined by the open_domain and aggregate_by_meta parameters. If the questions are being asked to a single document (i.e. SQuAD style), you should set open_domain=False to aggregate by question and document. If the questions are being asked to your full collection of documents, you should set open_domain=True to aggregate just by question. If the questions are being asked to a subslice of your document set (e.g. product review use cases), you should set open_domain=True and populate aggregate_by_meta with the names of Label meta fields to aggregate by question and your custom meta fields. For example, in a product review use case, you might set aggregate_by_meta=["product_id"] so that Labels with the same question but different answers from different documents are aggregated into the one MultiLabel object, provided that they have the same product_id (to be found in Label.meta["product_id"])

Arguments:

  • index: Name of the index to get the labels from. If None, the DocumentStore's default index (self.index) will be used.

  • filters: Optional filters to narrow down the search space to documents whose metadata fulfill certain conditions. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and", "$or", "$not"), a comparison operator ("$eq", "$in", "$gt", "$gte", "$lt", "$lte") or a metadata field name. Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of "$in") a list of values as value. If no logical operator is provided, "$and" is used as default operation. If no comparison operator is provided, "$eq" (or "$in" if the comparison value is a list) is used as default operation.

    __Example__:
    
    ```python
    filters = {
        "$and": {
            "type": {"$eq": "article"},
            "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
            "rating": {"$gte": 3},
            "$or": {
                "genre": {"$in": ["economy", "politics"]},
                "publisher": {"$eq": "nytimes"}
            }
        }
    }
    ```
    
  • open_domain: When True, labels are aggregated purely based on the question text alone. When False, labels are aggregated in a closed domain fashion based on the question text and also the id of the document that the label is tied to. In this setting, this function might return multiple MultiLabel objects with the same question string.

  • headers: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='} for basic authentication)

  • aggregate_by_meta: The names of the Label meta fields by which to aggregate. For example: ["product_id"]

  • drop_negative_labels: When True, labels with incorrect answers and documents are dropped.

  • drop_no_answers: When True, labels with no answers are dropped.

BaseDocumentStore.normalize_embedding

@staticmethod
def normalize_embedding(emb: np.ndarray) -> None

Performs L2 normalization of embeddings vector inplace. Input can be a single vector (1D array) or a matrix (2D array).

BaseDocumentStore.add_eval_data

def add_eval_data(filename: str,
                  doc_index: str = "eval_document",
                  label_index: str = "label",
                  batch_size: Optional[int] = None,
                  preprocessor: Optional[PreProcessor] = None,
                  max_docs: Optional[Union[int, bool]] = None,
                  open_domain: bool = False,
                  headers: Optional[Dict[str, str]] = None)

Adds a SQuAD-formatted file to the DocumentStore in order to be able to perform evaluation on it.

If a jsonl file and a batch_size is passed to the function, documents are loaded batchwise from disk and also indexed batchwise to the DocumentStore in order to prevent out of memory errors.

Arguments:

  • filename: Name of the file containing evaluation data (json or jsonl)
  • doc_index: Elasticsearch index where evaluation documents should be stored
  • label_index: Elasticsearch index where labeled questions should be stored
  • batch_size: Optional number of documents that are loaded and processed at a time. When set to None (default) all documents are processed at once.
  • preprocessor: Optional PreProcessor to preprocess evaluation documents. It can be used for splitting documents into passages (and assigning labels to corresponding passages). Currently the PreProcessor does not support split_by sentence, cleaning nor split_overlap != 0. When set to None (default) preprocessing is disabled.
  • max_docs: Optional number of documents that will be loaded. When set to None (default) all available eval documents are used.
  • open_domain: Set this to True if your file is an open domain dataset where two different answers to the same question might be found in different contexts.
  • headers: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='} for basic authentication)

BaseDocumentStore.delete_index

@abstractmethod
def delete_index(index: str)

Delete an existing index. The index including all data will be removed.

Arguments:

  • index: The name of the index to delete.

Returns:

None

BaseDocumentStore.run

def run(documents: List[Union[dict, Document]],
        index: Optional[str] = None,
        headers: Optional[Dict[str, str]] = None,
        id_hash_keys: Optional[List[str]] = None)

Run requests of document stores

Comment: We will gradually introduce the primitives. The doument stores also accept dicts and parse them to documents. In the future, however, only documents themselves will be accepted. Parsing the dictionaries in the run function is therefore only an interim solution until the run function also accepts documents.

Arguments:

  • documents: A list of dicts that are documents.
  • headers: A list of headers.
  • index: Optional name of index where the documents shall be written to. If None, the DocumentStore's default index (self.index) will be used.
  • id_hash_keys: List of the fields that the hashes of the ids are generated from.

BaseDocumentStore.describe_documents

def describe_documents(index=None)

Return a summary of the documents in the document store

KeywordDocumentStore

class KeywordDocumentStore(BaseDocumentStore)

Base class for implementing Document Stores that support keyword searches.

KeywordDocumentStore.query

@abstractmethod
def query(query: Optional[str],
          filters: Optional[FilterType] = None,
          top_k: int = 10,
          custom_query: Optional[str] = None,
          index: Optional[str] = None,
          headers: Optional[Dict[str, str]] = None,
          all_terms_must_match: bool = False,
          scale_score: bool = True) -> List[Document]

Scan through documents in DocumentStore and return a small number documents

that are most relevant to the query as defined by keyword matching algorithms like BM25.

Arguments:

  • query: The query

  • filters: Optional filters to narrow down the search space to documents whose metadata fulfill certain conditions. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and", "$or", "$not"), a comparison operator ("$eq", "$in", "$gt", "$gte", "$lt", "$lte") or a metadata field name. Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of "$in") a list of values as value. If no logical operator is provided, "$and" is used as default operation. If no comparison operator is provided, "$eq" (or "$in" if the comparison value is a list) is used as default operation.

    __Example__:
    
    ```python
    filters = {
        "$and": {
            "type": {"$eq": "article"},
            "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
            "rating": {"$gte": 3},
            "$or": {
                "genre": {"$in": ["economy", "politics"]},
                "publisher": {"$eq": "nytimes"}
            }
        }
    }
    # or simpler using default operators
    filters = {
        "type": "article",
        "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
        "rating": {"$gte": 3},
        "$or": {
            "genre": ["economy", "politics"],
            "publisher": "nytimes"
        }
    }
    ```
    
    To use the same logical operator multiple times on the same level, logical operators take
    optionally a list of dictionaries as value.
    
    __Example__:
    
    ```python
    filters = {
        "$or": [
            {
                "$and": {
                    "Type": "News Paper",
                    "Date": {
                        "$lt": "2019-01-01"
                    }
                }
            },
            {
                "$and": {
                    "Type": "Blog Post",
                    "Date": {
                        "$gte": "2019-01-01"
                    }
                }
            }
        ]
    }
    ```
    
  • top_k: How many documents to return per query.

  • custom_query: Custom query to be executed.

  • index: The name of the index in the DocumentStore from which to retrieve documents

  • headers: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='} for basic authentication)

  • all_terms_must_match: Whether all terms of the query must match the document. If true all query terms must be present in a document in order to be retrieved (i.e the AND operator is being used implicitly between query terms: "cozy fish restaurant" -> "cozy AND fish AND restaurant"). Otherwise at least one query term must be present in a document in order to be retrieved (i.e the OR operator is being used implicitly between query terms: "cozy fish restaurant" -> "cozy OR fish OR restaurant"). Defaults to False.

  • scale_score: Whether to scale the similarity score to the unit interval (range of [0,1]). If true (default) similarity scores (e.g. cosine or dot_product) which naturally have a different value range will be scaled to a range of [0,1], where 1 means extremely relevant. Otherwise raw similarity scores (e.g. cosine or dot_product) will be used.

KeywordDocumentStore.query_batch

@abstractmethod
def query_batch(queries: List[str],
                filters: Optional[Union[FilterType,
                                        List[Optional[FilterType]]]] = None,
                top_k: int = 10,
                custom_query: Optional[str] = None,
                index: Optional[str] = None,
                headers: Optional[Dict[str, str]] = None,
                all_terms_must_match: bool = False,
                scale_score: bool = True) -> List[List[Document]]

Scan through documents in DocumentStore and return a small number documents

that are most relevant to the provided queries as defined by keyword matching algorithms like BM25.

This method lets you find relevant documents for a single query string (output: List of Documents), or a a list of query strings (output: List of Lists of Documents).

Arguments:

  • queries: Single query or list of queries.

  • filters: Optional filters to narrow down the search space to documents whose metadata fulfill certain conditions. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and", "$or", "$not"), a comparison operator ("$eq", "$in", "$gt", "$gte", "$lt", "$lte") or a metadata field name. Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of "$in") a list of values as value. If no logical operator is provided, "$and" is used as default operation. If no comparison operator is provided, "$eq" (or "$in" if the comparison value is a list) is used as default operation.

    __Example__:
    
    ```python
    filters = {
        "$and": {
            "type": {"$eq": "article"},
            "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
            "rating": {"$gte": 3},
            "$or": {
                "genre": {"$in": ["economy", "politics"]},
                "publisher": {"$eq": "nytimes"}
            }
        }
    }
    # or simpler using default operators
    filters = {
        "type": "article",
        "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
        "rating": {"$gte": 3},
        "$or": {
            "genre": ["economy", "politics"],
            "publisher": "nytimes"
        }
    }
    ```
    
    To use the same logical operator multiple times on the same level, logical operators take
    optionally a list of dictionaries as value.
    
    __Example__:
    
    ```python
    filters = {
        "$or": [
            {
                "$and": {
                    "Type": "News Paper",
                    "Date": {
                        "$lt": "2019-01-01"
                    }
                }
            },
            {
                "$and": {
                    "Type": "Blog Post",
                    "Date": {
                        "$gte": "2019-01-01"
                    }
                }
            }
        ]
    }
    ```
    
  • top_k: How many documents to return per query.

  • custom_query: Custom query to be executed.

  • index: The name of the index in the DocumentStore from which to retrieve documents

  • headers: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='} for basic authentication)

  • all_terms_must_match: Whether all terms of the query must match the document. If true all query terms must be present in a document in order to be retrieved (i.e the AND operator is being used implicitly between query terms: "cozy fish restaurant" -> "cozy AND fish AND restaurant"). Otherwise at least one query term must be present in a document in order to be retrieved (i.e the OR operator is being used implicitly between query terms: "cozy fish restaurant" -> "cozy OR fish OR restaurant"). Defaults to False.

  • scale_score: Whether to scale the similarity score to the unit interval (range of [0,1]). If true (default) similarity scores (e.g. cosine or dot_product) which naturally have a different value range will be scaled to a range of [0,1], where 1 means extremely relevant. Otherwise raw similarity scores (e.g. cosine or dot_product) will be used.