DocumentationAPI Reference📓 Tutorials🧑‍🍳 Cookbook🤝 Integrations💜 Discord🎨 Studio (Waitlist)
API Reference

Sweeps through a document store and returns a set of candidate documents that are relevant to the query.

Module sparse

BM25Retriever

class BM25Retriever(BaseRetriever)

BM25Retriever.__init__

def __init__(document_store: Optional[KeywordDocumentStore] = None,
             top_k: int = 10,
             all_terms_must_match: bool = False,
             custom_query: Optional[str] = None,
             scale_score: bool = True)

Arguments:

  • document_store: An instance of one of the following DocumentStores to retrieve from: InMemoryDocumentStore, ElasticsearchDocumentStore and OpenSearchDocumentStore. If None, a document store must be passed to the retrieve method for this Retriever to work.

  • all_terms_must_match: Whether all terms of the query must match the document. If true all query terms must be present in a document in order to be retrieved (i.e the AND operator is being used implicitly between query terms: "cozy fish restaurant" -> "cozy AND fish AND restaurant"). Otherwise at least one query term must be present in a document in order to be retrieved (i.e the OR operator is being used implicitly between query terms: "cozy fish restaurant" -> "cozy OR fish OR restaurant"). Defaults to False.

  • custom_query: The query string containing a mandatory ${query} and an optional ${filters} placeholder. An example custom_query:

    {
        "size": 10,
        "query": {
            "bool": {
                "should": [{"multi_match": {
                    "query": ${query},                 // mandatory query placeholder
                    "type": "most_fields",
                    "fields": ["content", "title"]}}],
                "filter": ${filters}                  // optional filter placeholder
            }
        },
    }
    

For this custom_query, a sample retrieve() could be:

self.retrieve(query="Why did the revenue increase?",
              filters={"years": ["2019"], "quarters": ["Q1", "Q2"]})

Optionally, highlighting can be defined by specifying Elasticsearch's highlight settings. See https://www.elastic.co/guide/en/elasticsearch/reference/current/highlighting.html. You will find the highlighted output in the returned Document's meta field by key "highlighted".

 **Example custom_query with highlighting:**

 ```python
 {
     "size": 10,
     "query": {
         "bool": {
             "should": [{"multi_match": {
                 "query": ${query},                 // mandatory query placeholder
                 "type": "most_fields",
                 "fields": ["content", "title"]}}],
         }
     },
     "highlight": {             // enable highlighting
         "fields": {            // for fields content and title
             "content": {},
             "title": {}
         }
     },
 }
 ```

 **For this custom_query, highlighting info can be accessed by:**
```python
docs = self.retrieve(query="Why did the revenue increase?")
highlighted_content = docs[0].meta["highlighted"]["content"]
highlighted_title = docs[0].meta["highlighted"]["title"]
```
  • top_k: How many documents to return per query.
  • scale_score: Whether to scale the similarity score to the unit interval (range of [0,1]). If true (default) similarity scores (e.g. cosine or dot_product) which naturally have a different value range will be scaled to a range of [0,1], where 1 means extremely relevant. Otherwise raw similarity scores (e.g. cosine or dot_product) will be used.

BM25Retriever.retrieve

def retrieve(
        query: str,
        filters: Optional[FilterType] = None,
        top_k: Optional[int] = None,
        all_terms_must_match: Optional[bool] = None,
        index: Optional[str] = None,
        headers: Optional[Dict[str, str]] = None,
        scale_score: Optional[bool] = None,
        document_store: Optional[BaseDocumentStore] = None) -> List[Document]

Scan through documents in DocumentStore and return a small number documents

that are most relevant to the query.

Arguments:

  • query: The query

  • filters: Optional filters to narrow down the search space to documents whose metadata fulfill certain conditions. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and", "$or", "$not"), a comparison operator ("$eq", "$in", "$gt", "$gte", "$lt", "$lte") or a metadata field name. Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of "$in") a list of values as value. If no logical operator is provided, "$and" is used as default operation. If no comparison operator is provided, "$eq" (or "$in" if the comparison value is a list) is used as default operation.

    __Example__:
    
    ```python
    filters = {
        "$and": {
            "type": {"$eq": "article"},
            "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
            "rating": {"$gte": 3},
            "$or": {
                "genre": {"$in": ["economy", "politics"]},
                "publisher": {"$eq": "nytimes"}
            }
        }
    }
    # or simpler using default operators
    filters = {
        "type": "article",
        "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
        "rating": {"$gte": 3},
        "$or": {
            "genre": ["economy", "politics"],
            "publisher": "nytimes"
        }
    }
    ```
    
    To use the same logical operator multiple times on the same level, logical operators take
    optionally a list of dictionaries as value.
    
    __Example__:
    
    ```python
    filters = {
        "$or": [
            {
                "$and": {
                    "Type": "News Paper",
                    "Date": {
                        "$lt": "2019-01-01"
                    }
                }
            },
            {
                "$and": {
                    "Type": "Blog Post",
                    "Date": {
                        "$gte": "2019-01-01"
                    }
                }
            }
        ]
    }
    ```
    
  • top_k: How many documents to return per query.

  • all_terms_must_match: Whether all terms of the query must match the document. When set to True, the Retriever returns only documents that contain all query terms (that means the AND operator is being used implicitly between query terms. For example, the query "cozy fish restaurant" is read as "cozy AND fish AND restaurant"). When set to False, the Retriever returns documents containing at least one query term (this means the OR operator is being used implicitly between query terms. For example, the query "cozy fish restaurant" is read as "cozy OR fish OR restaurant"). Defaults to None. If you set a value for this parameter, it overwrites self.all_terms_must_match at runtime.

  • index: The name of the index in the DocumentStore from which to retrieve documents

  • headers: Custom HTTP headers to pass to elasticsearch client (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='}) Check out https://www.elastic.co/guide/en/elasticsearch/reference/current/http-clients.html for more information.

  • scale_score: Whether to scale the similarity score to the unit interval (range of [0,1]). If true similarity scores (e.g. cosine or dot_product) which naturally have a different value range will be scaled to a range of [0,1], where 1 means extremely relevant. Otherwise raw similarity scores (e.g. cosine or dot_product) will be used.

  • document_store: the docstore to use for retrieval. If None, the one given in the __init__ is used instead.

BM25Retriever.retrieve_batch

def retrieve_batch(
    queries: List[str],
    filters: Optional[Union[FilterType, List[Optional[FilterType]]]] = None,
    top_k: Optional[int] = None,
    all_terms_must_match: Optional[bool] = None,
    index: Optional[str] = None,
    headers: Optional[Dict[str, str]] = None,
    batch_size: Optional[int] = None,
    scale_score: Optional[bool] = None,
    document_store: Optional[BaseDocumentStore] = None
) -> List[List[Document]]

Scan through documents in DocumentStore and return a small number documents

that are most relevant to the supplied queries.

Returns a list of lists of Documents (one per query).

Arguments:

  • queries: List of query strings.

  • filters: Optional filters to narrow down the search space to documents whose metadata fulfill certain conditions. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and", "$or", "$not"), a comparison operator ("$eq", "$in", "$gt", "$gte", "$lt", "$lte") or a metadata field name. Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of "$in") a list of values as value. If no logical operator is provided, "$and" is used as default operation. If no comparison operator is provided, "$eq" (or "$in" if the comparison value is a list) is used as default operation.

    __Example__:
    
    ```python
    filters = {
        "$and": {
            "type": {"$eq": "article"},
            "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
            "rating": {"$gte": 3},
            "$or": {
                "genre": {"$in": ["economy", "politics"]},
                "publisher": {"$eq": "nytimes"}
            }
        }
    }
    # or simpler using default operators
    filters = {
        "type": "article",
        "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
        "rating": {"$gte": 3},
        "$or": {
            "genre": ["economy", "politics"],
            "publisher": "nytimes"
        }
    }
    ```
    
    To use the same logical operator multiple times on the same level, logical operators take
    optionally a list of dictionaries as value.
    
    __Example__:
    
    ```python
    filters = {
        "$or": [
            {
                "$and": {
                    "Type": "News Paper",
                    "Date": {
                        "$lt": "2019-01-01"
                    }
                }
            },
            {
                "$and": {
                    "Type": "Blog Post",
                    "Date": {
                        "$gte": "2019-01-01"
                    }
                }
            }
        ]
    }
    ```
    
  • top_k: How many documents to return per query.

  • all_terms_must_match: Whether all terms of the query must match the document. When set to True, the Retriever returns only documents that contain all query terms (that means the AND operator is being used implicitly between query terms. For example, the query "cozy fish restaurant" is read as "cozy AND fish AND restaurant"). When set to False, the Retriever returns documents containing at least one query term (this means the OR operator is being used implicitly between query terms. For example, the query "cozy fish restaurant" is read as "cozy OR fish OR restaurant").). Defaults to None. If you set a value for this parameter, it overwrites self.all_terms_must_match at runtime.

  • index: The name of the index in the DocumentStore from which to retrieve documents

  • headers: Custom HTTP headers to pass to elasticsearch client (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='}) Check out https://www.elastic.co/guide/en/elasticsearch/reference/current/http-clients.html for more information.

  • batch_size: Not applicable.

  • scale_score: Whether to scale the similarity score to the unit interval (range of [0,1]). If true similarity scores (e.g. cosine or dot_product) which naturally have a different value range will be scaled to a range of [0,1], where 1 means extremely relevant. Otherwise raw similarity scores (e.g. cosine or dot_product) will be used.

  • document_store: the docstore to use for retrieval. If None, the one given in the __init__ is used instead.

BM25Retriever.timing

def timing(fn, attr_name)

Wrapper method used to time functions.

BM25Retriever.eval

def eval(label_index: str = "label",
         doc_index: str = "eval_document",
         label_origin: str = "gold-label",
         top_k: int = 10,
         open_domain: bool = False,
         return_preds: bool = False,
         headers: Optional[Dict[str, str]] = None,
         document_store: Optional[BaseDocumentStore] = None) -> dict

Performs evaluation on the Retriever.

Retriever is evaluated based on whether it finds the correct document given the query string and at which position in the ranking of documents the correct document is.

Returns a dict containing the following metrics:

- "recall": Proportion of questions for which correct document is among retrieved documents
- "mrr": Mean of reciprocal rank. Rewards retrievers that give relevant documents a higher rank.
  Only considers the highest ranked relevant document.
- "map": Mean of average precision for each question. Rewards retrievers that give relevant
  documents a higher rank. Considers all retrieved relevant documents. If ``open_domain=True``,
  average precision is normalized by the number of retrieved relevant documents per query.
  If ``open_domain=False``, average precision is normalized by the number of all relevant documents
  per query.

Arguments:

  • label_index: Index/Table in DocumentStore where labeled questions are stored
  • doc_index: Index/Table in DocumentStore where documents that are used for evaluation are stored
  • top_k: How many documents to return per query
  • open_domain: If True, retrieval will be evaluated by checking if the answer string to a question is contained in the retrieved docs (common approach in open-domain QA). If False, retrieval uses a stricter evaluation that checks if the retrieved document ids are within ids explicitly stated in the labels.
  • return_preds: Whether to add predictions in the returned dictionary. If True, the returned dictionary contains the keys "predictions" and "metrics".
  • headers: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='} for basic authentication)

BM25Retriever.run

def run(root_node: str,
        query: Optional[str] = None,
        filters: Optional[FilterType] = None,
        top_k: Optional[int] = None,
        documents: Optional[List[Document]] = None,
        index: Optional[str] = None,
        headers: Optional[Dict[str, str]] = None,
        scale_score: Optional[bool] = None)

Arguments:

  • root_node: The root node of the pipeline's graph.
  • query: Query string.
  • filters: A dictionary where the keys specify a metadata field and the value is a list of accepted values for that field.
  • top_k: How many documents to return per query.
  • documents: List of Documents to Retrieve.
  • index: The name of the index in the DocumentStore from which to retrieve documents.
  • headers: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='} for basic authentication).
  • scale_score: Whether to scale the similarity score to the unit interval (range of [0,1]). If true (default), similarity scores (e.g. cosine or dot_product) which naturally have a different value range will be scaled to a range of [0,1], where 1 means extremely relevant. Otherwise, raw similarity scores (e.g. cosine or dot_product) will be used.

BM25Retriever.run_batch

def run_batch(root_node: str,
              queries: Optional[List[str]] = None,
              filters: Optional[Union[FilterType,
                                      List[Optional[FilterType]]]] = None,
              top_k: Optional[int] = None,
              documents: Optional[Union[List[Document],
                                        List[List[Document]]]] = None,
              index: Optional[str] = None,
              headers: Optional[Dict[str, str]] = None)

Arguments:

  • root_node: The root node of the pipeline's graph.
  • queries: The list of query strings.
  • filters: A dictionary where the keys specify a metadata field and the value is a list of accepted values for that field.
  • top_k: How many documents to return per query.
  • documents: List of Documents of Retrieve.
  • index: The name of the index in the DocumentStore from which to retrieve documents.
  • headers: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='} for basic authentication).

FilterRetriever

class FilterRetriever(BM25Retriever)

Naive "Retriever" that returns all documents that match the given filters. No impact of query at all. Helpful for benchmarking, testing and if you want to do QA on small documents without an "active" retriever.

FilterRetriever.retrieve

def retrieve(
        query: str,
        filters: Optional[FilterType] = None,
        top_k: Optional[int] = None,
        index: Optional[str] = None,
        headers: Optional[Dict[str, str]] = None,
        scale_score: Optional[bool] = None,
        document_store: Optional[BaseDocumentStore] = None) -> List[Document]

Scan through documents in DocumentStore and return a small number documents

that are most relevant to the query.

Arguments:

  • query: Has no effect, can pass in empty string
  • filters: A dictionary where the keys specify a metadata field and the value is a list of accepted values for that field
  • top_k: Has no effect, pass in any int or None
  • index: The name of the index in the DocumentStore from which to retrieve documents
  • headers: Custom HTTP headers to pass to elasticsearch client (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='}) Check out https://www.elastic.co/guide/en/elasticsearch/reference/current/http-clients.html for more information.
  • scale_score: Whether to scale the similarity score to the unit interval (range of [0,1]). If true similarity scores (e.g. cosine or dot_product) which naturally have a different value range will be scaled to a range of [0,1], where 1 means extremely relevant. Otherwise raw similarity scores (e.g. cosine or dot_product) will be used.
  • document_store: the docstore to use for retrieval. If None, the one given in the __init__ is used instead.

FilterRetriever.__init__

def __init__(document_store: Optional[KeywordDocumentStore] = None,
             top_k: int = 10,
             all_terms_must_match: bool = False,
             custom_query: Optional[str] = None,
             scale_score: bool = True)

Arguments:

  • document_store: An instance of one of the following DocumentStores to retrieve from: InMemoryDocumentStore, ElasticsearchDocumentStore and OpenSearchDocumentStore. If None, a document store must be passed to the retrieve method for this Retriever to work.

  • all_terms_must_match: Whether all terms of the query must match the document. If true all query terms must be present in a document in order to be retrieved (i.e the AND operator is being used implicitly between query terms: "cozy fish restaurant" -> "cozy AND fish AND restaurant"). Otherwise at least one query term must be present in a document in order to be retrieved (i.e the OR operator is being used implicitly between query terms: "cozy fish restaurant" -> "cozy OR fish OR restaurant"). Defaults to False.

  • custom_query: The query string containing a mandatory ${query} and an optional ${filters} placeholder. An example custom_query:

    {
        "size": 10,
        "query": {
            "bool": {
                "should": [{"multi_match": {
                    "query": ${query},                 // mandatory query placeholder
                    "type": "most_fields",
                    "fields": ["content", "title"]}}],
                "filter": ${filters}                  // optional filter placeholder
            }
        },
    }
    

For this custom_query, a sample retrieve() could be:

self.retrieve(query="Why did the revenue increase?",
              filters={"years": ["2019"], "quarters": ["Q1", "Q2"]})

Optionally, highlighting can be defined by specifying Elasticsearch's highlight settings. See https://www.elastic.co/guide/en/elasticsearch/reference/current/highlighting.html. You will find the highlighted output in the returned Document's meta field by key "highlighted".

 **Example custom_query with highlighting:**

 ```python
 {
     "size": 10,
     "query": {
         "bool": {
             "should": [{"multi_match": {
                 "query": ${query},                 // mandatory query placeholder
                 "type": "most_fields",
                 "fields": ["content", "title"]}}],
         }
     },
     "highlight": {             // enable highlighting
         "fields": {            // for fields content and title
             "content": {},
             "title": {}
         }
     },
 }
 ```

 **For this custom_query, highlighting info can be accessed by:**
```python
docs = self.retrieve(query="Why did the revenue increase?")
highlighted_content = docs[0].meta["highlighted"]["content"]
highlighted_title = docs[0].meta["highlighted"]["title"]
```
  • top_k: How many documents to return per query.
  • scale_score: Whether to scale the similarity score to the unit interval (range of [0,1]). If true (default) similarity scores (e.g. cosine or dot_product) which naturally have a different value range will be scaled to a range of [0,1], where 1 means extremely relevant. Otherwise raw similarity scores (e.g. cosine or dot_product) will be used.

FilterRetriever.retrieve_batch

def retrieve_batch(
    queries: List[str],
    filters: Optional[Union[FilterType, List[Optional[FilterType]]]] = None,
    top_k: Optional[int] = None,
    all_terms_must_match: Optional[bool] = None,
    index: Optional[str] = None,
    headers: Optional[Dict[str, str]] = None,
    batch_size: Optional[int] = None,
    scale_score: Optional[bool] = None,
    document_store: Optional[BaseDocumentStore] = None
) -> List[List[Document]]

Scan through documents in DocumentStore and return a small number documents

that are most relevant to the supplied queries.

Returns a list of lists of Documents (one per query).

Arguments:

  • queries: List of query strings.

  • filters: Optional filters to narrow down the search space to documents whose metadata fulfill certain conditions. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and", "$or", "$not"), a comparison operator ("$eq", "$in", "$gt", "$gte", "$lt", "$lte") or a metadata field name. Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of "$in") a list of values as value. If no logical operator is provided, "$and" is used as default operation. If no comparison operator is provided, "$eq" (or "$in" if the comparison value is a list) is used as default operation.

    __Example__:
    
    ```python
    filters = {
        "$and": {
            "type": {"$eq": "article"},
            "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
            "rating": {"$gte": 3},
            "$or": {
                "genre": {"$in": ["economy", "politics"]},
                "publisher": {"$eq": "nytimes"}
            }
        }
    }
    # or simpler using default operators
    filters = {
        "type": "article",
        "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
        "rating": {"$gte": 3},
        "$or": {
            "genre": ["economy", "politics"],
            "publisher": "nytimes"
        }
    }
    ```
    
    To use the same logical operator multiple times on the same level, logical operators take
    optionally a list of dictionaries as value.
    
    __Example__:
    
    ```python
    filters = {
        "$or": [
            {
                "$and": {
                    "Type": "News Paper",
                    "Date": {
                        "$lt": "2019-01-01"
                    }
                }
            },
            {
                "$and": {
                    "Type": "Blog Post",
                    "Date": {
                        "$gte": "2019-01-01"
                    }
                }
            }
        ]
    }
    ```
    
  • top_k: How many documents to return per query.

  • all_terms_must_match: Whether all terms of the query must match the document. When set to True, the Retriever returns only documents that contain all query terms (that means the AND operator is being used implicitly between query terms. For example, the query "cozy fish restaurant" is read as "cozy AND fish AND restaurant"). When set to False, the Retriever returns documents containing at least one query term (this means the OR operator is being used implicitly between query terms. For example, the query "cozy fish restaurant" is read as "cozy OR fish OR restaurant").). Defaults to None. If you set a value for this parameter, it overwrites self.all_terms_must_match at runtime.

  • index: The name of the index in the DocumentStore from which to retrieve documents

  • headers: Custom HTTP headers to pass to elasticsearch client (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='}) Check out https://www.elastic.co/guide/en/elasticsearch/reference/current/http-clients.html for more information.

  • batch_size: Not applicable.

  • scale_score: Whether to scale the similarity score to the unit interval (range of [0,1]). If true similarity scores (e.g. cosine or dot_product) which naturally have a different value range will be scaled to a range of [0,1], where 1 means extremely relevant. Otherwise raw similarity scores (e.g. cosine or dot_product) will be used.

  • document_store: the docstore to use for retrieval. If None, the one given in the __init__ is used instead.

FilterRetriever.timing

def timing(fn, attr_name)

Wrapper method used to time functions.

FilterRetriever.eval

def eval(label_index: str = "label",
         doc_index: str = "eval_document",
         label_origin: str = "gold-label",
         top_k: int = 10,
         open_domain: bool = False,
         return_preds: bool = False,
         headers: Optional[Dict[str, str]] = None,
         document_store: Optional[BaseDocumentStore] = None) -> dict

Performs evaluation on the Retriever.

Retriever is evaluated based on whether it finds the correct document given the query string and at which position in the ranking of documents the correct document is.

Returns a dict containing the following metrics:

- "recall": Proportion of questions for which correct document is among retrieved documents
- "mrr": Mean of reciprocal rank. Rewards retrievers that give relevant documents a higher rank.
  Only considers the highest ranked relevant document.
- "map": Mean of average precision for each question. Rewards retrievers that give relevant
  documents a higher rank. Considers all retrieved relevant documents. If ``open_domain=True``,
  average precision is normalized by the number of retrieved relevant documents per query.
  If ``open_domain=False``, average precision is normalized by the number of all relevant documents
  per query.

Arguments:

  • label_index: Index/Table in DocumentStore where labeled questions are stored
  • doc_index: Index/Table in DocumentStore where documents that are used for evaluation are stored
  • top_k: How many documents to return per query
  • open_domain: If True, retrieval will be evaluated by checking if the answer string to a question is contained in the retrieved docs (common approach in open-domain QA). If False, retrieval uses a stricter evaluation that checks if the retrieved document ids are within ids explicitly stated in the labels.
  • return_preds: Whether to add predictions in the returned dictionary. If True, the returned dictionary contains the keys "predictions" and "metrics".
  • headers: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='} for basic authentication)

FilterRetriever.run

def run(root_node: str,
        query: Optional[str] = None,
        filters: Optional[FilterType] = None,
        top_k: Optional[int] = None,
        documents: Optional[List[Document]] = None,
        index: Optional[str] = None,
        headers: Optional[Dict[str, str]] = None,
        scale_score: Optional[bool] = None)

Arguments:

  • root_node: The root node of the pipeline's graph.
  • query: Query string.
  • filters: A dictionary where the keys specify a metadata field and the value is a list of accepted values for that field.
  • top_k: How many documents to return per query.
  • documents: List of Documents to Retrieve.
  • index: The name of the index in the DocumentStore from which to retrieve documents.
  • headers: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='} for basic authentication).
  • scale_score: Whether to scale the similarity score to the unit interval (range of [0,1]). If true (default), similarity scores (e.g. cosine or dot_product) which naturally have a different value range will be scaled to a range of [0,1], where 1 means extremely relevant. Otherwise, raw similarity scores (e.g. cosine or dot_product) will be used.

FilterRetriever.run_batch

def run_batch(root_node: str,
              queries: Optional[List[str]] = None,
              filters: Optional[Union[FilterType,
                                      List[Optional[FilterType]]]] = None,
              top_k: Optional[int] = None,
              documents: Optional[Union[List[Document],
                                        List[List[Document]]]] = None,
              index: Optional[str] = None,
              headers: Optional[Dict[str, str]] = None)

Arguments:

  • root_node: The root node of the pipeline's graph.
  • queries: The list of query strings.
  • filters: A dictionary where the keys specify a metadata field and the value is a list of accepted values for that field.
  • top_k: How many documents to return per query.
  • documents: List of Documents of Retrieve.
  • index: The name of the index in the DocumentStore from which to retrieve documents.
  • headers: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='} for basic authentication).

TfidfRetriever

class TfidfRetriever(BaseRetriever)

Read all documents from a SQL backend.

Split documents into smaller units (eg, paragraphs or pages) to reduce the computations when text is passed on to a Reader for QA.

It uses sklearn's TfidfVectorizer to compute a tf-idf matrix.

TfidfRetriever.__init__

def __init__(document_store: Optional[BaseDocumentStore] = None,
             top_k: int = 10,
             auto_fit=True)

Arguments:

  • document_store: an instance of a DocumentStore to retrieve documents from.
  • top_k: How many documents to return per query.
  • auto_fit: Whether to automatically update tf-idf matrix by calling fit() after new documents have been added

TfidfRetriever.retrieve

def retrieve(
        query: str,
        filters: Optional[Union[FilterType,
                                List[Optional[FilterType]]]] = None,
        top_k: Optional[int] = None,
        index: Optional[str] = None,
        headers: Optional[Dict[str, str]] = None,
        scale_score: Optional[bool] = None,
        document_store: Optional[BaseDocumentStore] = None) -> List[Document]

Scan through documents in DocumentStore and return a small number documents

that are most relevant to the query.

Arguments:

  • query: The query
  • filters: A dictionary where the keys specify a metadata field and the value is a list of accepted values for that field
  • top_k: How many documents to return per query.
  • index: The name of the index in the DocumentStore from which to retrieve documents
  • scale_score: Whether to scale the similarity score to the unit interval (range of [0,1]). If true similarity scores (e.g. cosine or dot_product) which naturally have a different value range will be scaled to a range of [0,1], where 1 means extremely relevant. Otherwise raw similarity scores (e.g. cosine or dot_product) will be used.
  • document_store: the docstore to use for retrieval. If None, the one given in the __init__ is used instead.

TfidfRetriever.retrieve_batch

def retrieve_batch(
    queries: List[str],
    filters: Optional[Union[FilterType, List[Optional[FilterType]]]] = None,
    top_k: Optional[int] = None,
    index: Optional[str] = None,
    headers: Optional[Dict[str, str]] = None,
    batch_size: Optional[int] = None,
    scale_score: Optional[bool] = None,
    document_store: Optional[BaseDocumentStore] = None
) -> List[List[Document]]

Scan through documents in DocumentStore and return a small number documents

that are most relevant to the supplied queries.

Returns a list of lists of Documents (one per query).

Arguments:

  • queries: Single query string or list of queries.
  • filters: A dictionary where the keys specify a metadata field and the value is a list of accepted values for that field
  • top_k: How many documents to return per query.
  • index: The name of the index in the DocumentStore from which to retrieve documents
  • batch_size: Not applicable.
  • scale_score: Whether to scale the similarity score to the unit interval (range of [0,1]). If true similarity scores (e.g. cosine or dot_product) which naturally have a different value range will be scaled to a range of [0,1], where 1 means extremely relevant. Otherwise raw similarity scores (e.g. cosine or dot_product) will be used.
  • document_store: the docstore to use for retrieval. If None, the one given in the __init__ is used instead.

TfidfRetriever.fit

def fit(document_store: BaseDocumentStore, index: Optional[str] = None)

Performing training on this class according to the TF-IDF algorithm.

TfidfRetriever.timing

def timing(fn, attr_name)

Wrapper method used to time functions.

TfidfRetriever.eval

def eval(label_index: str = "label",
         doc_index: str = "eval_document",
         label_origin: str = "gold-label",
         top_k: int = 10,
         open_domain: bool = False,
         return_preds: bool = False,
         headers: Optional[Dict[str, str]] = None,
         document_store: Optional[BaseDocumentStore] = None) -> dict

Performs evaluation on the Retriever.

Retriever is evaluated based on whether it finds the correct document given the query string and at which position in the ranking of documents the correct document is.

Returns a dict containing the following metrics:

- "recall": Proportion of questions for which correct document is among retrieved documents
- "mrr": Mean of reciprocal rank. Rewards retrievers that give relevant documents a higher rank.
  Only considers the highest ranked relevant document.
- "map": Mean of average precision for each question. Rewards retrievers that give relevant
  documents a higher rank. Considers all retrieved relevant documents. If ``open_domain=True``,
  average precision is normalized by the number of retrieved relevant documents per query.
  If ``open_domain=False``, average precision is normalized by the number of all relevant documents
  per query.

Arguments:

  • label_index: Index/Table in DocumentStore where labeled questions are stored
  • doc_index: Index/Table in DocumentStore where documents that are used for evaluation are stored
  • top_k: How many documents to return per query
  • open_domain: If True, retrieval will be evaluated by checking if the answer string to a question is contained in the retrieved docs (common approach in open-domain QA). If False, retrieval uses a stricter evaluation that checks if the retrieved document ids are within ids explicitly stated in the labels.
  • return_preds: Whether to add predictions in the returned dictionary. If True, the returned dictionary contains the keys "predictions" and "metrics".
  • headers: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='} for basic authentication)

TfidfRetriever.run

def run(root_node: str,
        query: Optional[str] = None,
        filters: Optional[FilterType] = None,
        top_k: Optional[int] = None,
        documents: Optional[List[Document]] = None,
        index: Optional[str] = None,
        headers: Optional[Dict[str, str]] = None,
        scale_score: Optional[bool] = None)

Arguments:

  • root_node: The root node of the pipeline's graph.
  • query: Query string.
  • filters: A dictionary where the keys specify a metadata field and the value is a list of accepted values for that field.
  • top_k: How many documents to return per query.
  • documents: List of Documents to Retrieve.
  • index: The name of the index in the DocumentStore from which to retrieve documents.
  • headers: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='} for basic authentication).
  • scale_score: Whether to scale the similarity score to the unit interval (range of [0,1]). If true (default), similarity scores (e.g. cosine or dot_product) which naturally have a different value range will be scaled to a range of [0,1], where 1 means extremely relevant. Otherwise, raw similarity scores (e.g. cosine or dot_product) will be used.

TfidfRetriever.run_batch

def run_batch(root_node: str,
              queries: Optional[List[str]] = None,
              filters: Optional[Union[FilterType,
                                      List[Optional[FilterType]]]] = None,
              top_k: Optional[int] = None,
              documents: Optional[Union[List[Document],
                                        List[List[Document]]]] = None,
              index: Optional[str] = None,
              headers: Optional[Dict[str, str]] = None)

Arguments:

  • root_node: The root node of the pipeline's graph.
  • queries: The list of query strings.
  • filters: A dictionary where the keys specify a metadata field and the value is a list of accepted values for that field.
  • top_k: How many documents to return per query.
  • documents: List of Documents of Retrieve.
  • index: The name of the index in the DocumentStore from which to retrieve documents.
  • headers: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='} for basic authentication).

Module dense

DenseRetriever

class DenseRetriever(BaseRetriever)

Base class for all dense retrievers.

DenseRetriever.embed_queries

@abstractmethod
def embed_queries(queries: List[str]) -> np.ndarray

Create embeddings for a list of queries.

Arguments:

  • queries: List of queries to embed.

Returns:

Embeddings, one per input query, shape: (queries, embedding_dim)

DenseRetriever.embed_documents

@abstractmethod
def embed_documents(documents: List[Document]) -> np.ndarray

Create embeddings for a list of documents.

Arguments:

  • documents: List of documents to embed.

Returns:

Embeddings of documents, one per input document, shape: (documents, embedding_dim)

DenseRetriever.retrieve

@abstractmethod
def retrieve(
        query: str,
        filters: Optional[FilterType] = None,
        top_k: Optional[int] = None,
        index: Optional[str] = None,
        headers: Optional[Dict[str, str]] = None,
        scale_score: Optional[bool] = None,
        document_store: Optional[BaseDocumentStore] = None) -> List[Document]

Scan through documents in DocumentStore and return a small number documents

that are most relevant to the query.

Arguments:

  • query: The query
  • filters: A dictionary where the keys specify a metadata field and the value is a list of accepted values for that field
  • top_k: How many documents to return per query.
  • index: The name of the index in the DocumentStore from which to retrieve documents
  • headers: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='} for basic authentication)
  • scale_score: Whether to scale the similarity score to the unit interval (range of [0,1]). If true (default) similarity scores (e.g. cosine or dot_product) which naturally have a different value range will be scaled to a range of [0,1], where 1 means extremely relevant. Otherwise raw similarity scores (e.g. cosine or dot_product) will be used.
  • document_store: the docstore to use for retrieval. If None, the one given in the init is used instead.

DenseRetriever.timing

def timing(fn, attr_name)

Wrapper method used to time functions.

DenseRetriever.eval

def eval(label_index: str = "label",
         doc_index: str = "eval_document",
         label_origin: str = "gold-label",
         top_k: int = 10,
         open_domain: bool = False,
         return_preds: bool = False,
         headers: Optional[Dict[str, str]] = None,
         document_store: Optional[BaseDocumentStore] = None) -> dict

Performs evaluation on the Retriever.

Retriever is evaluated based on whether it finds the correct document given the query string and at which position in the ranking of documents the correct document is.

Returns a dict containing the following metrics:

- "recall": Proportion of questions for which correct document is among retrieved documents
- "mrr": Mean of reciprocal rank. Rewards retrievers that give relevant documents a higher rank.
  Only considers the highest ranked relevant document.
- "map": Mean of average precision for each question. Rewards retrievers that give relevant
  documents a higher rank. Considers all retrieved relevant documents. If ``open_domain=True``,
  average precision is normalized by the number of retrieved relevant documents per query.
  If ``open_domain=False``, average precision is normalized by the number of all relevant documents
  per query.

Arguments:

  • label_index: Index/Table in DocumentStore where labeled questions are stored
  • doc_index: Index/Table in DocumentStore where documents that are used for evaluation are stored
  • top_k: How many documents to return per query
  • open_domain: If True, retrieval will be evaluated by checking if the answer string to a question is contained in the retrieved docs (common approach in open-domain QA). If False, retrieval uses a stricter evaluation that checks if the retrieved document ids are within ids explicitly stated in the labels.
  • return_preds: Whether to add predictions in the returned dictionary. If True, the returned dictionary contains the keys "predictions" and "metrics".
  • headers: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='} for basic authentication)

DenseRetriever.run

def run(root_node: str,
        query: Optional[str] = None,
        filters: Optional[FilterType] = None,
        top_k: Optional[int] = None,
        documents: Optional[List[Document]] = None,
        index: Optional[str] = None,
        headers: Optional[Dict[str, str]] = None,
        scale_score: Optional[bool] = None)

Arguments:

  • root_node: The root node of the pipeline's graph.
  • query: Query string.
  • filters: A dictionary where the keys specify a metadata field and the value is a list of accepted values for that field.
  • top_k: How many documents to return per query.
  • documents: List of Documents to Retrieve.
  • index: The name of the index in the DocumentStore from which to retrieve documents.
  • headers: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='} for basic authentication).
  • scale_score: Whether to scale the similarity score to the unit interval (range of [0,1]). If true (default), similarity scores (e.g. cosine or dot_product) which naturally have a different value range will be scaled to a range of [0,1], where 1 means extremely relevant. Otherwise, raw similarity scores (e.g. cosine or dot_product) will be used.

DenseRetriever.run_batch

def run_batch(root_node: str,
              queries: Optional[List[str]] = None,
              filters: Optional[Union[FilterType,
                                      List[Optional[FilterType]]]] = None,
              top_k: Optional[int] = None,
              documents: Optional[Union[List[Document],
                                        List[List[Document]]]] = None,
              index: Optional[str] = None,
              headers: Optional[Dict[str, str]] = None)

Arguments:

  • root_node: The root node of the pipeline's graph.
  • queries: The list of query strings.
  • filters: A dictionary where the keys specify a metadata field and the value is a list of accepted values for that field.
  • top_k: How many documents to return per query.
  • documents: List of Documents of Retrieve.
  • index: The name of the index in the DocumentStore from which to retrieve documents.
  • headers: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='} for basic authentication).

DensePassageRetriever

class DensePassageRetriever(DenseRetriever)

Retriever that uses a bi-encoder (one transformer for query, one transformer for passage). See the original paper for more details: Karpukhin, Vladimir, et al. (2020): "Dense Passage Retrieval for Open-Domain Question Answering." (https://arxiv.org/abs/2004.04906).

DensePassageRetriever.__init__

def __init__(document_store: Optional[BaseDocumentStore] = None,
             query_embedding_model: Union[
                 Path, str] = "facebook/dpr-question_encoder-single-nq-base",
             passage_embedding_model: Union[
                 Path, str] = "facebook/dpr-ctx_encoder-single-nq-base",
             model_version: Optional[str] = None,
             max_seq_len_query: int = 64,
             max_seq_len_passage: int = 256,
             top_k: int = 10,
             use_gpu: bool = True,
             batch_size: int = 16,
             embed_title: bool = True,
             use_fast_tokenizers: bool = True,
             similarity_function: str = "dot_product",
             global_loss_buffer_size: int = 150000,
             progress_bar: bool = True,
             devices: Optional[List[Union[str, "torch.device"]]] = None,
             use_auth_token: Optional[Union[str, bool]] = None,
             scale_score: bool = True)

Init the Retriever incl. the two encoder models from a local or remote model checkpoint.

The checkpoint format matches huggingface transformers' model format

Example:

# remote model from FAIR
DensePassageRetriever(document_store=your_doc_store,
                      query_embedding_model="facebook/dpr-question_encoder-single-nq-base",
                      passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base")
# or from local path
DensePassageRetriever(document_store=your_doc_store,
                      query_embedding_model="model_directory/question-encoder",
                      passage_embedding_model="model_directory/context-encoder")

Arguments:

  • document_store: An instance of DocumentStore from which to retrieve documents.
  • query_embedding_model: Local path or remote name of question encoder checkpoint. The format equals the one used by hugging-face transformers' modelhub models Currently available remote names: "facebook/dpr-question_encoder-single-nq-base"
  • passage_embedding_model: Local path or remote name of passage encoder checkpoint. The format equals the one used by hugging-face transformers' modelhub models Currently available remote names: "facebook/dpr-ctx_encoder-single-nq-base"
  • model_version: The version of model to use from the HuggingFace model hub. Can be tag name, branch name, or commit hash.
  • max_seq_len_query: Longest length of each query sequence. Maximum number of tokens for the query text. Longer ones will be cut down."
  • max_seq_len_passage: Longest length of each passage/context sequence. Maximum number of tokens for the passage text. Longer ones will be cut down."
  • top_k: How many documents to return per query.
  • use_gpu: Whether to use all available GPUs or the CPU. Falls back on CPU if no GPU is available.
  • batch_size: Number of questions or passages to encode at once. In case of multiple gpus, this will be the total batch size.
  • embed_title: Whether to concatenate title and passage to a text pair that is then used to create the embedding. This is the approach used in the original paper and is likely to improve performance if your titles contain meaningful information for retrieval (topic, entities etc.) . The title is expected to be present in doc.meta["name"] and can be supplied in the documents before writing them to the DocumentStore like this: {"text": "my text", "meta": {"name": "my title"}}.
  • use_fast_tokenizers: Whether to use fast Rust tokenizers
  • similarity_function: Which function to apply for calculating the similarity of query and passage embeddings during training. Options: dot_product (Default) or cosine
  • global_loss_buffer_size: Buffer size for all_gather() in DDP. Increase if errors like "encoded data exceeds max_size ..." come up
  • progress_bar: Whether to show a tqdm progress bar or not. Can be helpful to disable in production deployments to keep the logs clean.
  • devices: List of torch devices (e.g. cuda, cpu, mps) to limit inference to specific devices. A list containing torch device objects and/or strings is supported (For example [torch.device('cuda:0'), "mps", "cuda:1"]). When specifying use_gpu=False the devices parameter is not used and a single cpu device is used for inference. Note: as multi-GPU training is currently not implemented for DPR, training will only use the first device provided in this list.
  • use_auth_token: The API token used to download private models from Huggingface. If this parameter is set to True, then the token generated when running transformers-cli login (stored in ~/.huggingface) will be used. Additional information can be found here https://huggingface.co/transformers/main_classes/model.html#transformers.PreTrainedModel.from_pretrained
  • scale_score: Whether to scale the similarity score to the unit interval (range of [0,1]). If true (default) similarity scores (e.g. cosine or dot_product) which naturally have a different value range will be scaled to a range of [0,1], where 1 means extremely relevant. Otherwise raw similarity scores (e.g. cosine or dot_product) will be used.

DensePassageRetriever.retrieve

def retrieve(
        query: str,
        filters: Optional[FilterType] = None,
        top_k: Optional[int] = None,
        index: Optional[str] = None,
        headers: Optional[Dict[str, str]] = None,
        scale_score: Optional[bool] = None,
        document_store: Optional[BaseDocumentStore] = None) -> List[Document]

Scan through the documents in a DocumentStore and return a small number of documents

that are most relevant to the query.

Arguments:

  • query: The query

  • filters: Optional filters to narrow down the search space to documents whose metadata fulfill certain conditions. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and", "$or", "$not"), a comparison operator ("$eq", "$in", "$gt", "$gte", "$lt", "$lte") or a metadata field name. Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of "$in") a list of values as value. If no logical operator is provided, "$and" is used as default operation. If no comparison operator is provided, "$eq" (or "$in" if the comparison value is a list) is used as default operation.

    __Example__:
    
    ```python
    filters = {
        "$and": {
            "type": {"$eq": "article"},
            "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
            "rating": {"$gte": 3},
            "$or": {
                "genre": {"$in": ["economy", "politics"]},
                "publisher": {"$eq": "nytimes"}
            }
        }
    }
    # or simpler using default operators
    filters = {
        "type": "article",
        "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
        "rating": {"$gte": 3},
        "$or": {
            "genre": ["economy", "politics"],
            "publisher": "nytimes"
        }
    }
    ```
    
    To use the same logical operator multiple times on the same level, logical operators take
    optionally a list of dictionaries as value.
    
    __Example__:
    
    ```python
    filters = {
        "$or": [
            {
                "$and": {
                    "Type": "News Paper",
                    "Date": {
                        "$lt": "2019-01-01"
                    }
                }
            },
            {
                "$and": {
                    "Type": "Blog Post",
                    "Date": {
                        "$gte": "2019-01-01"
                    }
                }
            }
        ]
    }
    ```
    
  • top_k: How many documents to return per query.

  • index: The name of the index in the DocumentStore from which to retrieve documents

  • headers: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic API_KEY'} for basic authentication)

  • scale_score: Whether to scale the similarity score to the unit interval (range of [0,1]). If true similarity scores (e.g. cosine or dot_product) which naturally have a different value range will be scaled to a range of [0,1], where 1 means extremely relevant. Otherwise raw similarity scores (e.g. cosine or dot_product) will be used.

  • document_store: the docstore to use for retrieval. If None, the one given in the __init__ is used instead.

DensePassageRetriever.retrieve_batch

def retrieve_batch(
    queries: List[str],
    filters: Optional[Union[FilterType, List[Optional[FilterType]]]] = None,
    top_k: Optional[int] = None,
    index: Optional[str] = None,
    headers: Optional[Dict[str, str]] = None,
    batch_size: Optional[int] = None,
    scale_score: Optional[bool] = None,
    document_store: Optional[BaseDocumentStore] = None
) -> List[List[Document]]

Scan through the documents in a DocumentStore and return a small number of documents

that are most relevant to the supplied queries.

Returns a list of lists of Documents (one per query).

Arguments:

  • queries: List of query strings.
  • filters: Optional filters to narrow down the search space to documents whose metadata fulfill certain conditions. Can be a single filter that will be applied to each query or a list of filters (one filter per query).

Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and", "$or", "$not"), a comparison operator ("$eq", "$in", "$gt", "$gte", "$lt", "$lte") or a metadata field name. Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of "$in") a list of values as value. If no logical operator is provided, "$and" is used as default operation. If no comparison operator is provided, "$eq" (or "$in" if the comparison value is a list) is used as default operation.

__Example__:

```python
filters = {
    "$and": {
        "type": {"$eq": "article"},
        "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
        "rating": {"$gte": 3},
        "$or": {
            "genre": {"$in": ["economy", "politics"]},
            "publisher": {"$eq": "nytimes"}
        }
    }
}
# or simpler using default operators
filters = {
    "type": "article",
    "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
    "rating": {"$gte": 3},
    "$or": {
        "genre": ["economy", "politics"],
        "publisher": "nytimes"
    }
}
```

To use the same logical operator multiple times on the same level, logical operators take
optionally a list of dictionaries as value.

__Example__:

```python
filters = {
    "$or": [
        {
            "$and": {
                "Type": "News Paper",
                "Date": {
                    "$lt": "2019-01-01"
                }
            }
        },
        {
            "$and": {
                "Type": "Blog Post",
                "Date": {
                    "$gte": "2019-01-01"
                }
            }
        }
    ]
}
```
  • top_k: How many documents to return per query.
  • index: The name of the index in the DocumentStore from which to retrieve documents
  • headers: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic API_KEY'} for basic authentication)
  • batch_size: Number of queries to embed at a time.
  • scale_score: Whether to scale the similarity score to the unit interval (range of [0,1]). If true similarity scores (e.g. cosine or dot_product) which naturally have a different value range will be scaled to a range of [0,1], where 1 means extremely relevant. Otherwise raw similarity scores (e.g. cosine or dot_product) will be used.
  • document_store: the docstore to use for retrieval. If None, the one given in the __init__ is used instead.

DensePassageRetriever.embed_queries

def embed_queries(queries: List[str]) -> np.ndarray

Create embeddings for a list of queries using the query encoder.

Arguments:

  • queries: List of queries to embed.

Returns:

Embeddings, one per input query, shape: (queries, embedding_dim)

DensePassageRetriever.embed_documents

def embed_documents(documents: List[Document]) -> np.ndarray

Create embeddings for a list of documents using the passage encoder.

Arguments:

  • documents: List of documents to embed.

Returns:

Embeddings of documents, one per input document, shape: (documents, embedding_dim)

DensePassageRetriever.train

def train(data_dir: str,
          train_filename: str,
          dev_filename: Optional[str] = None,
          test_filename: Optional[str] = None,
          max_samples: Optional[int] = None,
          max_processes: int = 128,
          multiprocessing_strategy: Optional[str] = None,
          dev_split: float = 0,
          batch_size: int = 2,
          embed_title: bool = True,
          num_hard_negatives: int = 1,
          num_positives: int = 1,
          n_epochs: int = 3,
          evaluate_every: int = 1000,
          n_gpu: int = 1,
          learning_rate: float = 1e-5,
          epsilon: float = 1e-08,
          weight_decay: float = 0.0,
          num_warmup_steps: int = 100,
          grad_acc_steps: int = 1,
          use_amp: bool = False,
          optimizer_name: str = "AdamW",
          optimizer_correct_bias: bool = True,
          save_dir: str = "../saved_models/dpr",
          query_encoder_save_dir: str = "query_encoder",
          passage_encoder_save_dir: str = "passage_encoder",
          checkpoint_root_dir: Path = Path("model_checkpoints"),
          checkpoint_every: Optional[int] = None,
          checkpoints_to_keep: int = 3,
          early_stopping: Optional[EarlyStopping] = None)

train a DensePassageRetrieval model

Arguments:

  • data_dir: Directory where training file, dev file and test file are present
  • train_filename: training filename
  • dev_filename: development set filename, file to be used by model in eval step of training
  • test_filename: test set filename, file to be used by model in test step after training
  • max_samples: maximum number of input samples to convert. Can be used for debugging a smaller dataset.
  • max_processes: the maximum number of processes to spawn in the multiprocessing.Pool used in DataSilo. It can be set to 1 to disable the use of multiprocessing or make debugging easier.
  • multiprocessing_strategy: Set the multiprocessing sharing strategy, this can be one of file_descriptor/file_system depending on your OS. If your system has low limits for the number of open file descriptors, and you can’t raise them, you should use the file_system strategy.
  • dev_split: The proportion of the train set that will be sliced. Only works if dev_filename is set to None
  • batch_size: total number of samples in 1 batch of data
  • embed_title: whether to concatenate passage title with each passage. The default setting in official DPR embeds passage title with the corresponding passage
  • num_hard_negatives: number of hard negative passages(passages which are very similar(high score by BM25) to query but do not contain the answer
  • num_positives: number of positive passages
  • n_epochs: number of epochs to train the model on
  • evaluate_every: number of training steps after evaluation is run
  • n_gpu: number of gpus to train on
  • learning_rate: learning rate of optimizer
  • epsilon: epsilon parameter of optimizer
  • weight_decay: weight decay parameter of optimizer
  • grad_acc_steps: number of steps to accumulate gradient over before back-propagation is done
  • use_amp: Whether to use automatic mixed precision (AMP) natively implemented in PyTorch to improve training speed and reduce GPU memory usage. For more information, see (Haystack Optimization)[https://haystack.deepset.ai/guides/optimization] and (Automatic Mixed Precision Package - Torch.amp)[https://pytorch.org/docs/stable/amp.html].
  • optimizer_name: what optimizer to use (default: AdamW)
  • num_warmup_steps: number of warmup steps
  • optimizer_correct_bias: Whether to correct bias in optimizer
  • save_dir: directory where models are saved
  • query_encoder_save_dir: directory inside save_dir where query_encoder model files are saved
  • passage_encoder_save_dir: directory inside save_dir where passage_encoder model files are saved
  • checkpoint_root_dir: The Path of a directory where all train checkpoints are saved. For each individual checkpoint, a subdirectory with the name epoch{epoch_num}_step{step_num} is created.
  • checkpoint_every: Save a train checkpoint after this many steps of training.
  • checkpoints_to_keep: The maximum number of train checkpoints to save.
  • early_stopping: An initialized EarlyStopping object to control early stopping and saving of the best models. Checkpoints can be stored via setting checkpoint_every to a custom number of steps. If any checkpoints are stored, a subsequent run of train() will resume training from the latest available checkpoint.

DensePassageRetriever.save

def save(save_dir: Union[Path, str],
         query_encoder_dir: str = "query_encoder",
         passage_encoder_dir: str = "passage_encoder")

Save DensePassageRetriever to the specified directory.

Arguments:

  • save_dir: Directory to save to.
  • query_encoder_dir: Directory in save_dir that contains query encoder model.
  • passage_encoder_dir: Directory in save_dir that contains passage encoder model.

Returns:

None

DensePassageRetriever.load

@classmethod
def load(cls,
         load_dir: Union[Path, str],
         document_store: BaseDocumentStore,
         max_seq_len_query: int = 64,
         max_seq_len_passage: int = 256,
         use_gpu: bool = True,
         batch_size: int = 16,
         embed_title: bool = True,
         use_fast_tokenizers: bool = True,
         similarity_function: str = "dot_product",
         query_encoder_dir: str = "query_encoder",
         passage_encoder_dir: str = "passage_encoder")

Load DensePassageRetriever from the specified directory.

DensePassageRetriever.timing

def timing(fn, attr_name)

Wrapper method used to time functions.

DensePassageRetriever.eval

def eval(label_index: str = "label",
         doc_index: str = "eval_document",
         label_origin: str = "gold-label",
         top_k: int = 10,
         open_domain: bool = False,
         return_preds: bool = False,
         headers: Optional[Dict[str, str]] = None,
         document_store: Optional[BaseDocumentStore] = None) -> dict

Performs evaluation on the Retriever.

Retriever is evaluated based on whether it finds the correct document given the query string and at which position in the ranking of documents the correct document is.

Returns a dict containing the following metrics:

- "recall": Proportion of questions for which correct document is among retrieved documents
- "mrr": Mean of reciprocal rank. Rewards retrievers that give relevant documents a higher rank.
  Only considers the highest ranked relevant document.
- "map": Mean of average precision for each question. Rewards retrievers that give relevant
  documents a higher rank. Considers all retrieved relevant documents. If ``open_domain=True``,
  average precision is normalized by the number of retrieved relevant documents per query.
  If ``open_domain=False``, average precision is normalized by the number of all relevant documents
  per query.

Arguments:

  • label_index: Index/Table in DocumentStore where labeled questions are stored
  • doc_index: Index/Table in DocumentStore where documents that are used for evaluation are stored
  • top_k: How many documents to return per query
  • open_domain: If True, retrieval will be evaluated by checking if the answer string to a question is contained in the retrieved docs (common approach in open-domain QA). If False, retrieval uses a stricter evaluation that checks if the retrieved document ids are within ids explicitly stated in the labels.
  • return_preds: Whether to add predictions in the returned dictionary. If True, the returned dictionary contains the keys "predictions" and "metrics".
  • headers: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='} for basic authentication)

DensePassageRetriever.run

def run(root_node: str,
        query: Optional[str] = None,
        filters: Optional[FilterType] = None,
        top_k: Optional[int] = None,
        documents: Optional[List[Document]] = None,
        index: Optional[str] = None,
        headers: Optional[Dict[str, str]] = None,
        scale_score: Optional[bool] = None)

Arguments:

  • root_node: The root node of the pipeline's graph.
  • query: Query string.
  • filters: A dictionary where the keys specify a metadata field and the value is a list of accepted values for that field.
  • top_k: How many documents to return per query.
  • documents: List of Documents to Retrieve.
  • index: The name of the index in the DocumentStore from which to retrieve documents.
  • headers: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='} for basic authentication).
  • scale_score: Whether to scale the similarity score to the unit interval (range of [0,1]). If true (default), similarity scores (e.g. cosine or dot_product) which naturally have a different value range will be scaled to a range of [0,1], where 1 means extremely relevant. Otherwise, raw similarity scores (e.g. cosine or dot_product) will be used.

DensePassageRetriever.run_batch

def run_batch(root_node: str,
              queries: Optional[List[str]] = None,
              filters: Optional[Union[FilterType,
                                      List[Optional[FilterType]]]] = None,
              top_k: Optional[int] = None,
              documents: Optional[Union[List[Document],
                                        List[List[Document]]]] = None,
              index: Optional[str] = None,
              headers: Optional[Dict[str, str]] = None)

Arguments:

  • root_node: The root node of the pipeline's graph.
  • queries: The list of query strings.
  • filters: A dictionary where the keys specify a metadata field and the value is a list of accepted values for that field.
  • top_k: How many documents to return per query.
  • documents: List of Documents of Retrieve.
  • index: The name of the index in the DocumentStore from which to retrieve documents.
  • headers: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='} for basic authentication).

TableTextRetriever

class TableTextRetriever(DenseRetriever)

Retriever that uses a tri-encoder to jointly retrieve among a database consisting of text passages and tables (one transformer for query, one transformer for text passages, one transformer for tables). See the original paper for more details: Kostić, Bogdan, et al. (2021): "Multi-modal Retrieval of Tables and Texts Using Tri-encoder Models" (https://arxiv.org/abs/2108.04049),

TableTextRetriever.__init__

def __init__(document_store: Optional[BaseDocumentStore] = None,
             query_embedding_model: Union[
                 Path,
                 str] = "deepset/bert-small-mm_retrieval-question_encoder",
             passage_embedding_model: Union[
                 Path,
                 str] = "deepset/bert-small-mm_retrieval-passage_encoder",
             table_embedding_model: Union[
                 Path, str] = "deepset/bert-small-mm_retrieval-table_encoder",
             model_version: Optional[str] = None,
             max_seq_len_query: int = 64,
             max_seq_len_passage: int = 256,
             max_seq_len_table: int = 256,
             top_k: int = 10,
             use_gpu: bool = True,
             batch_size: int = 16,
             embed_meta_fields: Optional[List[str]] = None,
             use_fast_tokenizers: bool = True,
             similarity_function: str = "dot_product",
             global_loss_buffer_size: int = 150000,
             progress_bar: bool = True,
             devices: Optional[List[Union[str, "torch.device"]]] = None,
             use_auth_token: Optional[Union[str, bool]] = None,
             scale_score: bool = True,
             use_fast: bool = True)

Init the Retriever incl. the two encoder models from a local or remote model checkpoint.

The checkpoint format matches huggingface transformers' model format

Arguments:

  • document_store: An instance of DocumentStore from which to retrieve documents.
  • query_embedding_model: Local path or remote name of question encoder checkpoint. The format equals the one used by hugging-face transformers' modelhub models.
  • passage_embedding_model: Local path or remote name of passage encoder checkpoint. The format equals the one used by hugging-face transformers' modelhub models.
  • table_embedding_model: Local path or remote name of table encoder checkpoint. The format equals the one used by hugging-face transformers' modelhub models.
  • model_version: The version of model to use from the HuggingFace model hub. Can be tag name, branch name, or commit hash.
  • max_seq_len_query: Longest length of each query sequence. Maximum number of tokens for the query text. Longer ones will be cut down."
  • max_seq_len_passage: Longest length of each passage/context sequence. Maximum number of tokens for the passage text. Longer ones will be cut down."
  • top_k: How many documents to return per query.
  • use_gpu: Whether to use all available GPUs or the CPU. Falls back on CPU if no GPU is available.
  • batch_size: Number of questions or passages to encode at once. In case of multiple gpus, this will be the total batch size.
  • embed_meta_fields: Concatenate the provided meta fields and text passage / table to a text pair that is then used to create the embedding. This is the approach used in the original paper and is likely to improve performance if your titles contain meaningful information for retrieval (topic, entities etc.). If no value is provided, a default will be created. That default embeds name, section title and caption.
  • use_fast_tokenizers: Whether to use fast Rust tokenizers
  • similarity_function: Which function to apply for calculating the similarity of query and passage embeddings during training. Options: dot_product (Default) or cosine
  • global_loss_buffer_size: Buffer size for all_gather() in DDP. Increase if errors like "encoded data exceeds max_size ..." come up
  • progress_bar: Whether to show a tqdm progress bar or not. Can be helpful to disable in production deployments to keep the logs clean.
  • devices: List of torch devices (e.g. cuda, cpu, mps) to limit inference to specific devices. A list containing torch device objects and/or strings is supported (For example [torch.device('cuda:0'), "mps", "cuda:1"]). When specifying use_gpu=False the devices parameter is not used and a single cpu device is used for inference. Note: as multi-GPU training is currently not implemented for TableTextRetriever, training will only use the first device provided in this list.
  • use_auth_token: The API token used to download private models from Huggingface. If this parameter is set to True, then the token generated when running transformers-cli login (stored in ~/.huggingface) will be used. Additional information can be found here https://huggingface.co/transformers/main_classes/model.html#transformers.PreTrainedModel.from_pretrained
  • scale_score: Whether to scale the similarity score to the unit interval (range of [0,1]). If true (default) similarity scores (e.g. cosine or dot_product) which naturally have a different value range will be scaled to a range of [0,1], where 1 means extremely relevant. Otherwise raw similarity scores (e.g. cosine or dot_product) will be used.
  • use_fast: Whether to use the fast version of DPR tokenizers or fallback to the standard version. Defaults to True.

TableTextRetriever.retrieve_batch

def retrieve_batch(
    queries: List[str],
    filters: Optional[Union[FilterType, List[Optional[FilterType]]]] = None,
    top_k: Optional[int] = None,
    index: Optional[str] = None,
    headers: Optional[Dict[str, str]] = None,
    batch_size: Optional[int] = None,
    scale_score: Optional[bool] = None,
    document_store: Optional[BaseDocumentStore] = None
) -> List[List[Document]]

Scan through the documents in a DocumentStore and return a small number of documents

that are most relevant to the supplied queries.

Returns a list of lists of Documents (one per query).

Arguments:

  • queries: List of query strings.
  • filters: Optional filters to narrow down the search space to documents whose metadata fulfill certain conditions. Can be a single filter that will be applied to each query or a list of filters (one filter per query).

Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and", "$or", "$not"), a comparison operator ("$eq", "$in", "$gt", "$gte", "$lt", "$lte") or a metadata field name. Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of "$in") a list of values as value. If no logical operator is provided, "$and" is used as default operation. If no comparison operator is provided, "$eq" (or "$in" if the comparison value is a list) is used as default operation.

__Example__:

```python
filters = {
    "$and": {
        "type": {"$eq": "article"},
        "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
        "rating": {"$gte": 3},
        "$or": {
            "genre": {"$in": ["economy", "politics"]},
            "publisher": {"$eq": "nytimes"}
        }
    }
}
# or simpler using default operators
filters = {
    "type": "article",
    "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
    "rating": {"$gte": 3},
    "$or": {
        "genre": ["economy", "politics"],
        "publisher": "nytimes"
    }
}
```

To use the same logical operator multiple times on the same level, logical operators take
optionally a list of dictionaries as value.

__Example__:

```python
filters = {
    "$or": [
        {
            "$and": {
                "Type": "News Paper",
                "Date": {
                    "$lt": "2019-01-01"
                }
            }
        },
        {
            "$and": {
                "Type": "Blog Post",
                "Date": {
                    "$gte": "2019-01-01"
                }
            }
        }
    ]
}
```
  • top_k: How many documents to return per query.
  • index: The name of the index in the DocumentStore from which to retrieve documents
  • headers: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic API_KEY'} for basic authentication)
  • batch_size: Number of queries to embed at a time.
  • scale_score: Whether to scale the similarity score to the unit interval (range of [0,1]). If true similarity scores (e.g. cosine or dot_product) which naturally have a different value range will be scaled to a range of [0,1], where 1 means extremely relevant. Otherwise raw similarity scores (e.g. cosine or dot_product) will be used.
  • document_store: the docstore to use for retrieval. If None, the one given in the __init__ is used instead.

TableTextRetriever.embed_queries

def embed_queries(queries: List[str]) -> np.ndarray

Create embeddings for a list of queries using the query encoder.

Arguments:

  • queries: List of queries to embed.

Returns:

Embeddings, one per input query, shape: (queries, embedding_dim)

TableTextRetriever.embed_documents

def embed_documents(documents: List[Document]) -> np.ndarray

Create embeddings for a list of text documents and / or tables using the text passage encoder and

the table encoder.

Arguments:

  • documents: List of documents to embed.

Returns:

Embeddings of documents, one per input document, shape: (documents, embedding_dim)

TableTextRetriever.train

def train(data_dir: str,
          train_filename: str,
          dev_filename: Optional[str] = None,
          test_filename: Optional[str] = None,
          max_samples: Optional[int] = None,
          max_processes: int = 128,
          dev_split: float = 0,
          batch_size: int = 2,
          embed_meta_fields: Optional[List[str]] = None,
          num_hard_negatives: int = 1,
          num_positives: int = 1,
          n_epochs: int = 3,
          evaluate_every: int = 1000,
          n_gpu: int = 1,
          learning_rate: float = 1e-5,
          epsilon: float = 1e-08,
          weight_decay: float = 0.0,
          num_warmup_steps: int = 100,
          grad_acc_steps: int = 1,
          use_amp: bool = False,
          optimizer_name: str = "AdamW",
          optimizer_correct_bias: bool = True,
          save_dir: str = "../saved_models/mm_retrieval",
          query_encoder_save_dir: str = "query_encoder",
          passage_encoder_save_dir: str = "passage_encoder",
          table_encoder_save_dir: str = "table_encoder",
          checkpoint_root_dir: Path = Path("model_checkpoints"),
          checkpoint_every: Optional[int] = None,
          checkpoints_to_keep: int = 3,
          early_stopping: Optional[EarlyStopping] = None)

Train a TableTextRetrieval model.

Arguments:

  • data_dir: Directory where training file, dev file and test file are present.
  • train_filename: Training filename.
  • dev_filename: Development set filename, file to be used by model in eval step of training.
  • test_filename: Test set filename, file to be used by model in test step after training.
  • max_samples: Maximum number of input samples to convert. Can be used for debugging a smaller dataset.
  • max_processes: The maximum number of processes to spawn in the multiprocessing.Pool used in DataSilo. It can be set to 1 to disable the use of multiprocessing or make debugging easier.
  • dev_split: The proportion of the train set that will be sliced. Only works if dev_filename is set to None.
  • batch_size: Total number of samples in 1 batch of data.
  • embed_meta_fields: Concatenate meta fields with each passage and table. If no value is provided, a default will be created. That default embeds page title, section title and caption with the corresponding table and title with corresponding text passage.
  • num_hard_negatives: Number of hard negative passages (passages which are very similar (high score by BM25) to query but do not contain the answer)-
  • num_positives: Number of positive passages.
  • n_epochs: Number of epochs to train the model on.
  • evaluate_every: Number of training steps after evaluation is run.
  • n_gpu: Number of gpus to train on.
  • learning_rate: Learning rate of optimizer.
  • epsilon: Epsilon parameter of optimizer.
  • weight_decay: Weight decay parameter of optimizer.
  • grad_acc_steps: Number of steps to accumulate gradient over before back-propagation is done.
  • use_amp: Whether to use automatic mixed precision (AMP) natively implemented in PyTorch to improve training speed and reduce GPU memory usage. For more information, see (Haystack Optimization)[https://haystack.deepset.ai/guides/optimization] and (Automatic Mixed Precision Package - Torch.amp)[https://pytorch.org/docs/stable/amp.html].
  • optimizer_name: What optimizer to use (default: TransformersAdamW).
  • num_warmup_steps: Number of warmup steps.
  • optimizer_correct_bias: Whether to correct bias in optimizer.
  • save_dir: Directory where models are saved.
  • query_encoder_save_dir: Directory inside save_dir where query_encoder model files are saved.
  • passage_encoder_save_dir: Directory inside save_dir where passage_encoder model files are saved.
  • table_encoder_save_dir: Directory inside save_dir where table_encoder model files are saved.
  • checkpoint_root_dir: The Path of a directory where all train checkpoints are saved. For each individual checkpoint, a subdirectory with the name epoch{epoch_num}_step{step_num} is created.
  • checkpoint_every: Save a train checkpoint after this many steps of training.
  • checkpoints_to_keep: The maximum number of train checkpoints to save.
  • early_stopping: An initialized EarlyStopping object to control early stopping and saving of the best models.

TableTextRetriever.save

def save(save_dir: Union[Path, str],
         query_encoder_dir: str = "query_encoder",
         passage_encoder_dir: str = "passage_encoder",
         table_encoder_dir: str = "table_encoder")

Save TableTextRetriever to the specified directory.

Arguments:

  • save_dir: Directory to save to.
  • query_encoder_dir: Directory in save_dir that contains query encoder model.
  • passage_encoder_dir: Directory in save_dir that contains passage encoder model.
  • table_encoder_dir: Directory in save_dir that contains table encoder model.

Returns:

None

TableTextRetriever.load

@classmethod
def load(cls,
         load_dir: Union[Path, str],
         document_store: BaseDocumentStore,
         max_seq_len_query: int = 64,
         max_seq_len_passage: int = 256,
         max_seq_len_table: int = 256,
         use_gpu: bool = True,
         batch_size: int = 16,
         embed_meta_fields: Optional[List[str]] = None,
         use_fast_tokenizers: bool = True,
         similarity_function: str = "dot_product",
         query_encoder_dir: str = "query_encoder",
         passage_encoder_dir: str = "passage_encoder",
         table_encoder_dir: str = "table_encoder")

Load TableTextRetriever from the specified directory.

TableTextRetriever.timing

def timing(fn, attr_name)

Wrapper method used to time functions.

TableTextRetriever.eval

def eval(label_index: str = "label",
         doc_index: str = "eval_document",
         label_origin: str = "gold-label",
         top_k: int = 10,
         open_domain: bool = False,
         return_preds: bool = False,
         headers: Optional[Dict[str, str]] = None,
         document_store: Optional[BaseDocumentStore] = None) -> dict

Performs evaluation on the Retriever.

Retriever is evaluated based on whether it finds the correct document given the query string and at which position in the ranking of documents the correct document is.

Returns a dict containing the following metrics:

- "recall": Proportion of questions for which correct document is among retrieved documents
- "mrr": Mean of reciprocal rank. Rewards retrievers that give relevant documents a higher rank.
  Only considers the highest ranked relevant document.
- "map": Mean of average precision for each question. Rewards retrievers that give relevant
  documents a higher rank. Considers all retrieved relevant documents. If ``open_domain=True``,
  average precision is normalized by the number of retrieved relevant documents per query.
  If ``open_domain=False``, average precision is normalized by the number of all relevant documents
  per query.

Arguments:

  • label_index: Index/Table in DocumentStore where labeled questions are stored
  • doc_index: Index/Table in DocumentStore where documents that are used for evaluation are stored
  • top_k: How many documents to return per query
  • open_domain: If True, retrieval will be evaluated by checking if the answer string to a question is contained in the retrieved docs (common approach in open-domain QA). If False, retrieval uses a stricter evaluation that checks if the retrieved document ids are within ids explicitly stated in the labels.
  • return_preds: Whether to add predictions in the returned dictionary. If True, the returned dictionary contains the keys "predictions" and "metrics".
  • headers: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='} for basic authentication)

TableTextRetriever.run

def run(root_node: str,
        query: Optional[str] = None,
        filters: Optional[FilterType] = None,
        top_k: Optional[int] = None,
        documents: Optional[List[Document]] = None,
        index: Optional[str] = None,
        headers: Optional[Dict[str, str]] = None,
        scale_score: Optional[bool] = None)

Arguments:

  • root_node: The root node of the pipeline's graph.
  • query: Query string.
  • filters: A dictionary where the keys specify a metadata field and the value is a list of accepted values for that field.
  • top_k: How many documents to return per query.
  • documents: List of Documents to Retrieve.
  • index: The name of the index in the DocumentStore from which to retrieve documents.
  • headers: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='} for basic authentication).
  • scale_score: Whether to scale the similarity score to the unit interval (range of [0,1]). If true (default), similarity scores (e.g. cosine or dot_product) which naturally have a different value range will be scaled to a range of [0,1], where 1 means extremely relevant. Otherwise, raw similarity scores (e.g. cosine or dot_product) will be used.

TableTextRetriever.run_batch

def run_batch(root_node: str,
              queries: Optional[List[str]] = None,
              filters: Optional[Union[FilterType,
                                      List[Optional[FilterType]]]] = None,
              top_k: Optional[int] = None,
              documents: Optional[Union[List[Document],
                                        List[List[Document]]]] = None,
              index: Optional[str] = None,
              headers: Optional[Dict[str, str]] = None)

Arguments:

  • root_node: The root node of the pipeline's graph.
  • queries: The list of query strings.
  • filters: A dictionary where the keys specify a metadata field and the value is a list of accepted values for that field.
  • top_k: How many documents to return per query.
  • documents: List of Documents of Retrieve.
  • index: The name of the index in the DocumentStore from which to retrieve documents.
  • headers: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='} for basic authentication).

EmbeddingRetriever

class EmbeddingRetriever(DenseRetriever)

EmbeddingRetriever.__init__

def __init__(embedding_model: str,
             document_store: Optional[BaseDocumentStore] = None,
             model_version: Optional[str] = None,
             use_gpu: bool = True,
             batch_size: int = 32,
             max_seq_len: int = 512,
             model_format: Optional[str] = None,
             pooling_strategy: str = "reduce_mean",
             query_prompt: Optional[str] = None,
             passage_prompt: Optional[str] = None,
             emb_extraction_layer: int = -1,
             top_k: int = 10,
             progress_bar: bool = True,
             devices: Optional[List[Union[str, "torch.device"]]] = None,
             use_auth_token: Optional[Union[str, bool]] = None,
             scale_score: bool = True,
             embed_meta_fields: Optional[List[str]] = None,
             api_key: Optional[str] = None,
             azure_api_version: str = "2022-12-01",
             azure_base_url: Optional[str] = None,
             azure_deployment_name: Optional[str] = None,
             api_base: str = "https://api.openai.com/v1",
             openai_organization: Optional[str] = None,
             aws_config: Optional[Dict[str, Any]] = None)

Arguments:

  • document_store: An instance of DocumentStore from which to retrieve documents.
  • embedding_model: Local path or name of model in Hugging Face's model hub such as 'sentence-transformers/all-MiniLM-L6-v2'. The embedding model could also potentially be an OpenAI model ["ada", "babbage", "davinci", "curie"] or a Cohere model ["embed-english-v2.0", "embed-english-light-v2.0", "embed-multilingual-v2.0"] or an AWS Bedrock model ["amazon.titan-embed-text-v1", "cohere.embed-english-v3", "cohere.embed-multilingual-v3"].
  • model_version: The version of model to use from the HuggingFace model hub. Can be tag name, branch name, or commit hash.
  • use_gpu: Whether to use all available GPUs or the CPU. Falls back on CPU if no GPU is available.
  • batch_size: Number of documents to encode at once.
  • max_seq_len: Longest length of each document sequence. Maximum number of tokens for the document text. Longer ones will be cut down.
  • model_format: Name of framework that was used for saving the model or model type. If no model_format is provided, it will be inferred automatically from the model configuration files. Options:
  1. farm : (will use _DefaultEmbeddingEncoder as embedding encoder)
  2. transformers : (will use _DefaultEmbeddingEncoder as embedding encoder)
  3. sentence_transformers : (will use _SentenceTransformersEmbeddingEncoder as embedding encoder)
  4. retribert : (will use _RetribertEmbeddingEncoder as embedding encoder)
  5. openai : (will use _OpenAIEmbeddingEncoder as embedding encoder)
  6. cohere : (will use _CohereEmbeddingEncoder as embedding encoder)
  7. bedrock : (will use _BedrockEmbeddingEncoder as embedding encoder)
  • pooling_strategy: Strategy for combining the embeddings from the model (for farm / transformers models only). Options:
  1. cls_token (sentence vector)
  2. reduce_mean (sentence vector)
  3. reduce_max (sentence vector)
  4. per_token (individual token vectors)
  • query_prompt: Model instruction for embedding texts to be used as queries.
  • passage_prompt: Model instruction for embedding texts to be retrieved.
  • emb_extraction_layer: Number of layer from which the embeddings shall be extracted (for farm / transformers models only). Default: -1 (very last layer).
  • top_k: How many documents to return per query.
  • progress_bar: If true displays progress bar during embedding.
  • devices: List of torch devices (e.g. cuda, cpu, mps) to limit inference to specific devices. A list containing torch device objects and/or strings is supported (For example [torch.device('cuda:0'), "mps", "cuda:1"]). When specifying use_gpu=False the devices parameter is not used and a single cpu device is used for inference. Note: As multi-GPU training is currently not implemented for EmbeddingRetriever, training will only use the first device provided in this list.
  • use_auth_token: The API token used to download private models from Huggingface. If this parameter is set to True, then the token generated when running transformers-cli login (stored in ~/.huggingface) will be used. Additional information can be found here https://huggingface.co/transformers/main_classes/model.html#transformers.PreTrainedModel.from_pretrained
  • scale_score: Whether to scale the similarity score to the unit interval (range of [0,1]). If true (default) similarity scores (e.g. cosine or dot_product) which naturally have a different value range will be scaled to a range of [0,1], where 1 means extremely relevant. Otherwise raw similarity scores (e.g. cosine or dot_product) will be used.
  • embed_meta_fields: Concatenate the provided meta fields and text passage / table to a text pair that is then used to create the embedding. This approach is also used in the TableTextRetriever paper and is likely to improve performance if your titles contain meaningful information for retrieval (topic, entities etc.). If no value is provided, a default empty list will be created.
  • api_key: The OpenAI API key or the Cohere API key. Required if one wants to use OpenAI/Cohere embeddings. For more details see https://beta.openai.com/account/api-keys and https://dashboard.cohere.ai/api-keys
  • azure_api_version: The version of the Azure OpenAI API to use. The default is 2022-12-01 version.
  • azure_base_url: The base URL for the Azure OpenAI API. If not supplied, Azure OpenAI API will not be used. This parameter is an OpenAI Azure endpoint, usually in the form `https://.openai.azure.com'
  • azure_deployment_name: The name of the Azure OpenAI API deployment. If not supplied, Azure OpenAI API will not be used.
  • api_base: The OpenAI API base URL, defaults to "https://api.openai.com/v1".
  • openai_organization: The OpenAI-Organization ID, defaults to None. For more details, see OpenAI documentation.
  • aws_config: The aws_config contains {aws_access_key, aws_secret_key, aws_region, profile_name} to use with the boto3 Session for an AWS Bedrock retriever. Defaults to 'None'.

EmbeddingRetriever.retrieve

def retrieve(
        query: str,
        filters: Optional[FilterType] = None,
        top_k: Optional[int] = None,
        index: Optional[str] = None,
        headers: Optional[Dict[str, str]] = None,
        scale_score: Optional[bool] = None,
        document_store: Optional[BaseDocumentStore] = None) -> List[Document]

Scan through the documents in a DocumentStore and return a small number of documents

that are most relevant to the query.

Arguments:

  • query: The query

  • filters: Optional filters to narrow down the search space to documents whose metadata fulfill certain conditions. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and", "$or", "$not"), a comparison operator ("$eq", "$in", "$gt", "$gte", "$lt", "$lte") or a metadata field name. Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of "$in") a list of values as value. If no logical operator is provided, "$and" is used as default operation. If no comparison operator is provided, "$eq" (or "$in" if the comparison value is a list) is used as default operation.

    __Example__:
    
    ```python
    filters = {
        "$and": {
            "type": {"$eq": "article"},
            "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
            "rating": {"$gte": 3},
            "$or": {
                "genre": {"$in": ["economy", "politics"]},
                "publisher": {"$eq": "nytimes"}
            }
        }
    }
    # or simpler using default operators
    filters = {
        "type": "article",
        "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
        "rating": {"$gte": 3},
        "$or": {
            "genre": ["economy", "politics"],
            "publisher": "nytimes"
        }
    }
    ```
    
    To use the same logical operator multiple times on the same level, logical operators take
    optionally a list of dictionaries as value.
    
    __Example__:
    
    ```python
    filters = {
        "$or": [
            {
                "$and": {
                    "Type": "News Paper",
                    "Date": {
                        "$lt": "2019-01-01"
                    }
                }
            },
            {
                "$and": {
                    "Type": "Blog Post",
                    "Date": {
                        "$gte": "2019-01-01"
                    }
                }
            }
        ]
    }
    ```
    
  • top_k: How many documents to return per query.

  • index: The name of the index in the DocumentStore from which to retrieve documents

  • headers: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic API_KEY'} for basic authentication)

  • scale_score: Whether to scale the similarity score to the unit interval (range of [0,1]). If true similarity scores (e.g. cosine or dot_product) which naturally have a different value range will be scaled to a range of [0,1], where 1 means extremely relevant. Otherwise raw similarity scores (e.g. cosine or dot_product) will be used.

  • document_store: the docstore to use for retrieval. If None, the one given in the __init__ is used instead.

EmbeddingRetriever.retrieve_batch

def retrieve_batch(
    queries: List[str],
    filters: Optional[Union[FilterType, List[Optional[FilterType]]]] = None,
    top_k: Optional[int] = None,
    index: Optional[str] = None,
    headers: Optional[Dict[str, str]] = None,
    batch_size: Optional[int] = None,
    scale_score: Optional[bool] = None,
    document_store: Optional[BaseDocumentStore] = None
) -> List[List[Document]]

Scan through the documents in a DocumentStore and return a small number of documents

that are most relevant to the supplied queries.

Returns a list of lists of Documents (one per query).

Arguments:

  • queries: List of query strings.
  • filters: Optional filters to narrow down the search space to documents whose metadata fulfill certain conditions. Can be a single filter that will be applied to each query or a list of filters (one filter per query).

Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and", "$or", "$not"), a comparison operator ("$eq", "$in", "$gt", "$gte", "$lt", "$lte") or a metadata field name. Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of "$in") a list of values as value. If no logical operator is provided, "$and" is used as default operation. If no comparison operator is provided, "$eq" (or "$in" if the comparison value is a list) is used as default operation.

__Example__:

```python
filters = {
    "$and": {
        "type": {"$eq": "article"},
        "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
        "rating": {"$gte": 3},
        "$or": {
            "genre": {"$in": ["economy", "politics"]},
            "publisher": {"$eq": "nytimes"}
        }
    }
}
# or simpler using default operators
filters = {
    "type": "article",
    "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
    "rating": {"$gte": 3},
    "$or": {
        "genre": ["economy", "politics"],
        "publisher": "nytimes"
    }
}
```

To use the same logical operator multiple times on the same level, logical operators take
optionally a list of dictionaries as value.

__Example__:

```python
filters = {
    "$or": [
        {
            "$and": {
                "Type": "News Paper",
                "Date": {
                    "$lt": "2019-01-01"
                }
            }
        },
        {
            "$and": {
                "Type": "Blog Post",
                "Date": {
                    "$gte": "2019-01-01"
                }
            }
        }
    ]
}
```
  • top_k: How many documents to return per query.
  • index: The name of the index in the DocumentStore from which to retrieve documents
  • headers: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic API_KEY'} for basic authentication)
  • batch_size: Number of queries to embed at a time.
  • scale_score: Whether to scale the similarity score to the unit interval (range of [0,1]). If true similarity scores (e.g. cosine or dot_product) which naturally have a different value range will be scaled to a range of [0,1], where 1 means extremely relevant. Otherwise raw similarity scores (e.g. cosine or dot_product) will be used.
  • document_store: the docstore to use for retrieval. If None, the one given in the __init__ is used instead.

EmbeddingRetriever.embed_queries

def embed_queries(queries: List[str]) -> np.ndarray

Create embeddings for a list of queries.

Arguments:

  • queries: List of queries to embed.

Returns:

Embeddings, one per input query, shape: (queries, embedding_dim)

EmbeddingRetriever.embed_documents

def embed_documents(documents: List[Document]) -> np.ndarray

Create embeddings for a list of documents.

Arguments:

  • documents: List of documents to embed.

Returns:

Embeddings, one per input document, shape: (docs, embedding_dim)

EmbeddingRetriever.train

def train(training_data: List[Dict[str, Any]],
          learning_rate: float = 2e-5,
          n_epochs: int = 1,
          num_warmup_steps: Optional[int] = None,
          batch_size: int = 16,
          train_loss: Literal["mnrl", "margin_mse"] = "mnrl",
          num_workers: int = 0,
          use_amp: bool = False,
          **kwargs) -> None

Trains/adapts the underlying embedding model. We only support the training of sentence-transformer embedding models.

Each training data example is a dictionary with the following keys:

  • question: the question string
  • pos_doc: the positive document string
  • neg_doc: the negative document string
  • score: the score margin

Arguments:

  • training_data: The training data in a dictionary format.
  • learning_rate: The learning rate.
  • n_epochs: The number of epochs that you want the train for.
  • num_warmup_steps: Behavior depends on the scheduler. For WarmupLinear (default), the learning rate is increased from 0 up to the maximal learning rate. After these many training steps, the learning rate is decreased linearly back to zero.
  • batch_size: The batch size to use for the training. The default values is 16.
  • train_loss: The loss to use for training. If you're using a sentence-transformer embedding_model (which is the only model that training is supported for), possible values are 'mnrl' (Multiple Negatives Ranking Loss) or 'margin_mse' (MarginMSE).
  • num_workers: The number of subprocesses to use for the Pytorch DataLoader.
  • use_amp: Use Automatic Mixed Precision (AMP).
  • kwargs: Additional training key word arguments to pass to the SentenceTransformer.fit function. Please reference the Sentence-Transformers documentation for a full list of keyword arguments.

EmbeddingRetriever.save

def save(save_dir: Union[Path, str]) -> None

Save the model to the given directory

Arguments:

  • save_dir (Union[Path, str]): The directory where the model will be saved

EmbeddingRetriever.timing

def timing(fn, attr_name)

Wrapper method used to time functions.

EmbeddingRetriever.eval

def eval(label_index: str = "label",
         doc_index: str = "eval_document",
         label_origin: str = "gold-label",
         top_k: int = 10,
         open_domain: bool = False,
         return_preds: bool = False,
         headers: Optional[Dict[str, str]] = None,
         document_store: Optional[BaseDocumentStore] = None) -> dict

Performs evaluation on the Retriever.

Retriever is evaluated based on whether it finds the correct document given the query string and at which position in the ranking of documents the correct document is.

Returns a dict containing the following metrics:

- "recall": Proportion of questions for which correct document is among retrieved documents
- "mrr": Mean of reciprocal rank. Rewards retrievers that give relevant documents a higher rank.
  Only considers the highest ranked relevant document.
- "map": Mean of average precision for each question. Rewards retrievers that give relevant
  documents a higher rank. Considers all retrieved relevant documents. If ``open_domain=True``,
  average precision is normalized by the number of retrieved relevant documents per query.
  If ``open_domain=False``, average precision is normalized by the number of all relevant documents
  per query.

Arguments:

  • label_index: Index/Table in DocumentStore where labeled questions are stored
  • doc_index: Index/Table in DocumentStore where documents that are used for evaluation are stored
  • top_k: How many documents to return per query
  • open_domain: If True, retrieval will be evaluated by checking if the answer string to a question is contained in the retrieved docs (common approach in open-domain QA). If False, retrieval uses a stricter evaluation that checks if the retrieved document ids are within ids explicitly stated in the labels.
  • return_preds: Whether to add predictions in the returned dictionary. If True, the returned dictionary contains the keys "predictions" and "metrics".
  • headers: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='} for basic authentication)

EmbeddingRetriever.run

def run(root_node: str,
        query: Optional[str] = None,
        filters: Optional[FilterType] = None,
        top_k: Optional[int] = None,
        documents: Optional[List[Document]] = None,
        index: Optional[str] = None,
        headers: Optional[Dict[str, str]] = None,
        scale_score: Optional[bool] = None)

Arguments:

  • root_node: The root node of the pipeline's graph.
  • query: Query string.
  • filters: A dictionary where the keys specify a metadata field and the value is a list of accepted values for that field.
  • top_k: How many documents to return per query.
  • documents: List of Documents to Retrieve.
  • index: The name of the index in the DocumentStore from which to retrieve documents.
  • headers: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='} for basic authentication).
  • scale_score: Whether to scale the similarity score to the unit interval (range of [0,1]). If true (default), similarity scores (e.g. cosine or dot_product) which naturally have a different value range will be scaled to a range of [0,1], where 1 means extremely relevant. Otherwise, raw similarity scores (e.g. cosine or dot_product) will be used.

EmbeddingRetriever.run_batch

def run_batch(root_node: str,
              queries: Optional[List[str]] = None,
              filters: Optional[Union[FilterType,
                                      List[Optional[FilterType]]]] = None,
              top_k: Optional[int] = None,
              documents: Optional[Union[List[Document],
                                        List[List[Document]]]] = None,
              index: Optional[str] = None,
              headers: Optional[Dict[str, str]] = None)

Arguments:

  • root_node: The root node of the pipeline's graph.
  • queries: The list of query strings.
  • filters: A dictionary where the keys specify a metadata field and the value is a list of accepted values for that field.
  • top_k: How many documents to return per query.
  • documents: List of Documents of Retrieve.
  • index: The name of the index in the DocumentStore from which to retrieve documents.
  • headers: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='} for basic authentication).

MultihopEmbeddingRetriever

class MultihopEmbeddingRetriever(EmbeddingRetriever)

Retriever that applies iterative retrieval using a shared encoder for query and passage. See original paper for more details:

Xiong, Wenhan, et. al. (2020): "Answering complex open-domain questions with multi-hop dense retrieval" (https://arxiv.org/abs/2009.12756)

MultihopEmbeddingRetriever.__init__

def __init__(embedding_model: str,
             document_store: Optional[BaseDocumentStore] = None,
             model_version: Optional[str] = None,
             num_iterations: int = 2,
             use_gpu: bool = True,
             batch_size: int = 32,
             max_seq_len: int = 512,
             model_format: str = "farm",
             pooling_strategy: str = "reduce_mean",
             emb_extraction_layer: int = -1,
             top_k: int = 10,
             progress_bar: bool = True,
             devices: Optional[List[Union[str, "torch.device"]]] = None,
             use_auth_token: Optional[Union[str, bool]] = None,
             scale_score: bool = True,
             embed_meta_fields: Optional[List[str]] = None)

Arguments:

  • document_store: An instance of DocumentStore from which to retrieve documents.

  • embedding_model: Local path or name of model in Hugging Face's model hub such as 'sentence-transformers/all-MiniLM-L6-v2'

  • model_version: The version of model to use from the HuggingFace model hub. Can be tag name, branch name, or commit hash.

  • num_iterations: The number of times passages are retrieved, i.e., the number of hops (Defaults to 2.)

  • use_gpu: Whether to use all available GPUs or the CPU. Falls back on CPU if no GPU is available.

  • batch_size: Number of documents to encode at once.

  • max_seq_len: Longest length of each document sequence. Maximum number of tokens for the document text. Longer ones will be cut down.

  • model_format: Name of framework that was used for saving the model or model type. If no model_format is provided, it will be inferred automatically from the model configuration files. Options:

  • 'farm' (will use _DefaultEmbeddingEncoder as embedding encoder)

  • 'transformers' (will use _DefaultEmbeddingEncoder as embedding encoder)

  • 'sentence_transformers' (will use _SentenceTransformersEmbeddingEncoder as embedding encoder)

  • 'retribert' (will use _RetribertEmbeddingEncoder as embedding encoder)

  • pooling_strategy: Strategy for combining the embeddings from the model (for farm / transformers models only). Options:

  • 'cls_token' (sentence vector)

  • 'reduce_mean' (sentence vector)

  • 'reduce_max' (sentence vector)

  • 'per_token' (individual token vectors)

  • emb_extraction_layer: Number of layer from which the embeddings shall be extracted (for farm / transformers models only). Default: -1 (very last layer).

  • top_k: How many documents to return per query.

  • progress_bar: If true displays progress bar during embedding.

  • devices: List of torch devices (e.g. cuda, cpu, mps) to limit inference to specific devices. A list containing torch device objects and/or strings is supported (For example [torch.device('cuda:0'), "mps", "cuda:1"]). When specifying use_gpu=False the devices parameter is not used and a single cpu device is used for inference. Note: As multi-GPU training is currently not implemented for EmbeddingRetriever, training will only use the first device provided in this list.

  • use_auth_token: The API token used to download private models from Huggingface. If this parameter is set to True, then the token generated when running transformers-cli login (stored in ~/.huggingface) will be used. Additional information can be found here https://huggingface.co/transformers/main_classes/model.html#transformers.PreTrainedModel.from_pretrained

  • scale_score: Whether to scale the similarity score to the unit interval (range of [0,1]). If true (default) similarity scores (e.g. cosine or dot_product) which naturally have a different value range will be scaled to a range of [0,1], where 1 means extremely relevant. Otherwise raw similarity scores (e.g. cosine or dot_product) will be used.

  • embed_meta_fields: Concatenate the provided meta fields and text passage / table to a text pair that is then used to create the embedding. This approach is also used in the TableTextRetriever paper and is likely to improve performance if your titles contain meaningful information for retrieval (topic, entities etc.). If no value is provided, a default empty list will be created.

MultihopEmbeddingRetriever.retrieve

def retrieve(
        query: str,
        filters: Optional[FilterType] = None,
        top_k: Optional[int] = None,
        index: Optional[str] = None,
        headers: Optional[Dict[str, str]] = None,
        scale_score: Optional[bool] = None,
        document_store: Optional[BaseDocumentStore] = None) -> List[Document]

Scan through the documents in a DocumentStore and return a small number of documents

that are most relevant to the query.

Arguments:

  • query: The query

  • filters: Optional filters to narrow down the search space to documents whose metadata fulfill certain conditions. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and", "$or", "$not"), a comparison operator ("$eq", "$in", "$gt", "$gte", "$lt", "$lte") or a metadata field name. Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of "$in") a list of values as value. If no logical operator is provided, "$and" is used as default operation. If no comparison operator is provided, "$eq" (or "$in" if the comparison value is a list) is used as default operation.

    __Example__:
    
    ```python
    filters = {
        "$and": {
            "type": {"$eq": "article"},
            "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
            "rating": {"$gte": 3},
            "$or": {
                "genre": {"$in": ["economy", "politics"]},
                "publisher": {"$eq": "nytimes"}
            }
        }
    }
    # or simpler using default operators
    filters = {
        "type": "article",
        "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
        "rating": {"$gte": 3},
        "$or": {
            "genre": ["economy", "politics"],
            "publisher": "nytimes"
        }
    }
    ```
    
    To use the same logical operator multiple times on the same level, logical operators take
    optionally a list of dictionaries as value.
    
    __Example__:
    
    ```python
    filters = {
        "$or": [
            {
                "$and": {
                    "Type": "News Paper",
                    "Date": {
                        "$lt": "2019-01-01"
                    }
                }
            },
            {
                "$and": {
                    "Type": "Blog Post",
                    "Date": {
                        "$gte": "2019-01-01"
                    }
                }
            }
        ]
    }
    ```
    
  • top_k: How many documents to return per query.

  • index: The name of the index in the DocumentStore from which to retrieve documents

  • headers: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic API_KEY'} for basic authentication)

  • scale_score: Whether to scale the similarity score to the unit interval (range of [0,1]). If true similarity scores (e.g. cosine or dot_product) which naturally have a different value range will be scaled to a range of [0,1], where 1 means extremely relevant. Otherwise raw similarity scores (e.g. cosine or dot_product) will be used.

  • document_store: the docstore to use for retrieval. If None, the one given in the __init__ is used instead.

MultihopEmbeddingRetriever.retrieve_batch

def retrieve_batch(
    queries: List[str],
    filters: Optional[Union[FilterType, List[Optional[FilterType]]]] = None,
    top_k: Optional[int] = None,
    index: Optional[str] = None,
    headers: Optional[Dict[str, str]] = None,
    batch_size: Optional[int] = None,
    scale_score: Optional[bool] = None,
    document_store: Optional[BaseDocumentStore] = None
) -> List[List[Document]]

Scan through the documents in a DocumentStore and return a small number of documents

that are most relevant to the supplied queries.

If you supply a single query, a single list of Documents is returned. If you supply a list of queries, a list of lists of Documents (one per query) is returned.

Arguments:

  • queries: Single query string or list of queries.
  • filters: Optional filters to narrow down the search space to documents whose metadata fulfill certain conditions. Can be a single filter that will be applied to each query or a list of filters (one filter per query).

Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and", "$or", "$not"), a comparison operator ("$eq", "$in", "$gt", "$gte", "$lt", "$lte") or a metadata field name. Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of "$in") a list of values as value. If no logical operator is provided, "$and" is used as default operation. If no comparison operator is provided, "$eq" (or "$in" if the comparison value is a list) is used as default operation.

__Example__:

```python
filters = {
    "$and": {
        "type": {"$eq": "article"},
        "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
        "rating": {"$gte": 3},
        "$or": {
            "genre": {"$in": ["economy", "politics"]},
            "publisher": {"$eq": "nytimes"}
        }
    }
}
# or simpler using default operators
filters = {
    "type": "article",
    "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
    "rating": {"$gte": 3},
    "$or": {
        "genre": ["economy", "politics"],
        "publisher": "nytimes"
    }
}
```

To use the same logical operator multiple times on the same level, logical operators take
optionally a list of dictionaries as value.

__Example__:

```python
filters = {
    "$or": [
        {
            "$and": {
                "Type": "News Paper",
                "Date": {
                    "$lt": "2019-01-01"
                }
            }
        },
        {
            "$and": {
                "Type": "Blog Post",
                "Date": {
                    "$gte": "2019-01-01"
                }
            }
        }
    ]
}
```
  • top_k: How many documents to return per query.
  • index: The name of the index in the DocumentStore from which to retrieve documents
  • headers: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic API_KEY'} for basic authentication)
  • batch_size: Number of queries to embed at a time.
  • scale_score: Whether to scale the similarity score to the unit interval (range of [0,1]). If true similarity scores (e.g. cosine or dot_product) which naturally have a different value range will be scaled to a range of [0,1], where 1 means extremely relevant. Otherwise raw similarity scores (e.g. cosine or dot_product) will be used.
  • document_store: the docstore to use for retrieval. If None, the one given in the __init__ is used instead.

MultihopEmbeddingRetriever.embed_queries

def embed_queries(queries: List[str]) -> np.ndarray

Create embeddings for a list of queries.

Arguments:

  • queries: List of queries to embed.

Returns:

Embeddings, one per input query, shape: (queries, embedding_dim)

MultihopEmbeddingRetriever.embed_documents

def embed_documents(documents: List[Document]) -> np.ndarray

Create embeddings for a list of documents.

Arguments:

  • documents: List of documents to embed.

Returns:

Embeddings, one per input document, shape: (docs, embedding_dim)

MultihopEmbeddingRetriever.train

def train(training_data: List[Dict[str, Any]],
          learning_rate: float = 2e-5,
          n_epochs: int = 1,
          num_warmup_steps: Optional[int] = None,
          batch_size: int = 16,
          train_loss: Literal["mnrl", "margin_mse"] = "mnrl",
          num_workers: int = 0,
          use_amp: bool = False,
          **kwargs) -> None

Trains/adapts the underlying embedding model. We only support the training of sentence-transformer embedding models.

Each training data example is a dictionary with the following keys:

  • question: the question string
  • pos_doc: the positive document string
  • neg_doc: the negative document string
  • score: the score margin

Arguments:

  • training_data: The training data in a dictionary format.
  • learning_rate: The learning rate.
  • n_epochs: The number of epochs that you want the train for.
  • num_warmup_steps: Behavior depends on the scheduler. For WarmupLinear (default), the learning rate is increased from 0 up to the maximal learning rate. After these many training steps, the learning rate is decreased linearly back to zero.
  • batch_size: The batch size to use for the training. The default values is 16.
  • train_loss: The loss to use for training. If you're using a sentence-transformer embedding_model (which is the only model that training is supported for), possible values are 'mnrl' (Multiple Negatives Ranking Loss) or 'margin_mse' (MarginMSE).
  • num_workers: The number of subprocesses to use for the Pytorch DataLoader.
  • use_amp: Use Automatic Mixed Precision (AMP).
  • kwargs: Additional training key word arguments to pass to the SentenceTransformer.fit function. Please reference the Sentence-Transformers documentation for a full list of keyword arguments.

MultihopEmbeddingRetriever.save

def save(save_dir: Union[Path, str]) -> None

Save the model to the given directory

Arguments:

  • save_dir (Union[Path, str]): The directory where the model will be saved

MultihopEmbeddingRetriever.timing

def timing(fn, attr_name)

Wrapper method used to time functions.

MultihopEmbeddingRetriever.eval

def eval(label_index: str = "label",
         doc_index: str = "eval_document",
         label_origin: str = "gold-label",
         top_k: int = 10,
         open_domain: bool = False,
         return_preds: bool = False,
         headers: Optional[Dict[str, str]] = None,
         document_store: Optional[BaseDocumentStore] = None) -> dict

Performs evaluation on the Retriever.

Retriever is evaluated based on whether it finds the correct document given the query string and at which position in the ranking of documents the correct document is.

Returns a dict containing the following metrics:

- "recall": Proportion of questions for which correct document is among retrieved documents
- "mrr": Mean of reciprocal rank. Rewards retrievers that give relevant documents a higher rank.
  Only considers the highest ranked relevant document.
- "map": Mean of average precision for each question. Rewards retrievers that give relevant
  documents a higher rank. Considers all retrieved relevant documents. If ``open_domain=True``,
  average precision is normalized by the number of retrieved relevant documents per query.
  If ``open_domain=False``, average precision is normalized by the number of all relevant documents
  per query.

Arguments:

  • label_index: Index/Table in DocumentStore where labeled questions are stored
  • doc_index: Index/Table in DocumentStore where documents that are used for evaluation are stored
  • top_k: How many documents to return per query
  • open_domain: If True, retrieval will be evaluated by checking if the answer string to a question is contained in the retrieved docs (common approach in open-domain QA). If False, retrieval uses a stricter evaluation that checks if the retrieved document ids are within ids explicitly stated in the labels.
  • return_preds: Whether to add predictions in the returned dictionary. If True, the returned dictionary contains the keys "predictions" and "metrics".
  • headers: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='} for basic authentication)

MultihopEmbeddingRetriever.run

def run(root_node: str,
        query: Optional[str] = None,
        filters: Optional[FilterType] = None,
        top_k: Optional[int] = None,
        documents: Optional[List[Document]] = None,
        index: Optional[str] = None,
        headers: Optional[Dict[str, str]] = None,
        scale_score: Optional[bool] = None)

Arguments:

  • root_node: The root node of the pipeline's graph.
  • query: Query string.
  • filters: A dictionary where the keys specify a metadata field and the value is a list of accepted values for that field.
  • top_k: How many documents to return per query.
  • documents: List of Documents to Retrieve.
  • index: The name of the index in the DocumentStore from which to retrieve documents.
  • headers: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='} for basic authentication).
  • scale_score: Whether to scale the similarity score to the unit interval (range of [0,1]). If true (default), similarity scores (e.g. cosine or dot_product) which naturally have a different value range will be scaled to a range of [0,1], where 1 means extremely relevant. Otherwise, raw similarity scores (e.g. cosine or dot_product) will be used.

MultihopEmbeddingRetriever.run_batch

def run_batch(root_node: str,
              queries: Optional[List[str]] = None,
              filters: Optional[Union[FilterType,
                                      List[Optional[FilterType]]]] = None,
              top_k: Optional[int] = None,
              documents: Optional[Union[List[Document],
                                        List[List[Document]]]] = None,
              index: Optional[str] = None,
              headers: Optional[Dict[str, str]] = None)

Arguments:

  • root_node: The root node of the pipeline's graph.
  • queries: The list of query strings.
  • filters: A dictionary where the keys specify a metadata field and the value is a list of accepted values for that field.
  • top_k: How many documents to return per query.
  • documents: List of Documents of Retrieve.
  • index: The name of the index in the DocumentStore from which to retrieve documents.
  • headers: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='} for basic authentication).

Module multimodal/retriever

MultiModalRetriever

class MultiModalRetriever(DenseRetriever)

MultiModalRetriever.__init__

def __init__(
        document_store: BaseDocumentStore,
        query_embedding_model: Union[Path, str],
        document_embedding_models: Dict[str, Union[Path, str]],
        query_type: str = "text",
        query_feature_extractor_params: Optional[Dict[str, Any]] = None,
        document_feature_extractors_params: Optional[Dict[str,
                                                          Dict[str,
                                                               Any]]] = None,
        top_k: int = 10,
        batch_size: int = 16,
        embed_meta_fields: Optional[List[str]] = None,
        similarity_function: str = "dot_product",
        progress_bar: bool = True,
        devices: Optional[List[Union[str, "torch.device"]]] = None,
        use_auth_token: Optional[Union[str, bool]] = None,
        scale_score: bool = True)

Retriever that uses a multiple encoder to jointly retrieve among a database consisting of different

data types. See the original paper for more details: Kostić, Bogdan, et al. (2021): "Multi-modal Retrieval of Tables and Texts Using Tri-encoder Models" (https://arxiv.org/abs/2108.04049),

Arguments:

  • document_store: An instance of DocumentStore from which to retrieve documents.
  • query_embedding_model: Local path or remote name of question encoder checkpoint. The format equals the one used by Hugging Face transformers' modelhub models.
  • document_embedding_models: Dictionary matching a local path or remote name of document encoder checkpoint with the content type it should handle ("text", "table", "image", and so on). The format equals the one used by Hugging Face transformers' modelhub models.
  • query_type: The content type of the query ("text", "image" and so on).
  • query_feature_extraction_params: The parameters to pass to the feature extractor of the query. If no value is provided, a default dictionary with "max_length": 64 will be set.
  • document_feature_extraction_params: The parameters to pass to the feature extractor of the documents. If no value is provided, a default dictionary with "text": {"max_length": 256} will be set.
  • top_k: How many documents to return per query.
  • batch_size: Number of questions or documents to encode at once. For multiple GPUs, this is the total batch size.
  • embed_meta_fields: Concatenate the provided meta fields to a (text) pair that is then used to create the embedding. This is likely to improve performance if your titles contain meaningful information for retrieval (topic, entities, and so on). Note that only text and table documents support this feature. If no values is provided, a default with "name" as embedding field will be created.
  • similarity_function: Which function to apply for calculating the similarity of query and document embeddings during training. Options: dot_product (default) or cosine.
  • progress_bar: Whether to show a tqdm progress bar or not. Can be helpful to disable in production deployments to keep the logs clean.
  • devices: List of GPU (or CPU) devices to limit inference to certain GPUs and not use all available ones. These strings will be converted into pytorch devices, so use the string notation described in [Tensor Attributes] (https://pytorch.org/docs/simage/tensor_attributes.html?highlight=torch%20device#torch.torch.device) (e.g. ["cuda:0"]). Note: As multi-GPU training is currently not implemented for TableTextRetriever, training only uses the first device provided in this list.
  • use_auth_token: API token used to download private models from Hugging Face. If this parameter is set to True, the local token is used, which must be previously created using transformer-cli login. For more information, see Hugging Face documentation
  • scale_score: Whether to scale the similarity score to the unit interval (range of [0,1]). If true (default) similarity scores (e.g. cosine or dot_product) which naturally have a different value range are scaled to a range of [0,1], where 1 means extremely relevant. Otherwise raw similarity scores (for example, cosine or dot_product) are used.

MultiModalRetriever.retrieve

def retrieve(
        query: Any,
        query_type: ContentTypes = "text",
        filters: Optional[FilterType] = None,
        top_k: Optional[int] = None,
        index: Optional[str] = None,
        headers: Optional[Dict[str, str]] = None,
        scale_score: Optional[bool] = None,
        document_store: Optional[BaseDocumentStore] = None) -> List[Document]

Scan through documents in DocumentStore and return a small number of documents that are most relevant to the

supplied query. Returns a list of Documents.

Arguments:

  • query: Query value. It might be text, a path, a table, and so on.
  • query_type: Type of the query ("text", "table", "image" and so on).
  • filters: Optional filters to narrow down the search space to documents whose metadata fulfill certain conditions. It can be a single filter applied to each query or a list of filters (one filter per query).
  • top_k: How many documents to return per query. Must be > 0.
  • index: The name of the index in the DocumentStore from which to retrieve documents.
  • batch_size: Number of queries to embed at a time. Must be > 0.
  • scale_score: Whether to scale the similarity score to the unit interval (range of [0,1]). If true, similarity scores (for example, cosine or dot_product) which naturally have a different value range is scaled to a range of [0,1], where 1 means extremely relevant. Otherwise raw similarity scores (for example, cosine or dot_product) are used.

MultiModalRetriever.retrieve_batch

def retrieve_batch(
    queries: List[Any],
    queries_type: ContentTypes = "text",
    filters: Optional[Union[FilterType, List[Optional[FilterType]]]] = None,
    top_k: Optional[int] = None,
    index: Optional[str] = None,
    headers: Optional[Dict[str, str]] = None,
    batch_size: Optional[int] = None,
    scale_score: Optional[bool] = None,
    document_store: Optional[BaseDocumentStore] = None
) -> List[List[Document]]

Scan through documents in DocumentStore and return a small number of documents that are most relevant to the

supplied queries. Returns a list of lists of Documents (one list per query).

This method assumes all queries are of the same data type. Mixed-type query batches (for example one image and one text) are currently not supported. Group the queries by type and call retrieve() on uniform batches only.

Arguments:

  • queries: List of query values. They might be text, paths, tables, and so on.
  • queries_type: Type of the query ("text", "table", "image" and so on)
  • filters: Optional filters to narrow down the search space to documents whose metadata fulfill certain conditions. It can be a single filter that will be applied to each query or a list of filters (one filter per query).
  • top_k: How many documents to return per query. Must be > 0.
  • index: The name of the index in the DocumentStore from which to retrieve documents.
  • batch_size: Number of queries to embed at a time. Must be > 0.
  • scale_score: Whether to scale the similarity score to the unit interval (range of [0,1]). If True, similarity scores (for example, cosine or dot_product) which naturally have a different value range are scaled to a range of [0,1], where 1 means extremely relevant. Otherwise raw similarity scores (for example, cosine or dot_product) are used.

MultiModalRetriever.timing

def timing(fn, attr_name)

Wrapper method used to time functions.

MultiModalRetriever.eval

def eval(label_index: str = "label",
         doc_index: str = "eval_document",
         label_origin: str = "gold-label",
         top_k: int = 10,
         open_domain: bool = False,
         return_preds: bool = False,
         headers: Optional[Dict[str, str]] = None,
         document_store: Optional[BaseDocumentStore] = None) -> dict

Performs evaluation on the Retriever.

Retriever is evaluated based on whether it finds the correct document given the query string and at which position in the ranking of documents the correct document is.

Returns a dict containing the following metrics:

- "recall": Proportion of questions for which correct document is among retrieved documents
- "mrr": Mean of reciprocal rank. Rewards retrievers that give relevant documents a higher rank.
  Only considers the highest ranked relevant document.
- "map": Mean of average precision for each question. Rewards retrievers that give relevant
  documents a higher rank. Considers all retrieved relevant documents. If ``open_domain=True``,
  average precision is normalized by the number of retrieved relevant documents per query.
  If ``open_domain=False``, average precision is normalized by the number of all relevant documents
  per query.

Arguments:

  • label_index: Index/Table in DocumentStore where labeled questions are stored
  • doc_index: Index/Table in DocumentStore where documents that are used for evaluation are stored
  • top_k: How many documents to return per query
  • open_domain: If True, retrieval will be evaluated by checking if the answer string to a question is contained in the retrieved docs (common approach in open-domain QA). If False, retrieval uses a stricter evaluation that checks if the retrieved document ids are within ids explicitly stated in the labels.
  • return_preds: Whether to add predictions in the returned dictionary. If True, the returned dictionary contains the keys "predictions" and "metrics".
  • headers: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='} for basic authentication)

MultiModalRetriever.run

def run(root_node: str,
        query: Optional[str] = None,
        filters: Optional[FilterType] = None,
        top_k: Optional[int] = None,
        documents: Optional[List[Document]] = None,
        index: Optional[str] = None,
        headers: Optional[Dict[str, str]] = None,
        scale_score: Optional[bool] = None)

Arguments:

  • root_node: The root node of the pipeline's graph.
  • query: Query string.
  • filters: A dictionary where the keys specify a metadata field and the value is a list of accepted values for that field.
  • top_k: How many documents to return per query.
  • documents: List of Documents to Retrieve.
  • index: The name of the index in the DocumentStore from which to retrieve documents.
  • headers: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='} for basic authentication).
  • scale_score: Whether to scale the similarity score to the unit interval (range of [0,1]). If true (default), similarity scores (e.g. cosine or dot_product) which naturally have a different value range will be scaled to a range of [0,1], where 1 means extremely relevant. Otherwise, raw similarity scores (e.g. cosine or dot_product) will be used.

MultiModalRetriever.run_batch

def run_batch(root_node: str,
              queries: Optional[List[str]] = None,
              filters: Optional[Union[FilterType,
                                      List[Optional[FilterType]]]] = None,
              top_k: Optional[int] = None,
              documents: Optional[Union[List[Document],
                                        List[List[Document]]]] = None,
              index: Optional[str] = None,
              headers: Optional[Dict[str, str]] = None)

Arguments:

  • root_node: The root node of the pipeline's graph.
  • queries: The list of query strings.
  • filters: A dictionary where the keys specify a metadata field and the value is a list of accepted values for that field.
  • top_k: How many documents to return per query.
  • documents: List of Documents of Retrieve.
  • index: The name of the index in the DocumentStore from which to retrieve documents.
  • headers: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='} for basic authentication).

Module web

WebRetriever

class WebRetriever(BaseRetriever)

The WebRetriever is an effective tool designed to extract relevant documents from the web. It leverages the WebSearch class to obtain web page results, strips the HTML from those pages, and extracts the raw text content. Depending on the operation mode, this text can be further broken down into smaller documents with the help of a PreProcessor.

The WebRetriever supports three distinct modes of operation:

  • Snippets Mode: In this mode, the WebRetriever generates a list of Document instances, where each Document represents a snippet or a segment from a web page result. It's important to note that this mode does not involve actual web page retrieval.

  • Raw Documents Mode: In this mode, the WebRetriever generates a list of Document instances, where each Document represents an entire web page (retrieved from the search result link) devoid of any HTML and containing only the raw text content.

  • Preprocessed Documents Mode: This mode is similar to the Raw Documents Mode but includes an additional step - the raw text from each retrieved web page is divided into smaller Document instances using a specified PreProcessor. If no PreProcessor is specified, the default PreProcessor is used.

WebRetriever.__init__

def __init__(api_key: str,
             search_engine_provider: Union[str, SearchEngine] = "SerperDev",
             search_engine_kwargs: Optional[Dict[str, Any]] = None,
             top_search_results: Optional[int] = 10,
             top_k: Optional[int] = 5,
             mode: Literal["snippets", "raw_documents",
                           "preprocessed_documents"] = "snippets",
             preprocessor: Optional[PreProcessor] = None,
             cache_document_store: Optional[BaseDocumentStore] = None,
             cache_index: Optional[str] = None,
             cache_headers: Optional[Dict[str, str]] = None,
             cache_time: int = 1 * 24 * 60 * 60,
             allowed_domains: Optional[List[str]] = None,
             link_content_fetcher: Optional[LinkContentFetcher] = None)

Arguments:

  • api_key: API key for the search engine provider.
  • search_engine_provider: Name of the search engine provider class. The options are "SerperDev" (default), "SearchApi", "SerpAPI", "BingAPI" or "GoogleAPI"
  • search_engine_kwargs: Additional parameters to pass to the search engine provider.
  • top_search_results: Number of top search results to be retrieved.
  • top_k: Top k documents to be returned by the retriever.
  • mode: Whether to return snippets, raw documents, or preprocessed documents. Snippets are the default.
  • preprocessor: Optional PreProcessor to be used to split documents into paragraphs. If not provided, the default PreProcessor is used.
  • cache_document_store: DocumentStore to be used to cache search results.
  • cache_index: Index name to be used to cache search results.
  • cache_headers: Headers to be used to cache search results.
  • cache_time: Time in seconds to cache search results. Defaults to 24 hours.
  • allowed_domains: List of domains to restrict the search to. If not provided, the search is unrestricted.
  • link_content_fetcher: LinkContentFetcher to be used to fetch the content from the links. If not provided, the default LinkContentFetcher is used.

WebRetriever.retrieve

def retrieve(query: str,
             top_k: Optional[int] = None,
             preprocessor: Optional[PreProcessor] = None,
             cache_document_store: Optional[BaseDocumentStore] = None,
             cache_index: Optional[str] = None,
             cache_headers: Optional[Dict[str, str]] = None,
             cache_time: Optional[int] = None,
             **kwargs) -> List[Document]

Retrieve Documents in real-time from the web based on the URLs provided by the WebSearch.

This method takes a search query as input, retrieves the corresponding web documents, and returns them in a structured format suitable for further processing or analysis. The documents are retrieved at runtime, ensuring up-to-date information.

Optionally, the retrieved documents can be stored in a DocumentStore for future use, saving time and resources on repeated retrievals. This caching mechanism can significantly improve retrieval times for frequently accessed URLs.

Arguments:

  • query: The query string.
  • top_k: The number of Documents to be returned by the retriever.
  • preprocessor: The PreProcessor to be used to split documents into paragraphs.
  • cache_document_store: The DocumentStore to cache the documents to.
  • cache_index: The index name to save the documents to.
  • cache_headers: The headers to save the documents to.
  • cache_time: The time limit in seconds for the documents in the cache. If objects are older than this time, they will be deleted from the cache on the next retrieval.

WebRetriever.retrieve_batch

def retrieve_batch(queries: List[str],
                   top_p: Optional[int] = None,
                   top_k: Optional[int] = None,
                   preprocessor: Optional[PreProcessor] = None,
                   cache_document_store: Optional[BaseDocumentStore] = None,
                   cache_index: Optional[str] = None,
                   cache_headers: Optional[Dict[str, str]] = None,
                   cache_time: Optional[int] = None) -> List[List[Document]]

Batch retrieval method that fetches documents for a list of queries. Each query is passed to the retrieve

method which fetches documents from the web in real-time or from a DocumentStore cache. The fetched documents are then extended to a list of documents.

Arguments:

  • queries: List of query strings to retrieve documents for.
  • top_p: The number of documents to be returned by the retriever for each query. If None, the instance's default value is used.
  • top_k: The maximum number of documents to be retrieved for each query. If None, the instance's default value is used.
  • preprocessor: The PreProcessor to be used to split documents into paragraphs. If None, the instance's default PreProcessor is used.
  • cache_document_store: The DocumentStore to cache the documents to. If None, the instance's default DocumentStore is used.
  • cache_index: The index name to save the documents to. If None, the instance's default cache_index is used.
  • cache_headers: The headers to save the documents to. If None, the instance's default cache_headers is used.
  • cache_time: The time limit in seconds for the documents in the cache.

Returns:

A list of lists where each inner list represents the documents fetched for a particular query.

WebRetriever.timing

def timing(fn, attr_name)

Wrapper method used to time functions.

WebRetriever.eval

def eval(label_index: str = "label",
         doc_index: str = "eval_document",
         label_origin: str = "gold-label",
         top_k: int = 10,
         open_domain: bool = False,
         return_preds: bool = False,
         headers: Optional[Dict[str, str]] = None,
         document_store: Optional[BaseDocumentStore] = None) -> dict

Performs evaluation on the Retriever.

Retriever is evaluated based on whether it finds the correct document given the query string and at which position in the ranking of documents the correct document is.

Returns a dict containing the following metrics:

- "recall": Proportion of questions for which correct document is among retrieved documents
- "mrr": Mean of reciprocal rank. Rewards retrievers that give relevant documents a higher rank.
  Only considers the highest ranked relevant document.
- "map": Mean of average precision for each question. Rewards retrievers that give relevant
  documents a higher rank. Considers all retrieved relevant documents. If ``open_domain=True``,
  average precision is normalized by the number of retrieved relevant documents per query.
  If ``open_domain=False``, average precision is normalized by the number of all relevant documents
  per query.

Arguments:

  • label_index: Index/Table in DocumentStore where labeled questions are stored
  • doc_index: Index/Table in DocumentStore where documents that are used for evaluation are stored
  • top_k: How many documents to return per query
  • open_domain: If True, retrieval will be evaluated by checking if the answer string to a question is contained in the retrieved docs (common approach in open-domain QA). If False, retrieval uses a stricter evaluation that checks if the retrieved document ids are within ids explicitly stated in the labels.
  • return_preds: Whether to add predictions in the returned dictionary. If True, the returned dictionary contains the keys "predictions" and "metrics".
  • headers: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='} for basic authentication)

WebRetriever.run

def run(root_node: str,
        query: Optional[str] = None,
        filters: Optional[FilterType] = None,
        top_k: Optional[int] = None,
        documents: Optional[List[Document]] = None,
        index: Optional[str] = None,
        headers: Optional[Dict[str, str]] = None,
        scale_score: Optional[bool] = None)

Arguments:

  • root_node: The root node of the pipeline's graph.
  • query: Query string.
  • filters: A dictionary where the keys specify a metadata field and the value is a list of accepted values for that field.
  • top_k: How many documents to return per query.
  • documents: List of Documents to Retrieve.
  • index: The name of the index in the DocumentStore from which to retrieve documents.
  • headers: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='} for basic authentication).
  • scale_score: Whether to scale the similarity score to the unit interval (range of [0,1]). If true (default), similarity scores (e.g. cosine or dot_product) which naturally have a different value range will be scaled to a range of [0,1], where 1 means extremely relevant. Otherwise, raw similarity scores (e.g. cosine or dot_product) will be used.

WebRetriever.run_batch

def run_batch(root_node: str,
              queries: Optional[List[str]] = None,
              filters: Optional[Union[FilterType,
                                      List[Optional[FilterType]]]] = None,
              top_k: Optional[int] = None,
              documents: Optional[Union[List[Document],
                                        List[List[Document]]]] = None,
              index: Optional[str] = None,
              headers: Optional[Dict[str, str]] = None)

Arguments:

  • root_node: The root node of the pipeline's graph.
  • queries: The list of query strings.
  • filters: A dictionary where the keys specify a metadata field and the value is a list of accepted values for that field.
  • top_k: How many documents to return per query.
  • documents: List of Documents of Retrieve.
  • index: The name of the index in the DocumentStore from which to retrieve documents.
  • headers: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='} for basic authentication).

Module link_content

html_content_handler

def html_content_handler(response: Response) -> Optional[str]

Extracts text from HTML response text using the boilerpy3 extractor.

Arguments:

  • response: Response object from the request.

Returns:

The extracted text.

LinkContentFetcher

class LinkContentFetcher(BaseComponent)

LinkContentFetcher fetches content from a URL and converts it into a list of Document objects.

LinkContentFetcher supports the following content types:
- HTML
- PDF

LinkContentFetcher offers a few options for customizing the content extraction process:
- content_handlers: A dictionary of content handlers to use for extracting content from a response.
- processor: PreProcessor to apply to the extracted text
- raise_on_failure: A boolean indicating whether to raise an exception when a failure occurs

One can use LinkContentFetcher as a standalone component or as part of a Pipeline. Here is an example of using
LinkContentFetcher as a standalone component:

```python
from haystack.nodes import LinkContentFetcher
from haystack.schema import Document

link_content_fetcher = LinkContentFetcher()
dl_wiki: List[Document] = link_content_fetcher.fetch(url="https://en.wikipedia.org/wiki/Deep_learning")
print(dl_wiki)
```

One can also use LinkContentFetcher as part of a Pipeline. Here is an example of using LinkContentFetcher as part
of a Pipeline:

```python
import os
from haystack.nodes import PromptNode, LinkContentFetcher, PromptTemplate
from haystack import Pipeline

anthropic_key = os.environ.get("ANTHROPIC_API_KEY")
if not anthropic_key:
    raise ValueError("Please set the ANTHROPIC_API_KEY environment variable")


retriever = LinkContentFetcher() # optionally add additional user agents
pt = PromptTemplate(
    "Given the content below, create a summary consisting of three sections: Objectives, "
    "Implementation and Learnings/Conclusions.

" "Each section should have at least three bullet points. " "In the content below disregard References section.

: {documents}" )

prompt_node = PromptNode("claude-instant-1",
                          api_key=anthropic_key,
                          max_length=512,
                          default_prompt_template=pt,
                          model_kwargs={"stream": True}
                          )

pipeline = Pipeline()
pipeline.add_node(component=retriever, name="Retriever", inputs=["Query"])
pipeline.add_node(component=prompt_node, name="PromptNode", inputs=["Retriever"])

research_papers = ["https://arxiv.org/pdf/2307.03172.pdf", "https://arxiv.org/pdf/1706.03762.pdf"]

for research_paper in research_papers:
    print(f"Research paper summary: {research_paper}")
    pipeline.run(research_paper)
    print("

")

LinkContentFetcher.__init__

def __init__(content_handlers: Optional[Dict[str, Callable]] = None,
             processor: Optional[PreProcessor] = None,
             raise_on_failure: Optional[bool] = False,
             user_agents: Optional[List[str]] = None,
             retry_attempts: Optional[int] = None)

Creates a LinkContentFetcher instance.

Arguments:

  • content_handlers: A dictionary of content handlers to use for extracting content from a response.
  • processor: PreProcessor to apply to the extracted text
  • raise_on_failure: A boolean indicating whether to raise an exception when a failure occurs during content extraction. If False, the error is simply logged and the program continues. Defaults to False.
  • user_agents: A list of user agents to use when fetching content. Defaults to None.
  • retry_attempts: The number of times to retry fetching content. Defaults to 2.

LinkContentFetcher.fetch

def fetch(url: str,
          timeout: Optional[int] = 3,
          doc_kwargs: Optional[dict] = None) -> List[Document]

Fetches content from a URL and converts it into a list of Document objects. If no content is extracted,

an empty list is returned.

Arguments:

  • url: URL to fetch content from.
  • timeout: Timeout in seconds for the request.
  • doc_kwargs: Optional kwargs to pass to the Document constructor.

Returns:

List of Document objects or an empty list if no content is extracted.

LinkContentFetcher.run

def run(query: Optional[str] = None,
        file_paths: Optional[List[str]] = None,
        labels: Optional[MultiLabel] = None,
        documents: Optional[List[Document]] = None,
        meta: Optional[dict] = None) -> Tuple[Dict, str]

Fetches content from a URL specified by query parameter and converts it into a list of Document objects.

param query: The query - a URL to fetch content from. param file_paths: Not used. param labels: Not used. param documents: Not used. param meta: Not used.

return: List of Document objects.

LinkContentFetcher.run_batch

def run_batch(queries: Optional[Union[str, List[str]]] = None,
              file_paths: Optional[List[str]] = None,
              labels: Optional[Union[MultiLabel, List[MultiLabel]]] = None,
              documents: Optional[Union[List[Document],
                                        List[List[Document]]]] = None,
              meta: Optional[Union[Dict[str, Any], List[Dict[str,
                                                             Any]]]] = None,
              params: Optional[dict] = None,
              debug: Optional[bool] = None)

Takes a list of queries, where each query is expected to be a URL. For each query, the method fetches content from the specified URL and transforms it into a list of Document objects. The output is a list of these document lists, where each individual list of Document objects corresponds to the content retrieved

param queries: List of queries - URLs to fetch content from. param file_paths: Not used. param labels: Not used. param documents: Not used. param meta: Not used. param params: Not used. param debug: Not used.

return: List of lists of Document objects.