Stores your texts and meta data and provides them to the Retriever at query time.
Module es7
ElasticsearchDocumentStore
class ElasticsearchDocumentStore(_ElasticsearchDocumentStore)
ElasticsearchDocumentStore.__init__
def __init__(host: Union[str, List[str]] = "localhost",
port: Union[int, List[int]] = 9200,
username: str = "",
password: str = "",
api_key_id: Optional[str] = None,
api_key: Optional[str] = None,
aws4auth=None,
index: str = "document",
label_index: str = "label",
search_fields: Union[str, list] = "content",
content_field: str = "content",
name_field: str = "name",
embedding_field: str = "embedding",
embedding_dim: int = 768,
custom_mapping: Optional[dict] = None,
excluded_meta_data: Optional[list] = None,
analyzer: str = "standard",
scheme: str = "http",
ca_certs: Optional[str] = None,
verify_certs: bool = True,
recreate_index: bool = False,
create_index: bool = True,
refresh_type: str = "wait_for",
similarity: str = "dot_product",
timeout: int = 300,
return_embedding: bool = False,
duplicate_documents: str = "overwrite",
scroll: str = "1d",
skip_missing_embeddings: bool = True,
synonyms: Optional[List] = None,
synonym_type: str = "synonym",
use_system_proxy: bool = False,
batch_size: int = 10_000)
A DocumentStore using Elasticsearch to store and query the documents for our search.
- Keeps all the logic to store and query documents from Elastic, incl. mapping of fields, adding filters or boosts to your queries, and storing embeddings
- You can either use an existing Elasticsearch index or create a new one via haystack
- Retrievers operate on top of this DocumentStore to find the relevant documents for a query
Arguments:
host
: url(s) of elasticsearch nodesport
: port(s) of elasticsearch nodesusername
: username (standard authentication via http_auth)password
: password (standard authentication via http_auth)api_key_id
: ID of the API key (alternative authentication mode to the above http_auth)api_key
: Secret value of the API key (alternative authentication mode to the above http_auth)aws4auth
: Authentication for usage with aws elasticsearch (can be generated with the requests-aws4auth package)index
: Name of index in elasticsearch to use for storing the documents that we want to search. If not existing yet, we will create one.label_index
: Name of index in elasticsearch to use for storing labels. If not existing yet, we will create one.search_fields
: Name of fields used by BM25Retriever to find matches in the docs to our incoming query (using elastic's multi_match query), e.g. ["title", "full_text"]content_field
: Name of field that might contain the answer and will therefore be passed to the Reader Model (e.g. "full_text"). If no Reader is used (e.g. in FAQ-Style QA) the plain content of this field will just be returned.name_field
: Name of field that contains the title of the the docembedding_field
: Name of field containing an embedding vector (Only needed when using a dense retriever (e.g. DensePassageRetriever, EmbeddingRetriever) on top)embedding_dim
: Dimensionality of embedding vector (Only needed when using a dense retriever (e.g. DensePassageRetriever, EmbeddingRetriever) on top)custom_mapping
: If you want to use your own custom mapping for creating a new index in Elasticsearch, you can supply it here as a dictionary.analyzer
: Specify the default analyzer from one of the built-ins when creating a new Elasticsearch Index. Elasticsearch also has built-in analyzers for different languages (e.g. impacting tokenization). More info at: https://www.elastic.co/guide/en/elasticsearch/reference/7.9/analysis-analyzers.htmlexcluded_meta_data
: Name of fields in Elasticsearch that should not be returned (e.g. [field_one, field_two]). Helpful if you have fields with long, irrelevant content that you don't want to display in results (e.g. embedding vectors).scheme
: 'https' or 'http', protocol used to connect to your elasticsearch instanceca_certs
: Root certificates for SSL: it is a path to certificate authority (CA) certs on disk. You can use certifi package with certifi.where() to find where the CA certs file is located in your machine.verify_certs
: Whether to be strict about ca certificatesrecreate_index
: If set to True, an existing elasticsearch index will be deleted and a new one will be created using the config you are using for initialization. Be aware that all data in the old index will be lost if you choose to recreate the index. Be aware that both the document_index and the label_index will be recreated.create_index
: Whether to try creating a new index (If the index of that name is already existing, we will just continue in any case) ..deprecated:: 2.0 This param is deprecated. In the next major version we will always try to create an index if there is no existing index (the current behaviour when create_index=True). If you are looking to recreate an existing index by deleting it first if it already exist use param recreate_index.refresh_type
: Type of ES refresh used to control when changes made by a request (e.g. bulk) are made visible to search. If set to 'wait_for', continue only after changes are visible (slow, but safe). If set to 'false', continue directly (fast, but sometimes unintuitive behaviour when docs are not immediately available after ingestion). More info at https://www.elastic.co/guide/en/elasticsearch/reference/6.8/docs-refresh.htmlsimilarity
: The similarity function used to compare document vectors. 'dot_product' is the default since it is more performant with DPR embeddings. 'cosine' is recommended if you are using a Sentence BERT model.timeout
: Number of seconds after which an ElasticSearch request times out.return_embedding
: To return document embeddingduplicate_documents
: Handle duplicates document based on parameter options. Parameter options : ( 'skip','overwrite','fail') skip: Ignore the duplicates documents overwrite: Update any existing documents with the same ID when adding documents. fail: an error is raised if the document ID of the document being added already exists.scroll
: Determines how long the current index is fixed, e.g. during updating all documents with embeddings. Defaults to "1d" and should not be larger than this. Can also be in minutes "5m" or hours "15h" For details, see https://www.elastic.co/guide/en/elasticsearch/reference/current/scroll-api.htmlskip_missing_embeddings
: Parameter to control queries based on vector similarity when indexed documents miss embeddings. Parameter options: (True, False) False: Raises exception if one or more documents do not have embeddings at query time True: Query will ignore all documents without embeddings (recommended if you concurrently index and query)synonyms
: List of synonyms can be passed while elasticsearch initialization. For example: [ "foo, bar => baz", "foozball , foosball" ] More info at https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-synonym-tokenfilter.htmlsynonym_type
: Synonym filter type can be passed. Synonym or Synonym_graph to handle synonyms, including multi-word synonyms correctly during the analysis process. More info at https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-synonym-graph-tokenfilter.htmluse_system_proxy
: Whether to use system proxy.batch_size
: Number of Documents to index at once / Number of queries to execute at once. If you face memory issues, decrease the batch_size.
ElasticsearchDocumentStore.query_by_embedding
def query_by_embedding(query_emb: np.ndarray,
filters: Optional[FilterType] = None,
top_k: int = 10,
index: Optional[str] = None,
return_embedding: Optional[bool] = None,
headers: Optional[Dict[str, str]] = None,
scale_score: bool = True) -> List[Document]
Find the document that is most similar to the provided query_emb
by using a vector similarity metric.
Arguments:
-
query_emb
: Embedding of the query (e.g. gathered from DPR) -
filters
: Optional filters to narrow down the search space to documents whose metadata fulfill certain conditions. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and"
,"$or"
,"$not"
), a comparison operator ("$eq"
,"$in"
,"$gt"
,"$gte"
,"$lt"
,"$lte"
) or a metadata field name. Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of"$in"
) a list of values as value. If no logical operator is provided,"$and"
is used as default operation. If no comparison operator is provided,"$eq"
(or"$in"
if the comparison value is a list) is used as default operation.__Example__: ```python filters = { "$and": { "type": {"$eq": "article"}, "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"}, "rating": {"$gte": 3}, "$or": { "genre": {"$in": ["economy", "politics"]}, "publisher": {"$eq": "nytimes"} } } } # or simpler using default operators filters = { "type": "article", "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"}, "rating": {"$gte": 3}, "$or": { "genre": ["economy", "politics"], "publisher": "nytimes" } } ``` To use the same logical operator multiple times on the same level, logical operators take optionally a list of dictionaries as value. __Example__: ```python filters = { "$or": [ { "$and": { "Type": "News Paper", "Date": { "$lt": "2019-01-01" } } }, { "$and": { "Type": "Blog Post", "Date": { "$gte": "2019-01-01" } } } ] } ```
-
top_k
: How many documents to return -
index
: Index name for storing the docs and metadata -
return_embedding
: To return document embedding -
headers
: Custom HTTP headers to pass to elasticsearch client (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='}) Check out https://www.elastic.co/guide/en/elasticsearch/reference/current/http-clients.html for more information. -
scale_score
: Whether to scale the similarity score to the unit interval (range of [0,1]). If true (default) similarity scores (e.g. cosine or dot_product) which naturally have a different value range will be scaled to a range of [0,1], where 1 means extremely relevant. Otherwise raw similarity scores (e.g. cosine or dot_product) will be used.
ElasticsearchDocumentStore.get_document_by_id
def get_document_by_id(
id: str,
index: Optional[str] = None,
headers: Optional[Dict[str, str]] = None) -> Optional[Document]
Fetch a document by specifying its text id string
ElasticsearchDocumentStore.get_documents_by_id
def get_documents_by_id(
ids: List[str],
index: Optional[str] = None,
batch_size: int = 10_000,
headers: Optional[Dict[str, str]] = None) -> List[Document]
Fetch documents by specifying a list of text id strings.
Arguments:
ids
: List of document IDs. Be aware that passing a large number of ids might lead to performance issues.index
: search index where the documents are stored. If not supplied, self.index will be used.batch_size
: Maximum number of results for each query. Limited to 10,000 documents by default. To reduce the pressure on the cluster, you can lower this limit, at the expense of longer retrieval times.headers
: Custom HTTP headers to pass to the client (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='}) Check out https://www.elastic.co/guide/en/elasticsearch/reference/current/http-clients.html for more information.
ElasticsearchDocumentStore.get_metadata_values_by_key
def get_metadata_values_by_key(key: str,
query: Optional[str] = None,
filters: Optional[FilterType] = None,
index: Optional[str] = None,
headers: Optional[Dict[str, str]] = None,
batch_size: int = 10) -> List[dict]
Get values associated with a metadata key. The output is in the format:
[{"value": "my-value-1", "count": 23}, {"value": "my-value-2", "count": 12}, ... ]
Arguments:
-
key
: the meta key name to get the values for. -
query
: narrow down the scope to documents matching the query string. -
filters
: Narrow down the scope to documents that match the given filters. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and"
,"$or"
,"$not"
), a comparison operator ("$eq"
,"$in"
,"$gt"
,"$gte"
,"$lt"
,"$lte"
) or a metadata field name. Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of"$in"
) a list of values as value. If no logical operator is provided,"$and"
is used as default operation. If no comparison operator is provided,"$eq"
(or"$in"
if the comparison value is a list) is used as default operation.__Example__: ```python filters = { "$and": { "type": {"$eq": "article"}, "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"}, "rating": {"$gte": 3}, "$or": { "genre": {"$in": ["economy", "politics"]}, "publisher": {"$eq": "nytimes"} } } } ```
-
index
: search index where the meta values should be searched. If not supplied, self.index will be used. -
headers
: Custom HTTP headers to pass to the client (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='}) Check out https://www.elastic.co/guide/en/elasticsearch/reference/current/http-clients.html for more information. -
batch_size
: Maximum number of results for each request. Limited to 10 values by default. You can increase this limit to decrease retrieval time. To reduce the pressure on the cluster, you shouldn't set this higher than 1,000.
ElasticsearchDocumentStore.write_documents
def write_documents(documents: Union[List[dict], List[Document]],
index: Optional[str] = None,
batch_size: Optional[int] = None,
duplicate_documents: Optional[str] = None,
headers: Optional[Dict[str, str]] = None)
Indexes documents for later queries.
If a document with the same ID already exists:
a) (Default) Manage duplication according to the duplicate_documents
parameter.
b) If self.update_existing_documents=True
for DocumentStore: Overwrite existing documents.
(This is only relevant if you pass your own ID when initializing a Document
.
If you don't set custom IDs for your Documents or just pass a list of dictionaries here,
they automatically get UUIDs assigned. See the Document
class for details.)
Arguments:
documents
: A list of Python dictionaries or a list of Haystack Document objects. For documents as dictionaries, the format is {"content": ""}. Optionally: Include meta data via {"content": "", "meta":{"name": ", "author": "somebody", ...}} You can use it for filtering and you can access it in the responses of the Finder. Advanced: If you are using your own field mapping, change the key names in the dictionary to what you have set for self.content_field and self.name_field.index
: search index where the documents should be indexed. If you don't specify it, self.index is used.batch_size
: Number of documents that are passed to the bulk function at each round. If not specified, self.batch_size is used.duplicate_documents
: Handle duplicate documents based on parameter options. Parameter options: ( 'skip','overwrite','fail') skip: Ignore the duplicate documents overwrite: Update any existing documents with the same ID when adding documents. fail: Raises an error if the document ID of the document being added already exists.headers
: Custom HTTP headers to pass to the client (for example {'Authorization': 'Basic YWRtaW46cm9vdA=='}) For more information, see HTTP/REST clients and security.
Raises:
DuplicateDocumentError
: Exception trigger on duplicate document
Returns:
None
ElasticsearchDocumentStore.write_labels
def write_labels(labels: Union[List[Label], List[dict]],
index: Optional[str] = None,
headers: Optional[Dict[str, str]] = None,
batch_size: int = 10_000)
Write annotation labels into document store.
Arguments:
labels
: A list of Python dictionaries or a list of Haystack Label objects.index
: search index where the labels should be stored. If not supplied, self.label_index will be used.batch_size
: Number of labels that are passed to the bulk function at each round.headers
: Custom HTTP headers to pass to the client (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='}) Check out https://www.elastic.co/guide/en/elasticsearch/reference/current/http-clients.html for more information.
ElasticsearchDocumentStore.update_document_meta
def update_document_meta(id: str,
meta: Dict[str, str],
index: Optional[str] = None,
headers: Optional[Dict[str, str]] = None)
Update the metadata dictionary of a document by specifying its string id
ElasticsearchDocumentStore.get_document_count
def get_document_count(filters: Optional[FilterType] = None,
index: Optional[str] = None,
only_documents_without_embedding: bool = False,
headers: Optional[Dict[str, str]] = None) -> int
Return the number of documents in the document store.
ElasticsearchDocumentStore.get_label_count
def get_label_count(index: Optional[str] = None,
headers: Optional[Dict[str, str]] = None) -> int
Return the number of labels in the document store
ElasticsearchDocumentStore.get_embedding_count
def get_embedding_count(index: Optional[str] = None,
filters: Optional[FilterType] = None,
headers: Optional[Dict[str, str]] = None) -> int
Return the count of embeddings in the document store.
ElasticsearchDocumentStore.get_all_documents
def get_all_documents(
index: Optional[str] = None,
filters: Optional[FilterType] = None,
return_embedding: Optional[bool] = None,
batch_size: int = 10_000,
headers: Optional[Dict[str, str]] = None) -> List[Document]
Get documents from the document store.
Arguments:
-
index
: Name of the index to get the documents from. If None, the DocumentStore's default index (self.index) will be used. -
filters
: Optional filters to narrow down the documents to return. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and"
,"$or"
,"$not"
), a comparison operator ("$eq"
,"$in"
,"$gt"
,"$gte"
,"$lt"
,"$lte"
) or a metadata field name. Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of"$in"
) a list of values as value. If no logical operator is provided,"$and"
is used as default operation. If no comparison operator is provided,"$eq"
(or"$in"
if the comparison value is a list) is used as default operation.__Example__: ```python filters = { "$and": { "type": {"$eq": "article"}, "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"}, "rating": {"$gte": 3}, "$or": { "genre": {"$in": ["economy", "politics"]}, "publisher": {"$eq": "nytimes"} } } } ```
-
return_embedding
: Whether to return the document embeddings. -
batch_size
: When working with large number of documents, batching can help reduce memory footprint. -
headers
: Custom HTTP headers to pass to the client (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='}) Check out https://www.elastic.co/guide/en/elasticsearch/reference/current/http-clients.html for more information.
ElasticsearchDocumentStore.get_all_documents_generator
def get_all_documents_generator(
index: Optional[str] = None,
filters: Optional[FilterType] = None,
return_embedding: Optional[bool] = None,
batch_size: int = 10_000,
headers: Optional[Dict[str,
str]] = None) -> Generator[Document, None, None]
Get documents from the document store. Under-the-hood, documents are fetched in batches from the
document store and yielded as individual documents. This method can be used to iteratively process a large number of documents without having to load all documents in memory.
Arguments:
-
index
: Name of the index to get the documents from. If None, the DocumentStore's default index (self.index) will be used. -
filters
: Optional filters to narrow down the documents to return. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and"
,"$or"
,"$not"
), a comparison operator ("$eq"
,"$in"
,"$gt"
,"$gte"
,"$lt"
,"$lte"
) or a metadata field name. Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of"$in"
) a list of values as value. If no logical operator is provided,"$and"
is used as default operation. If no comparison operator is provided,"$eq"
(or"$in"
if the comparison value is a list) is used as default operation.__Example__: ```python filters = { "$and": { "type": {"$eq": "article"}, "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"}, "rating": {"$gte": 3}, "$or": { "genre": {"$in": ["economy", "politics"]}, "publisher": {"$eq": "nytimes"} } } } ```
-
return_embedding
: Whether to return the document embeddings. -
batch_size
: When working with large number of documents, batching can help reduce memory footprint. -
headers
: Custom HTTP headers to pass to the client (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='}) Check out https://www.elastic.co/guide/en/elasticsearch/reference/current/http-clients.html for more information.
ElasticsearchDocumentStore.get_all_labels
def get_all_labels(index: Optional[str] = None,
filters: Optional[FilterType] = None,
headers: Optional[Dict[str, str]] = None,
batch_size: int = 10_000) -> List[Label]
Return all labels in the document store
ElasticsearchDocumentStore.query
def query(query: Optional[str],
filters: Optional[FilterType] = None,
top_k: int = 10,
custom_query: Optional[str] = None,
index: Optional[str] = None,
headers: Optional[Dict[str, str]] = None,
all_terms_must_match: bool = False,
scale_score: bool = True) -> List[Document]
Scan through documents in DocumentStore and return a small number documents
that are most relevant to the query as defined by the BM25 algorithm.
Arguments:
-
query
: The query -
filters
: Optional filters to narrow down the search space to documents whose metadata fulfill certain conditions. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and"
,"$or"
,"$not"
), a comparison operator ("$eq"
,"$in"
,"$gt"
,"$gte"
,"$lt"
,"$lte"
) or a metadata field name. Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of"$in"
) a list of values as value. If no logical operator is provided,"$and"
is used as default operation. If no comparison operator is provided,"$eq"
(or"$in"
if the comparison value is a list) is used as default operation.__Example__: ```python filters = { "$and": { "type": {"$eq": "article"}, "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"}, "rating": {"$gte": 3}, "$or": { "genre": {"$in": ["economy", "politics"]}, "publisher": {"$eq": "nytimes"} } } } # or simpler using default operators filters = { "type": "article", "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"}, "rating": {"$gte": 3}, "$or": { "genre": ["economy", "politics"], "publisher": "nytimes" } } ``` To use the same logical operator multiple times on the same level, logical operators take optionally a list of dictionaries as value. __Example__: ```python filters = { "$or": [ { "$and": { "Type": "News Paper", "Date": { "$lt": "2019-01-01" } } }, { "$and": { "Type": "Blog Post", "Date": { "$gte": "2019-01-01" } } } ] } ```
-
top_k
: How many documents to return per query. -
custom_query
: query string containing a mandatory${query}
and an optional${filters}
placeholder. ::**An example custom_query:**
{ "size": 10, "query": { "bool": { "should": [{"multi_match": { "query": ${query}, // mandatory query placeholder "type": "most_fields", "fields": ["content", "title"]}}], "filter": ${filters} // optional filters placeholder } }, } ``` **For this custom_query, a sample retrieve() could be:** ```python self.retrieve(query="Why did the revenue increase?", filters={"years": ["2019"], "quarters": ["Q1", "Q2"]})
Optionally, highlighting can be defined by specifying the highlight settings. See https://www.elastic.co/guide/en/elasticsearch/reference/current/highlighting.html. You will find the highlighted output in the returned Document's meta field by key "highlighted". ::
**Example custom_query with highlighting:**
{
"size": 10,
"query": {
"bool": {
"should": [{"multi_match": {
"query": ${query}, // mandatory query placeholder
"type": "most_fields",
"fields": ["content", "title"]}}],
}
},
"highlight": { // enable highlighting
"fields": { // for fields content and title
"content": {},
"title": {}
}
},
}
```
**For this custom_query, highlighting info can be accessed by:**
```python
docs = self.retrieve(query="Why did the revenue increase?")
highlighted_content = docs[0].meta["highlighted"]["content"]
highlighted_title = docs[0].meta["highlighted"]["title"]
index
: The name of the index in the DocumentStore from which to retrieve documentsheaders
: Custom HTTP headers to pass to the client (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='}) Check out https://www.elastic.co/guide/en/elasticsearch/reference/current/http-clients.html for more information.all_terms_must_match
: Whether all terms of the query must match the document. If true all query terms must be present in a document in order to be retrieved (i.e the AND operator is being used implicitly between query terms: "cozy fish restaurant" -> "cozy AND fish AND restaurant"). Otherwise at least one query term must be present in a document in order to be retrieved (i.e the OR operator is being used implicitly between query terms: "cozy fish restaurant" -> "cozy OR fish OR restaurant"). Defaults to false.scale_score
: Whether to scale the similarity score to the unit interval (range of [0,1]). If true (default) similarity scores (e.g. cosine or dot_product) which naturally have a different value range will be scaled to a range of [0,1], where 1 means extremely relevant. Otherwise raw similarity scores (e.g. cosine or dot_product) will be used.
ElasticsearchDocumentStore.query_batch
def query_batch(queries: List[str],
filters: Optional[Union[FilterType,
List[Optional[FilterType]]]] = None,
top_k: int = 10,
custom_query: Optional[str] = None,
index: Optional[str] = None,
headers: Optional[Dict[str, str]] = None,
all_terms_must_match: bool = False,
scale_score: bool = True,
batch_size: Optional[int] = None) -> List[List[Document]]
Scan through documents in DocumentStore and return a small number of documents
that are most relevant to the provided queries as defined by keyword matching algorithms like BM25.
This method lets you find relevant documents for list of query strings (output: List of Lists of Documents).
Arguments:
queries
: List of query strings.filters
: Optional filters to narrow down the search space to documents whose metadata fulfill certain conditions. Can be a single filter that will be applied to each query or a list of filters (one filter per query).
Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical
operator ("$and"
, "$or"
, "$not"
), a comparison operator ("$eq"
, "$in"
, "$gt"
,
"$gte"
, "$lt"
, "$lte"
) or a metadata field name.
Logical operator keys take a dictionary of metadata field names and/or logical operators as
value. Metadata field names take a dictionary of comparison operators as value. Comparison
operator keys take a single value or (in case of "$in"
) a list of values as value.
If no logical operator is provided, "$and"
is used as default operation. If no comparison
operator is provided, "$eq"
(or "$in"
if the comparison value is a list) is used as default
operation.
__Example__:
```python
filters = {
"$and": {
"type": {"$eq": "article"},
"date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
"rating": {"$gte": 3},
"$or": {
"genre": {"$in": ["economy", "politics"]},
"publisher": {"$eq": "nytimes"}
}
}
}
# or simpler using default operators
filters = {
"type": "article",
"date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
"rating": {"$gte": 3},
"$or": {
"genre": ["economy", "politics"],
"publisher": "nytimes"
}
}
```
To use the same logical operator multiple times on the same level, logical operators take
optionally a list of dictionaries as value.
__Example__:
```python
filters = {
"$or": [
{
"$and": {
"Type": "News Paper",
"Date": {
"$lt": "2019-01-01"
}
}
},
{
"$and": {
"Type": "Blog Post",
"Date": {
"$gte": "2019-01-01"
}
}
}
]
}
```
top_k
: How many documents to return per query.custom_query
: Custom query to be executed.index
: The name of the index in the DocumentStore from which to retrieve documentsheaders
: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='} for basic authentication)all_terms_must_match
: Whether all terms of the query must match the document. If true all query terms must be present in a document in order to be retrieved (i.e the AND operator is being used implicitly between query terms: "cozy fish restaurant" -> "cozy AND fish AND restaurant"). Otherwise, at least one query term must be present in a document in order to be retrieved (i.e the OR operator is being used implicitly between query terms: "cozy fish restaurant" -> "cozy OR fish OR restaurant"). Defaults to False.scale_score
: Whether to scale the similarity score to the unit interval (range of [0,1]). If true (default) similarity scores (e.g. cosine or dot_product) which naturally have a different value range will be scaled to a range of [0,1], where 1 means extremely relevant. Otherwise, raw similarity scores (e.g. cosine or dot_product) will be used.batch_size
: Number of queries that are processed at once. If not specified, self.batch_size is used.
ElasticsearchDocumentStore.query_by_embedding_batch
def query_by_embedding_batch(
query_embs: Union[List[np.ndarray], np.ndarray],
filters: Optional[Union[FilterType,
List[Optional[FilterType]]]] = None,
top_k: int = 10,
index: Optional[str] = None,
return_embedding: Optional[bool] = None,
headers: Optional[Dict[str, str]] = None,
scale_score: bool = True,
batch_size: Optional[int] = None) -> List[List[Document]]
Find the documents that are most similar to the provided query_embs
by using a vector similarity metric.
Arguments:
-
query_embs
: Embeddings of the queries (e.g. gathered from DPR). Can be a list of one-dimensional numpy arrays or a two-dimensional numpy array. -
filters
: Optional filters to narrow down the search space to documents whose metadata fulfill certain conditions. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and"
,"$or"
,"$not"
), a comparison operator ("$eq"
,"$in"
,"$gt"
,"$gte"
,"$lt"
,"$lte"
) or a metadata field name. Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of"$in"
) a list of values as value. If no logical operator is provided,"$and"
is used as default operation. If no comparison operator is provided,"$eq"
(or"$in"
if the comparison value is a list) is used as default operation.__Example__: ```python filters = { "$and": { "type": {"$eq": "article"}, "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"}, "rating": {"$gte": 3}, "$or": { "genre": {"$in": ["economy", "politics"]}, "publisher": {"$eq": "nytimes"} } } } # or simpler using default operators filters = { "type": "article", "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"}, "rating": {"$gte": 3}, "$or": { "genre": ["economy", "politics"], "publisher": "nytimes" } } ``` To use the same logical operator multiple times on the same level, logical operators take optionally a list of dictionaries as value. __Example__: ```python filters = { "$or": [ { "$and": { "Type": "News Paper", "Date": { "$lt": "2019-01-01" } } }, { "$and": { "Type": "Blog Post", "Date": { "$gte": "2019-01-01" } } } ] } ```
-
top_k
: How many documents to return -
index
: Index name for storing the docs and metadata -
return_embedding
: To return document embedding -
headers
: Custom HTTP headers to pass to elasticsearch client (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='}) Check out https://www.elastic.co/guide/en/elasticsearch/reference/current/http-clients.html for more information. -
scale_score
: Whether to scale the similarity score to the unit interval (range of [0,1]). If true (default) similarity scores (e.g. cosine or dot_product) which naturally have a different value range will be scaled to a range of [0,1], where 1 means extremely relevant. Otherwise, raw similarity scores (e.g. cosine or dot_product) will be used. -
batch_size
: Number of query embeddings to process at once. If not specified, self.batch_size is used.
ElasticsearchDocumentStore.update_embeddings
def update_embeddings(retriever: DenseRetriever,
index: Optional[str] = None,
filters: Optional[FilterType] = None,
update_existing_embeddings: bool = True,
batch_size: Optional[int] = None,
headers: Optional[Dict[str, str]] = None)
Updates the embeddings in the the document store using the encoding model specified in the retriever.
This can be useful if want to add or change the embeddings for your documents (e.g. after changing the retriever config).
Arguments:
-
retriever
: Retriever to use to update the embeddings. -
index
: Index name to update -
update_existing_embeddings
: Whether to update existing embeddings of the documents. If set to False, only documents without embeddings are processed. This mode can be used for incremental updating of embeddings, wherein, only newly indexed documents get processed. -
filters
: Optional filters to narrow down the documents for which embeddings are to be updated. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and"
,"$or"
,"$not"
), a comparison operator ("$eq"
,"$in"
,"$gt"
,"$gte"
,"$lt"
,"$lte"
) or a metadata field name. Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of"$in"
) a list of values as value. If no logical operator is provided,"$and"
is used as default operation. If no comparison operator is provided,"$eq"
(or"$in"
if the comparison value is a list) is used as default operation.__Example__: ```python filters = { "$and": { "type": {"$eq": "article"}, "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"}, "rating": {"$gte": 3}, "$or": { "genre": {"$in": ["economy", "politics"]}, "publisher": {"$eq": "nytimes"} } } } ```
-
batch_size
: When working with large number of documents, batching can help reduce memory footprint. -
headers
: Custom HTTP headers to pass to the client (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='}) Check out https://www.elastic.co/guide/en/elasticsearch/reference/current/http-clients.html for more information.
Returns:
None
ElasticsearchDocumentStore.delete_all_documents
def delete_all_documents(index: Optional[str] = None,
filters: Optional[FilterType] = None,
headers: Optional[Dict[str, str]] = None)
Delete documents in an index. All documents are deleted if no filters are passed.
Arguments:
-
index
: Index name to delete the document from. -
filters
: Optional filters to narrow down the documents to be deleted. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and"
,"$or"
,"$not"
), a comparison operator ("$eq"
,"$in"
,"$gt"
,"$gte"
,"$lt"
,"$lte"
) or a metadata field name. Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of"$in"
) a list of values as value. If no logical operator is provided,"$and"
is used as default operation. If no comparison operator is provided,"$eq"
(or"$in"
if the comparison value is a list) is used as default operation.__Example__: ```python filters = { "$and": { "type": {"$eq": "article"}, "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"}, "rating": {"$gte": 3}, "$or": { "genre": {"$in": ["economy", "politics"]}, "publisher": {"$eq": "nytimes"} } } } ```
-
headers
: Custom HTTP headers to pass to the client (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='}) Check out https://www.elastic.co/guide/en/elasticsearch/reference/current/http-clients.html for more information.
Returns:
None
ElasticsearchDocumentStore.delete_documents
def delete_documents(index: Optional[str] = None,
ids: Optional[List[str]] = None,
filters: Optional[FilterType] = None,
headers: Optional[Dict[str, str]] = None)
Delete documents in an index. All documents are deleted if no filters are passed.
Arguments:
-
index
: Index name to delete the documents from. If None, the DocumentStore's default index (self.index) will be used -
ids
: Optional list of IDs to narrow down the documents to be deleted. -
filters
: Optional filters to narrow down the documents to be deleted. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and"
,"$or"
,"$not"
), a comparison operator ("$eq"
,"$in"
,"$gt"
,"$gte"
,"$lt"
,"$lte"
) or a metadata field name. Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of"$in"
) a list of values as value. If no logical operator is provided,"$and"
is used as default operation. If no comparison operator is provided,"$eq"
(or"$in"
if the comparison value is a list) is used as default operation.__Example__: ```python filters = { "$and": { "type": {"$eq": "article"}, "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"}, "rating": {"$gte": 3}, "$or": { "genre": {"$in": ["economy", "politics"]}, "publisher": {"$eq": "nytimes"} } } } ``` If filters are provided along with a list of IDs, this method deletes the intersection of the two query results (documents that match the filters and have their ID in the list).
-
headers
: Custom HTTP headers to pass to the client (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='}) Check out https://www.elastic.co/guide/en/elasticsearch/reference/current/http-clients.html for more information.
Returns:
None
ElasticsearchDocumentStore.delete_labels
def delete_labels(index: Optional[str] = None,
ids: Optional[List[str]] = None,
filters: Optional[FilterType] = None,
headers: Optional[Dict[str, str]] = None)
Delete labels in an index. All labels are deleted if no filters are passed.
Arguments:
-
index
: Index name to delete the labels from. If None, the DocumentStore's default label index (self.label_index) will be used -
ids
: Optional list of IDs to narrow down the labels to be deleted. -
filters
: Optional filters to narrow down the labels to be deleted. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and"
,"$or"
,"$not"
), a comparison operator ("$eq"
,"$in"
,"$gt"
,"$gte"
,"$lt"
,"$lte"
) or a metadata field name. Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of"$in"
) a list of values as value. If no logical operator is provided,"$and"
is used as default operation. If no comparison operator is provided,"$eq"
(or"$in"
if the comparison value is a list) is used as default operation.__Example__: ```python filters = { "$and": { "type": {"$eq": "article"}, "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"}, "rating": {"$gte": 3}, "$or": { "genre": {"$in": ["economy", "politics"]}, "publisher": {"$eq": "nytimes"} } } } ```
-
headers
: Custom HTTP headers to pass to the client (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='}) Check out https://www.elastic.co/guide/en/elasticsearch/reference/current/http-clients.html for more information.
Returns:
None
ElasticsearchDocumentStore.delete_index
def delete_index(index: str)
Delete an existing search index. The index including all data will be removed.
Arguments:
index
: The name of the index to delete.
Returns:
None
ElasticsearchDocumentStore.get_all_labels_aggregated
def get_all_labels_aggregated(
index: Optional[str] = None,
filters: Optional[FilterType] = None,
open_domain: bool = True,
drop_negative_labels: bool = False,
drop_no_answers: bool = False,
aggregate_by_meta: Optional[Union[str, list]] = None,
headers: Optional[Dict[str, str]] = None) -> List[MultiLabel]
Return all labels in the DocumentStore, aggregated into MultiLabel objects.
This aggregation step helps, for example, if you collected multiple possible answers for one question and you want now all answers bundled together in one place for evaluation. How they are aggregated is defined by the open_domain and aggregate_by_meta parameters. If the questions are being asked to a single document (i.e. SQuAD style), you should set open_domain=False to aggregate by question and document. If the questions are being asked to your full collection of documents, you should set open_domain=True to aggregate just by question. If the questions are being asked to a subslice of your document set (e.g. product review use cases), you should set open_domain=True and populate aggregate_by_meta with the names of Label meta fields to aggregate by question and your custom meta fields. For example, in a product review use case, you might set aggregate_by_meta=["product_id"] so that Labels with the same question but different answers from different documents are aggregated into the one MultiLabel object, provided that they have the same product_id (to be found in Label.meta["product_id"])
Arguments:
-
index
: Name of the index to get the labels from. If None, the DocumentStore's default index (self.index) will be used. -
filters
: Optional filters to narrow down the search space to documents whose metadata fulfill certain conditions. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and"
,"$or"
,"$not"
), a comparison operator ("$eq"
,"$in"
,"$gt"
,"$gte"
,"$lt"
,"$lte"
) or a metadata field name. Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of"$in"
) a list of values as value. If no logical operator is provided,"$and"
is used as default operation. If no comparison operator is provided,"$eq"
(or"$in"
if the comparison value is a list) is used as default operation.__Example__: ```python filters = { "$and": { "type": {"$eq": "article"}, "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"}, "rating": {"$gte": 3}, "$or": { "genre": {"$in": ["economy", "politics"]}, "publisher": {"$eq": "nytimes"} } } } ```
-
open_domain
: When True, labels are aggregated purely based on the question text alone. When False, labels are aggregated in a closed domain fashion based on the question text and also the id of the document that the label is tied to. In this setting, this function might return multiple MultiLabel objects with the same question string. -
headers
: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='} for basic authentication) -
aggregate_by_meta
: The names of the Label meta fields by which to aggregate. For example: ["product_id"] -
drop_negative_labels
: When True, labels with incorrect answers and documents are dropped. -
drop_no_answers
: When True, labels with no answers are dropped.
ElasticsearchDocumentStore.normalize_embedding
@staticmethod
def normalize_embedding(emb: np.ndarray) -> None
Performs L2 normalization of embeddings vector inplace. Input can be a single vector (1D array) or a matrix (2D array).
ElasticsearchDocumentStore.add_eval_data
def add_eval_data(filename: str,
doc_index: str = "eval_document",
label_index: str = "label",
batch_size: Optional[int] = None,
preprocessor: Optional[PreProcessor] = None,
max_docs: Optional[Union[int, bool]] = None,
open_domain: bool = False,
headers: Optional[Dict[str, str]] = None)
Adds a SQuAD-formatted file to the DocumentStore in order to be able to perform evaluation on it.
If a jsonl file and a batch_size is passed to the function, documents are loaded batchwise from disk and also indexed batchwise to the DocumentStore in order to prevent out of memory errors.
Arguments:
filename
: Name of the file containing evaluation data (json or jsonl)doc_index
: Elasticsearch index where evaluation documents should be storedlabel_index
: Elasticsearch index where labeled questions should be storedbatch_size
: Optional number of documents that are loaded and processed at a time. When set to None (default) all documents are processed at once.preprocessor
: Optional PreProcessor to preprocess evaluation documents. It can be used for splitting documents into passages (and assigning labels to corresponding passages). Currently the PreProcessor does not support split_by sentence, cleaning nor split_overlap != 0. When set to None (default) preprocessing is disabled.max_docs
: Optional number of documents that will be loaded. When set to None (default) all available eval documents are used.open_domain
: Set this to True if your file is an open domain dataset where two different answers to the same question might be found in different contexts.headers
: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='} for basic authentication)
ElasticsearchDocumentStore.run
def run(documents: List[Union[dict, Document]],
index: Optional[str] = None,
headers: Optional[Dict[str, str]] = None,
id_hash_keys: Optional[List[str]] = None)
Run requests of document stores
Comment: We will gradually introduce the primitives. The doument stores also accept dicts and parse them to documents. In the future, however, only documents themselves will be accepted. Parsing the dictionaries in the run function is therefore only an interim solution until the run function also accepts documents.
Arguments:
documents
: A list of dicts that are documents.headers
: A list of headers.index
: Optional name of index where the documents shall be written to. If None, the DocumentStore's default index (self.index) will be used.id_hash_keys
: List of the fields that the hashes of the ids are generated from.
ElasticsearchDocumentStore.describe_documents
def describe_documents(index=None)
Return a summary of the documents in the document store
Module es8
ElasticsearchDocumentStore
class ElasticsearchDocumentStore(_ElasticsearchDocumentStore)
ElasticsearchDocumentStore.__init__
def __init__(host: Union[str, List[str]] = "localhost",
port: Union[int, List[int]] = 9200,
username: str = "",
password: str = "",
api_key_id: Optional[str] = None,
api_key: Optional[str] = None,
aws4auth=None,
index: str = "document",
label_index: str = "label",
search_fields: Union[str, list] = "content",
content_field: str = "content",
name_field: str = "name",
embedding_field: str = "embedding",
embedding_dim: int = 768,
custom_mapping: Optional[dict] = None,
excluded_meta_data: Optional[list] = None,
analyzer: str = "standard",
scheme: str = "http",
ca_certs: Optional[str] = None,
verify_certs: bool = True,
recreate_index: bool = False,
create_index: bool = True,
refresh_type: str = "wait_for",
similarity: str = "dot_product",
timeout: int = 300,
return_embedding: bool = False,
duplicate_documents: str = "overwrite",
scroll: str = "1d",
skip_missing_embeddings: bool = True,
synonyms: Optional[List] = None,
synonym_type: str = "synonym",
use_system_proxy: bool = False,
batch_size: int = 10_000)
A DocumentStore using Elasticsearch to store and query the documents for our search.
- Keeps all the logic to store and query documents from Elastic, incl. mapping of fields, adding filters or boosts to your queries, and storing embeddings
- You can either use an existing Elasticsearch index or create a new one via haystack
- Retrievers operate on top of this DocumentStore to find the relevant documents for a query
Arguments:
host
: url(s) of elasticsearch nodesport
: port(s) of elasticsearch nodesusername
: username (standard authentication via http_auth)password
: password (standard authentication via http_auth)api_key_id
: ID of the API key (alternative authentication mode to the above http_auth)api_key
: Secret value of the API key (alternative authentication mode to the above http_auth)aws4auth
: Authentication for usage with aws elasticsearch (can be generated with the requests-aws4auth package)index
: Name of index in elasticsearch to use for storing the documents that we want to search. If not existing yet, we will create one.label_index
: Name of index in elasticsearch to use for storing labels. If not existing yet, we will create one.search_fields
: Name of fields used by BM25Retriever to find matches in the docs to our incoming query (using elastic's multi_match query), e.g. ["title", "full_text"]content_field
: Name of field that might contain the answer and will therefore be passed to the Reader Model (e.g. "full_text"). If no Reader is used (e.g. in FAQ-Style QA) the plain content of this field will just be returned.name_field
: Name of field that contains the title of the the docembedding_field
: Name of field containing an embedding vector (Only needed when using a dense retriever (e.g. DensePassageRetriever, EmbeddingRetriever) on top)embedding_dim
: Dimensionality of embedding vector (Only needed when using a dense retriever (e.g. DensePassageRetriever, EmbeddingRetriever) on top)custom_mapping
: If you want to use your own custom mapping for creating a new index in Elasticsearch, you can supply it here as a dictionary.analyzer
: Specify the default analyzer from one of the built-ins when creating a new Elasticsearch Index. Elasticsearch also has built-in analyzers for different languages (e.g. impacting tokenization). More info at: https://www.elastic.co/guide/en/elasticsearch/reference/7.9/analysis-analyzers.htmlexcluded_meta_data
: Name of fields in Elasticsearch that should not be returned (e.g. [field_one, field_two]). Helpful if you have fields with long, irrelevant content that you don't want to display in results (e.g. embedding vectors).scheme
: 'https' or 'http', protocol used to connect to your elasticsearch instanceca_certs
: Root certificates for SSL: it is a path to certificate authority (CA) certs on disk. You can use certifi package with certifi.where() to find where the CA certs file is located in your machine.verify_certs
: Whether to be strict about ca certificatesrecreate_index
: If set to True, an existing elasticsearch index will be deleted and a new one will be created using the config you are using for initialization. Be aware that all data in the old index will be lost if you choose to recreate the index. Be aware that both the document_index and the label_index will be recreated.create_index
: Whether to try creating a new index (If the index of that name is already existing, we will just continue in any case) ..deprecated:: 2.0 This param is deprecated. In the next major version we will always try to create an index if there is no existing index (the current behaviour when create_index=True). If you are looking to recreate an existing index by deleting it first if it already exist use param recreate_index.refresh_type
: Type of ES refresh used to control when changes made by a request (e.g. bulk) are made visible to search. If set to 'wait_for', continue only after changes are visible (slow, but safe). If set to 'false', continue directly (fast, but sometimes unintuitive behaviour when docs are not immediately available after ingestion). More info at https://www.elastic.co/guide/en/elasticsearch/reference/6.8/docs-refresh.htmlsimilarity
: The similarity function used to compare document vectors. 'dot_product' is the default since it is more performant with DPR embeddings. 'cosine' is recommended if you are using a Sentence BERT model.timeout
: Number of seconds after which an ElasticSearch request times out.return_embedding
: To return document embeddingduplicate_documents
: Handle duplicates document based on parameter options. Parameter options : ( 'skip','overwrite','fail') skip: Ignore the duplicates documents overwrite: Update any existing documents with the same ID when adding documents. fail: an error is raised if the document ID of the document being added already exists.scroll
: Determines how long the current index is fixed, e.g. during updating all documents with embeddings. Defaults to "1d" and should not be larger than this. Can also be in minutes "5m" or hours "15h" For details, see https://www.elastic.co/guide/en/elasticsearch/reference/current/scroll-api.htmlskip_missing_embeddings
: Parameter to control queries based on vector similarity when indexed documents miss embeddings. Parameter options: (True, False) False: Raises exception if one or more documents do not have embeddings at query time True: Query will ignore all documents without embeddings (recommended if you concurrently index and query)synonyms
: List of synonyms can be passed while elasticsearch initialization. For example: [ "foo, bar => baz", "foozball , foosball" ] More info at https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-synonym-tokenfilter.htmlsynonym_type
: Synonym filter type can be passed. Synonym or Synonym_graph to handle synonyms, including multi-word synonyms correctly during the analysis process. More info at https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-synonym-graph-tokenfilter.htmluse_system_proxy
: Whether to use system proxy.batch_size
: Number of Documents to index at once / Number of queries to execute at once. If you face memory issues, decrease the batch_size.
ElasticsearchDocumentStore.query_by_embedding
def query_by_embedding(query_emb: np.ndarray,
filters: Optional[FilterType] = None,
top_k: int = 10,
index: Optional[str] = None,
return_embedding: Optional[bool] = None,
headers: Optional[Dict[str, str]] = None,
scale_score: bool = True) -> List[Document]
Find the document that is most similar to the provided query_emb
by using a vector similarity metric.
Arguments:
-
query_emb
: Embedding of the query (e.g. gathered from DPR) -
filters
: Optional filters to narrow down the search space to documents whose metadata fulfill certain conditions. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and"
,"$or"
,"$not"
), a comparison operator ("$eq"
,"$in"
,"$gt"
,"$gte"
,"$lt"
,"$lte"
) or a metadata field name. Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of"$in"
) a list of values as value. If no logical operator is provided,"$and"
is used as default operation. If no comparison operator is provided,"$eq"
(or"$in"
if the comparison value is a list) is used as default operation.__Example__: ```python filters = { "$and": { "type": {"$eq": "article"}, "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"}, "rating": {"$gte": 3}, "$or": { "genre": {"$in": ["economy", "politics"]}, "publisher": {"$eq": "nytimes"} } } } # or simpler using default operators filters = { "type": "article", "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"}, "rating": {"$gte": 3}, "$or": { "genre": ["economy", "politics"], "publisher": "nytimes" } } ``` To use the same logical operator multiple times on the same level, logical operators take optionally a list of dictionaries as value. __Example__: ```python filters = { "$or": [ { "$and": { "Type": "News Paper", "Date": { "$lt": "2019-01-01" } } }, { "$and": { "Type": "Blog Post", "Date": { "$gte": "2019-01-01" } } } ] } ```
-
top_k
: How many documents to return -
index
: Index name for storing the docs and metadata -
return_embedding
: To return document embedding -
headers
: Custom HTTP headers to pass to elasticsearch client (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='}) Check out https://www.elastic.co/guide/en/elasticsearch/reference/current/http-clients.html for more information. -
scale_score
: Whether to scale the similarity score to the unit interval (range of [0,1]). If true (default) similarity scores (e.g. cosine or dot_product) which naturally have a different value range will be scaled to a range of [0,1], where 1 means extremely relevant. Otherwise raw similarity scores (e.g. cosine or dot_product) will be used.
ElasticsearchDocumentStore.get_document_by_id
def get_document_by_id(
id: str,
index: Optional[str] = None,
headers: Optional[Dict[str, str]] = None) -> Optional[Document]
Fetch a document by specifying its text id string
ElasticsearchDocumentStore.get_documents_by_id
def get_documents_by_id(
ids: List[str],
index: Optional[str] = None,
batch_size: int = 10_000,
headers: Optional[Dict[str, str]] = None) -> List[Document]
Fetch documents by specifying a list of text id strings.
Arguments:
ids
: List of document IDs. Be aware that passing a large number of ids might lead to performance issues.index
: search index where the documents are stored. If not supplied, self.index will be used.batch_size
: Maximum number of results for each query. Limited to 10,000 documents by default. To reduce the pressure on the cluster, you can lower this limit, at the expense of longer retrieval times.headers
: Custom HTTP headers to pass to the client (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='}) Check out https://www.elastic.co/guide/en/elasticsearch/reference/current/http-clients.html for more information.
ElasticsearchDocumentStore.get_metadata_values_by_key
def get_metadata_values_by_key(key: str,
query: Optional[str] = None,
filters: Optional[FilterType] = None,
index: Optional[str] = None,
headers: Optional[Dict[str, str]] = None,
batch_size: int = 10) -> List[dict]
Get values associated with a metadata key. The output is in the format:
[{"value": "my-value-1", "count": 23}, {"value": "my-value-2", "count": 12}, ... ]
Arguments:
-
key
: the meta key name to get the values for. -
query
: narrow down the scope to documents matching the query string. -
filters
: Narrow down the scope to documents that match the given filters. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and"
,"$or"
,"$not"
), a comparison operator ("$eq"
,"$in"
,"$gt"
,"$gte"
,"$lt"
,"$lte"
) or a metadata field name. Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of"$in"
) a list of values as value. If no logical operator is provided,"$and"
is used as default operation. If no comparison operator is provided,"$eq"
(or"$in"
if the comparison value is a list) is used as default operation.__Example__: ```python filters = { "$and": { "type": {"$eq": "article"}, "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"}, "rating": {"$gte": 3}, "$or": { "genre": {"$in": ["economy", "politics"]}, "publisher": {"$eq": "nytimes"} } } } ```
-
index
: search index where the meta values should be searched. If not supplied, self.index will be used. -
headers
: Custom HTTP headers to pass to the client (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='}) Check out https://www.elastic.co/guide/en/elasticsearch/reference/current/http-clients.html for more information. -
batch_size
: Maximum number of results for each request. Limited to 10 values by default. You can increase this limit to decrease retrieval time. To reduce the pressure on the cluster, you shouldn't set this higher than 1,000.
ElasticsearchDocumentStore.write_documents
def write_documents(documents: Union[List[dict], List[Document]],
index: Optional[str] = None,
batch_size: Optional[int] = None,
duplicate_documents: Optional[str] = None,
headers: Optional[Dict[str, str]] = None)
Indexes documents for later queries.
If a document with the same ID already exists:
a) (Default) Manage duplication according to the duplicate_documents
parameter.
b) If self.update_existing_documents=True
for DocumentStore: Overwrite existing documents.
(This is only relevant if you pass your own ID when initializing a Document
.
If you don't set custom IDs for your Documents or just pass a list of dictionaries here,
they automatically get UUIDs assigned. See the Document
class for details.)
Arguments:
documents
: A list of Python dictionaries or a list of Haystack Document objects. For documents as dictionaries, the format is {"content": ""}. Optionally: Include meta data via {"content": "", "meta":{"name": ", "author": "somebody", ...}} You can use it for filtering and you can access it in the responses of the Finder. Advanced: If you are using your own field mapping, change the key names in the dictionary to what you have set for self.content_field and self.name_field.index
: search index where the documents should be indexed. If you don't specify it, self.index is used.batch_size
: Number of documents that are passed to the bulk function at each round. If not specified, self.batch_size is used.duplicate_documents
: Handle duplicate documents based on parameter options. Parameter options: ( 'skip','overwrite','fail') skip: Ignore the duplicate documents overwrite: Update any existing documents with the same ID when adding documents. fail: Raises an error if the document ID of the document being added already exists.headers
: Custom HTTP headers to pass to the client (for example {'Authorization': 'Basic YWRtaW46cm9vdA=='}) For more information, see HTTP/REST clients and security.
Raises:
DuplicateDocumentError
: Exception trigger on duplicate document
Returns:
None
ElasticsearchDocumentStore.write_labels
def write_labels(labels: Union[List[Label], List[dict]],
index: Optional[str] = None,
headers: Optional[Dict[str, str]] = None,
batch_size: int = 10_000)
Write annotation labels into document store.
Arguments:
labels
: A list of Python dictionaries or a list of Haystack Label objects.index
: search index where the labels should be stored. If not supplied, self.label_index will be used.batch_size
: Number of labels that are passed to the bulk function at each round.headers
: Custom HTTP headers to pass to the client (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='}) Check out https://www.elastic.co/guide/en/elasticsearch/reference/current/http-clients.html for more information.
ElasticsearchDocumentStore.update_document_meta
def update_document_meta(id: str,
meta: Dict[str, str],
index: Optional[str] = None,
headers: Optional[Dict[str, str]] = None)
Update the metadata dictionary of a document by specifying its string id
ElasticsearchDocumentStore.get_document_count
def get_document_count(filters: Optional[FilterType] = None,
index: Optional[str] = None,
only_documents_without_embedding: bool = False,
headers: Optional[Dict[str, str]] = None) -> int
Return the number of documents in the document store.
ElasticsearchDocumentStore.get_label_count
def get_label_count(index: Optional[str] = None,
headers: Optional[Dict[str, str]] = None) -> int
Return the number of labels in the document store
ElasticsearchDocumentStore.get_embedding_count
def get_embedding_count(index: Optional[str] = None,
filters: Optional[FilterType] = None,
headers: Optional[Dict[str, str]] = None) -> int
Return the count of embeddings in the document store.
ElasticsearchDocumentStore.get_all_documents
def get_all_documents(
index: Optional[str] = None,
filters: Optional[FilterType] = None,
return_embedding: Optional[bool] = None,
batch_size: int = 10_000,
headers: Optional[Dict[str, str]] = None) -> List[Document]
Get documents from the document store.
Arguments:
-
index
: Name of the index to get the documents from. If None, the DocumentStore's default index (self.index) will be used. -
filters
: Optional filters to narrow down the documents to return. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and"
,"$or"
,"$not"
), a comparison operator ("$eq"
,"$in"
,"$gt"
,"$gte"
,"$lt"
,"$lte"
) or a metadata field name. Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of"$in"
) a list of values as value. If no logical operator is provided,"$and"
is used as default operation. If no comparison operator is provided,"$eq"
(or"$in"
if the comparison value is a list) is used as default operation.__Example__: ```python filters = { "$and": { "type": {"$eq": "article"}, "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"}, "rating": {"$gte": 3}, "$or": { "genre": {"$in": ["economy", "politics"]}, "publisher": {"$eq": "nytimes"} } } } ```
-
return_embedding
: Whether to return the document embeddings. -
batch_size
: When working with large number of documents, batching can help reduce memory footprint. -
headers
: Custom HTTP headers to pass to the client (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='}) Check out https://www.elastic.co/guide/en/elasticsearch/reference/current/http-clients.html for more information.
ElasticsearchDocumentStore.get_all_documents_generator
def get_all_documents_generator(
index: Optional[str] = None,
filters: Optional[FilterType] = None,
return_embedding: Optional[bool] = None,
batch_size: int = 10_000,
headers: Optional[Dict[str,
str]] = None) -> Generator[Document, None, None]
Get documents from the document store. Under-the-hood, documents are fetched in batches from the
document store and yielded as individual documents. This method can be used to iteratively process a large number of documents without having to load all documents in memory.
Arguments:
-
index
: Name of the index to get the documents from. If None, the DocumentStore's default index (self.index) will be used. -
filters
: Optional filters to narrow down the documents to return. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and"
,"$or"
,"$not"
), a comparison operator ("$eq"
,"$in"
,"$gt"
,"$gte"
,"$lt"
,"$lte"
) or a metadata field name. Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of"$in"
) a list of values as value. If no logical operator is provided,"$and"
is used as default operation. If no comparison operator is provided,"$eq"
(or"$in"
if the comparison value is a list) is used as default operation.__Example__: ```python filters = { "$and": { "type": {"$eq": "article"}, "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"}, "rating": {"$gte": 3}, "$or": { "genre": {"$in": ["economy", "politics"]}, "publisher": {"$eq": "nytimes"} } } } ```
-
return_embedding
: Whether to return the document embeddings. -
batch_size
: When working with large number of documents, batching can help reduce memory footprint. -
headers
: Custom HTTP headers to pass to the client (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='}) Check out https://www.elastic.co/guide/en/elasticsearch/reference/current/http-clients.html for more information.
ElasticsearchDocumentStore.get_all_labels
def get_all_labels(index: Optional[str] = None,
filters: Optional[FilterType] = None,
headers: Optional[Dict[str, str]] = None,
batch_size: int = 10_000) -> List[Label]
Return all labels in the document store
ElasticsearchDocumentStore.query
def query(query: Optional[str],
filters: Optional[FilterType] = None,
top_k: int = 10,
custom_query: Optional[str] = None,
index: Optional[str] = None,
headers: Optional[Dict[str, str]] = None,
all_terms_must_match: bool = False,
scale_score: bool = True) -> List[Document]
Scan through documents in DocumentStore and return a small number documents
that are most relevant to the query as defined by the BM25 algorithm.
Arguments:
-
query
: The query -
filters
: Optional filters to narrow down the search space to documents whose metadata fulfill certain conditions. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and"
,"$or"
,"$not"
), a comparison operator ("$eq"
,"$in"
,"$gt"
,"$gte"
,"$lt"
,"$lte"
) or a metadata field name. Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of"$in"
) a list of values as value. If no logical operator is provided,"$and"
is used as default operation. If no comparison operator is provided,"$eq"
(or"$in"
if the comparison value is a list) is used as default operation.__Example__: ```python filters = { "$and": { "type": {"$eq": "article"}, "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"}, "rating": {"$gte": 3}, "$or": { "genre": {"$in": ["economy", "politics"]}, "publisher": {"$eq": "nytimes"} } } } # or simpler using default operators filters = { "type": "article", "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"}, "rating": {"$gte": 3}, "$or": { "genre": ["economy", "politics"], "publisher": "nytimes" } } ``` To use the same logical operator multiple times on the same level, logical operators take optionally a list of dictionaries as value. __Example__: ```python filters = { "$or": [ { "$and": { "Type": "News Paper", "Date": { "$lt": "2019-01-01" } } }, { "$and": { "Type": "Blog Post", "Date": { "$gte": "2019-01-01" } } } ] } ```
-
top_k
: How many documents to return per query. -
custom_query
: query string containing a mandatory${query}
and an optional${filters}
placeholder. ::**An example custom_query:**
{ "size": 10, "query": { "bool": { "should": [{"multi_match": { "query": ${query}, // mandatory query placeholder "type": "most_fields", "fields": ["content", "title"]}}], "filter": ${filters} // optional filters placeholder } }, } ``` **For this custom_query, a sample retrieve() could be:** ```python self.retrieve(query="Why did the revenue increase?", filters={"years": ["2019"], "quarters": ["Q1", "Q2"]})
Optionally, highlighting can be defined by specifying the highlight settings. See https://www.elastic.co/guide/en/elasticsearch/reference/current/highlighting.html. You will find the highlighted output in the returned Document's meta field by key "highlighted". ::
**Example custom_query with highlighting:**
{
"size": 10,
"query": {
"bool": {
"should": [{"multi_match": {
"query": ${query}, // mandatory query placeholder
"type": "most_fields",
"fields": ["content", "title"]}}],
}
},
"highlight": { // enable highlighting
"fields": { // for fields content and title
"content": {},
"title": {}
}
},
}
```
**For this custom_query, highlighting info can be accessed by:**
```python
docs = self.retrieve(query="Why did the revenue increase?")
highlighted_content = docs[0].meta["highlighted"]["content"]
highlighted_title = docs[0].meta["highlighted"]["title"]
index
: The name of the index in the DocumentStore from which to retrieve documentsheaders
: Custom HTTP headers to pass to the client (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='}) Check out https://www.elastic.co/guide/en/elasticsearch/reference/current/http-clients.html for more information.all_terms_must_match
: Whether all terms of the query must match the document. If true all query terms must be present in a document in order to be retrieved (i.e the AND operator is being used implicitly between query terms: "cozy fish restaurant" -> "cozy AND fish AND restaurant"). Otherwise at least one query term must be present in a document in order to be retrieved (i.e the OR operator is being used implicitly between query terms: "cozy fish restaurant" -> "cozy OR fish OR restaurant"). Defaults to false.scale_score
: Whether to scale the similarity score to the unit interval (range of [0,1]). If true (default) similarity scores (e.g. cosine or dot_product) which naturally have a different value range will be scaled to a range of [0,1], where 1 means extremely relevant. Otherwise raw similarity scores (e.g. cosine or dot_product) will be used.
ElasticsearchDocumentStore.query_batch
def query_batch(queries: List[str],
filters: Optional[Union[FilterType,
List[Optional[FilterType]]]] = None,
top_k: int = 10,
custom_query: Optional[str] = None,
index: Optional[str] = None,
headers: Optional[Dict[str, str]] = None,
all_terms_must_match: bool = False,
scale_score: bool = True,
batch_size: Optional[int] = None) -> List[List[Document]]
Scan through documents in DocumentStore and return a small number of documents
that are most relevant to the provided queries as defined by keyword matching algorithms like BM25.
This method lets you find relevant documents for list of query strings (output: List of Lists of Documents).
Arguments:
queries
: List of query strings.filters
: Optional filters to narrow down the search space to documents whose metadata fulfill certain conditions. Can be a single filter that will be applied to each query or a list of filters (one filter per query).
Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical
operator ("$and"
, "$or"
, "$not"
), a comparison operator ("$eq"
, "$in"
, "$gt"
,
"$gte"
, "$lt"
, "$lte"
) or a metadata field name.
Logical operator keys take a dictionary of metadata field names and/or logical operators as
value. Metadata field names take a dictionary of comparison operators as value. Comparison
operator keys take a single value or (in case of "$in"
) a list of values as value.
If no logical operator is provided, "$and"
is used as default operation. If no comparison
operator is provided, "$eq"
(or "$in"
if the comparison value is a list) is used as default
operation.
__Example__:
```python
filters = {
"$and": {
"type": {"$eq": "article"},
"date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
"rating": {"$gte": 3},
"$or": {
"genre": {"$in": ["economy", "politics"]},
"publisher": {"$eq": "nytimes"}
}
}
}
# or simpler using default operators
filters = {
"type": "article",
"date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
"rating": {"$gte": 3},
"$or": {
"genre": ["economy", "politics"],
"publisher": "nytimes"
}
}
```
To use the same logical operator multiple times on the same level, logical operators take
optionally a list of dictionaries as value.
__Example__:
```python
filters = {
"$or": [
{
"$and": {
"Type": "News Paper",
"Date": {
"$lt": "2019-01-01"
}
}
},
{
"$and": {
"Type": "Blog Post",
"Date": {
"$gte": "2019-01-01"
}
}
}
]
}
```
top_k
: How many documents to return per query.custom_query
: Custom query to be executed.index
: The name of the index in the DocumentStore from which to retrieve documentsheaders
: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='} for basic authentication)all_terms_must_match
: Whether all terms of the query must match the document. If true all query terms must be present in a document in order to be retrieved (i.e the AND operator is being used implicitly between query terms: "cozy fish restaurant" -> "cozy AND fish AND restaurant"). Otherwise, at least one query term must be present in a document in order to be retrieved (i.e the OR operator is being used implicitly between query terms: "cozy fish restaurant" -> "cozy OR fish OR restaurant"). Defaults to False.scale_score
: Whether to scale the similarity score to the unit interval (range of [0,1]). If true (default) similarity scores (e.g. cosine or dot_product) which naturally have a different value range will be scaled to a range of [0,1], where 1 means extremely relevant. Otherwise, raw similarity scores (e.g. cosine or dot_product) will be used.batch_size
: Number of queries that are processed at once. If not specified, self.batch_size is used.
ElasticsearchDocumentStore.query_by_embedding_batch
def query_by_embedding_batch(
query_embs: Union[List[np.ndarray], np.ndarray],
filters: Optional[Union[FilterType,
List[Optional[FilterType]]]] = None,
top_k: int = 10,
index: Optional[str] = None,
return_embedding: Optional[bool] = None,
headers: Optional[Dict[str, str]] = None,
scale_score: bool = True,
batch_size: Optional[int] = None) -> List[List[Document]]
Find the documents that are most similar to the provided query_embs
by using a vector similarity metric.
Arguments:
-
query_embs
: Embeddings of the queries (e.g. gathered from DPR). Can be a list of one-dimensional numpy arrays or a two-dimensional numpy array. -
filters
: Optional filters to narrow down the search space to documents whose metadata fulfill certain conditions. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and"
,"$or"
,"$not"
), a comparison operator ("$eq"
,"$in"
,"$gt"
,"$gte"
,"$lt"
,"$lte"
) or a metadata field name. Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of"$in"
) a list of values as value. If no logical operator is provided,"$and"
is used as default operation. If no comparison operator is provided,"$eq"
(or"$in"
if the comparison value is a list) is used as default operation.__Example__: ```python filters = { "$and": { "type": {"$eq": "article"}, "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"}, "rating": {"$gte": 3}, "$or": { "genre": {"$in": ["economy", "politics"]}, "publisher": {"$eq": "nytimes"} } } } # or simpler using default operators filters = { "type": "article", "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"}, "rating": {"$gte": 3}, "$or": { "genre": ["economy", "politics"], "publisher": "nytimes" } } ``` To use the same logical operator multiple times on the same level, logical operators take optionally a list of dictionaries as value. __Example__: ```python filters = { "$or": [ { "$and": { "Type": "News Paper", "Date": { "$lt": "2019-01-01" } } }, { "$and": { "Type": "Blog Post", "Date": { "$gte": "2019-01-01" } } } ] } ```
-
top_k
: How many documents to return -
index
: Index name for storing the docs and metadata -
return_embedding
: To return document embedding -
headers
: Custom HTTP headers to pass to elasticsearch client (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='}) Check out https://www.elastic.co/guide/en/elasticsearch/reference/current/http-clients.html for more information. -
scale_score
: Whether to scale the similarity score to the unit interval (range of [0,1]). If true (default) similarity scores (e.g. cosine or dot_product) which naturally have a different value range will be scaled to a range of [0,1], where 1 means extremely relevant. Otherwise, raw similarity scores (e.g. cosine or dot_product) will be used. -
batch_size
: Number of query embeddings to process at once. If not specified, self.batch_size is used.
ElasticsearchDocumentStore.update_embeddings
def update_embeddings(retriever: DenseRetriever,
index: Optional[str] = None,
filters: Optional[FilterType] = None,
update_existing_embeddings: bool = True,
batch_size: Optional[int] = None,
headers: Optional[Dict[str, str]] = None)
Updates the embeddings in the the document store using the encoding model specified in the retriever.
This can be useful if want to add or change the embeddings for your documents (e.g. after changing the retriever config).
Arguments:
-
retriever
: Retriever to use to update the embeddings. -
index
: Index name to update -
update_existing_embeddings
: Whether to update existing embeddings of the documents. If set to False, only documents without embeddings are processed. This mode can be used for incremental updating of embeddings, wherein, only newly indexed documents get processed. -
filters
: Optional filters to narrow down the documents for which embeddings are to be updated. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and"
,"$or"
,"$not"
), a comparison operator ("$eq"
,"$in"
,"$gt"
,"$gte"
,"$lt"
,"$lte"
) or a metadata field name. Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of"$in"
) a list of values as value. If no logical operator is provided,"$and"
is used as default operation. If no comparison operator is provided,"$eq"
(or"$in"
if the comparison value is a list) is used as default operation.__Example__: ```python filters = { "$and": { "type": {"$eq": "article"}, "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"}, "rating": {"$gte": 3}, "$or": { "genre": {"$in": ["economy", "politics"]}, "publisher": {"$eq": "nytimes"} } } } ```
-
batch_size
: When working with large number of documents, batching can help reduce memory footprint. -
headers
: Custom HTTP headers to pass to the client (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='}) Check out https://www.elastic.co/guide/en/elasticsearch/reference/current/http-clients.html for more information.
Returns:
None
ElasticsearchDocumentStore.delete_all_documents
def delete_all_documents(index: Optional[str] = None,
filters: Optional[FilterType] = None,
headers: Optional[Dict[str, str]] = None)
Delete documents in an index. All documents are deleted if no filters are passed.
Arguments:
-
index
: Index name to delete the document from. -
filters
: Optional filters to narrow down the documents to be deleted. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and"
,"$or"
,"$not"
), a comparison operator ("$eq"
,"$in"
,"$gt"
,"$gte"
,"$lt"
,"$lte"
) or a metadata field name. Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of"$in"
) a list of values as value. If no logical operator is provided,"$and"
is used as default operation. If no comparison operator is provided,"$eq"
(or"$in"
if the comparison value is a list) is used as default operation.__Example__: ```python filters = { "$and": { "type": {"$eq": "article"}, "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"}, "rating": {"$gte": 3}, "$or": { "genre": {"$in": ["economy", "politics"]}, "publisher": {"$eq": "nytimes"} } } } ```
-
headers
: Custom HTTP headers to pass to the client (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='}) Check out https://www.elastic.co/guide/en/elasticsearch/reference/current/http-clients.html for more information.
Returns:
None
ElasticsearchDocumentStore.delete_documents
def delete_documents(index: Optional[str] = None,
ids: Optional[List[str]] = None,
filters: Optional[FilterType] = None,
headers: Optional[Dict[str, str]] = None)
Delete documents in an index. All documents are deleted if no filters are passed.
Arguments:
-
index
: Index name to delete the documents from. If None, the DocumentStore's default index (self.index) will be used -
ids
: Optional list of IDs to narrow down the documents to be deleted. -
filters
: Optional filters to narrow down the documents to be deleted. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and"
,"$or"
,"$not"
), a comparison operator ("$eq"
,"$in"
,"$gt"
,"$gte"
,"$lt"
,"$lte"
) or a metadata field name. Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of"$in"
) a list of values as value. If no logical operator is provided,"$and"
is used as default operation. If no comparison operator is provided,"$eq"
(or"$in"
if the comparison value is a list) is used as default operation.__Example__: ```python filters = { "$and": { "type": {"$eq": "article"}, "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"}, "rating": {"$gte": 3}, "$or": { "genre": {"$in": ["economy", "politics"]}, "publisher": {"$eq": "nytimes"} } } } ``` If filters are provided along with a list of IDs, this method deletes the intersection of the two query results (documents that match the filters and have their ID in the list).
-
headers
: Custom HTTP headers to pass to the client (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='}) Check out https://www.elastic.co/guide/en/elasticsearch/reference/current/http-clients.html for more information.
Returns:
None
ElasticsearchDocumentStore.delete_labels
def delete_labels(index: Optional[str] = None,
ids: Optional[List[str]] = None,
filters: Optional[FilterType] = None,
headers: Optional[Dict[str, str]] = None)
Delete labels in an index. All labels are deleted if no filters are passed.
Arguments:
-
index
: Index name to delete the labels from. If None, the DocumentStore's default label index (self.label_index) will be used -
ids
: Optional list of IDs to narrow down the labels to be deleted. -
filters
: Optional filters to narrow down the labels to be deleted. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and"
,"$or"
,"$not"
), a comparison operator ("$eq"
,"$in"
,"$gt"
,"$gte"
,"$lt"
,"$lte"
) or a metadata field name. Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of"$in"
) a list of values as value. If no logical operator is provided,"$and"
is used as default operation. If no comparison operator is provided,"$eq"
(or"$in"
if the comparison value is a list) is used as default operation.__Example__: ```python filters = { "$and": { "type": {"$eq": "article"}, "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"}, "rating": {"$gte": 3}, "$or": { "genre": {"$in": ["economy", "politics"]}, "publisher": {"$eq": "nytimes"} } } } ```
-
headers
: Custom HTTP headers to pass to the client (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='}) Check out https://www.elastic.co/guide/en/elasticsearch/reference/current/http-clients.html for more information.
Returns:
None
ElasticsearchDocumentStore.delete_index
def delete_index(index: str)
Delete an existing search index. The index including all data will be removed.
Arguments:
index
: The name of the index to delete.
Returns:
None
ElasticsearchDocumentStore.get_all_labels_aggregated
def get_all_labels_aggregated(
index: Optional[str] = None,
filters: Optional[FilterType] = None,
open_domain: bool = True,
drop_negative_labels: bool = False,
drop_no_answers: bool = False,
aggregate_by_meta: Optional[Union[str, list]] = None,
headers: Optional[Dict[str, str]] = None) -> List[MultiLabel]
Return all labels in the DocumentStore, aggregated into MultiLabel objects.
This aggregation step helps, for example, if you collected multiple possible answers for one question and you want now all answers bundled together in one place for evaluation. How they are aggregated is defined by the open_domain and aggregate_by_meta parameters. If the questions are being asked to a single document (i.e. SQuAD style), you should set open_domain=False to aggregate by question and document. If the questions are being asked to your full collection of documents, you should set open_domain=True to aggregate just by question. If the questions are being asked to a subslice of your document set (e.g. product review use cases), you should set open_domain=True and populate aggregate_by_meta with the names of Label meta fields to aggregate by question and your custom meta fields. For example, in a product review use case, you might set aggregate_by_meta=["product_id"] so that Labels with the same question but different answers from different documents are aggregated into the one MultiLabel object, provided that they have the same product_id (to be found in Label.meta["product_id"])
Arguments:
-
index
: Name of the index to get the labels from. If None, the DocumentStore's default index (self.index) will be used. -
filters
: Optional filters to narrow down the search space to documents whose metadata fulfill certain conditions. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and"
,"$or"
,"$not"
), a comparison operator ("$eq"
,"$in"
,"$gt"
,"$gte"
,"$lt"
,"$lte"
) or a metadata field name. Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of"$in"
) a list of values as value. If no logical operator is provided,"$and"
is used as default operation. If no comparison operator is provided,"$eq"
(or"$in"
if the comparison value is a list) is used as default operation.__Example__: ```python filters = { "$and": { "type": {"$eq": "article"}, "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"}, "rating": {"$gte": 3}, "$or": { "genre": {"$in": ["economy", "politics"]}, "publisher": {"$eq": "nytimes"} } } } ```
-
open_domain
: When True, labels are aggregated purely based on the question text alone. When False, labels are aggregated in a closed domain fashion based on the question text and also the id of the document that the label is tied to. In this setting, this function might return multiple MultiLabel objects with the same question string. -
headers
: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='} for basic authentication) -
aggregate_by_meta
: The names of the Label meta fields by which to aggregate. For example: ["product_id"] -
drop_negative_labels
: When True, labels with incorrect answers and documents are dropped. -
drop_no_answers
: When True, labels with no answers are dropped.
ElasticsearchDocumentStore.normalize_embedding
@staticmethod
def normalize_embedding(emb: np.ndarray) -> None
Performs L2 normalization of embeddings vector inplace. Input can be a single vector (1D array) or a matrix (2D array).
ElasticsearchDocumentStore.add_eval_data
def add_eval_data(filename: str,
doc_index: str = "eval_document",
label_index: str = "label",
batch_size: Optional[int] = None,
preprocessor: Optional[PreProcessor] = None,
max_docs: Optional[Union[int, bool]] = None,
open_domain: bool = False,
headers: Optional[Dict[str, str]] = None)
Adds a SQuAD-formatted file to the DocumentStore in order to be able to perform evaluation on it.
If a jsonl file and a batch_size is passed to the function, documents are loaded batchwise from disk and also indexed batchwise to the DocumentStore in order to prevent out of memory errors.
Arguments:
filename
: Name of the file containing evaluation data (json or jsonl)doc_index
: Elasticsearch index where evaluation documents should be storedlabel_index
: Elasticsearch index where labeled questions should be storedbatch_size
: Optional number of documents that are loaded and processed at a time. When set to None (default) all documents are processed at once.preprocessor
: Optional PreProcessor to preprocess evaluation documents. It can be used for splitting documents into passages (and assigning labels to corresponding passages). Currently the PreProcessor does not support split_by sentence, cleaning nor split_overlap != 0. When set to None (default) preprocessing is disabled.max_docs
: Optional number of documents that will be loaded. When set to None (default) all available eval documents are used.open_domain
: Set this to True if your file is an open domain dataset where two different answers to the same question might be found in different contexts.headers
: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='} for basic authentication)
ElasticsearchDocumentStore.run
def run(documents: List[Union[dict, Document]],
index: Optional[str] = None,
headers: Optional[Dict[str, str]] = None,
id_hash_keys: Optional[List[str]] = None)
Run requests of document stores
Comment: We will gradually introduce the primitives. The doument stores also accept dicts and parse them to documents. In the future, however, only documents themselves will be accepted. Parsing the dictionaries in the run function is therefore only an interim solution until the run function also accepts documents.
Arguments:
documents
: A list of dicts that are documents.headers
: A list of headers.index
: Optional name of index where the documents shall be written to. If None, the DocumentStore's default index (self.index) will be used.id_hash_keys
: List of the fields that the hashes of the ids are generated from.
ElasticsearchDocumentStore.describe_documents
def describe_documents(index=None)
Return a summary of the documents in the document store
Module opensearch
OpenSearchDocumentStore
class OpenSearchDocumentStore(SearchEngineDocumentStore)
OpenSearchDocumentStore.__init__
def __init__(scheme: str = "https",
username: str = "admin",
password: str = "admin",
host: Union[str, List[str]] = "localhost",
port: Union[int, List[int]] = 9200,
api_key_id: Optional[str] = None,
api_key: Optional[str] = None,
aws4auth=None,
index: str = "document",
label_index: str = "label",
search_fields: Union[str, list] = "content",
content_field: str = "content",
name_field: str = "name",
embedding_field: str = "embedding",
embedding_dim: int = 768,
custom_mapping: Optional[dict] = None,
excluded_meta_data: Optional[list] = None,
analyzer: str = "standard",
ca_certs: Optional[str] = None,
verify_certs: bool = False,
recreate_index: bool = False,
create_index: bool = True,
refresh_type: str = "wait_for",
similarity: str = "dot_product",
timeout: int = 300,
return_embedding: bool = False,
duplicate_documents: str = "overwrite",
index_type: str = "flat",
scroll: str = "1d",
skip_missing_embeddings: bool = True,
synonyms: Optional[List] = None,
synonym_type: str = "synonym",
use_system_proxy: bool = False,
knn_engine: str = "nmslib",
knn_parameters: Optional[Dict] = None,
ivf_train_size: Optional[int] = None,
batch_size: int = 10_000)
Document Store using OpenSearch (https://opensearch.org/). It is compatible with the Amazon OpenSearch Service.
In addition to native OpenSearch query & filtering, it provides efficient vector similarity search using the KNN plugin that can scale to a large number of documents.
Arguments:
host
: url(s) of OpenSearch nodesport
: port(s) of OpenSearch nodesusername
: username (standard authentication via http_auth)password
: password (standard authentication via http_auth)api_key_id
: ID of the API key (alternative authentication mode to the above http_auth)api_key
: Secret value of the API key (alternative authentication mode to the above http_auth)aws4auth
: Authentication for usage with AWS OpenSearch Service (can be generated with the requests-aws4auth package)index
: Name of index in OpenSearch to use for storing the documents that we want to search. If not existing yet, we will create one.label_index
: Name of index in OpenSearch to use for storing labels. If not existing yet, we will create one.search_fields
: Name of fields used by BM25Retriever to find matches in the docs to our incoming query (using OpenSearch's multi_match query), e.g. ["title", "full_text"]content_field
: Name of field that might contain the answer and will therefore be passed to the Reader Model (e.g. "full_text"). If no Reader is used (e.g. in FAQ-Style QA) the plain content of this field will just be returned.name_field
: Name of field that contains the title of the the docembedding_field
: Name of field containing an embedding vector (Only needed when using a dense retriever (e.g. DensePassageRetriever, EmbeddingRetriever) on top) Note, that in OpenSearch the similarity type for efficient approximate vector similarity calculations is tied to the embedding field's data type which cannot be changed after creation.embedding_dim
: Dimensionality of embedding vector (Only needed when using a dense retriever (e.g. DensePassageRetriever, EmbeddingRetriever) on top)custom_mapping
: If you want to use your own custom mapping for creating a new index in OpenSearch, you can supply it here as a dictionary.analyzer
: Specify the default analyzer from one of the built-ins when creating a new OpenSearch Index. OpenSearch also has built-in analyzers for different languages (e.g. impacting tokenization). More info at: https://opensearch.org/docs/latest/analyzers/text-analyzers/excluded_meta_data
: Name of fields in OpenSearch that should not be returned (e.g. [field_one, field_two]). Helpful if you have fields with long, irrelevant content that you don't want to display in results (e.g. embedding vectors).scheme
: 'https' or 'http', protocol used to connect to your OpenSearch instanceca_certs
: Root certificates for SSL: it is a path to certificate authority (CA) certs on disk. You can use certifi package with certifi.where() to find where the CA certs file is located in your machine.verify_certs
: Whether to be strict about ca certificatescreate_index
: Whether to try creating a new index (If the index of that name is already existing, we will just continue in any caserefresh_type
: Type of OpenSearch refresh used to control when changes made by a request (e.g. bulk) are made visible to search. If set to 'wait_for', continue only after changes are visible (slow, but safe). If set to 'false', continue directly (fast, but sometimes unintuitive behaviour when docs are not immediately available after ingestion). More info at https://opensearch.org/docs/latest/api-reference/document-apis/bulk/`url`-parameterssimilarity
: The similarity function used to compare document vectors. 'dot_product' is the default since it is more performant with DPR embeddings. 'cosine' is recommended if you are using a Sentence BERT model. Note, that the use of efficient approximate vector calculations in OpenSearch is tied to embedding_field's data type which cannot be changed after creation. You won't be able to use approximate vector calculations on an embedding_field which was created with a different similarity value. In such cases a fallback to exact but slow vector calculations will happen and a warning will be displayed.timeout
: Number of seconds after which an OpenSearch request times out.return_embedding
: To return document embeddingduplicate_documents
: Handle duplicates document based on parameter options. Parameter options : ( 'skip','overwrite','fail') skip: Ignore the duplicates documents overwrite: Update any existing documents with the same ID when adding documents. fail: an error is raised if the document ID of the document being added already exists.index_type
: The type of index you want to create. Choose from 'flat', 'hnsw', 'ivf', or 'ivf_pq'. 'ivf_pq' is an IVF index optimized for memory through product quantization. ('ivf' and 'ivf_pq' are only available with 'faiss' as knn_engine.) If index_type='flat', we use OpenSearch's default index settings (which is an hnsw index optimized for accuracy and memory footprint), since OpenSearch does not require a special index for exact vector similarity calculations. Note that OpenSearchDocumentStore will only perform exact vector calculations if the selected knn_engine supports it (currently only knn_engine='score_script'). For the other knn_engines we use hnsw, as this usually achieves the best balance between nearly as good accuracy and latency.scroll
: Determines how long the current index is fixed, e.g. during updating all documents with embeddings. Defaults to "1d" and should not be larger than this. Can also be in minutes "5m" or hours "15h" For details, see https://opensearch.org/docs/latest/api-reference/scroll/skip_missing_embeddings
: Parameter to control queries based on vector similarity when indexed documents miss embeddings. Parameter options: (True, False) False: Raises exception if one or more documents do not have embeddings at query time True: Query will ignore all documents without embeddings (recommended if you concurrently index and query)synonyms
: List of synonyms can be passed while OpenSearch initialization. For example: [ "foo, bar => baz", "foozball , foosball" ] More info at https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-synonym-tokenfilter.htmlsynonym_type
: Synonym filter type can be passed. Synonym or Synonym_graph to handle synonyms, including multi-word synonyms correctly during the analysis process. More info at https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-synonym-graph-tokenfilter.htmlknn_engine
: The engine you want to use for the nearest neighbor search by OpenSearch's KNN plug-in. Possible values: "nmslib", "faiss" or "score_script". Defaults to "nmslib". For more information, see k-NN Index.knn_parameters
: Custom parameters for the KNN engine. Parameter names depend on the index type you use. Configurable parameters for indices of type...hnsw
:"ef_construction"
,"ef_search"
,"m"
ivf
:"nlist"
,"nprobes"
ivf_pq
:"nlist"
,"nprobes"
,"m"
,"code_size"
If you don't specify any parameters, the OpenSearch's default values are used. (With the exception of index_type='hnsw', where we use values other than OpenSearch's default ones to achieve comparability throughout DocumentStores in Haystack.) For more information on configuration of knn indices, see OpenSearch Documentation.
ivf_train_size
: Number of embeddings to use for training the IVF index. Training starts automatically once the number of indexed embeddings exceeds ivf_train_size. IfNone
, the minimum number of embeddings recommended for training by FAISS is used (depends on the desired index type and knn parameters). If0
, training doesn't happen automatically but needs to be triggered manually via thetrain_index
method. Default:None
batch_size
: Number of Documents to index at once / Number of queries to execute at once. If you face memory issues, decrease the batch_size.
OpenSearchDocumentStore.write_documents
def write_documents(documents: Union[List[dict], List[Document]],
index: Optional[str] = None,
batch_size: Optional[int] = None,
duplicate_documents: Optional[str] = None,
headers: Optional[Dict[str, str]] = None)
Indexes documents for later queries in OpenSearch.
If a document with the same ID already exists in OpenSearch:
a) (Default) Throw OpenSearch's standard error message for duplicate IDs.
b) If self.update_existing_documents=True
for DocumentStore: Overwrite existing documents.
(This is only relevant if you pass your own ID when initializing a Document
.
If you don't set custom IDs for your Documents or just pass a list of dictionaries here,
they automatically get UUIDs assigned. See the Document
class for details.)
Arguments:
documents
: A list of Python dictionaries or a list of Haystack Document objects. For documents as dictionaries, the format is {"content": ""}. Optionally: Include meta data via {"content": "", "meta":{"name": ", "author": "somebody", ...}} You can use it for filtering and you can access it in the responses of the Finder. Advanced: If you are using your own OpenSearch mapping, change the key names in the dictionary to what you have set for self.content_field and self.name_field.index
: OpenSearch index where the documents should be indexed. If you don't specify it, self.index is used.batch_size
: Number of documents that are passed to OpenSearch's bulk function at a time.duplicate_documents
: Handle duplicate documents based on parameter options. Parameter options: ( 'skip','overwrite','fail') skip: Ignore the duplicate documents overwrite: Update any existing documents with the same ID when adding documents. fail: Raises an error if the document ID of the document being added already exists.headers
: Custom HTTP headers to pass to OpenSearch client (for example {'Authorization': 'Basic YWRtaW46cm9vdA=='}) For more information, see HTTP/REST clients and security.
Raises:
DuplicateDocumentError
: Exception trigger on duplicate document
Returns:
None
OpenSearchDocumentStore.query_by_embedding
def query_by_embedding(query_emb: np.ndarray,
filters: Optional[FilterType] = None,
top_k: int = 10,
index: Optional[str] = None,
return_embedding: Optional[bool] = None,
headers: Optional[Dict[str, str]] = None,
scale_score: bool = True) -> List[Document]
Find the document that is most similar to the provided query_emb
by using a vector similarity metric.
Arguments:
-
query_emb
: Embedding of the query (e.g. gathered from DPR) -
filters
: Optional filters to narrow down the search space to documents whose metadata fulfill certain conditions. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and"
,"$or"
,"$not"
), a comparison operator ("$eq"
,"$in"
,"$gt"
,"$gte"
,"$lt"
,"$lte"
) or a metadata field name. Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of"$in"
) a list of values as value. If no logical operator is provided,"$and"
is used as default operation. If no comparison operator is provided,"$eq"
(or"$in"
if the comparison value is a list) is used as default operation.__Example__: ```python filters = { "$and": { "type": {"$eq": "article"}, "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"}, "rating": {"$gte": 3}, "$or": { "genre": {"$in": ["economy", "politics"]}, "publisher": {"$eq": "nytimes"} } } } # or simpler using default operators filters = { "type": "article", "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"}, "rating": {"$gte": 3}, "$or": { "genre": ["economy", "politics"], "publisher": "nytimes" } } ``` To use the same logical operator multiple times on the same level, logical operators take optionally a list of dictionaries as value. __Example__: ```python filters = { "$or": [ { "$and": { "Type": "News Paper", "Date": { "$lt": "2019-01-01" } } }, { "$and": { "Type": "Blog Post", "Date": { "$gte": "2019-01-01" } } } ] } ```
-
top_k
: How many documents to return -
index
: Index name for storing the docs and metadata -
return_embedding
: To return document embedding -
headers
: Custom HTTP headers to pass to OpenSearch client (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='}) Check out https://www.elastic.co/guide/en/elasticsearch/reference/current/http-clients.html for more information. -
scale_score
: Whether to scale the similarity score to the unit interval (range of [0,1]). If true (default) similarity scores (e.g. cosine or dot_product) which naturally have a different value range will be scaled to a range of [0,1], where 1 means extremely relevant. Otherwise raw similarity scores (e.g. cosine or dot_product) will be used.
OpenSearchDocumentStore.query_by_embedding_batch
def query_by_embedding_batch(
query_embs: Union[List[np.ndarray], np.ndarray],
filters: Optional[Union[FilterType,
List[Optional[FilterType]]]] = None,
top_k: int = 10,
index: Optional[str] = None,
return_embedding: Optional[bool] = None,
headers: Optional[Dict[str, str]] = None,
scale_score: bool = True,
batch_size: Optional[int] = None) -> List[List[Document]]
Find the documents that are most similar to the provided query_embs
by using a vector similarity metric.
Arguments:
-
query_embs
: Embeddings of the queries (e.g. gathered from DPR). Can be a list of one-dimensional numpy arrays or a two-dimensional numpy array. -
filters
: Optional filters to narrow down the search space to documents whose metadata fulfill certain conditions. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and"
,"$or"
,"$not"
), a comparison operator ("$eq"
,"$in"
,"$gt"
,"$gte"
,"$lt"
,"$lte"
) or a metadata field name. Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of"$in"
) a list of values as value. If no logical operator is provided,"$and"
is used as default operation. If no comparison operator is provided,"$eq"
(or"$in"
if the comparison value is a list) is used as default operation.__Example__: ```python filters = { "$and": { "type": {"$eq": "article"}, "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"}, "rating": {"$gte": 3}, "$or": { "genre": {"$in": ["economy", "politics"]}, "publisher": {"$eq": "nytimes"} } } } # or simpler using default operators filters = { "type": "article", "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"}, "rating": {"$gte": 3}, "$or": { "genre": ["economy", "politics"], "publisher": "nytimes" } } ``` To use the same logical operator multiple times on the same level, logical operators take optionally a list of dictionaries as value. __Example__: ```python filters = { "$or": [ { "$and": { "Type": "News Paper", "Date": { "$lt": "2019-01-01" } } }, { "$and": { "Type": "Blog Post", "Date": { "$gte": "2019-01-01" } } } ] } ```
-
top_k
: How many documents to return -
index
: Index name for storing the docs and metadata -
return_embedding
: To return document embedding -
headers
: Custom HTTP headers to pass to OpenSearch client (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='}) Check out https://www.elastic.co/guide/en/elasticsearch/reference/current/http-clients.html for more information. -
scale_score
: Whether to scale the similarity score to the unit interval (range of [0,1]). If true (default) similarity scores (e.g. cosine or dot_product) which naturally have a different value range will be scaled to a range of [0,1], where 1 means extremely relevant. Otherwise, raw similarity scores (e.g. cosine or dot_product) will be used. -
batch_size
: Number of query embeddings to process at once. If not specified, self.batch_size is used.
OpenSearchDocumentStore.query
def query(query: Optional[str],
filters: Optional[FilterType] = None,
top_k: int = 10,
custom_query: Optional[str] = None,
index: Optional[str] = None,
headers: Optional[Dict[str, str]] = None,
all_terms_must_match: bool = False,
scale_score: bool = True) -> List[Document]
Scan through documents in DocumentStore and return a small number documents
that are most relevant to the query as defined by the BM25 algorithm.
Arguments:
-
query
: The query -
filters
: Optional filters to narrow down the search space to documents whose metadata fulfill certain conditions. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and"
,"$or"
,"$not"
), a comparison operator ("$eq"
,"$in"
,"$gt"
,"$gte"
,"$lt"
,"$lte"
) or a metadata field name. Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of"$in"
) a list of values as value. If no logical operator is provided,"$and"
is used as default operation. If no comparison operator is provided,"$eq"
(or"$in"
if the comparison value is a list) is used as default operation.__Example__: ```python filters = { "$and": { "type": {"$eq": "article"}, "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"}, "rating": {"$gte": 3}, "$or": { "genre": {"$in": ["economy", "politics"]}, "publisher": {"$eq": "nytimes"} } } } # or simpler using default operators filters = { "type": "article", "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"}, "rating": {"$gte": 3}, "$or": { "genre": ["economy", "politics"], "publisher": "nytimes" } } ``` To use the same logical operator multiple times on the same level, logical operators take optionally a list of dictionaries as value. __Example__: ```python filters = { "$or": [ { "$and": { "Type": "News Paper", "Date": { "$lt": "2019-01-01" } } }, { "$and": { "Type": "Blog Post", "Date": { "$gte": "2019-01-01" } } } ] } ```
-
top_k
: How many documents to return per query. -
custom_query
: The query string containing a mandatory${query}
and an optional${filters}
placeholder. ::**An example custom_query:**
{ "size": 10, "query": { "bool": { "should": [{"multi_match": { "query": ${query}, // mandatory query placeholder "type": "most_fields", "fields": ["content", "title"]}}], "filter": ${filters} // optional filters placeholder } }, } ``` **For this custom_query, a sample `retrieve()` could be:** ```python self.retrieve(query="Why did the revenue increase?", filters={"years": ["2019"], "quarters": ["Q1", "Q2"]})
Optionally, highlighting can be defined by specifying the highlight settings. See https://www.elastic.co/guide/en/elasticsearch/reference/current/highlighting.html. You will find the highlighted output in the returned Document's meta field by key "highlighted". ::
**Example custom_query with highlighting:**
{
"size": 10,
"query": {
"bool": {
"should": [{"multi_match": {
"query": ${query}, // mandatory query placeholder
"type": "most_fields",
"fields": ["content", "title"]}}],
}
},
"highlight": { // enable highlighting
"fields": { // for fields content and title
"content": {},
"title": {}
}
},
}
```
**For this custom_query, highlighting info can be accessed by:**
```python
docs = self.retrieve(query="Why did the revenue increase?")
highlighted_content = docs[0].meta["highlighted"]["content"]
highlighted_title = docs[0].meta["highlighted"]["title"]
index
: The name of the index in the DocumentStore from which to retrieve documentsheaders
: Custom HTTP headers to pass to the client (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='}) Check out https://www.elastic.co/guide/en/elasticsearch/reference/current/http-clients.html for more information.all_terms_must_match
: Whether all terms of the query must match the document. If true all query terms must be present in a document in order to be retrieved (i.e the AND operator is being used implicitly between query terms: "cozy fish restaurant" -> "cozy AND fish AND restaurant"). Otherwise at least one query term must be present in a document in order to be retrieved (i.e the OR operator is being used implicitly between query terms: "cozy fish restaurant" -> "cozy OR fish OR restaurant"). Defaults to false.scale_score
: Whether to scale the similarity score to the unit interval (range of [0,1]). If true (default) similarity scores (e.g. cosine or dot_product) which naturally have a different value range will be scaled to a range of [0,1], where 1 means extremely relevant. Otherwise raw similarity scores (e.g. cosine or dot_product) will be used.
OpenSearchDocumentStore.train_index
def train_index(documents: Optional[Union[List[dict], List[Document]]] = None,
embeddings: Optional[np.ndarray] = None,
index: Optional[str] = None,
headers: Optional[Dict[str, str]] = None)
Trains an IVF index on the provided Documents or embeddings if the index hasn't been trained yet.
The train vectors should come from the same distribution as your final vectors. You can pass either Documents (including embeddings) or just plain embeddings you want to train the index on.
Arguments:
documents
: Documents (including the embeddings) you want to train the index on.embeddings
: Plain embeddings you want to train the index on.index
: Name of the index to train. IfNone
, the DocumentStore's default index (self.index) is used.headers
: Custom HTTP headers to pass to the OpenSearch client (for example {'Authorization': 'Basic YWRtaW46cm9vdA=='}). For more information, see HTTP/REST clients and security.
OpenSearchDocumentStore.delete_index
def delete_index(index: str)
Delete an existing search index. The index together with all data will be removed.
If the index is of type "ivf"
or "ivf_pq"
, this method also deletes the corresponding IVF and PQ model.
Arguments:
index
: The name of the index to delete.
Returns:
None
OpenSearchDocumentStore.get_metadata_values_by_key
def get_metadata_values_by_key(key: str,
query: Optional[str] = None,
filters: Optional[FilterType] = None,
index: Optional[str] = None,
headers: Optional[Dict[str, str]] = None,
batch_size: int = 10) -> List[dict]
Get values associated with a metadata key. The output is in the format:
[{"value": "my-value-1", "count": 23}, {"value": "my-value-2", "count": 12}, ... ]
Arguments:
-
key
: The meta key name to get the values for. -
query
: Narrow down the scope to documents matching the query string. -
filters
: Narrow down the scope to documents matching the given filters. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and"
,"$or"
,"$not"
), a comparison operator ("$eq"
,"$in"
,"$gt"
,"$gte"
,"$lt"
,"$lte"
) or a metadata field name. Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of"$in"
) a list of values as value. If no logical operator is provided,"$and"
is used as default operation. If no comparison operator is provided,"$eq"
(or"$in"
if the comparison value is a list) is used as default operation.__Example__: ```python filters = { "$and": { "type": {"$eq": "article"}, "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"}, "rating": {"$gte": 3}, "$or": { "genre": {"$in": ["economy", "politics"]}, "publisher": {"$eq": "nytimes"} } } } ```
-
index
: The search index to search for the meta values. If not supplied, self.index is used. -
headers
: Custom HTTP headers to pass to the client (for example, {'Authorization': 'Basic YWRtaW46cm9vdA=='}) Check out Elasticsearch documentation for more information. -
batch_size
: Maximum number of results for each request. Limited to 10 values by default. You can increase this limit to decrease retrieval time. To reduce the pressure on the cluster, you shouldn't set this higher than 1,000.
OpenSearchDocumentStore.get_documents_by_id
def get_documents_by_id(
ids: List[str],
index: Optional[str] = None,
batch_size: int = 10_000,
headers: Optional[Dict[str, str]] = None) -> List[Document]
Fetch documents by specifying a list of text ID strings.
Arguments:
ids
: List of document IDs. Be aware that passing a large number of IDs might lead to performance issues.index
: The search index where the documents are stored. If not supplied, self.index is used.batch_size
: Maximum number of results for each query. Limited to 10,000 documents by default. To reduce the pressure on the cluster, you can lower this limit at the expense of longer retrieval times.headers
: Custom HTTP headers to pass to the client (for example, {'Authorization': 'Basic YWRtaW46cm9vdA=='}) Check out Elasticsearch documentation for more information.
OpenSearchDocumentStore.update_document_meta
def update_document_meta(id: str,
meta: Dict[str, str],
index: Optional[str] = None,
headers: Optional[Dict[str, str]] = None)
Update the metadata dictionary of a document by specifying its ID string.
OpenSearchDocumentStore.get_document_by_id
def get_document_by_id(
id: str,
index: Optional[str] = None,
headers: Optional[Dict[str, str]] = None) -> Optional[Document]
Fetch a document by specifying its text id string
OpenSearchDocumentStore.write_labels
def write_labels(labels: Union[List[Label], List[dict]],
index: Optional[str] = None,
headers: Optional[Dict[str, str]] = None,
batch_size: int = 10_000)
Write annotation labels into document store.
Arguments:
labels
: A list of Python dictionaries or a list of Haystack Label objects.index
: search index where the labels should be stored. If not supplied, self.label_index will be used.batch_size
: Number of labels that are passed to the bulk function at each round.headers
: Custom HTTP headers to pass to the client (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='}) Check out https://www.elastic.co/guide/en/elasticsearch/reference/current/http-clients.html for more information.
OpenSearchDocumentStore.get_document_count
def get_document_count(filters: Optional[FilterType] = None,
index: Optional[str] = None,
only_documents_without_embedding: bool = False,
headers: Optional[Dict[str, str]] = None) -> int
Return the number of documents in the document store.
OpenSearchDocumentStore.get_label_count
def get_label_count(index: Optional[str] = None,
headers: Optional[Dict[str, str]] = None) -> int
Return the number of labels in the document store
OpenSearchDocumentStore.get_embedding_count
def get_embedding_count(index: Optional[str] = None,
filters: Optional[FilterType] = None,
headers: Optional[Dict[str, str]] = None) -> int
Return the count of embeddings in the document store.
OpenSearchDocumentStore.get_all_documents
def get_all_documents(
index: Optional[str] = None,
filters: Optional[FilterType] = None,
return_embedding: Optional[bool] = None,
batch_size: int = 10_000,
headers: Optional[Dict[str, str]] = None) -> List[Document]
Get documents from the document store.
Arguments:
-
index
: Name of the index to get the documents from. If None, the DocumentStore's default index (self.index) will be used. -
filters
: Optional filters to narrow down the documents to return. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and"
,"$or"
,"$not"
), a comparison operator ("$eq"
,"$in"
,"$gt"
,"$gte"
,"$lt"
,"$lte"
) or a metadata field name. Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of"$in"
) a list of values as value. If no logical operator is provided,"$and"
is used as default operation. If no comparison operator is provided,"$eq"
(or"$in"
if the comparison value is a list) is used as default operation.__Example__: ```python filters = { "$and": { "type": {"$eq": "article"}, "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"}, "rating": {"$gte": 3}, "$or": { "genre": {"$in": ["economy", "politics"]}, "publisher": {"$eq": "nytimes"} } } } ```
-
return_embedding
: Whether to return the document embeddings. -
batch_size
: When working with large number of documents, batching can help reduce memory footprint. -
headers
: Custom HTTP headers to pass to the client (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='}) Check out https://www.elastic.co/guide/en/elasticsearch/reference/current/http-clients.html for more information.
OpenSearchDocumentStore.get_all_documents_generator
def get_all_documents_generator(
index: Optional[str] = None,
filters: Optional[FilterType] = None,
return_embedding: Optional[bool] = None,
batch_size: int = 10_000,
headers: Optional[Dict[str,
str]] = None) -> Generator[Document, None, None]
Get documents from the document store. Under-the-hood, documents are fetched in batches from the
document store and yielded as individual documents. This method can be used to iteratively process a large number of documents without having to load all documents in memory.
Arguments:
-
index
: Name of the index to get the documents from. If None, the DocumentStore's default index (self.index) will be used. -
filters
: Optional filters to narrow down the documents to return. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and"
,"$or"
,"$not"
), a comparison operator ("$eq"
,"$in"
,"$gt"
,"$gte"
,"$lt"
,"$lte"
) or a metadata field name. Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of"$in"
) a list of values as value. If no logical operator is provided,"$and"
is used as default operation. If no comparison operator is provided,"$eq"
(or"$in"
if the comparison value is a list) is used as default operation.__Example__: ```python filters = { "$and": { "type": {"$eq": "article"}, "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"}, "rating": {"$gte": 3}, "$or": { "genre": {"$in": ["economy", "politics"]}, "publisher": {"$eq": "nytimes"} } } } ```
-
return_embedding
: Whether to return the document embeddings. -
batch_size
: When working with large number of documents, batching can help reduce memory footprint. -
headers
: Custom HTTP headers to pass to the client (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='}) Check out https://www.elastic.co/guide/en/elasticsearch/reference/current/http-clients.html for more information.
OpenSearchDocumentStore.get_all_labels
def get_all_labels(index: Optional[str] = None,
filters: Optional[FilterType] = None,
headers: Optional[Dict[str, str]] = None,
batch_size: int = 10_000) -> List[Label]
Return all labels in the document store
OpenSearchDocumentStore.query_batch
def query_batch(queries: List[str],
filters: Optional[Union[FilterType,
List[Optional[FilterType]]]] = None,
top_k: int = 10,
custom_query: Optional[str] = None,
index: Optional[str] = None,
headers: Optional[Dict[str, str]] = None,
all_terms_must_match: bool = False,
scale_score: bool = True,
batch_size: Optional[int] = None) -> List[List[Document]]
Scan through documents in DocumentStore and return a small number of documents
that are most relevant to the provided queries as defined by keyword matching algorithms like BM25.
This method lets you find relevant documents for list of query strings (output: List of Lists of Documents).
Arguments:
queries
: List of query strings.filters
: Optional filters to narrow down the search space to documents whose metadata fulfill certain conditions. Can be a single filter that will be applied to each query or a list of filters (one filter per query).
Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical
operator ("$and"
, "$or"
, "$not"
), a comparison operator ("$eq"
, "$in"
, "$gt"
,
"$gte"
, "$lt"
, "$lte"
) or a metadata field name.
Logical operator keys take a dictionary of metadata field names and/or logical operators as
value. Metadata field names take a dictionary of comparison operators as value. Comparison
operator keys take a single value or (in case of "$in"
) a list of values as value.
If no logical operator is provided, "$and"
is used as default operation. If no comparison
operator is provided, "$eq"
(or "$in"
if the comparison value is a list) is used as default
operation.
__Example__:
```python
filters = {
"$and": {
"type": {"$eq": "article"},
"date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
"rating": {"$gte": 3},
"$or": {
"genre": {"$in": ["economy", "politics"]},
"publisher": {"$eq": "nytimes"}
}
}
}
# or simpler using default operators
filters = {
"type": "article",
"date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
"rating": {"$gte": 3},
"$or": {
"genre": ["economy", "politics"],
"publisher": "nytimes"
}
}
```
To use the same logical operator multiple times on the same level, logical operators take
optionally a list of dictionaries as value.
__Example__:
```python
filters = {
"$or": [
{
"$and": {
"Type": "News Paper",
"Date": {
"$lt": "2019-01-01"
}
}
},
{
"$and": {
"Type": "Blog Post",
"Date": {
"$gte": "2019-01-01"
}
}
}
]
}
```
top_k
: How many documents to return per query.custom_query
: Custom query to be executed.index
: The name of the index in the DocumentStore from which to retrieve documentsheaders
: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='} for basic authentication)all_terms_must_match
: Whether all terms of the query must match the document. If true all query terms must be present in a document in order to be retrieved (i.e the AND operator is being used implicitly between query terms: "cozy fish restaurant" -> "cozy AND fish AND restaurant"). Otherwise, at least one query term must be present in a document in order to be retrieved (i.e the OR operator is being used implicitly between query terms: "cozy fish restaurant" -> "cozy OR fish OR restaurant"). Defaults to False.scale_score
: Whether to scale the similarity score to the unit interval (range of [0,1]). If true (default) similarity scores (e.g. cosine or dot_product) which naturally have a different value range will be scaled to a range of [0,1], where 1 means extremely relevant. Otherwise, raw similarity scores (e.g. cosine or dot_product) will be used.batch_size
: Number of queries that are processed at once. If not specified, self.batch_size is used.
OpenSearchDocumentStore.update_embeddings
def update_embeddings(retriever: DenseRetriever,
index: Optional[str] = None,
filters: Optional[FilterType] = None,
update_existing_embeddings: bool = True,
batch_size: Optional[int] = None,
headers: Optional[Dict[str, str]] = None)
Updates the embeddings in the the document store using the encoding model specified in the retriever.
This can be useful if want to add or change the embeddings for your documents (e.g. after changing the retriever config).
Arguments:
-
retriever
: Retriever to use to update the embeddings. -
index
: Index name to update -
update_existing_embeddings
: Whether to update existing embeddings of the documents. If set to False, only documents without embeddings are processed. This mode can be used for incremental updating of embeddings, wherein, only newly indexed documents get processed. -
filters
: Optional filters to narrow down the documents for which embeddings are to be updated. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and"
,"$or"
,"$not"
), a comparison operator ("$eq"
,"$in"
,"$gt"
,"$gte"
,"$lt"
,"$lte"
) or a metadata field name. Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of"$in"
) a list of values as value. If no logical operator is provided,"$and"
is used as default operation. If no comparison operator is provided,"$eq"
(or"$in"
if the comparison value is a list) is used as default operation.__Example__: ```python filters = { "$and": { "type": {"$eq": "article"}, "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"}, "rating": {"$gte": 3}, "$or": { "genre": {"$in": ["economy", "politics"]}, "publisher": {"$eq": "nytimes"} } } } ```
-
batch_size
: When working with large number of documents, batching can help reduce memory footprint. -
headers
: Custom HTTP headers to pass to the client (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='}) Check out https://www.elastic.co/guide/en/elasticsearch/reference/current/http-clients.html for more information.
Returns:
None
OpenSearchDocumentStore.delete_all_documents
def delete_all_documents(index: Optional[str] = None,
filters: Optional[FilterType] = None,
headers: Optional[Dict[str, str]] = None)
Delete documents in an index. All documents are deleted if no filters are passed.
Arguments:
-
index
: Index name to delete the document from. -
filters
: Optional filters to narrow down the documents to be deleted. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and"
,"$or"
,"$not"
), a comparison operator ("$eq"
,"$in"
,"$gt"
,"$gte"
,"$lt"
,"$lte"
) or a metadata field name. Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of"$in"
) a list of values as value. If no logical operator is provided,"$and"
is used as default operation. If no comparison operator is provided,"$eq"
(or"$in"
if the comparison value is a list) is used as default operation.__Example__: ```python filters = { "$and": { "type": {"$eq": "article"}, "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"}, "rating": {"$gte": 3}, "$or": { "genre": {"$in": ["economy", "politics"]}, "publisher": {"$eq": "nytimes"} } } } ```
-
headers
: Custom HTTP headers to pass to the client (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='}) Check out https://www.elastic.co/guide/en/elasticsearch/reference/current/http-clients.html for more information.
Returns:
None
OpenSearchDocumentStore.delete_documents
def delete_documents(index: Optional[str] = None,
ids: Optional[List[str]] = None,
filters: Optional[FilterType] = None,
headers: Optional[Dict[str, str]] = None)
Delete documents in an index. All documents are deleted if no filters are passed.
Arguments:
-
index
: Index name to delete the documents from. If None, the DocumentStore's default index (self.index) will be used -
ids
: Optional list of IDs to narrow down the documents to be deleted. -
filters
: Optional filters to narrow down the documents to be deleted. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and"
,"$or"
,"$not"
), a comparison operator ("$eq"
,"$in"
,"$gt"
,"$gte"
,"$lt"
,"$lte"
) or a metadata field name. Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of"$in"
) a list of values as value. If no logical operator is provided,"$and"
is used as default operation. If no comparison operator is provided,"$eq"
(or"$in"
if the comparison value is a list) is used as default operation.__Example__: ```python filters = { "$and": { "type": {"$eq": "article"}, "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"}, "rating": {"$gte": 3}, "$or": { "genre": {"$in": ["economy", "politics"]}, "publisher": {"$eq": "nytimes"} } } } ``` If filters are provided along with a list of IDs, this method deletes the intersection of the two query results (documents that match the filters and have their ID in the list).
-
headers
: Custom HTTP headers to pass to the client (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='}) Check out https://www.elastic.co/guide/en/elasticsearch/reference/current/http-clients.html for more information.
Returns:
None
OpenSearchDocumentStore.delete_labels
def delete_labels(index: Optional[str] = None,
ids: Optional[List[str]] = None,
filters: Optional[FilterType] = None,
headers: Optional[Dict[str, str]] = None)
Delete labels in an index. All labels are deleted if no filters are passed.
Arguments:
-
index
: Index name to delete the labels from. If None, the DocumentStore's default label index (self.label_index) will be used -
ids
: Optional list of IDs to narrow down the labels to be deleted. -
filters
: Optional filters to narrow down the labels to be deleted. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and"
,"$or"
,"$not"
), a comparison operator ("$eq"
,"$in"
,"$gt"
,"$gte"
,"$lt"
,"$lte"
) or a metadata field name. Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of"$in"
) a list of values as value. If no logical operator is provided,"$and"
is used as default operation. If no comparison operator is provided,"$eq"
(or"$in"
if the comparison value is a list) is used as default operation.__Example__: ```python filters = { "$and": { "type": {"$eq": "article"}, "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"}, "rating": {"$gte": 3}, "$or": { "genre": {"$in": ["economy", "politics"]}, "publisher": {"$eq": "nytimes"} } } } ```
-
headers
: Custom HTTP headers to pass to the client (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='}) Check out https://www.elastic.co/guide/en/elasticsearch/reference/current/http-clients.html for more information.
Returns:
None
OpenSearchDocumentStore.get_all_labels_aggregated
def get_all_labels_aggregated(
index: Optional[str] = None,
filters: Optional[FilterType] = None,
open_domain: bool = True,
drop_negative_labels: bool = False,
drop_no_answers: bool = False,
aggregate_by_meta: Optional[Union[str, list]] = None,
headers: Optional[Dict[str, str]] = None) -> List[MultiLabel]
Return all labels in the DocumentStore, aggregated into MultiLabel objects.
This aggregation step helps, for example, if you collected multiple possible answers for one question and you want now all answers bundled together in one place for evaluation. How they are aggregated is defined by the open_domain and aggregate_by_meta parameters. If the questions are being asked to a single document (i.e. SQuAD style), you should set open_domain=False to aggregate by question and document. If the questions are being asked to your full collection of documents, you should set open_domain=True to aggregate just by question. If the questions are being asked to a subslice of your document set (e.g. product review use cases), you should set open_domain=True and populate aggregate_by_meta with the names of Label meta fields to aggregate by question and your custom meta fields. For example, in a product review use case, you might set aggregate_by_meta=["product_id"] so that Labels with the same question but different answers from different documents are aggregated into the one MultiLabel object, provided that they have the same product_id (to be found in Label.meta["product_id"])
Arguments:
-
index
: Name of the index to get the labels from. If None, the DocumentStore's default index (self.index) will be used. -
filters
: Optional filters to narrow down the search space to documents whose metadata fulfill certain conditions. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and"
,"$or"
,"$not"
), a comparison operator ("$eq"
,"$in"
,"$gt"
,"$gte"
,"$lt"
,"$lte"
) or a metadata field name. Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of"$in"
) a list of values as value. If no logical operator is provided,"$and"
is used as default operation. If no comparison operator is provided,"$eq"
(or"$in"
if the comparison value is a list) is used as default operation.__Example__: ```python filters = { "$and": { "type": {"$eq": "article"}, "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"}, "rating": {"$gte": 3}, "$or": { "genre": {"$in": ["economy", "politics"]}, "publisher": {"$eq": "nytimes"} } } } ```
-
open_domain
: When True, labels are aggregated purely based on the question text alone. When False, labels are aggregated in a closed domain fashion based on the question text and also the id of the document that the label is tied to. In this setting, this function might return multiple MultiLabel objects with the same question string. -
headers
: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='} for basic authentication) -
aggregate_by_meta
: The names of the Label meta fields by which to aggregate. For example: ["product_id"] -
drop_negative_labels
: When True, labels with incorrect answers and documents are dropped. -
drop_no_answers
: When True, labels with no answers are dropped.
OpenSearchDocumentStore.normalize_embedding
@staticmethod
def normalize_embedding(emb: np.ndarray) -> None
Performs L2 normalization of embeddings vector inplace. Input can be a single vector (1D array) or a matrix (2D array).
OpenSearchDocumentStore.add_eval_data
def add_eval_data(filename: str,
doc_index: str = "eval_document",
label_index: str = "label",
batch_size: Optional[int] = None,
preprocessor: Optional[PreProcessor] = None,
max_docs: Optional[Union[int, bool]] = None,
open_domain: bool = False,
headers: Optional[Dict[str, str]] = None)
Adds a SQuAD-formatted file to the DocumentStore in order to be able to perform evaluation on it.
If a jsonl file and a batch_size is passed to the function, documents are loaded batchwise from disk and also indexed batchwise to the DocumentStore in order to prevent out of memory errors.
Arguments:
filename
: Name of the file containing evaluation data (json or jsonl)doc_index
: Elasticsearch index where evaluation documents should be storedlabel_index
: Elasticsearch index where labeled questions should be storedbatch_size
: Optional number of documents that are loaded and processed at a time. When set to None (default) all documents are processed at once.preprocessor
: Optional PreProcessor to preprocess evaluation documents. It can be used for splitting documents into passages (and assigning labels to corresponding passages). Currently the PreProcessor does not support split_by sentence, cleaning nor split_overlap != 0. When set to None (default) preprocessing is disabled.max_docs
: Optional number of documents that will be loaded. When set to None (default) all available eval documents are used.open_domain
: Set this to True if your file is an open domain dataset where two different answers to the same question might be found in different contexts.headers
: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='} for basic authentication)
OpenSearchDocumentStore.run
def run(documents: List[Union[dict, Document]],
index: Optional[str] = None,
headers: Optional[Dict[str, str]] = None,
id_hash_keys: Optional[List[str]] = None)
Run requests of document stores
Comment: We will gradually introduce the primitives. The doument stores also accept dicts and parse them to documents. In the future, however, only documents themselves will be accepted. Parsing the dictionaries in the run function is therefore only an interim solution until the run function also accepts documents.
Arguments:
documents
: A list of dicts that are documents.headers
: A list of headers.index
: Optional name of index where the documents shall be written to. If None, the DocumentStore's default index (self.index) will be used.id_hash_keys
: List of the fields that the hashes of the ids are generated from.
OpenSearchDocumentStore.describe_documents
def describe_documents(index=None)
Return a summary of the documents in the document store
Module memory
InMemoryDocumentStore
class InMemoryDocumentStore(KeywordDocumentStore)
In-memory document store
InMemoryDocumentStore.__init__
def __init__(index: str = "document",
label_index: str = "label",
embedding_field: Optional[str] = "embedding",
embedding_dim: int = 768,
return_embedding: bool = False,
similarity: str = "dot_product",
progress_bar: bool = True,
duplicate_documents: str = "overwrite",
use_gpu: bool = True,
scoring_batch_size: int = 500000,
devices: Optional[List[Union[str, "torch.device"]]] = None,
use_bm25: bool = False,
bm25_tokenization_regex: str = r"(?u)\b\w\w+\b",
bm25_algorithm: Literal["BM25Okapi", "BM25L",
"BM25Plus"] = "BM25Okapi",
bm25_parameters: Optional[Dict] = None)
Arguments:
index
: The documents are scoped to an index attribute that can be used when writing, querying, or deleting documents. This parameter sets the default value for document index.label_index
: The default value of index attribute for the labels.embedding_field
: Name of field containing an embedding vector (Only needed when using a dense retriever (e.g. DensePassageRetriever, EmbeddingRetriever) on top)embedding_dim
: The size of the embedding vector.return_embedding
: To return document embeddingsimilarity
: The similarity function used to compare document vectors. 'dot_product' is the default sine it is more performant with DPR embeddings. 'cosine' is recommended if you are using a Sentence BERT model.progress_bar
: Whether to show a tqdm progress bar or not. Can be helpful to disable in production deployments to keep the logs clean.duplicate_documents
: Handle duplicates document based on parameter options. Parameter options : ( 'skip','overwrite','fail') skip: Ignore the duplicates documents overwrite: Update any existing documents with the same ID when adding documents. fail: an error is raised if the document ID of the document being added already exists.use_gpu
: Whether to use a GPU or the CPU for calculating embedding similarity. Falls back to CPU if no GPU is available.scoring_batch_size
: Batch size of documents to calculate similarity for. Very small batch sizes are inefficient. Very large batch sizes can overrun GPU memory. In general you want to make sure you have at leastembedding_dim
scoring_batch_size
4 bytes available in GPU memory. Since the data is originally stored in CPU memory there is little risk of overruning memory when running on CPU.devices
: List of torch devices (e.g. cuda, cpu, mps) to limit inference to specific devices. A list containing torch device objects and/or strings is supported (For example [torch.device('cuda:0'), "mps", "cuda:1"]). When specifyinguse_gpu=False
the devices parameter is not used and a single cpu device is used for inference.use_bm25
: Whether to build a sparse representation of documents based on BM25.use_bm25=True
is required to connectBM25Retriever
to this Document Store.bm25_tokenization_regex
: The regular expression to use for tokenization of the text.bm25_algorithm
: The specific BM25 implementation to adopt. Parameter options : ( 'BM25Okapi', 'BM25L', 'BM25Plus')bm25_parameters
: Parameters for BM25 implementation in a dictionary format. For example: {'k1':1.5, 'b':0.75, 'epsilon':0.25} You can learn more about these parameters by visiting https://github.com/dorianbrown/rank_bm25 By default, no parameters are set.
InMemoryDocumentStore.write_documents
def write_documents(documents: Union[List[dict], List[Document]],
index: Optional[str] = None,
batch_size: int = 10_000,
duplicate_documents: Optional[str] = None,
headers: Optional[Dict[str, str]] = None)
Indexes documents for later queries.
Arguments:
documents
: a list of Python dictionaries or a list of Haystack Document objects. For documents as dictionaries, the format is {"content": ""}. Optionally: Include meta data via {"content": "", "meta": {"name": ", "author": "somebody", ...}} It can be used for filtering and is accessible in the responses of the Finder. :param index: write documents to a custom namespace. For instance, documents for evaluation can be indexed in a separate index than the documents for search. :param duplicate_documents: Handle duplicates document based on parameter options. Parameter options : ( 'skip','overwrite','fail') skip: Ignore the duplicates documents overwrite: Update any existing documents with the same ID when adding documents. fail: an error is raised if the document ID of the document being added already exists. :raises DuplicateDocumentError: Exception trigger on duplicate document :return: None
InMemoryDocumentStore.update_bm25
def update_bm25(index: Optional[str] = None)
Updates the BM25 sparse representation in the the document store.
Arguments:
index
: Index name for which the BM25 representation is to be updated. If set to None, the default self.index is used.
InMemoryDocumentStore.write_labels
def write_labels(labels: Union[List[dict], List[Label]],
index: Optional[str] = None,
headers: Optional[Dict[str, str]] = None)
Write annotation labels into document store.
InMemoryDocumentStore.get_document_by_id
def get_document_by_id(
id: str,
index: Optional[str] = None,
headers: Optional[Dict[str, str]] = None) -> Optional[Document]
Fetch a document by specifying its text id string.
InMemoryDocumentStore.get_documents_by_id
def get_documents_by_id(
ids: List[str],
index: Optional[str] = None,
batch_size: Optional[int] = None,
headers: Optional[Dict[str, str]] = None) -> List[Document]
Fetch documents by specifying a list of text id strings.
InMemoryDocumentStore.query_by_embedding
def query_by_embedding(query_emb: np.ndarray,
filters: Optional[FilterType] = None,
top_k: int = 10,
index: Optional[str] = None,
return_embedding: Optional[bool] = None,
headers: Optional[Dict[str, str]] = None,
scale_score: bool = True) -> List[Document]
Find the document that is most similar to the provided query_emb
by using a vector similarity metric.
Arguments:
query_emb
: Embedding of the query (e.g. gathered from DPR)filters
: Optional filters to narrow down the search space to documents whose metadata fulfill certain conditions. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and"
,"$or"
,"$not"
), a comparison operator ("$eq"
,"$in"
,"$gt"
,"$gte"
,"$lt"
,"$lte"
) or a metadata field name. Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of"$in"
) a list of values as value. If no logical operator is provided,"$and"
is used as default operation. If no comparison operator is provided,"$eq"
(or"$in"
if the comparison value is a list) is used as default operation. Example:python filters = { "$and": { "type": {"$eq": "article"}, "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"}, "rating": {"$gte": 3}, "$or": { "genre": {"$in": ["economy", "politics"]}, "publisher": {"$eq": "nytimes"} } } } # or simpler using default operators filters = { "type": "article", "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"}, "rating": {"$gte": 3}, "$or": { "genre": ["economy", "politics"], "publisher": "nytimes" } }
To use the same logical operator multiple times on the same level, logical operators take optionally a list of dictionaries as value. Example:python filters = { "$or": [ { "$and": { "Type": "News Paper", "Date": { "$lt": "2019-01-01" } } }, { "$and": { "Type": "Blog Post", "Date": { "$gte": "2019-01-01" } } } ] }
top_k
: How many documents to returnindex
: Index name for storing the docs and metadatareturn_embedding
: To return document embeddingscale_score
: Whether to scale the similarity score to the unit interval (range of [0,1]). If true (default) similarity scores (e.g. cosine or dot_product) which naturally have a different value range will be scaled to a range of [0,1], where 1 means extremely relevant. Otherwise raw similarity scores (e.g. cosine or dot_product) will be used.
InMemoryDocumentStore.update_embeddings
def update_embeddings(retriever: DenseRetriever,
index: Optional[str] = None,
filters: Optional[FilterType] = None,
update_existing_embeddings: bool = True,
batch_size: int = 10_000)
Updates the embeddings in the the document store using the encoding model specified in the retriever.
This can be useful if want to add or change the embeddings for your documents (e.g. after changing the retriever config).
Arguments:
retriever
: Retriever to use to get embeddings for textindex
: Index name for which embeddings are to be updated. If set to None, the default self.index is used.update_existing_embeddings
: Whether to update existing embeddings of the documents. If set to False, only documents without embeddings are processed. This mode can be used for incremental updating of embeddings, wherein, only newly indexed documents get processed.filters
: Narrow down the scope to documents that match the given filters. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and"
,"$or"
,"$not"
), a comparison operator ("$eq"
,"$in"
,"$gt"
,"$gte"
,"$lt"
,"$lte"
) or a metadata field name. Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of"$in"
) a list of values as value. If no logical operator is provided,"$and"
is used as default operation. If no comparison operator is provided,"$eq"
(or"$in"
if the comparison value is a list) is used as default operation. Example:python filters = { "$and": { "type": {"$eq": "article"}, "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"}, "rating": {"$gte": 3}, "$or": { "genre": {"$in": ["economy", "politics"]}, "publisher": {"$eq": "nytimes"} } } }
batch_size
: When working with large number of documents, batching can help reduce memory footprint.
Returns:
None
InMemoryDocumentStore.get_document_count
def get_document_count(filters: Optional[FilterType] = None,
index: Optional[str] = None,
only_documents_without_embedding: bool = False,
headers: Optional[Dict[str, str]] = None) -> int
Return the number of documents in the document store.
InMemoryDocumentStore.update_document_meta
def update_document_meta(id: str,
meta: Dict[str, Any],
index: Optional[str] = None)
Update the metadata dictionary of a document by specifying its string id.
Arguments:
id
: The ID of the Document whose metadata is being updated.meta
: A dictionary with key-value pairs that should be added / changed for the provided Document ID.index
: Name of the index the Document is located at.
InMemoryDocumentStore.get_embedding_count
def get_embedding_count(filters: Optional[FilterType] = None,
index: Optional[str] = None) -> int
Return the count of embeddings in the document store.
InMemoryDocumentStore.get_label_count
def get_label_count(index: Optional[str] = None,
headers: Optional[Dict[str, str]] = None) -> int
Return the number of labels in the document store.
InMemoryDocumentStore.get_all_documents
def get_all_documents(
index: Optional[str] = None,
filters: Optional[FilterType] = None,
return_embedding: Optional[bool] = None,
batch_size: int = 10_000,
headers: Optional[Dict[str, str]] = None) -> List[Document]
Get all documents from the document store as a list.
Arguments:
index
: Name of the index to get the documents from. If None, the DocumentStore's default index (self.index) will be used.filters
: Narrow down the scope to documents that match the given filters. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and"
,"$or"
,"$not"
), a comparison operator ("$eq"
,"$in"
,"$gt"
,"$gte"
,"$lt"
,"$lte"
) or a metadata field name. Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of"$in"
) a list of values as value. If no logical operator is provided,"$and"
is used as default operation. If no comparison operator is provided,"$eq"
(or"$in"
if the comparison value is a list) is used as default operation. Example:python filters = { "$and": { "type": {"$eq": "article"}, "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"}, "rating": {"$gte": 3}, "$or": { "genre": {"$in": ["economy", "politics"]}, "publisher": {"$eq": "nytimes"} } } }
return_embedding
: Whether to return the document embeddings.
InMemoryDocumentStore.get_all_documents_generator
def get_all_documents_generator(
index: Optional[str] = None,
filters: Optional[FilterType] = None,
return_embedding: Optional[bool] = None,
batch_size: int = 10_000,
headers: Optional[Dict[str,
str]] = None) -> Generator[Document, None, None]
Get all documents from the document store. The methods returns a Python Generator that yields individual
documents.
Arguments:
index
: Name of the index to get the documents from. If None, the DocumentStore's default index (self.index) will be used.filters
: Narrow down the scope to documents that match the given filters. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and"
,"$or"
,"$not"
), a comparison operator ("$eq"
,"$in"
,"$gt"
,"$gte"
,"$lt"
,"$lte"
) or a metadata field name. Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of"$in"
) a list of values as value. If no logical operator is provided,"$and"
is used as default operation. If no comparison operator is provided,"$eq"
(or"$in"
if the comparison value is a list) is used as default operation. Example:python filters = { "$and": { "type": {"$eq": "article"}, "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"}, "rating": {"$gte": 3}, "$or": { "genre": {"$in": ["economy", "politics"]}, "publisher": {"$eq": "nytimes"} } } }
return_embedding
: Whether to return the document embeddings.
InMemoryDocumentStore.get_all_labels
def get_all_labels(index: Optional[str] = None,
filters: Optional[FilterType] = None,
headers: Optional[Dict[str, str]] = None) -> List[Label]
Return all labels in the document store.
InMemoryDocumentStore.delete_all_documents
def delete_all_documents(index: Optional[str] = None,
filters: Optional[FilterType] = None,
headers: Optional[Dict[str, str]] = None)
Delete documents in an index. All documents are deleted if no filters are passed.
Arguments:
index
: Index name to delete the document from.filters
: Narrow down the scope to documents that match the given filters. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and"
,"$or"
,"$not"
), a comparison operator ("$eq"
,"$in"
,"$gt"
,"$gte"
,"$lt"
,"$lte"
) or a metadata field name. Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of"$in"
) a list of values as value. If no logical operator is provided,"$and"
is used as default operation. If no comparison operator is provided,"$eq"
(or"$in"
if the comparison value is a list) is used as default operation. Example:python filters = { "$and": { "type": {"$eq": "article"}, "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"}, "rating": {"$gte": 3}, "$or": { "genre": {"$in": ["economy", "politics"]}, "publisher": {"$eq": "nytimes"} } } }
Returns:
None
InMemoryDocumentStore.delete_documents
def delete_documents(index: Optional[str] = None,
ids: Optional[List[str]] = None,
filters: Optional[FilterType] = None,
headers: Optional[Dict[str, str]] = None)
Delete documents in an index. All documents are deleted if no filters are passed.
Arguments:
index
: Index name to delete the documents from. If None, the DocumentStore's default index (self.index) will be used.ids
: Optional list of IDs to narrow down the documents to be deleted.filters
: Narrow down the scope to documents that match the given filters. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and"
,"$or"
,"$not"
), a comparison operator ("$eq"
,"$in"
,"$gt"
,"$gte"
,"$lt"
,"$lte"
) or a metadata field name. Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of"$in"
) a list of values as value. If no logical operator is provided,"$and"
is used as default operation. If no comparison operator is provided,"$eq"
(or"$in"
if the comparison value is a list) is used as default operation. Example:python filters = { "$and": { "type": {"$eq": "article"}, "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"}, "rating": {"$gte": 3}, "$or": { "genre": {"$in": ["economy", "politics"]}, "publisher": {"$eq": "nytimes"} } } }
Returns:
None
InMemoryDocumentStore.delete_index
def delete_index(index: str)
Delete an existing index. The index including all data will be removed.
Arguments:
index
: The name of the index to delete.
Returns:
None
InMemoryDocumentStore.delete_labels
def delete_labels(index: Optional[str] = None,
ids: Optional[List[str]] = None,
filters: Optional[FilterType] = None,
headers: Optional[Dict[str, str]] = None)
Delete labels in an index. All labels are deleted if no filters are passed.
Arguments:
index
: Index name to delete the labels from. If None, the DocumentStore's default label index (self.label_index) will be used.ids
: Optional list of IDs to narrow down the labels to be deleted.filters
: Narrow down the scope to documents that match the given filters. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and"
,"$or"
,"$not"
), a comparison operator ("$eq"
,"$in"
,"$gt"
,"$gte"
,"$lt"
,"$lte"
) or a metadata field name. Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of"$in"
) a list of values as value. If no logical operator is provided,"$and"
is used as default operation. If no comparison operator is provided,"$eq"
(or"$in"
if the comparison value is a list) is used as default operation. Example:python filters = { "$and": { "type": {"$eq": "article"}, "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"}, "rating": {"$gte": 3}, "$or": { "genre": {"$in": ["economy", "politics"]}, "publisher": {"$eq": "nytimes"} } } }
Returns:
None
InMemoryDocumentStore.query
def query(query: Optional[str],
filters: Optional[FilterType] = None,
top_k: int = 10,
custom_query: Optional[str] = None,
index: Optional[str] = None,
headers: Optional[Dict[str, str]] = None,
all_terms_must_match: bool = False,
scale_score: bool = True) -> List[Document]
Scan through documents in DocumentStore and return a small number of documents
that are most relevant to the query as defined by the BM25 algorithm.
Arguments:
query
: The query.top_k
: How many documents to return per query.index
: The name of the index in the DocumentStore from which to retrieve documents.scale_score
: Whether to scale the similarity score to the unit interval (range of [0,1]).
InMemoryDocumentStore.query_batch
def query_batch(queries: List[str],
filters: Optional[Union[FilterType,
List[Optional[FilterType]]]] = None,
top_k: int = 10,
custom_query: Optional[str] = None,
index: Optional[str] = None,
headers: Optional[Dict[str, str]] = None,
all_terms_must_match: bool = False,
scale_score: bool = True) -> List[List[Document]]
Scan through documents in DocumentStore and return a small number of documents
that are most relevant to the provided queries as defined by keyword matching algorithms like BM25. This method lets you find relevant documents for list of query strings (output: List of Lists of Documents).
Arguments:
query
: The query.top_k
: How many documents to return per query.index
: The name of the index in the DocumentStore from which to retrieve documents.scale_score
: Whether to scale the similarity score to the unit interval (range of [0,1]).
InMemoryDocumentStore.get_all_labels_aggregated
def get_all_labels_aggregated(
index: Optional[str] = None,
filters: Optional[FilterType] = None,
open_domain: bool = True,
drop_negative_labels: bool = False,
drop_no_answers: bool = False,
aggregate_by_meta: Optional[Union[str, list]] = None,
headers: Optional[Dict[str, str]] = None) -> List[MultiLabel]
Return all labels in the DocumentStore, aggregated into MultiLabel objects.
This aggregation step helps, for example, if you collected multiple possible answers for one question and you want now all answers bundled together in one place for evaluation. How they are aggregated is defined by the open_domain and aggregate_by_meta parameters. If the questions are being asked to a single document (i.e. SQuAD style), you should set open_domain=False to aggregate by question and document. If the questions are being asked to your full collection of documents, you should set open_domain=True to aggregate just by question. If the questions are being asked to a subslice of your document set (e.g. product review use cases), you should set open_domain=True and populate aggregate_by_meta with the names of Label meta fields to aggregate by question and your custom meta fields. For example, in a product review use case, you might set aggregate_by_meta=["product_id"] so that Labels with the same question but different answers from different documents are aggregated into the one MultiLabel object, provided that they have the same product_id (to be found in Label.meta["product_id"])
Arguments:
-
index
: Name of the index to get the labels from. If None, the DocumentStore's default index (self.index) will be used. -
filters
: Optional filters to narrow down the search space to documents whose metadata fulfill certain conditions. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and"
,"$or"
,"$not"
), a comparison operator ("$eq"
,"$in"
,"$gt"
,"$gte"
,"$lt"
,"$lte"
) or a metadata field name. Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of"$in"
) a list of values as value. If no logical operator is provided,"$and"
is used as default operation. If no comparison operator is provided,"$eq"
(or"$in"
if the comparison value is a list) is used as default operation.__Example__: ```python filters = { "$and": { "type": {"$eq": "article"}, "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"}, "rating": {"$gte": 3}, "$or": { "genre": {"$in": ["economy", "politics"]}, "publisher": {"$eq": "nytimes"} } } } ```
-
open_domain
: When True, labels are aggregated purely based on the question text alone. When False, labels are aggregated in a closed domain fashion based on the question text and also the id of the document that the label is tied to. In this setting, this function might return multiple MultiLabel objects with the same question string. -
headers
: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='} for basic authentication) -
aggregate_by_meta
: The names of the Label meta fields by which to aggregate. For example: ["product_id"] -
drop_negative_labels
: When True, labels with incorrect answers and documents are dropped. -
drop_no_answers
: When True, labels with no answers are dropped.
InMemoryDocumentStore.normalize_embedding
@staticmethod
def normalize_embedding(emb: np.ndarray) -> None
Performs L2 normalization of embeddings vector inplace. Input can be a single vector (1D array) or a matrix (2D array).
InMemoryDocumentStore.add_eval_data
def add_eval_data(filename: str,
doc_index: str = "eval_document",
label_index: str = "label",
batch_size: Optional[int] = None,
preprocessor: Optional[PreProcessor] = None,
max_docs: Optional[Union[int, bool]] = None,
open_domain: bool = False,
headers: Optional[Dict[str, str]] = None)
Adds a SQuAD-formatted file to the DocumentStore in order to be able to perform evaluation on it.
If a jsonl file and a batch_size is passed to the function, documents are loaded batchwise from disk and also indexed batchwise to the DocumentStore in order to prevent out of memory errors.
Arguments:
filename
: Name of the file containing evaluation data (json or jsonl)doc_index
: Elasticsearch index where evaluation documents should be storedlabel_index
: Elasticsearch index where labeled questions should be storedbatch_size
: Optional number of documents that are loaded and processed at a time. When set to None (default) all documents are processed at once.preprocessor
: Optional PreProcessor to preprocess evaluation documents. It can be used for splitting documents into passages (and assigning labels to corresponding passages). Currently the PreProcessor does not support split_by sentence, cleaning nor split_overlap != 0. When set to None (default) preprocessing is disabled.max_docs
: Optional number of documents that will be loaded. When set to None (default) all available eval documents are used.open_domain
: Set this to True if your file is an open domain dataset where two different answers to the same question might be found in different contexts.headers
: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='} for basic authentication)
InMemoryDocumentStore.run
def run(documents: List[Union[dict, Document]],
index: Optional[str] = None,
headers: Optional[Dict[str, str]] = None,
id_hash_keys: Optional[List[str]] = None)
Run requests of document stores
Comment: We will gradually introduce the primitives. The doument stores also accept dicts and parse them to documents. In the future, however, only documents themselves will be accepted. Parsing the dictionaries in the run function is therefore only an interim solution until the run function also accepts documents.
Arguments:
documents
: A list of dicts that are documents.headers
: A list of headers.index
: Optional name of index where the documents shall be written to. If None, the DocumentStore's default index (self.index) will be used.id_hash_keys
: List of the fields that the hashes of the ids are generated from.
InMemoryDocumentStore.describe_documents
def describe_documents(index=None)
Return a summary of the documents in the document store
Module mongodb_atlas
MongoDBAtlasDocumentStore
class MongoDBAtlasDocumentStore(BaseDocumentStore)
MongoDBAtlasDocumentStore.__init__
def __init__(mongo_connection_string: Optional[str] = None,
database_name: Optional[str] = None,
collection_name: Optional[str] = None,
vector_search_index: Optional[str] = None,
embedding_dim: int = 768,
return_embedding: bool = False,
similarity: str = "cosine",
embedding_field: str = "embedding",
progress_bar: bool = True,
duplicate_documents: str = "overwrite",
recreate_index: bool = False)
Document Store using MongoDB Atlas as a backend (https://www.mongodb.com/docs/atlas/getting-started/).
It is compatible with EmbeddingRetriever and filters.
Arguments:
mongo_connection_string
: MongoDB Atlas connection string in the format: "mongodb+srv://{mongo_atlas_username}:{mongo_atlas_password}@{mongo_atlas_host}/?{mongo_atlas_params_string}".database_name
: Name of the database to use.collection_name
: Name of the collection to use.vector_search_index
: The name of the index to use for vector search. To use the search index it must have been created in the Atlas web UI before. None by default.embedding_dim
: Dimensionality of embeddings, 768 by default.return_embedding
: Whether to return document embeddings when returning documents.similarity
: The similarity function to use for the embeddings. One of "euclidean", "cosine" or "dotProduct". "cosine" is the default.embedding_field
: The name of the field in the document that contains the embedding.progress_bar
: Whether to show a progress bar when writing documents.duplicate_documents
: How to handle duplicate documents. One of "overwrite", "skip" or "fail". "overwrite" is the default.recreate_index
: Whether to recreate the index when initializing the document store.
MongoDBAtlasDocumentStore.delete_documents
def delete_documents(index: Optional[str] = None,
ids: Optional[List[str]] = None,
filters: Optional[FilterType] = None,
headers: Optional[Dict[str, str]] = None)
Delete documents from the document store.
Arguments:
index
: Optional collection name. IfNone
, the DocumentStore's default collection will be used.ids
: Optional list of IDs to narrow down the documents to be deleted.filters
: optional filters (see get_all_documents for description). If filters are provided along with a list of IDs, this method deletes the intersection of the two query results (documents that match the filters and have their ID in the list).headers
: MongoDBAtlasDocumentStore does not support headers.
Returns:
None
:
MongoDBAtlasDocumentStore.delete_index
def delete_index(index=None)
Deletes the collection named by index or the collection specified when the driver was initialized.
MongoDBAtlasDocumentStore.get_all_documents
def get_all_documents(index: Optional[str] = None,
filters: Optional[FilterType] = None,
return_embedding: Optional[bool] = False,
batch_size: int = DEFAULT_BATCH_SIZE,
headers: Optional[Dict[str, str]] = None)
Retrieves all documents in the index (collection).
Arguments:
-
index
: Optional collection name. IfNone
, the DocumentStore's default collection will be used. -
filters
: Optional filters to narrow down the documents that will be retrieved. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and"
,"$or"
,"$not"
), a comparison operator ("$eq"
,"$in"
,"$gt"
,"$gte"
,"$lt"
,"$lte"
) or a metadata field name. Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of"$in"
) a list of values as value. If no logical operator is provided,"$and"
is used as default operation. If no comparison operator is provided,"$eq"
(or"$in"
if the comparison value is a list) is used as default operation. Example:```python filters = { "$and": { "type": {"$eq": "article"}, "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"}, "rating": {"$gte": 3}, "$or": { "genre": {"$in": ["economy", "politics"]}, "publisher": {"$eq": "nytimes"} } } } ```
Note that filters will be acting on the contents of the meta field of the documents in the collection.
-
return_embedding
: Optional flag to return the embedding of the document. -
batch_size
: Number of documents to process at a time. When working with large number of documents, batching can help reduce memory footprint. -
headers
: MongoDBAtlasDocumentStore does not support headers.
MongoDBAtlasDocumentStore.get_document_count
def get_document_count(filters: Optional[FilterType] = None,
index: Optional[str] = None,
only_documents_without_embedding: bool = False,
headers: Optional[Dict[str, str]] = None) -> int
Return the number of documents.
Arguments:
filters
: Optional filters (see get_all_documents for description).index
: Optional collection name. IfNone
, the DocumentStore's default collection will be used.only_documents_without_embedding
: If set toTrue
, only documents without embeddings are counted.headers
: MongoDBAtlasDocumentStore does not support headers.
MongoDBAtlasDocumentStore.get_embedding_count
def get_embedding_count(filters: Optional[FilterType] = None,
index: Optional[str] = None) -> int
Return the number of documents with embeddings.
Arguments:
filters
: Optional filters (see get_all_documents for description).
MongoDBAtlasDocumentStore.get_all_documents_generator
def get_all_documents_generator(
index: Optional[str] = None,
filters: Optional[FilterType] = None,
return_embedding: Optional[bool] = False,
batch_size: int = DEFAULT_BATCH_SIZE,
headers: Optional[Dict[str,
str]] = None) -> Generator[Document, None, None]
Retrieves all documents in the index (collection). Under-the-hood, documents are fetched in batches from the
document store and yielded as individual documents. This method can be used to iteratively process a large number of documents without having to load all documents in memory.
Arguments:
index
: Optional collection name. IfNone
, the DocumentStore's default collection will be used.filters
: optional filters (see get_all_documents for description).return_embedding
: Optional flag to return the embedding of the document.batch_size
: Number of documents to process at a time. When working with large number of documents, batching can help reduce memory footprint.headers
: MongoDBAtlasDocumentStore does not support headers.
MongoDBAtlasDocumentStore.get_documents_by_id
def get_documents_by_id(
ids: List[str],
index: Optional[str] = None,
batch_size: int = DEFAULT_BATCH_SIZE,
headers: Optional[Dict[str, str]] = None,
return_embedding: Optional[bool] = None) -> List[Document]
Retrieves all documents matching ids.
Arguments:
ids
: List of IDs to retrieve.index
: Optional collection name. IfNone
, the DocumentStore's default collection will be used.batch_size
: Number of documents to retrieve at a time. When working with large number of documents, batching can help reduce memory footprint.headers
: MongoDBAtlasDocumentStore does not support headers.return_embedding
: Optional flag to return the embedding of the document.
MongoDBAtlasDocumentStore.get_document_by_id
def get_document_by_id(id: str,
index: Optional[str] = None,
headers: Optional[Dict[str, str]] = None,
return_embedding: Optional[bool] = None) -> Document
Retrieves the document matching id.
Arguments:
id
: The ID of the document to retrieveindex
: Optional collection name. IfNone
, the DocumentStore's default collection will be used.headers
: MongoDBAtlasDocumentStore does not support headers.return_embedding
: Optional flag to return the embedding of the document.
MongoDBAtlasDocumentStore.query_by_embedding
def query_by_embedding(query_emb: np.ndarray,
filters: Optional[FilterType] = None,
top_k: int = 10,
index: Optional[str] = None,
return_embedding: Optional[bool] = None,
headers: Optional[Dict[str, str]] = None,
scale_score: bool = True) -> List[Document]
Find the documents that are most similar to the provided query_emb
by using a vector similarity metric.
Arguments:
query_emb
: Embedding of the queryfilters
: optional filters (see get_all_documents for description).top_k
: How many documents to return.index
: Optional collection name. IfNone
, the DocumentStore's default collection will be used.return_embedding
: Whether to return document embedding.headers
: MongoDBAtlasDocumentStore does not support headers.scale_score
: Whether to scale the similarity score to the unit interval (range of [0,1]). If true (default) similarity scores (e.g. cosine or dot_product) which naturally have a different value range will be scaled to a range of [0,1], where 1 means extremely relevant. Otherwise raw similarity scores (e.g. cosine or dot_product) will be used.
MongoDBAtlasDocumentStore.update_document_meta
def update_document_meta(id: str,
meta: Dict[str, str],
index: Optional[str] = None)
Update the metadata dictionary of a document by specifying its string ID.
Arguments:
id
: ID of the Document to update.meta
: Dictionary of new metadata.index
: Optional collection name. IfNone
, the DocumentStore's default collection will be used.
MongoDBAtlasDocumentStore.write_documents
def write_documents(documents: Union[List[dict], List[Document]],
index: Optional[str] = None,
batch_size: int = DEFAULT_BATCH_SIZE,
duplicate_documents: Optional[str] = None,
headers: Optional[Dict[str, str]] = None)
Parameters:
documents: List of Dicts
or Documents
index (str): search index name - contain letters, numbers, hyphens, or underscores
Arguments:
duplicate_documents
: handle duplicate documents based on parameter options. Parameter options:"overwrite"
: Update any existing documents with the same ID when adding documents."skip"
: Ignore the duplicate documents."fail"
: An error is raised if the document ID of the document being added already exists.
"overwrite" is the default behaviour.
MongoDBAtlasDocumentStore.update_embeddings
def update_embeddings(retriever: DenseRetriever,
index: Optional[str] = None,
update_existing_embeddings: bool = True,
filters: Optional[FilterType] = None,
batch_size: int = DEFAULT_BATCH_SIZE)
Updates the embeddings in the document store using the encoding model specified in the retriever.
This can be useful if you want to add or change the embeddings for your documents (e.g. after changing the retriever config).
Arguments:
retriever
: Retriever to use to get embeddings for text.index
: Optional collection name. IfNone
, the DocumentStore's default collection will be used.update_existing_embeddings
: Whether to update existing embeddings of the documents. If set toFalse
, only documents without embeddings are processed. This mode can be used for incremental updating of embeddings, wherein, only newly indexed documents get processed.filters
: optional filters (see get_all_documents for description).batch_size
: Number of documents to process at a time. When working with large number of documents, batching can help reduce memory footprint. "
MongoDBAtlasDocumentStore.get_all_labels_aggregated
def get_all_labels_aggregated(
index: Optional[str] = None,
filters: Optional[FilterType] = None,
open_domain: bool = True,
drop_negative_labels: bool = False,
drop_no_answers: bool = False,
aggregate_by_meta: Optional[Union[str, list]] = None,
headers: Optional[Dict[str, str]] = None) -> List[MultiLabel]
Return all labels in the DocumentStore, aggregated into MultiLabel objects.
This aggregation step helps, for example, if you collected multiple possible answers for one question and you want now all answers bundled together in one place for evaluation. How they are aggregated is defined by the open_domain and aggregate_by_meta parameters. If the questions are being asked to a single document (i.e. SQuAD style), you should set open_domain=False to aggregate by question and document. If the questions are being asked to your full collection of documents, you should set open_domain=True to aggregate just by question. If the questions are being asked to a subslice of your document set (e.g. product review use cases), you should set open_domain=True and populate aggregate_by_meta with the names of Label meta fields to aggregate by question and your custom meta fields. For example, in a product review use case, you might set aggregate_by_meta=["product_id"] so that Labels with the same question but different answers from different documents are aggregated into the one MultiLabel object, provided that they have the same product_id (to be found in Label.meta["product_id"])
Arguments:
-
index
: Name of the index to get the labels from. If None, the DocumentStore's default index (self.index) will be used. -
filters
: Optional filters to narrow down the search space to documents whose metadata fulfill certain conditions. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and"
,"$or"
,"$not"
), a comparison operator ("$eq"
,"$in"
,"$gt"
,"$gte"
,"$lt"
,"$lte"
) or a metadata field name. Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of"$in"
) a list of values as value. If no logical operator is provided,"$and"
is used as default operation. If no comparison operator is provided,"$eq"
(or"$in"
if the comparison value is a list) is used as default operation.__Example__: ```python filters = { "$and": { "type": {"$eq": "article"}, "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"}, "rating": {"$gte": 3}, "$or": { "genre": {"$in": ["economy", "politics"]}, "publisher": {"$eq": "nytimes"} } } } ```
-
open_domain
: When True, labels are aggregated purely based on the question text alone. When False, labels are aggregated in a closed domain fashion based on the question text and also the id of the document that the label is tied to. In this setting, this function might return multiple MultiLabel objects with the same question string. -
headers
: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='} for basic authentication) -
aggregate_by_meta
: The names of the Label meta fields by which to aggregate. For example: ["product_id"] -
drop_negative_labels
: When True, labels with incorrect answers and documents are dropped. -
drop_no_answers
: When True, labels with no answers are dropped.
MongoDBAtlasDocumentStore.normalize_embedding
@staticmethod
def normalize_embedding(emb: np.ndarray) -> None
Performs L2 normalization of embeddings vector inplace. Input can be a single vector (1D array) or a matrix (2D array).
MongoDBAtlasDocumentStore.add_eval_data
def add_eval_data(filename: str,
doc_index: str = "eval_document",
label_index: str = "label",
batch_size: Optional[int] = None,
preprocessor: Optional[PreProcessor] = None,
max_docs: Optional[Union[int, bool]] = None,
open_domain: bool = False,
headers: Optional[Dict[str, str]] = None)
Adds a SQuAD-formatted file to the DocumentStore in order to be able to perform evaluation on it.
If a jsonl file and a batch_size is passed to the function, documents are loaded batchwise from disk and also indexed batchwise to the DocumentStore in order to prevent out of memory errors.
Arguments:
filename
: Name of the file containing evaluation data (json or jsonl)doc_index
: Elasticsearch index where evaluation documents should be storedlabel_index
: Elasticsearch index where labeled questions should be storedbatch_size
: Optional number of documents that are loaded and processed at a time. When set to None (default) all documents are processed at once.preprocessor
: Optional PreProcessor to preprocess evaluation documents. It can be used for splitting documents into passages (and assigning labels to corresponding passages). Currently the PreProcessor does not support split_by sentence, cleaning nor split_overlap != 0. When set to None (default) preprocessing is disabled.max_docs
: Optional number of documents that will be loaded. When set to None (default) all available eval documents are used.open_domain
: Set this to True if your file is an open domain dataset where two different answers to the same question might be found in different contexts.headers
: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='} for basic authentication)
MongoDBAtlasDocumentStore.run
def run(documents: List[Union[dict, Document]],
index: Optional[str] = None,
headers: Optional[Dict[str, str]] = None,
id_hash_keys: Optional[List[str]] = None)
Run requests of document stores
Comment: We will gradually introduce the primitives. The doument stores also accept dicts and parse them to documents. In the future, however, only documents themselves will be accepted. Parsing the dictionaries in the run function is therefore only an interim solution until the run function also accepts documents.
Arguments:
documents
: A list of dicts that are documents.headers
: A list of headers.index
: Optional name of index where the documents shall be written to. If None, the DocumentStore's default index (self.index) will be used.id_hash_keys
: List of the fields that the hashes of the ids are generated from.
MongoDBAtlasDocumentStore.describe_documents
def describe_documents(index=None)
Return a summary of the documents in the document store
MongoDBAtlasDocumentStoreError
class MongoDBAtlasDocumentStoreError(DocumentStoreError)
Exception for issues that occur in a MongoDBAtlas document store
ValidationError
class ValidationError(Exception)
Exception for validation errors
Module sql
SQLDocumentStore
class SQLDocumentStore(BaseDocumentStore)
SQLDocumentStore.__init__
def __init__(url: str = "sqlite://",
index: str = "document",
label_index: str = "label",
duplicate_documents: str = "overwrite",
check_same_thread: bool = False,
isolation_level: Optional[str] = None)
An SQL backed DocumentStore. Currently supports SQLite, PostgreSQL and MySQL backends.
Arguments:
url
: URL for SQL database as expected by SQLAlchemy. More info here: https://docs.sqlalchemy.org/en/13/core/engines.html#database-urlsindex
: The documents are scoped to an index attribute that can be used when writing, querying, or deleting documents. This parameter sets the default value for document index.label_index
: The default value of index attribute for the labels.duplicate_documents
: Handle duplicates document based on parameter options. Parameter options : ( 'skip','overwrite','fail') skip: Ignore the duplicates documents overwrite: Update any existing documents with the same ID when adding documents. fail: an error is raised if the document ID of the document being added already exists.check_same_thread
: Set to False to mitigate multithreading issues in older SQLite versions (see https://docs.sqlalchemy.org/en/14/dialects/sqlite.html?highlight=check_same_thread#threading-pooling-behavior)isolation_level
: see SQLAlchemy'sisolation_level
parameter forcreate_engine()
(https://docs.sqlalchemy.org/en/14/core/engines.html#sqlalchemy.create_engine.params.isolation_level)
SQLDocumentStore.get_document_by_id
def get_document_by_id(
id: str,
index: Optional[str] = None,
headers: Optional[Dict[str, str]] = None) -> Optional[Document]
Fetch a document by specifying its text id string
SQLDocumentStore.get_documents_by_id
def get_documents_by_id(
ids: List[str],
index: Optional[str] = None,
batch_size: int = 10_000,
headers: Optional[Dict[str, str]] = None) -> List[Document]
Fetch documents by specifying a list of text id strings
SQLDocumentStore.get_documents_by_vector_ids
def get_documents_by_vector_ids(vector_ids: List[str],
index: Optional[str] = None,
batch_size: int = 10_000)
Fetch documents by specifying a list of text vector id strings
SQLDocumentStore.get_all_documents_generator
def get_all_documents_generator(
index: Optional[str] = None,
filters: Optional[FilterType] = None,
return_embedding: Optional[bool] = None,
batch_size: int = 10_000,
headers: Optional[Dict[str,
str]] = None) -> Generator[Document, None, None]
Get documents from the document store. Under-the-hood, documents are fetched in batches from the
document store and yielded as individual documents. This method can be used to iteratively process a large number of documents without having to load all documents in memory.
Arguments:
index
: Name of the index to get the documents from. If None, the DocumentStore's default index (self.index) will be used.filters
: Optional filters to narrow down the documents to return. Example: {"name": ["some", "more"], "category": ["only_one"]}return_embedding
: Whether to return the document embeddings.batch_size
: When working with large number of documents, batching can help reduce memory footprint.
SQLDocumentStore.get_all_labels
def get_all_labels(index=None,
filters: Optional[FilterType] = None,
headers: Optional[Dict[str, str]] = None)
Return all labels in the document store
SQLDocumentStore.write_documents
def write_documents(documents: Union[List[dict], List[Document]],
index: Optional[str] = None,
batch_size: int = 10_000,
duplicate_documents: Optional[str] = None,
headers: Optional[Dict[str, str]] = None) -> None
Indexes documents for later queries.
Arguments:
documents
: a list of Python dictionaries or a list of Haystack Document objects. For documents as dictionaries, the format is {"text": ""}. Optionally: Include meta data via {"text": "", "meta":{"name": ", "author": "somebody", ...}} It can be used for filtering and is accessible in the responses of the Finder.index
: add an optional index attribute to documents. It can be later used for filtering. For instance, documents for evaluation can be indexed in a separate index than the documents for search.batch_size
: When working with large number of documents, batching can help reduce memory footprint.duplicate_documents
: Handle duplicates document based on parameter options. Parameter options : ( 'skip','overwrite','fail') skip: Ignore the duplicates documents overwrite: Update any existing documents with the same ID when adding documents but is considerably slower (default). fail: an error is raised if the document ID of the document being added already exists.
Returns:
None
SQLDocumentStore.write_labels
def write_labels(labels, index=None, headers: Optional[Dict[str, str]] = None)
Write annotation labels into document store.
SQLDocumentStore.update_vector_ids
def update_vector_ids(vector_id_map: Dict[str, str],
index: Optional[str] = None,
batch_size: int = 10_000)
Update vector_ids for given document_ids.
Arguments:
vector_id_map
: dict containing mapping of document_id -> vector_id.index
: filter documents by the optional index attribute for documents in database.batch_size
: When working with large number of documents, batching can help reduce memory footprint.
SQLDocumentStore.reset_vector_ids
def reset_vector_ids(index: Optional[str] = None)
Set vector IDs for all documents as None
SQLDocumentStore.update_document_meta
def update_document_meta(id: str,
meta: Dict[str, str],
index: Optional[str] = None)
Update the metadata dictionary of a document by specifying its string id
SQLDocumentStore.get_document_count
def get_document_count(filters: Optional[FilterType] = None,
index: Optional[str] = None,
only_documents_without_embedding: bool = False,
headers: Optional[Dict[str, str]] = None) -> int
Return the number of documents in the document store.
SQLDocumentStore.get_label_count
def get_label_count(index: Optional[str] = None,
headers: Optional[Dict[str, str]] = None) -> int
Return the number of labels in the document store
SQLDocumentStore.delete_all_documents
def delete_all_documents(index: Optional[str] = None,
filters: Optional[FilterType] = None,
headers: Optional[Dict[str, str]] = None)
Delete documents in an index. All documents are deleted if no filters are passed.
Arguments:
index
: Index name to delete the document from.filters
: Optional filters to narrow down the documents to be deleted.
Returns:
None
SQLDocumentStore.delete_documents
def delete_documents(index: Optional[str] = None,
ids: Optional[List[str]] = None,
filters: Optional[FilterType] = None,
headers: Optional[Dict[str, str]] = None)
Delete documents in an index. All documents are deleted if no filters are passed.
Arguments:
index
: Index name to delete the document from. If None, the DocumentStore's default index (self.index) will be used.ids
: Optional list of IDs to narrow down the documents to be deleted.filters
: Optional filters to narrow down the documents to be deleted. Example filters: {"name": ["some", "more"], "category": ["only_one"]}. If filters are provided along with a list of IDs, this method deletes the intersection of the two query results (documents that match the filters and have their ID in the list).
Returns:
None
SQLDocumentStore.delete_index
def delete_index(index: str)
Delete an existing index. The index including all data will be removed.
Arguments:
index
: The name of the index to delete.
Returns:
None
SQLDocumentStore.delete_labels
def delete_labels(index: Optional[str] = None,
ids: Optional[List[str]] = None,
filters: Optional[FilterType] = None,
headers: Optional[Dict[str, str]] = None)
Delete labels from the document store. All labels are deleted if no filters are passed.
Arguments:
index
: Index name to delete the labels from. If None, the DocumentStore's default label index (self.label_index) will be used.ids
: Optional list of IDs to narrow down the labels to be deleted.filters
: Optional filters to narrow down the labels to be deleted. Example filters: {"id": ["9a196e41-f7b5-45b4-bd19-5feb7501c159", "9a196e41-f7b5-45b4-bd19-5feb7501c159"]} or {"query": ["question2"]}
Returns:
None
SQLDocumentStore.get_all_labels_aggregated
def get_all_labels_aggregated(
index: Optional[str] = None,
filters: Optional[FilterType] = None,
open_domain: bool = True,
drop_negative_labels: bool = False,
drop_no_answers: bool = False,
aggregate_by_meta: Optional[Union[str, list]] = None,
headers: Optional[Dict[str, str]] = None) -> List[MultiLabel]
Return all labels in the DocumentStore, aggregated into MultiLabel objects.
This aggregation step helps, for example, if you collected multiple possible answers for one question and you want now all answers bundled together in one place for evaluation. How they are aggregated is defined by the open_domain and aggregate_by_meta parameters. If the questions are being asked to a single document (i.e. SQuAD style), you should set open_domain=False to aggregate by question and document. If the questions are being asked to your full collection of documents, you should set open_domain=True to aggregate just by question. If the questions are being asked to a subslice of your document set (e.g. product review use cases), you should set open_domain=True and populate aggregate_by_meta with the names of Label meta fields to aggregate by question and your custom meta fields. For example, in a product review use case, you might set aggregate_by_meta=["product_id"] so that Labels with the same question but different answers from different documents are aggregated into the one MultiLabel object, provided that they have the same product_id (to be found in Label.meta["product_id"])
Arguments:
-
index
: Name of the index to get the labels from. If None, the DocumentStore's default index (self.index) will be used. -
filters
: Optional filters to narrow down the search space to documents whose metadata fulfill certain conditions. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and"
,"$or"
,"$not"
), a comparison operator ("$eq"
,"$in"
,"$gt"
,"$gte"
,"$lt"
,"$lte"
) or a metadata field name. Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of"$in"
) a list of values as value. If no logical operator is provided,"$and"
is used as default operation. If no comparison operator is provided,"$eq"
(or"$in"
if the comparison value is a list) is used as default operation.__Example__: ```python filters = { "$and": { "type": {"$eq": "article"}, "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"}, "rating": {"$gte": 3}, "$or": { "genre": {"$in": ["economy", "politics"]}, "publisher": {"$eq": "nytimes"} } } } ```
-
open_domain
: When True, labels are aggregated purely based on the question text alone. When False, labels are aggregated in a closed domain fashion based on the question text and also the id of the document that the label is tied to. In this setting, this function might return multiple MultiLabel objects with the same question string. -
headers
: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='} for basic authentication) -
aggregate_by_meta
: The names of the Label meta fields by which to aggregate. For example: ["product_id"] -
drop_negative_labels
: When True, labels with incorrect answers and documents are dropped. -
drop_no_answers
: When True, labels with no answers are dropped.
SQLDocumentStore.normalize_embedding
@staticmethod
def normalize_embedding(emb: np.ndarray) -> None
Performs L2 normalization of embeddings vector inplace. Input can be a single vector (1D array) or a matrix (2D array).
SQLDocumentStore.add_eval_data
def add_eval_data(filename: str,
doc_index: str = "eval_document",
label_index: str = "label",
batch_size: Optional[int] = None,
preprocessor: Optional[PreProcessor] = None,
max_docs: Optional[Union[int, bool]] = None,
open_domain: bool = False,
headers: Optional[Dict[str, str]] = None)
Adds a SQuAD-formatted file to the DocumentStore in order to be able to perform evaluation on it.
If a jsonl file and a batch_size is passed to the function, documents are loaded batchwise from disk and also indexed batchwise to the DocumentStore in order to prevent out of memory errors.
Arguments:
filename
: Name of the file containing evaluation data (json or jsonl)doc_index
: Elasticsearch index where evaluation documents should be storedlabel_index
: Elasticsearch index where labeled questions should be storedbatch_size
: Optional number of documents that are loaded and processed at a time. When set to None (default) all documents are processed at once.preprocessor
: Optional PreProcessor to preprocess evaluation documents. It can be used for splitting documents into passages (and assigning labels to corresponding passages). Currently the PreProcessor does not support split_by sentence, cleaning nor split_overlap != 0. When set to None (default) preprocessing is disabled.max_docs
: Optional number of documents that will be loaded. When set to None (default) all available eval documents are used.open_domain
: Set this to True if your file is an open domain dataset where two different answers to the same question might be found in different contexts.headers
: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='} for basic authentication)
SQLDocumentStore.run
def run(documents: List[Union[dict, Document]],
index: Optional[str] = None,
headers: Optional[Dict[str, str]] = None,
id_hash_keys: Optional[List[str]] = None)
Run requests of document stores
Comment: We will gradually introduce the primitives. The doument stores also accept dicts and parse them to documents. In the future, however, only documents themselves will be accepted. Parsing the dictionaries in the run function is therefore only an interim solution until the run function also accepts documents.
Arguments:
documents
: A list of dicts that are documents.headers
: A list of headers.index
: Optional name of index where the documents shall be written to. If None, the DocumentStore's default index (self.index) will be used.id_hash_keys
: List of the fields that the hashes of the ids are generated from.
SQLDocumentStore.describe_documents
def describe_documents(index=None)
Return a summary of the documents in the document store
Module faiss
FAISSDocumentStore
class FAISSDocumentStore(SQLDocumentStore)
A DocumentStore for very large-scale, embedding-based dense Retrievers, like the DPR.
It implements the FAISS library to perform similarity search on vectors.
The document text and meta-data (for filtering) are stored using the SQLDocumentStore, while the vector embeddings are indexed in a FAISS index.
When you initialize the FAISSDocumentStore, the faiss_document_store.db
database file is created on your disk. For more information, see DocumentStore.
FAISSDocumentStore.__init__
def __init__(sql_url: str = "sqlite:///faiss_document_store.db",
vector_dim: Optional[int] = None,
embedding_dim: int = 768,
faiss_index_factory_str: str = "Flat",
faiss_index: Optional["faiss.swigfaiss.Index"] = None,
return_embedding: bool = False,
index: str = "document",
similarity: str = "dot_product",
embedding_field: str = "embedding",
progress_bar: bool = True,
duplicate_documents: str = "overwrite",
faiss_index_path: Optional[Union[str, Path]] = None,
faiss_config_path: Optional[Union[str, Path]] = None,
isolation_level: Optional[str] = None,
n_links: int = 64,
ef_search: int = 20,
ef_construction: int = 80,
validate_index_sync: bool = True,
batch_size: int = 10_000)
Arguments:
sql_url
: SQL connection URL for the database. The default value is "sqlite:///faiss_document_store.db"`. It defaults to a local, file-based SQLite DB. For large scale deployment, we recommend Postgres.vector_dim
: Deprecated. Use embedding_dim instead.embedding_dim
: The embedding vector size. Default: 768.faiss_index_factory_str
: Creates a new FAISS index of the specified type. It determines the type based on the string you pass to it, following the conventions of the original FAISS index factory. Recommended options:- "Flat" (default): Best accuracy (= exact). Becomes slow and RAM-intense for > 1 Mio docs.
- "HNSW": Graph-based heuristic. If you don't specify it further, we use the following configuration: HNSW64, efConstruction=80 and efSearch=20.
- "IVFx,Flat": Inverted index. Replace x with the number of centroids aka nlist. Rule of thumb: nlist = 10 * sqrt (num_docs) is a good starting point. For more details see:
- Overview of indices
- Guideline for choosing an index
- FAISS Index factory Benchmarks: XXX
faiss_index
: Loads an existing FAISS index. This can be an empty index you configured manually or an index with Documents you used in Haystack before and want to load again. You can use it to load a previously saved DocumentStore.return_embedding
: Returns document embedding. Unlike other document stores, FAISS will return normalized embeddings.index
: Specifies the name of the index in DocumentStore to use.similarity
: Specifies the similarity function used to compare document vectors. 'dot_product' is the default because it's more performant with DPR embeddings. 'cosine' is recommended if you're using a Sentence-Transformer model. In both cases, the returned values in Document.score are normalized to be in range [0,1]: Fordot_product
: expit(np.asarray(raw_score / 100)) Forcosine
: (raw_score + 1) / 2embedding_field
: The name of the field containing an embedding vector.progress_bar
: Shows a tqdm progress bar. You may want to disable it in production deployments to keep the logs clean.duplicate_documents
: Handles duplicates document based on parameter options. Parameter options: ( 'skip','overwrite','fail') skip: Ignores the duplicate documents. overwrite: Updates any existing documents with the same ID when adding documents. fail: Raises an error if the document ID of the document being added already exists.faiss_index_path
: The stored FAISS index file. Callsave()
to create this file. Use the same index file path you specified when callingsave()
. If you specifyfaiss_index_path
, you can only passfaiss_config_path
.faiss_config_path
: Stored FAISS initial configuration. It contains all the parameters used to initialize the DocumentStore. Callsave()
to create it and then use the same configuration file path you specified when callingsave()
. Don't set it if you haven't specifiedconfig_path
when callingsave()
.isolation_level
: See SQLAlchemy'sisolation_level
parameter forcreate_engine()
.n_links
: Used only ifindex_factory == "HNSW"
.ef_search
: Used only ifindex_factory == "HNSW"
.ef_construction
: Used only ifindex_factory == "HNSW"
.validate_index_sync
: Checks if the document count equals the embedding count at initialization time.batch_size
: Number of Documents to index at once / Number of queries to execute at once. If you face memory issues, decrease the batch_size.
FAISSDocumentStore.write_documents
def write_documents(documents: Union[List[dict], List[Document]],
index: Optional[str] = None,
batch_size: Optional[int] = None,
duplicate_documents: Optional[str] = None,
headers: Optional[Dict[str, str]] = None) -> None
Add new documents to the DocumentStore.
Arguments:
documents
: List ofDicts
or List ofDocuments
. If they already contain the embeddings, we'll index them right away in FAISS. If not, you can later call update_embeddings() to create & index them.index
: (SQL) index name for storing the docs and metadata.batch_size
: When working with large number of documents, batching can help reduce memory footprint.duplicate_documents
: Handle duplicates document based on parameter options. Parameter options: ( 'skip','overwrite','fail') skip: Ignore the duplicates documents. overwrite: Update any existing documents with the same ID when adding documents. fail: an error is raised if the document ID of the document being added already exists.
Raises:
DuplicateDocumentError
: Exception trigger on duplicate document.
Returns:
None
FAISSDocumentStore.update_embeddings
def update_embeddings(retriever: DenseRetriever,
index: Optional[str] = None,
update_existing_embeddings: bool = True,
filters: Optional[FilterType] = None,
batch_size: Optional[int] = None)
Updates the embeddings in the the document store using the encoding model specified in the retriever.
This can be useful if want to add or change the embeddings for your documents (e.g. after changing the retriever config).
Arguments:
retriever
: Retriever to use to get embeddings for textindex
: Index name for which embeddings are to be updated. If set to None, the default self.index is used.update_existing_embeddings
: Whether to update existing embeddings of the documents. If set to False, only documents without embeddings are processed. This mode can be used for incremental updating of embeddings, wherein, only newly indexed documents get processed.filters
: Optional filters to narrow down the documents for which embeddings are to be updated. Example: {"name": ["some", "more"], "category": ["only_one"]}batch_size
: When working with large number of documents, batching can help reduce memory footprint.
Returns:
None
FAISSDocumentStore.get_all_documents_generator
def get_all_documents_generator(
index: Optional[str] = None,
filters: Optional[FilterType] = None,
return_embedding: Optional[bool] = None,
batch_size: Optional[int] = None,
headers: Optional[Dict[str,
str]] = None) -> Generator[Document, None, None]
Get all documents from the document store. Under-the-hood, documents are fetched in batches from the
document store and yielded as individual documents. This method can be used to iteratively process a large number of documents without having to load all documents in memory.
Arguments:
index
: Name of the index to get the documents from. If None, the DocumentStore's default index (self.index) will be used.filters
: Optional filters to narrow down the documents to return. Example: {"name": ["some", "more"], "category": ["only_one"]}return_embedding
: Whether to return the document embeddings. Unlike other document stores, FAISS will return normalized embeddingsbatch_size
: When working with large number of documents, batching can help reduce memory footprint.
FAISSDocumentStore.get_embedding_count
def get_embedding_count(index: Optional[str] = None,
filters: Optional[FilterType] = None) -> int
Return the count of embeddings in the document store.
FAISSDocumentStore.train_index
def train_index(documents: Optional[Union[List[dict], List[Document]]] = None,
embeddings: Optional[np.ndarray] = None,
index: Optional[str] = None)
Some FAISS indices (e.g. IVF) require initial "training" on a sample of vectors before you can add your final vectors.
The train vectors should come from the same distribution as your final ones. You can pass either documents (incl. embeddings) or just the plain embeddings that the index shall be trained on.
Arguments:
documents
: Documents (incl. the embeddings)embeddings
: Plain embeddingsindex
: Name of the index to train. If None, the DocumentStore's default index (self.index) will be used.
Returns:
None
FAISSDocumentStore.delete_all_documents
def delete_all_documents(index: Optional[str] = None,
filters: Optional[FilterType] = None,
headers: Optional[Dict[str, str]] = None)
Delete all documents from the document store.
FAISSDocumentStore.delete_documents
def delete_documents(index: Optional[str] = None,
ids: Optional[List[str]] = None,
filters: Optional[FilterType] = None,
headers: Optional[Dict[str, str]] = None)
Delete documents from the document store. All documents are deleted if no filters are passed.
Arguments:
index
: Index name to delete the documents from. If None, the DocumentStore's default index (self.index) will be used.ids
: Optional list of IDs to narrow down the documents to be deleted.filters
: Optional filters to narrow down the documents to be deleted. Example filters: {"name": ["some", "more"], "category": ["only_one"]}. If filters are provided along with a list of IDs, this method deletes the intersection of the two query results (documents that match the filters and have their ID in the list).
Returns:
None
FAISSDocumentStore.delete_index
def delete_index(index: str)
Delete an existing index. The index including all data will be removed.
Arguments:
index
: The name of the index to delete.
Returns:
None
FAISSDocumentStore.query_by_embedding
def query_by_embedding(query_emb: np.ndarray,
filters: Optional[FilterType] = None,
top_k: int = 10,
index: Optional[str] = None,
return_embedding: Optional[bool] = None,
headers: Optional[Dict[str, str]] = None,
scale_score: bool = True) -> List[Document]
Find the document that is most similar to the provided query_emb
by using a vector similarity metric.
Arguments:
query_emb
: Embedding of the query (e.g. gathered from DPR)filters
: Optional filters to narrow down the search space. Example: {"name": ["some", "more"], "category": ["only_one"]}top_k
: How many documents to returnindex
: Index name to query the document from.return_embedding
: To return document embedding. Unlike other document stores, FAISS will return normalized embeddingsscale_score
: Whether to scale the similarity score to the unit interval (range of [0,1]). If true (default) similarity scores (e.g. cosine or dot_product) which naturally have a different value range will be scaled to a range of [0,1], where 1 means extremely relevant. Otherwise raw similarity scores (e.g. cosine or dot_product) will be used.
FAISSDocumentStore.save
def save(index_path: Union[str, Path],
config_path: Optional[Union[str, Path]] = None)
Save FAISS Index to the specified file.
The FAISS DocumentStore contains a SQL database and a FAISS index. The database is saved to your disk when you initialize the DocumentStore. The FAISS index is not. You must explicitly save it by calling the save()
method. You can then use the saved index to load a different DocumentStore.
Saving a FAISSDocumentStore creates two files on your disk: the index file and the configuration file. The configuration file contains all the parameters needed to initialize the DocumentStore. For more information, see DocumentStore.
Arguments:
index_path
: The path where you want to save the index.config_path
: The path where you want to save the configuration file. This is the JSON file that contains all the parameters to initialize the DocumentStore. It defaults to the same as the index file path, except the extension (.json). This file contains all the parameters passed to FAISSDocumentStore() at creation time (for example thesql_url
,embedding_dim
, and so on), and will be used by theload()
method to restore the index with the saved configuration.
Returns:
None
FAISSDocumentStore.load
@classmethod
def load(cls,
index_path: Union[str, Path],
config_path: Optional[Union[str, Path]] = None)
Load a saved FAISS index from a file and connect to the SQL database. load()
is a class method, so, you need to call it on the class itself instead of the instance. For more information, see DocumentStore.
Note: To have a correct mapping from FAISS to SQL,
make sure to use the same SQL DB that you used when calling save()
.
Arguments:
index_path
: The stored FAISS index file. Callsave()
to create this file. Use the same index file path you specified when callingsave()
.config_path
: Stored FAISS initial configuration parameters. Callsave()
to create it.
FAISSDocumentStore.get_document_by_id
def get_document_by_id(
id: str,
index: Optional[str] = None,
headers: Optional[Dict[str, str]] = None) -> Optional[Document]
Fetch a document by specifying its text id string
FAISSDocumentStore.get_documents_by_vector_ids
def get_documents_by_vector_ids(vector_ids: List[str],
index: Optional[str] = None,
batch_size: int = 10_000)
Fetch documents by specifying a list of text vector id strings
FAISSDocumentStore.get_all_labels
def get_all_labels(index=None,
filters: Optional[FilterType] = None,
headers: Optional[Dict[str, str]] = None)
Return all labels in the document store
FAISSDocumentStore.write_labels
def write_labels(labels, index=None, headers: Optional[Dict[str, str]] = None)
Write annotation labels into document store.
FAISSDocumentStore.update_vector_ids
def update_vector_ids(vector_id_map: Dict[str, str],
index: Optional[str] = None,
batch_size: int = 10_000)
Update vector_ids for given document_ids.
Arguments:
vector_id_map
: dict containing mapping of document_id -> vector_id.index
: filter documents by the optional index attribute for documents in database.batch_size
: When working with large number of documents, batching can help reduce memory footprint.
FAISSDocumentStore.reset_vector_ids
def reset_vector_ids(index: Optional[str] = None)
Set vector IDs for all documents as None
FAISSDocumentStore.update_document_meta
def update_document_meta(id: str,
meta: Dict[str, str],
index: Optional[str] = None)
Update the metadata dictionary of a document by specifying its string id
FAISSDocumentStore.get_document_count
def get_document_count(filters: Optional[FilterType] = None,
index: Optional[str] = None,
only_documents_without_embedding: bool = False,
headers: Optional[Dict[str, str]] = None) -> int
Return the number of documents in the document store.
FAISSDocumentStore.get_label_count
def get_label_count(index: Optional[str] = None,
headers: Optional[Dict[str, str]] = None) -> int
Return the number of labels in the document store
FAISSDocumentStore.delete_labels
def delete_labels(index: Optional[str] = None,
ids: Optional[List[str]] = None,
filters: Optional[FilterType] = None,
headers: Optional[Dict[str, str]] = None)
Delete labels from the document store. All labels are deleted if no filters are passed.
Arguments:
index
: Index name to delete the labels from. If None, the DocumentStore's default label index (self.label_index) will be used.ids
: Optional list of IDs to narrow down the labels to be deleted.filters
: Optional filters to narrow down the labels to be deleted. Example filters: {"id": ["9a196e41-f7b5-45b4-bd19-5feb7501c159", "9a196e41-f7b5-45b4-bd19-5feb7501c159"]} or {"query": ["question2"]}
Returns:
None
FAISSDocumentStore.get_all_labels_aggregated
def get_all_labels_aggregated(
index: Optional[str] = None,
filters: Optional[FilterType] = None,
open_domain: bool = True,
drop_negative_labels: bool = False,
drop_no_answers: bool = False,
aggregate_by_meta: Optional[Union[str, list]] = None,
headers: Optional[Dict[str, str]] = None) -> List[MultiLabel]
Return all labels in the DocumentStore, aggregated into MultiLabel objects.
This aggregation step helps, for example, if you collected multiple possible answers for one question and you want now all answers bundled together in one place for evaluation. How they are aggregated is defined by the open_domain and aggregate_by_meta parameters. If the questions are being asked to a single document (i.e. SQuAD style), you should set open_domain=False to aggregate by question and document. If the questions are being asked to your full collection of documents, you should set open_domain=True to aggregate just by question. If the questions are being asked to a subslice of your document set (e.g. product review use cases), you should set open_domain=True and populate aggregate_by_meta with the names of Label meta fields to aggregate by question and your custom meta fields. For example, in a product review use case, you might set aggregate_by_meta=["product_id"] so that Labels with the same question but different answers from different documents are aggregated into the one MultiLabel object, provided that they have the same product_id (to be found in Label.meta["product_id"])
Arguments:
-
index
: Name of the index to get the labels from. If None, the DocumentStore's default index (self.index) will be used. -
filters
: Optional filters to narrow down the search space to documents whose metadata fulfill certain conditions. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and"
,"$or"
,"$not"
), a comparison operator ("$eq"
,"$in"
,"$gt"
,"$gte"
,"$lt"
,"$lte"
) or a metadata field name. Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of"$in"
) a list of values as value. If no logical operator is provided,"$and"
is used as default operation. If no comparison operator is provided,"$eq"
(or"$in"
if the comparison value is a list) is used as default operation.__Example__: ```python filters = { "$and": { "type": {"$eq": "article"}, "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"}, "rating": {"$gte": 3}, "$or": { "genre": {"$in": ["economy", "politics"]}, "publisher": {"$eq": "nytimes"} } } } ```
-
open_domain
: When True, labels are aggregated purely based on the question text alone. When False, labels are aggregated in a closed domain fashion based on the question text and also the id of the document that the label is tied to. In this setting, this function might return multiple MultiLabel objects with the same question string. -
headers
: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='} for basic authentication) -
aggregate_by_meta
: The names of the Label meta fields by which to aggregate. For example: ["product_id"] -
drop_negative_labels
: When True, labels with incorrect answers and documents are dropped. -
drop_no_answers
: When True, labels with no answers are dropped.
FAISSDocumentStore.normalize_embedding
@staticmethod
def normalize_embedding(emb: np.ndarray) -> None
Performs L2 normalization of embeddings vector inplace. Input can be a single vector (1D array) or a matrix (2D array).
FAISSDocumentStore.add_eval_data
def add_eval_data(filename: str,
doc_index: str = "eval_document",
label_index: str = "label",
batch_size: Optional[int] = None,
preprocessor: Optional[PreProcessor] = None,
max_docs: Optional[Union[int, bool]] = None,
open_domain: bool = False,
headers: Optional[Dict[str, str]] = None)
Adds a SQuAD-formatted file to the DocumentStore in order to be able to perform evaluation on it.
If a jsonl file and a batch_size is passed to the function, documents are loaded batchwise from disk and also indexed batchwise to the DocumentStore in order to prevent out of memory errors.
Arguments:
filename
: Name of the file containing evaluation data (json or jsonl)doc_index
: Elasticsearch index where evaluation documents should be storedlabel_index
: Elasticsearch index where labeled questions should be storedbatch_size
: Optional number of documents that are loaded and processed at a time. When set to None (default) all documents are processed at once.preprocessor
: Optional PreProcessor to preprocess evaluation documents. It can be used for splitting documents into passages (and assigning labels to corresponding passages). Currently the PreProcessor does not support split_by sentence, cleaning nor split_overlap != 0. When set to None (default) preprocessing is disabled.max_docs
: Optional number of documents that will be loaded. When set to None (default) all available eval documents are used.open_domain
: Set this to True if your file is an open domain dataset where two different answers to the same question might be found in different contexts.headers
: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='} for basic authentication)
FAISSDocumentStore.run
def run(documents: List[Union[dict, Document]],
index: Optional[str] = None,
headers: Optional[Dict[str, str]] = None,
id_hash_keys: Optional[List[str]] = None)
Run requests of document stores
Comment: We will gradually introduce the primitives. The doument stores also accept dicts and parse them to documents. In the future, however, only documents themselves will be accepted. Parsing the dictionaries in the run function is therefore only an interim solution until the run function also accepts documents.
Arguments:
documents
: A list of dicts that are documents.headers
: A list of headers.index
: Optional name of index where the documents shall be written to. If None, the DocumentStore's default index (self.index) will be used.id_hash_keys
: List of the fields that the hashes of the ids are generated from.
FAISSDocumentStore.describe_documents
def describe_documents(index=None)
Return a summary of the documents in the document store
Module weaviate
WeaviateDocumentStore
class WeaviateDocumentStore(KeywordDocumentStore)
Weaviate is a cloud-native, modular, real-time vector search engine built to scale your machine learning models. (See https://weaviate.io/developers/weaviate/current/index.html#what-is-weaviate)
Some of the key differences in contrast to FAISS:
- Stores everything in one place: documents, meta data and vectors - so less network overhead when scaling this up
- Allows combination of vector search and scalar filtering, i.e. you can filter for a certain tag and do dense retrieval on that subset
- Has less variety of ANN algorithms, as of now only HNSW.
- Requires document ids to be in uuid-format. If wrongly formatted ids are provided at indexing time they will be replaced with uuids automatically.
Weaviate python client is used to connect to the server, more details are here https://weaviate.io/developers/weaviate/client-libraries/python
Usage:
- Start a Weaviate server (see https://weaviate.io/developers/weaviate/current/getting-started/installation.html)
- Init a WeaviateDocumentStore in Haystack
Connection Parameters Precedence: The selection and priority of connection parameters are as follows:
- If
use_embedded
is set to True, an embedded Weaviate instance will be used, and all other connection parameters will be ignored. - If
use_embedded
is False or not provided and anapi_key
is provided, theapi_key
will be used to authenticate through AuthApiKey, assuming a connection to a Weaviate Cloud Service (WCS) instance. - If neither
use_embedded
norapi_key
is provided, but ausername
andpassword
are provided, they will be used to authenticate through AuthClientPassword, assuming an OIDC Resource Owner Password flow. - If none of the above conditions are met, no authentication method will be used and a connection will be attempted with the provided
host
andport
values without any authentication.
Limitations: The current implementation is not supporting the storage of labels, so you cannot run any evaluation workflows.
WeaviateDocumentStore.__init__
def __init__(host: Union[str, List[str]] = "http://localhost",
port: Union[int, List[int]] = 8080,
timeout_config: tuple = (5, 15),
username: Optional[str] = None,
password: Optional[str] = None,
scope: Optional[str] = "offline_access",
api_key: Optional[str] = None,
use_embedded: bool = False,
embedded_options: Optional[dict] = None,
additional_headers: Optional[Dict[str, Any]] = None,
index: str = "Document",
embedding_dim: int = 768,
content_field: str = "content",
name_field: str = "name",
similarity: str = "cosine",
index_type: str = "hnsw",
custom_schema: Optional[dict] = None,
return_embedding: bool = False,
embedding_field: str = "embedding",
progress_bar: bool = True,
duplicate_documents: str = "overwrite",
recreate_index: bool = False,
replication_factor: int = 1,
batch_size: int = 10_000)
Arguments:
host
: Weaviate server connection URL for storing and processing documents and vectors. For more details, see Weaviate installation.port
: The port of the Weaviate instance.timeout_config
: The Weaviate timeout config as a tuple of (retries, time out seconds).username
: The Weaviate username (standard authentication using http_auth).password
: Weaviate password (standard authentication using http_auth).scope
: The scope of the credentials when using the OIDC Resource Owner Password or Client Credentials authentication flow.api_key
: The Weaviate Cloud Services (WCS) API key (for WCS authentication).use_embedded
: Whether to use an embedded Weaviate instance. Default: False.embedded_options
: Custom options for the embedded Weaviate instance. Default: None.additional_headers
: Additional headers to be included in the requests sent to Weaviate, for example the bearer token.index
: Index name for document text, embedding, and metadata (in Weaviate terminology, this is a "Class" in the Weaviate schema).embedding_dim
: The embedding vector size. Default: 768.content_field
: Name of the field that might contain the answer and is passed to the Reader model (for example, "full_text"). If no Reader is used (for example, in FAQ-Style QA), the plain content of this field is returned.name_field
: Name of the field that contains the title of the doc.similarity
: The similarity function used to compare document vectors. Available options are 'cosine' (default), 'dot_product', and 'l2'. 'cosine' is recommended for Sentence Transformers.index_type
: Index type of any vector object defined in the Weaviate schema. The vector index type is pluggable. Currently, only HSNW is supported. See also Weaviate documentation.custom_schema
: Allows to create a custom schema in Weaviate. For more details, see Weaviate documentation.module_name
: Vectorization module to convert data into vectors. Default is "text2vec-trasnformers" For more details, see Weaviate documentation.return_embedding
: Returns document embedding.embedding_field
: Name of the field containing an embedding vector.progress_bar
: Whether to show a tqdm progress bar or not. Can be helpful to disable in production deployments to keep the logs clean.duplicate_documents
: Handle duplicates document based on parameter options. Parameter options: 'skip','overwrite','fail' skip: Ignore the duplicates documents overwrite: Update any existing documents with the same ID when adding documents. fail: Raises an error if the document ID of the document being added already exists.recreate_index
: If set to True, deletes an existing Weaviate index and creates a new one using the config you are using for initialization. Note that all data in the old index is lost if you choose to recreate the index.replication_factor
: Sets the Weaviate Class's replication factor in Weaviate at the time of Class creation. See also Weaviate documentation.batch_size
: The number of documents to index at once.
WeaviateDocumentStore.get_document_by_id
def get_document_by_id(
id: str,
index: Optional[str] = None,
headers: Optional[Dict[str, str]] = None) -> Optional[Document]
Fetch a document by specifying its uuid string
WeaviateDocumentStore.get_documents_by_id
def get_documents_by_id(
ids: List[str],
index: Optional[str] = None,
batch_size: Optional[int] = None,
headers: Optional[Dict[str, str]] = None) -> List[Document]
Fetch documents by specifying a list of uuid strings.
WeaviateDocumentStore.write_documents
def write_documents(documents: Union[List[dict], List[Document]],
index: Optional[str] = None,
batch_size: Optional[int] = None,
duplicate_documents: Optional[str] = None,
headers: Optional[Dict[str, str]] = None)
Add new documents to the DocumentStore.
Arguments:
documents
: List ofDicts
or List ofDocuments
. A dummy embedding vector for each document is automatically generated if it is not provided. The document id needs to be in uuid format. Otherwise a correctly formatted uuid will be automatically generated based on the provided id.index
: index name for storing the docs and metadatabatch_size
: When working with large number of documents, batching can help reduce memory footprint. If no batch_size is provided, self.batch_size is used.duplicate_documents
: Handle duplicates document based on parameter options. Parameter options : ( 'skip','overwrite','fail') skip: Ignore the duplicates documents overwrite: Update any existing documents with the same ID when adding documents. fail: an error is raised if the document ID of the document being added already exists.
Raises:
DuplicateDocumentError
: Exception trigger on duplicate document
Returns:
None
WeaviateDocumentStore.update_document_meta
def update_document_meta(id: str,
meta: Dict[str, Union[List, str, int, float, bool]],
index: Optional[str] = None)
Update the metadata dictionary of a document by specifying its string id. Overwrites only the specified fields, the unspecified ones remain unchanged.
WeaviateDocumentStore.get_embedding_count
def get_embedding_count(filters: Optional[FilterType] = None,
index: Optional[str] = None) -> int
Return the number of embeddings in the document store, which is the same as the number of documents since every document has a default embedding.
WeaviateDocumentStore.get_document_count
def get_document_count(filters: Optional[FilterType] = None,
index: Optional[str] = None,
only_documents_without_embedding: bool = False,
headers: Optional[Dict[str, str]] = None) -> int
Return the number of documents in the document store.
WeaviateDocumentStore.get_all_documents
def get_all_documents(
index: Optional[str] = None,
filters: Optional[FilterType] = None,
return_embedding: Optional[bool] = None,
batch_size: Optional[int] = None,
headers: Optional[Dict[str, str]] = None) -> List[Document]
Get documents from the document store.
Note this limitation from the changelog of Weaviate 1.8.0:
.. quote::
Due to the increasing cost of each page outlined above, there is a limit to
how many objects can be retrieved using pagination. By default setting the sum
of offset and limit to higher than 10,000 objects, will lead to an error.
If you must retrieve more than 10,000 objects, you can increase this limit by
setting the environment variable QUERY_MAXIMUM_RESULTS=<desired-value>
.
Warning: Setting this to arbitrarily high values can make the memory consumption
of a single query explode and single queries can slow down the entire cluster.
We recommend setting this value to the lowest possible value that does not
interfere with your users' expectations.
(https://github.com/semi-technologies/weaviate/releases/tag/v1.8.0)
Arguments:
-
index
: Name of the index to get the documents from. If None, the DocumentStore's default index (self.index) will be used. -
filters
: Optional filters to narrow down the search space to documents whose metadata fulfill certain conditions. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and"
,"$or"
,"$not"
), a comparison operator ("$eq"
,"$in"
,"$gt"
,"$gte"
,"$lt"
,"$lte"
) or a metadata field name. Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of"$in"
) a list of values as value. If no logical operator is provided,"$and"
is used as default operation. If no comparison operator is provided,"$eq"
(or"$in"
if the comparison value is a list) is used as default operation.__Example__: ```python filters = { "$and": { "type": {"$eq": "article"}, "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"}, "rating": {"$gte": 3}, "$or": { "genre": {"$in": ["economy", "politics"]}, "publisher": {"$eq": "nytimes"} } } } ```
-
return_embedding
: Whether to return the document embeddings. -
batch_size
: When working with large number of documents, batching can help reduce memory footprint. If no batch_size is provided, self.batch_size is used.
WeaviateDocumentStore.get_all_documents_generator
def get_all_documents_generator(
index: Optional[str] = None,
filters: Optional[FilterType] = None,
return_embedding: Optional[bool] = None,
batch_size: Optional[int] = None,
headers: Optional[Dict[str,
str]] = None) -> Generator[Document, None, None]
Get documents from the document store. Under-the-hood, documents are fetched in batches from the
document store and yielded as individual documents. This method can be used to iteratively process a large number of documents without having to load all documents in memory.
Note this limitation from the changelog of Weaviate 1.8.0:
.. quote::
Due to the increasing cost of each page outlined above, there is a limit to
how many objects can be retrieved using pagination. By default setting the sum
of offset and limit to higher than 10,000 objects, will lead to an error.
If you must retrieve more than 10,000 objects, you can increase this limit by
setting the environment variable QUERY_MAXIMUM_RESULTS=<desired-value>
.
Warning: Setting this to arbitrarily high values can make the memory consumption
of a single query explode and single queries can slow down the entire cluster.
We recommend setting this value to the lowest possible value that does not
interfere with your users' expectations.
(https://github.com/semi-technologies/weaviate/releases/tag/v1.8.0)
Arguments:
-
index
: Name of the index to get the documents from. If None, the DocumentStore's default index (self.index) will be used. -
filters
: Optional filters to narrow down the search space to documents whose metadata fulfill certain conditions. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and"
,"$or"
,"$not"
), a comparison operator ("$eq"
,"$in"
,"$gt"
,"$gte"
,"$lt"
,"$lte"
) or a metadata field name. Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of"$in"
) a list of values as value. If no logical operator is provided,"$and"
is used as default operation. If no comparison operator is provided,"$eq"
(or"$in"
if the comparison value is a list) is used as default operation.__Example__: ```python filters = { "$and": { "type": {"$eq": "article"}, "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"}, "rating": {"$gte": 3}, "$or": { "genre": {"$in": ["economy", "politics"]}, "publisher": {"$eq": "nytimes"} } } } ```
-
return_embedding
: Whether to return the document embeddings. -
batch_size
: When working with large number of documents, batching can help reduce memory footprint. If no batch_size is provided, self.batch_size is used.
WeaviateDocumentStore.query
def query(query: Optional[str],
filters: Optional[FilterType] = None,
top_k: int = 10,
custom_query: Optional[str] = None,
index: Optional[str] = None,
headers: Optional[Dict[str, str]] = None,
all_terms_must_match: bool = False,
scale_score: bool = True) -> List[Document]
Scan through documents in DocumentStore and return a small number documents
that are most relevant to the query as defined by Weaviate semantic search.
Arguments:
-
query
: The query -
filters
: Optional filters to narrow down the search space to documents whose metadata fulfill certain conditions. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and"
,"$or"
,"$not"
), a comparison operator ("$eq"
,"$in"
,"$gt"
,"$gte"
,"$lt"
,"$lte"
) or a metadata field name. Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of"$in"
) a list of values as value. If no logical operator is provided,"$and"
is used as default operation. If no comparison operator is provided,"$eq"
(or"$in"
if the comparison value is a list) is used as default operation.__Example__: ```python filters = { "$and": { "type": {"$eq": "article"}, "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"}, "rating": {"$gte": 3}, "$or": { "genre": {"$in": ["economy", "politics"]}, "publisher": {"$eq": "nytimes"} } } } # or simpler using default operators filters = { "type": "article", "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"}, "rating": {"$gte": 3}, "$or": { "genre": ["economy", "politics"], "publisher": "nytimes" } } ``` To use the same logical operator multiple times on the same level, logical operators take optionally a list of dictionaries as value. __Example__: ```python filters = { "$or": [ { "$and": { "Type": "News Paper", "Date": { "$lt": "2019-01-01" } } }, { "$and": { "Type": "Blog Post", "Date": { "$gte": "2019-01-01" } } } ] } ```
-
top_k
: How many documents to return per query. -
all_terms_must_match
: Not used in Weaviate. -
custom_query
: Custom query that will executed using query.raw method, for more details refer https://weaviate.io/developers/weaviate/current/graphql-references/filters.html -
index
: The name of the index in the DocumentStore from which to retrieve documents -
headers
: Not used in Weaviate. -
scale_score
: Whether to scale the similarity score to the unit interval (range of [0,1]). If true (default) similarity scores (e.g. cosine or dot_product) which naturally have a different value range will be scaled to a range of [0,1], where 1 means extremely relevant. Otherwise raw similarity scores (e.g. cosine or dot_product) will be used.
WeaviateDocumentStore.query_batch
def query_batch(queries: List[str],
filters: Optional[Union[FilterType,
List[Optional[FilterType]]]] = None,
top_k: int = 10,
custom_query: Optional[str] = None,
index: Optional[str] = None,
headers: Optional[Dict[str, str]] = None,
all_terms_must_match: bool = False,
scale_score: bool = True) -> List[List[Document]]
Scan through documents in DocumentStore and return a small number documents
that are most relevant to the provided queries as defined by keyword matching algorithms like BM25.
This method lets you find relevant documents for a single query string (output: List of Documents), or a a list of query strings (output: List of Lists of Documents).
Arguments:
-
queries
: Single query or list of queries. -
filters
: Optional filters to narrow down the search space to documents whose metadata fulfill certain conditions. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and"
,"$or"
,"$not"
), a comparison operator ("$eq"
,"$in"
,"$gt"
,"$gte"
,"$lt"
,"$lte"
) or a metadata field name. Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of"$in"
) a list of values as value. If no logical operator is provided,"$and"
is used as default operation. If no comparison operator is provided,"$eq"
(or"$in"
if the comparison value is a list) is used as default operation.__Example__: ```python filters = { "$and": { "type": {"$eq": "article"}, "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"}, "rating": {"$gte": 3}, "$or": { "genre": {"$in": ["economy", "politics"]}, "publisher": {"$eq": "nytimes"} } } } # or simpler using default operators filters = { "type": "article", "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"}, "rating": {"$gte": 3}, "$or": { "genre": ["economy", "politics"], "publisher": "nytimes" } } ``` To use the same logical operator multiple times on the same level, logical operators take optionally a list of dictionaries as value. __Example__: ```python filters = { "$or": [ { "$and": { "Type": "News Paper", "Date": { "$lt": "2019-01-01" } } }, { "$and": { "Type": "Blog Post", "Date": { "$gte": "2019-01-01" } } } ] } ```
-
top_k
: How many documents to return per query. -
custom_query
: Custom query to be executed. -
index
: The name of the index in the DocumentStore from which to retrieve documents -
headers
: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='} for basic authentication) -
all_terms_must_match
: Whether all terms of the query must match the document. If true all query terms must be present in a document in order to be retrieved (i.e the AND operator is being used implicitly between query terms: "cozy fish restaurant" -> "cozy AND fish AND restaurant"). Otherwise at least one query term must be present in a document in order to be retrieved (i.e the OR operator is being used implicitly between query terms: "cozy fish restaurant" -> "cozy OR fish OR restaurant"). Defaults to False. -
scale_score
: Whether to scale the similarity score to the unit interval (range of [0,1]). If true (default) similarity scores (e.g. cosine or dot_product) which naturally have a different value range will be scaled to a range of [0,1], where 1 means extremely relevant. Otherwise raw similarity scores (e.g. cosine or dot_product) will be used.
WeaviateDocumentStore.query_by_embedding
def query_by_embedding(query_emb: np.ndarray,
filters: Optional[FilterType] = None,
top_k: int = 10,
index: Optional[str] = None,
return_embedding: Optional[bool] = None,
headers: Optional[Dict[str, str]] = None,
scale_score: bool = True) -> List[Document]
Find the document that is most similar to the provided query_emb
by using a vector similarity metric.
Arguments:
-
query_emb
: Embedding of the query (e.g. gathered from DPR) -
filters
: Optional filters to narrow down the search space to documents whose metadata fulfill certain conditions. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and"
,"$or"
,"$not"
), a comparison operator ("$eq"
,"$in"
,"$gt"
,"$gte"
,"$lt"
,"$lte"
) or a metadata field name. Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of"$in"
) a list of values as value. If no logical operator is provided,"$and"
is used as default operation. If no comparison operator is provided,"$eq"
(or"$in"
if the comparison value is a list) is used as default operation.__Example__: ```python filters = { "$and": { "type": {"$eq": "article"}, "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"}, "rating": {"$gte": 3}, "$or": { "genre": {"$in": ["economy", "politics"]}, "publisher": {"$eq": "nytimes"} } } } # or simpler using default operators filters = { "type": "article", "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"}, "rating": {"$gte": 3}, "$or": { "genre": ["economy", "politics"], "publisher": "nytimes" } } ``` To use the same logical operator multiple times on the same level, logical operators take optionally a list of dictionaries as value. __Example__: ```python filters = { "$or": [ { "$and": { "Type": "News Paper", "Date": { "$lt": "2019-01-01" } } }, { "$and": { "Type": "Blog Post", "Date": { "$gte": "2019-01-01" } } } ] } ```
-
top_k
: How many documents to return -
index
: index name for storing the docs and metadata -
return_embedding
: To return document embedding -
scale_score
: Whether to scale the similarity score to the unit interval (range of [0,1]). If true (default) similarity scores (e.g. cosine or dot_product) which naturally have a different value range will be scaled to a range of [0,1], where 1 means extremely relevant. Otherwise raw similarity scores (e.g. cosine or dot_product) will be used.
WeaviateDocumentStore.update_embeddings
def update_embeddings(retriever: DenseRetriever,
index: Optional[str] = None,
filters: Optional[FilterType] = None,
update_existing_embeddings: bool = True,
batch_size: Optional[int] = None)
Updates the embeddings in the document store using the encoding model specified in the retriever.
This can be useful if you want to change the embeddings for your documents (e.g. after changing the retriever config).
Arguments:
-
retriever
: Retriever to use to update the embeddings. -
index
: Index name to update -
update_existing_embeddings
: Weaviate mandates an embedding while creating the document itself. This option must be always true for weaviate and it will update the embeddings for all the documents. -
filters
: Optional filters to narrow down the search space to documents whose metadata fulfill certain conditions. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and"
,"$or"
,"$not"
), a comparison operator ("$eq"
,"$in"
,"$gt"
,"$gte"
,"$lt"
,"$lte"
) or a metadata field name. Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of"$in"
) a list of values as value. If no logical operator is provided,"$and"
is used as default operation. If no comparison operator is provided,"$eq"
(or"$in"
if the comparison value is a list) is used as default operation.__Example__: ```python filters = { "$and": { "type": {"$eq": "article"}, "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"}, "rating": {"$gte": 3}, "$or": { "genre": {"$in": ["economy", "politics"]}, "publisher": {"$eq": "nytimes"} } } } ```
-
batch_size
: When working with large number of documents, batching can help reduce memory footprint. If no batch_size is specified, self.batch_size is used.
Returns:
None
WeaviateDocumentStore.delete_all_documents
def delete_all_documents(index: Optional[str] = None,
filters: Optional[FilterType] = None,
headers: Optional[Dict[str, str]] = None)
Delete documents in an index. All documents are deleted if no filters are passed.
Arguments:
-
index
: Index name to delete the document from. -
filters
: Optional filters to narrow down the search space to documents whose metadata fulfill certain conditions. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and"
,"$or"
,"$not"
), a comparison operator ("$eq"
,"$in"
,"$gt"
,"$gte"
,"$lt"
,"$lte"
) or a metadata field name. Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of"$in"
) a list of values as value. If no logical operator is provided,"$and"
is used as default operation. If no comparison operator is provided,"$eq"
(or"$in"
if the comparison value is a list) is used as default operation.__Example__: ```python filters = { "$and": { "type": {"$eq": "article"}, "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"}, "rating": {"$gte": 3}, "$or": { "genre": {"$in": ["economy", "politics"]}, "publisher": {"$eq": "nytimes"} } } } ```
Returns:
None
WeaviateDocumentStore.delete_documents
def delete_documents(index: Optional[str] = None,
ids: Optional[List[str]] = None,
filters: Optional[FilterType] = None,
headers: Optional[Dict[str, str]] = None)
Delete documents in an index. All documents are deleted if no filters are passed.
Arguments:
-
index
: Index name to delete the document from. If None, the DocumentStore's default index (self.index) will be used. -
ids
: Optional list of IDs to narrow down the documents to be deleted. -
filters
: Optional filters to narrow down the search space to documents whose metadata fulfill certain conditions. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and"
,"$or"
,"$not"
), a comparison operator ("$eq"
,"$in"
,"$gt"
,"$gte"
,"$lt"
,"$lte"
) or a metadata field name. Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of"$in"
) a list of values as value. If no logical operator is provided,"$and"
is used as default operation. If no comparison operator is provided,"$eq"
(or"$in"
if the comparison value is a list) is used as default operation.__Example__: ```python filters = { "$and": { "type": {"$eq": "article"}, "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"}, "rating": {"$gte": 3}, "$or": { "genre": {"$in": ["economy", "politics"]}, "publisher": {"$eq": "nytimes"} } } } ``` If filters are provided along with a list of IDs, this method deletes the intersection of the two query results (documents that match the filters and have their ID in the list).
Returns:
None
WeaviateDocumentStore.delete_index
def delete_index(index: str)
Delete an existing index. The index including all data will be removed.
Arguments:
index
: The name of the index to delete.
Returns:
None
WeaviateDocumentStore.delete_labels
def delete_labels(index: Optional[str] = None,
ids: Optional[List[str]] = None,
filters: Optional[FilterType] = None,
headers: Optional[Dict[str, str]] = None)
Implemented to respect BaseDocumentStore's contract.
Weaviate does not support labels (yet).
WeaviateDocumentStore.get_all_labels
def get_all_labels(index: Optional[str] = None,
filters: Optional[FilterType] = None,
headers: Optional[Dict[str, str]] = None) -> List[Label]
Implemented to respect BaseDocumentStore's contract.
Weaviate does not support labels (yet).
WeaviateDocumentStore.get_label_count
def get_label_count(index: Optional[str] = None,
headers: Optional[Dict[str, str]] = None) -> int
Implemented to respect BaseDocumentStore's contract.
Weaviate does not support labels (yet).
WeaviateDocumentStore.write_labels
def write_labels(labels: Union[List[Label], List[dict]],
index: Optional[str] = None,
headers: Optional[Dict[str, str]] = None)
Implemented to respect BaseDocumentStore's contract.
Weaviate does not support labels (yet).
WeaviateDocumentStore.get_all_labels_aggregated
def get_all_labels_aggregated(
index: Optional[str] = None,
filters: Optional[FilterType] = None,
open_domain: bool = True,
drop_negative_labels: bool = False,
drop_no_answers: bool = False,
aggregate_by_meta: Optional[Union[str, list]] = None,
headers: Optional[Dict[str, str]] = None) -> List[MultiLabel]
Return all labels in the DocumentStore, aggregated into MultiLabel objects.
This aggregation step helps, for example, if you collected multiple possible answers for one question and you want now all answers bundled together in one place for evaluation. How they are aggregated is defined by the open_domain and aggregate_by_meta parameters. If the questions are being asked to a single document (i.e. SQuAD style), you should set open_domain=False to aggregate by question and document. If the questions are being asked to your full collection of documents, you should set open_domain=True to aggregate just by question. If the questions are being asked to a subslice of your document set (e.g. product review use cases), you should set open_domain=True and populate aggregate_by_meta with the names of Label meta fields to aggregate by question and your custom meta fields. For example, in a product review use case, you might set aggregate_by_meta=["product_id"] so that Labels with the same question but different answers from different documents are aggregated into the one MultiLabel object, provided that they have the same product_id (to be found in Label.meta["product_id"])
Arguments:
-
index
: Name of the index to get the labels from. If None, the DocumentStore's default index (self.index) will be used. -
filters
: Optional filters to narrow down the search space to documents whose metadata fulfill certain conditions. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and"
,"$or"
,"$not"
), a comparison operator ("$eq"
,"$in"
,"$gt"
,"$gte"
,"$lt"
,"$lte"
) or a metadata field name. Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of"$in"
) a list of values as value. If no logical operator is provided,"$and"
is used as default operation. If no comparison operator is provided,"$eq"
(or"$in"
if the comparison value is a list) is used as default operation.__Example__: ```python filters = { "$and": { "type": {"$eq": "article"}, "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"}, "rating": {"$gte": 3}, "$or": { "genre": {"$in": ["economy", "politics"]}, "publisher": {"$eq": "nytimes"} } } } ```
-
open_domain
: When True, labels are aggregated purely based on the question text alone. When False, labels are aggregated in a closed domain fashion based on the question text and also the id of the document that the label is tied to. In this setting, this function might return multiple MultiLabel objects with the same question string. -
headers
: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='} for basic authentication) -
aggregate_by_meta
: The names of the Label meta fields by which to aggregate. For example: ["product_id"] -
drop_negative_labels
: When True, labels with incorrect answers and documents are dropped. -
drop_no_answers
: When True, labels with no answers are dropped.
WeaviateDocumentStore.normalize_embedding
@staticmethod
def normalize_embedding(emb: np.ndarray) -> None
Performs L2 normalization of embeddings vector inplace. Input can be a single vector (1D array) or a matrix (2D array).
WeaviateDocumentStore.add_eval_data
def add_eval_data(filename: str,
doc_index: str = "eval_document",
label_index: str = "label",
batch_size: Optional[int] = None,
preprocessor: Optional[PreProcessor] = None,
max_docs: Optional[Union[int, bool]] = None,
open_domain: bool = False,
headers: Optional[Dict[str, str]] = None)
Adds a SQuAD-formatted file to the DocumentStore in order to be able to perform evaluation on it.
If a jsonl file and a batch_size is passed to the function, documents are loaded batchwise from disk and also indexed batchwise to the DocumentStore in order to prevent out of memory errors.
Arguments:
filename
: Name of the file containing evaluation data (json or jsonl)doc_index
: Elasticsearch index where evaluation documents should be storedlabel_index
: Elasticsearch index where labeled questions should be storedbatch_size
: Optional number of documents that are loaded and processed at a time. When set to None (default) all documents are processed at once.preprocessor
: Optional PreProcessor to preprocess evaluation documents. It can be used for splitting documents into passages (and assigning labels to corresponding passages). Currently the PreProcessor does not support split_by sentence, cleaning nor split_overlap != 0. When set to None (default) preprocessing is disabled.max_docs
: Optional number of documents that will be loaded. When set to None (default) all available eval documents are used.open_domain
: Set this to True if your file is an open domain dataset where two different answers to the same question might be found in different contexts.headers
: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='} for basic authentication)
WeaviateDocumentStore.run
def run(documents: List[Union[dict, Document]],
index: Optional[str] = None,
headers: Optional[Dict[str, str]] = None,
id_hash_keys: Optional[List[str]] = None)
Run requests of document stores
Comment: We will gradually introduce the primitives. The doument stores also accept dicts and parse them to documents. In the future, however, only documents themselves will be accepted. Parsing the dictionaries in the run function is therefore only an interim solution until the run function also accepts documents.
Arguments:
documents
: A list of dicts that are documents.headers
: A list of headers.index
: Optional name of index where the documents shall be written to. If None, the DocumentStore's default index (self.index) will be used.id_hash_keys
: List of the fields that the hashes of the ids are generated from.
WeaviateDocumentStore.describe_documents
def describe_documents(index=None)
Return a summary of the documents in the document store
Module deepsetcloud
disable_and_log
def disable_and_log(func)
Decorator to disable write operation, shows warning and inputs instead.
DeepsetCloudDocumentStore
class DeepsetCloudDocumentStore(KeywordDocumentStore)
DeepsetCloudDocumentStore.__init__
def __init__(api_key: Optional[str] = None,
workspace: str = "default",
index: Optional[str] = None,
duplicate_documents: str = "overwrite",
api_endpoint: Optional[str] = None,
similarity: str = "dot_product",
return_embedding: bool = False,
label_index: str = "default",
embedding_dim: int = 768,
use_prefiltering: bool = False,
search_fields: Union[str, list] = "content")
A DocumentStore facade enabling you to interact with the documents stored in deepset Cloud.
Thus you can run experiments like trying new nodes, pipelines, etc. without having to index your data again.
You can also use this DocumentStore to create new pipelines on deepset Cloud. To do that, take the following steps:
- create a new DeepsetCloudDocumentStore without an index (e.g.
DeepsetCloudDocumentStore()
) - create query and indexing pipelines using this DocumentStore
- call
Pipeline.save_to_deepset_cloud()
passing the pipelines and apipeline_config_name
- call
Pipeline.deploy_on_deepset_cloud()
passing thepipeline_config_name
DeepsetCloudDocumentStore is not intended for use in production-like scenarios. See https://haystack.deepset.ai/components/document-store for more information.
Arguments:
api_key
: Secret value of the API key. If not specified, will be read from DEEPSET_CLOUD_API_KEY environment variable. See docs on how to generate an API key for your workspace: https://docs.cloud.deepset.ai/docs/connect-deepset-cloud-to-your-applicationworkspace
: workspace name in deepset Cloudindex
: name of the index to access within the deepset Cloud workspace. This equals typically the name of your pipeline. You can run Pipeline.list_pipelines_on_deepset_cloud() to see all available ones. If you set index toNone
, this DocumentStore will always return empty results. This is especially useful if you want to create a new Pipeline within deepset Cloud (see Pipeline.save_to_deepset_cloud()and
Pipeline.deploy_on_deepset_cloud()`).duplicate_documents
: Handle duplicates document based on parameter options. Parameter options : ( 'skip','overwrite','fail') skip: Ignore the duplicates documents overwrite: Update any existing documents with the same ID when adding documents. fail: an error is raised if the document ID of the document being added already exists.api_endpoint
: The URL of the deepset Cloud API. If not specified, will be read from DEEPSET_CLOUD_API_ENDPOINT environment variable. If DEEPSET_CLOUD_API_ENDPOINT environment variable is not specified either, defaults to "https://api.cloud.deepset.ai/api/v1".similarity
: The similarity function used to compare document vectors. 'dot_product' is the default since it is more performant with DPR embeddings. 'cosine' is recommended if you are using a Sentence Transformer model.label_index
: index for the evaluation set interfacereturn_embedding
: To return document embedding.embedding_dim
: Specifies the dimensionality of the embedding vector (only needed when using a dense retriever, for example, DensePassageRetriever pr EmbeddingRetriever, on top).use_prefiltering
: By default, DeepsetCloudDocumentStore uses post-filtering when querying with filters. To use pre-filtering instead, set this parameter toTrue
. Note that pre-filtering comes at the cost of higher latency.search_fields
: Names of fields BM25Retriever uses to find matches to the incoming query in the documents, for example: ["content", "title"].
DeepsetCloudDocumentStore.get_all_documents
def get_all_documents(
index: Optional[str] = None,
filters: Optional[FilterType] = None,
return_embedding: Optional[bool] = None,
batch_size: int = 10_000,
headers: Optional[Dict[str, str]] = None) -> List[Document]
Get documents from the document store.
Arguments:
-
index
: Name of the index to get the documents from. If None, the DocumentStore's default index (self.index) will be used. -
filters
: Optional filters to narrow down the search space to documents whose metadata fulfill certain conditions. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and"
,"$or"
,"$not"
), a comparison operator ("$eq"
,"$in"
,"$gt"
,"$gte"
,"$lt"
,"$lte"
) or a metadata field name. Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of"$in"
) a list of values as value. If no logical operator is provided,"$and"
is used as default operation. If no comparison operator is provided,"$eq"
(or"$in"
if the comparison value is a list) is used as default operation.__Example__: ```python filters = { "$and": { "type": {"$eq": "article"}, "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"}, "rating": {"$gte": 3}, "$or": { "genre": {"$in": ["economy", "politics"]}, "publisher": {"$eq": "nytimes"} } } } ```
-
return_embedding
: Whether to return the document embeddings. -
batch_size
: Number of documents that are passed to bulk function at a time. -
headers
: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='} for basic authentication)
DeepsetCloudDocumentStore.get_all_documents_generator
def get_all_documents_generator(
index: Optional[str] = None,
filters: Optional[FilterType] = None,
return_embedding: Optional[bool] = None,
batch_size: int = 10_000,
headers: Optional[Dict[str,
str]] = None) -> Generator[Document, None, None]
Get documents from the document store. Under-the-hood, documents are fetched in batches from the
document store and yielded as individual documents. This method can be used to iteratively process a large number of documents without having to load all documents in memory.
Arguments:
-
index
: Name of the index to get the documents from. If None, the DocumentStore's default index (self.index) will be used. -
filters
: Optional filters to narrow down the search space to documents whose metadata fulfill certain conditions. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and"
,"$or"
,"$not"
), a comparison operator ("$eq"
,"$in"
,"$gt"
,"$gte"
,"$lt"
,"$lte"
) or a metadata field name. Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of"$in"
) a list of values as value. If no logical operator is provided,"$and"
is used as default operation. If no comparison operator is provided,"$eq"
(or"$in"
if the comparison value is a list) is used as default operation.__Example__: ```python filters = { "$and": { "type": {"$eq": "article"}, "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"}, "rating": {"$gte": 3}, "$or": { "genre": {"$in": ["economy", "politics"]}, "publisher": {"$eq": "nytimes"} } } } ```
-
return_embedding
: Whether to return the document embeddings. -
batch_size
: When working with large number of documents, batching can help reduce memory footprint. -
headers
: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='} for basic authentication)
DeepsetCloudDocumentStore.query_by_embedding
def query_by_embedding(query_emb: np.ndarray,
filters: Optional[FilterType] = None,
top_k: int = 10,
index: Optional[str] = None,
return_embedding: Optional[bool] = None,
headers: Optional[Dict[str, str]] = None,
scale_score: bool = True) -> List[Document]
Find the document that is most similar to the provided query_emb
by using a vector similarity metric.
Arguments:
-
query_emb
: Embedding of the query (e.g. gathered from DPR) -
filters
: Optional filters to narrow down the search space to documents whose metadata fulfill certain conditions. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and"
,"$or"
,"$not"
), a comparison operator ("$eq"
,"$in"
,"$gt"
,"$gte"
,"$lt"
,"$lte"
) or a metadata field name. Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of"$in"
) a list of values as value. If no logical operator is provided,"$and"
is used as default operation. If no comparison operator is provided,"$eq"
(or"$in"
if the comparison value is a list) is used as default operation.__Example__: ```python filters = { "$and": { "type": {"$eq": "article"}, "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"}, "rating": {"$gte": 3}, "$or": { "genre": {"$in": ["economy", "politics"]}, "publisher": {"$eq": "nytimes"} } } } # or simpler using default operators filters = { "type": "article", "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"}, "rating": {"$gte": 3}, "$or": { "genre": ["economy", "politics"], "publisher": "nytimes" } } ``` To use the same logical operator multiple times on the same level, logical operators take optionally a list of dictionaries as value. __Example__: ```python filters = { "$or": [ { "$and": { "Type": "News Paper", "Date": { "$lt": "2019-01-01" } } }, { "$and": { "Type": "Blog Post", "Date": { "$gte": "2019-01-01" } } } ] } ```
-
top_k
: How many documents to return -
index
: Index name for storing the docs and metadata -
return_embedding
: To return document embedding -
headers
: Custom HTTP headers to pass to requests -
scale_score
: Whether to scale the similarity score to the unit interval (range of [0,1]). If true (default) similarity scores (e.g. cosine or dot_product) which naturally have a different value range will be scaled to a range of [0,1], where 1 means extremely relevant. Otherwise raw similarity scores (e.g. cosine or dot_product) will be used.
DeepsetCloudDocumentStore.query
def query(query: Optional[str],
filters: Optional[FilterType] = None,
top_k: int = 10,
custom_query: Optional[str] = None,
index: Optional[str] = None,
headers: Optional[Dict[str, str]] = None,
all_terms_must_match: bool = False,
scale_score: bool = True) -> List[Document]
Scan through documents in DocumentStore and return a small number documents
that are most relevant to the query as defined by the BM25 algorithm.
Arguments:
-
query
: The query -
filters
: Optional filters to narrow down the search space to documents whose metadata fulfill certain conditions. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and"
,"$or"
,"$not"
), a comparison operator ("$eq"
,"$in"
,"$gt"
,"$gte"
,"$lt"
,"$lte"
) or a metadata field name. Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of"$in"
) a list of values as value. If no logical operator is provided,"$and"
is used as default operation. If no comparison operator is provided,"$eq"
(or"$in"
if the comparison value is a list) is used as default operation.__Example__: ```python filters = { "$and": { "type": {"$eq": "article"}, "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"}, "rating": {"$gte": 3}, "$or": { "genre": {"$in": ["economy", "politics"]}, "publisher": {"$eq": "nytimes"} } } } # or simpler using default operators filters = { "type": "article", "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"}, "rating": {"$gte": 3}, "$or": { "genre": ["economy", "politics"], "publisher": "nytimes" } } ``` To use the same logical operator multiple times on the same level, logical operators take optionally a list of dictionaries as value. __Example__: ```python filters = { "$or": [ { "$and": { "Type": "News Paper", "Date": { "$lt": "2019-01-01" } } }, { "$and": { "Type": "Blog Post", "Date": { "$gte": "2019-01-01" } } } ] } ```
-
top_k
: How many documents to return per query. -
custom_query
: Custom query to be executed. -
index
: The name of the index in the DocumentStore from which to retrieve documents -
headers
: Custom HTTP headers to pass to requests -
all_terms_must_match
: Whether all terms of the query must match the document. If true all query terms must be present in a document in order to be retrieved (i.e the AND operator is being used implicitly between query terms: "cozy fish restaurant" -> "cozy AND fish AND restaurant"). Otherwise at least one query term must be present in a document in order to be retrieved (i.e the OR operator is being used implicitly between query terms: "cozy fish restaurant" -> "cozy OR fish OR restaurant"). Defaults to False. -
scale_score
: Whether to scale the similarity score to the unit interval (range of [0,1]). If true (default) similarity scores (e.g. cosine or dot_product) which naturally have a different value range will be scaled to a range of [0,1], where 1 means extremely relevant. Otherwise raw similarity scores (e.g. cosine or dot_product) will be used.
DeepsetCloudDocumentStore.write_documents
@disable_and_log
def write_documents(documents: Union[List[dict], List[Document]],
index: Optional[str] = None,
batch_size: int = 10_000,
duplicate_documents: Optional[str] = None,
headers: Optional[Dict[str, str]] = None)
Indexes documents for later queries.
Arguments:
documents
: a list of Python dictionaries or a list of Haystack Document objects. For documents as dictionaries, the format is {"text": ""}. Optionally: Include meta data via {"text": "", "meta":{"name": ", "author": "somebody", ...}} It can be used for filtering and is accessible in the responses of the Finder.index
: Optional name of index where the documents shall be written to. If None, the DocumentStore's default index (self.index) will be used.batch_size
: Number of documents that are passed to bulk function at a time.duplicate_documents
: Handle duplicates document based on parameter options. Parameter options : ( 'skip','overwrite','fail') skip: Ignore the duplicates documents overwrite: Update any existing documents with the same ID when adding documents. fail: an error is raised if the document ID of the document being added already exists.headers
: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='} for basic authentication)
Returns:
None
DeepsetCloudDocumentStore.update_document_meta
@disable_and_log
def update_document_meta(id: str,
meta: Dict[str, Any],
index: Optional[str] = None)
Update the metadata dictionary of a document by specifying its string id.
Arguments:
id
: The ID of the Document whose metadata is being updated.meta
: A dictionary with key-value pairs that should be added / changed for the provided Document ID.index
: Name of the index the Document is located at.
DeepsetCloudDocumentStore.get_evaluation_sets
def get_evaluation_sets() -> List[dict]
Returns a list of uploaded evaluation sets to deepset cloud.
Returns:
list of evaluation sets as dicts These contain ("name", "evaluation_set_id", "created_at", "matched_labels", "total_labels") as fields.
DeepsetCloudDocumentStore.get_all_labels
def get_all_labels(index: Optional[str] = None,
filters: Optional[FilterType] = None,
headers: Optional[Dict[str, str]] = None) -> List[Label]
Returns a list of labels for the given index name.
Arguments:
index
: Optional name of evaluation set for which labels should be searched. If None, the DocumentStore's default label_index (self.label_index) will be used.headers
: Not supported.
Returns:
list of Labels.
DeepsetCloudDocumentStore.get_label_count
def get_label_count(index: Optional[str] = None,
headers: Optional[Dict[str, str]] = None) -> int
Counts the number of labels for the given index and returns the value.
Arguments:
index
: Optional evaluation set name for which the labels should be counted. If None, the DocumentStore's default label_index (self.label_index) will be used.headers
: Not supported.
Returns:
number of labels for the given index
DeepsetCloudDocumentStore.get_all_labels_aggregated
def get_all_labels_aggregated(
index: Optional[str] = None,
filters: Optional[FilterType] = None,
open_domain: bool = True,
drop_negative_labels: bool = False,
drop_no_answers: bool = False,
aggregate_by_meta: Optional[Union[str, list]] = None,
headers: Optional[Dict[str, str]] = None) -> List[MultiLabel]
Return all labels in the DocumentStore, aggregated into MultiLabel objects.
This aggregation step helps, for example, if you collected multiple possible answers for one question and you want now all answers bundled together in one place for evaluation. How they are aggregated is defined by the open_domain and aggregate_by_meta parameters. If the questions are being asked to a single document (i.e. SQuAD style), you should set open_domain=False to aggregate by question and document. If the questions are being asked to your full collection of documents, you should set open_domain=True to aggregate just by question. If the questions are being asked to a subslice of your document set (e.g. product review use cases), you should set open_domain=True and populate aggregate_by_meta with the names of Label meta fields to aggregate by question and your custom meta fields. For example, in a product review use case, you might set aggregate_by_meta=["product_id"] so that Labels with the same question but different answers from different documents are aggregated into the one MultiLabel object, provided that they have the same product_id (to be found in Label.meta["product_id"])
Arguments:
-
index
: Name of the index to get the labels from. If None, the DocumentStore's default index (self.index) will be used. -
filters
: Optional filters to narrow down the search space to documents whose metadata fulfill certain conditions. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and"
,"$or"
,"$not"
), a comparison operator ("$eq"
,"$in"
,"$gt"
,"$gte"
,"$lt"
,"$lte"
) or a metadata field name. Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of"$in"
) a list of values as value. If no logical operator is provided,"$and"
is used as default operation. If no comparison operator is provided,"$eq"
(or"$in"
if the comparison value is a list) is used as default operation.__Example__: ```python filters = { "$and": { "type": {"$eq": "article"}, "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"}, "rating": {"$gte": 3}, "$or": { "genre": {"$in": ["economy", "politics"]}, "publisher": {"$eq": "nytimes"} } } } ```
-
open_domain
: When True, labels are aggregated purely based on the question text alone. When False, labels are aggregated in a closed domain fashion based on the question text and also the id of the document that the label is tied to. In this setting, this function might return multiple MultiLabel objects with the same question string. -
headers
: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='} for basic authentication) -
aggregate_by_meta
: The names of the Label meta fields by which to aggregate. For example: ["product_id"] -
drop_negative_labels
: When True, labels with incorrect answers and documents are dropped. -
drop_no_answers
: When True, labels with no answers are dropped.
DeepsetCloudDocumentStore.normalize_embedding
@staticmethod
def normalize_embedding(emb: np.ndarray) -> None
Performs L2 normalization of embeddings vector inplace. Input can be a single vector (1D array) or a matrix (2D array).
DeepsetCloudDocumentStore.add_eval_data
def add_eval_data(filename: str,
doc_index: str = "eval_document",
label_index: str = "label",
batch_size: Optional[int] = None,
preprocessor: Optional[PreProcessor] = None,
max_docs: Optional[Union[int, bool]] = None,
open_domain: bool = False,
headers: Optional[Dict[str, str]] = None)
Adds a SQuAD-formatted file to the DocumentStore in order to be able to perform evaluation on it.
If a jsonl file and a batch_size is passed to the function, documents are loaded batchwise from disk and also indexed batchwise to the DocumentStore in order to prevent out of memory errors.
Arguments:
filename
: Name of the file containing evaluation data (json or jsonl)doc_index
: Elasticsearch index where evaluation documents should be storedlabel_index
: Elasticsearch index where labeled questions should be storedbatch_size
: Optional number of documents that are loaded and processed at a time. When set to None (default) all documents are processed at once.preprocessor
: Optional PreProcessor to preprocess evaluation documents. It can be used for splitting documents into passages (and assigning labels to corresponding passages). Currently the PreProcessor does not support split_by sentence, cleaning nor split_overlap != 0. When set to None (default) preprocessing is disabled.max_docs
: Optional number of documents that will be loaded. When set to None (default) all available eval documents are used.open_domain
: Set this to True if your file is an open domain dataset where two different answers to the same question might be found in different contexts.headers
: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='} for basic authentication)
DeepsetCloudDocumentStore.run
def run(documents: List[Union[dict, Document]],
index: Optional[str] = None,
headers: Optional[Dict[str, str]] = None,
id_hash_keys: Optional[List[str]] = None)
Run requests of document stores
Comment: We will gradually introduce the primitives. The doument stores also accept dicts and parse them to documents. In the future, however, only documents themselves will be accepted. Parsing the dictionaries in the run function is therefore only an interim solution until the run function also accepts documents.
Arguments:
documents
: A list of dicts that are documents.headers
: A list of headers.index
: Optional name of index where the documents shall be written to. If None, the DocumentStore's default index (self.index) will be used.id_hash_keys
: List of the fields that the hashes of the ids are generated from.
DeepsetCloudDocumentStore.describe_documents
def describe_documents(index=None)
Return a summary of the documents in the document store
Module pinecone
PineconeDocumentStore
class PineconeDocumentStore(BaseDocumentStore)
Document store for very large scale embedding based dense retrievers like the DPR. This is a hosted document store, this means that your vectors will not be stored locally but in the cloud. This means that the similarity search will be run on the cloud as well.
It implements the Pinecone vector database (https://www.pinecone.io) to perform similarity search on vectors. In order to use this document store, you need an API key that you can obtain by creating an account on the Pinecone website.
The document text is stored using the SQLDocumentStore, while the vector embeddings and metadata (for filtering) are indexed in a Pinecone Index.
PineconeDocumentStore.__init__
def __init__(api_key: str,
environment: str = "us-west1-gcp",
pinecone_index: Optional["pinecone.Index"] = None,
embedding_dim: int = 768,
pods: int = 1,
pod_type: str = "p1.x1",
return_embedding: bool = False,
index: str = "document",
similarity: str = "cosine",
replicas: int = 1,
shards: int = 1,
namespace: Optional[str] = None,
embedding_field: str = "embedding",
progress_bar: bool = True,
duplicate_documents: str = "overwrite",
recreate_index: bool = False,
metadata_config: Optional[Dict] = None,
validate_index_sync: bool = True,
pool_threads: int = DEFAULT_POOL_THREADS)
Arguments:
api_key
: Pinecone vector database API key (https://app.pinecone.io).environment
: Pinecone cloud environment uses"us-west1-gcp"
by default. Other GCP and AWS regions are supported, contact Pinecone here if required.pinecone_index
: pinecone-client Index object, an index will be initialized or loaded if not specified.embedding_dim
: The embedding vector size.pods
: The number of pods for the index to use, including replicas. Defaults to 1.pod_type
: The type of pod to use. Defaults to"p1.x1"
.return_embedding
: Whether to return document embeddings.index
: Name of index in document store to use.similarity
: The similarity function used to compare document vectors."cosine"
is the default and is recommended if you are using a Sentence-Transformer model."dot_product"
is more performant with DPR embeddings. In both cases, the returned values in Document.score are normalized to be in range [0,1]: - For"dot_product"
:expit(np.asarray(raw_score / 100))
- For"cosine"
:(raw_score + 1) / 2
replicas
: The number of replicas. Replicas duplicate the index. They provide higher availability and throughput.shards
: The number of shards to be used in the index. We recommend to use 1 shard per 1GB of data.namespace
: Optional namespace. If not specified, None is default.embedding_field
: Name of field containing an embedding vector.progress_bar
: Whether to show a tqdm progress bar or not. Can be helpful to disable in production deployments to keep the logs clean.duplicate_documents
: Handle duplicate documents based on parameter options. Parameter options:"skip"
: Ignore the duplicate documents."overwrite"
: Update any existing documents with the same ID when adding documents."fail"
: An error is raised if the document ID of the document being added already exists.recreate_index
: If set to True, an existing Pinecone index will be deleted and a new one will be created using the config you are using for initialization. Be aware that all data in the old index will be lost if you choose to recreate the index. Be aware that both the document_index and the label_index will be recreated.metadata_config
: Which metadata fields should be indexed, part of the selective metadata filtering feature. Should be in the format{"indexed": ["metadata-field-1", "metadata-field-2", "metadata-field-n"]}
. By default, no fields are indexed.pool_threads
: Number of threads to use for index upsert.
PineconeDocumentStore.get_document_count
def get_document_count(filters: Optional[FilterType] = None,
index: Optional[str] = None,
only_documents_without_embedding: bool = False,
headers: Optional[Dict[str, str]] = None,
namespace: Optional[str] = None,
type_metadata: Optional[DocTypeMetadata] = None) -> int
Return the count of documents in the document store.
Arguments:
-
filters
: Optional filters to narrow down the documents which will be counted. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and"
,"$or"
,"$not"
), a comparison operator ("$eq"
,"$in"
,"$gt"
,"$gte"
,"$lt"
,"$lte"
), or a metadata field name. Logical operator keys take a dictionary of metadata field names or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of"$in"
) a list of values as value. If no logical operator is provided,"$and"
is used as default operation. If no comparison operator is provided,"$eq"
(or"$in"
if the comparison value is a list) is used as default operation. Example:```python filters = { "$and": { "type": {"$eq": "article"}, "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"}, "rating": {"$gte": 3}, "$or": { "genre": {"$in": ["economy", "politics"]}, "publisher": {"$eq": "nytimes"} } } } ```
-
index
: Optional index name to use for the query. If not provided, the default index name is used. -
only_documents_without_embedding
: If set toTrue
, only documents without embeddings are counted. -
headers
: PineconeDocumentStore does not support headers. -
namespace
: Optional namespace to count documents from. If not specified, None is default. -
type_metadata
: Optional value fordoc_type
metadata to reference documents that need to be counted. Parameter options: -
"vector"
: Documents with embedding. -
"no-vector"
: Documents without embedding (dummy embedding only). -
"label"
: Labels.
PineconeDocumentStore.get_embedding_count
def get_embedding_count(filters: Optional[FilterType] = None,
index: Optional[str] = None,
namespace: Optional[str] = None) -> int
Return the count of embeddings in the document store.
Arguments:
-
index
: Optional index name to retrieve all documents from. -
filters
: Optional filters to narrow down the documents with embedding which will be counted. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and"
,"$or"
,"$not"
), a comparison operator ("$eq"
,"$in"
,"$gt"
,"$gte"
,"$lt"
,"$lte"
), or a metadata field name. Logical operator keys take a dictionary of metadata field names or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of"$in"
) a list of values as value. If no logical operator is provided,"$and"
is used as default operation. If no comparison operator is provided,"$eq"
(or"$in"
if the comparison value is a list) is used as default operation. Example:```python filters = { "$and": { "type": {"$eq": "article"}, "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"}, "rating": {"$gte": 3}, "$or": { "genre": {"$in": ["economy", "politics"]}, "publisher": {"$eq": "nytimes"} } } } ```
-
namespace
: Optional namespace to count embeddings from. If not specified, None is default.
PineconeDocumentStore.write_documents
def write_documents(documents: Union[List[dict], List[Document]],
index: Optional[str] = None,
batch_size: int = DEFAULT_BATCH_SIZE,
duplicate_documents: Optional[str] = None,
headers: Optional[Dict[str, str]] = None,
labels: Optional[bool] = False,
namespace: Optional[str] = None,
use_async: bool = False,
document_chunk_size: int = DEFAULT_DOCUMENT_CHUNK_SIZE)
Add new documents to the DocumentStore.
Arguments:
documents
: List ofDicts
or list ofDocuments
. If they already contain embeddings, we'll index them right away in Pinecone. If not, you can later callupdate_embeddings()
to create & index them.index
: Index name for storing the docs and metadata.batch_size
: Number of documents to upsert at a time. When working with large number of documents, batching can help to reduce the memory footprint.duplicate_documents
: handle duplicate documents based on parameter options. Parameter options:"skip"
: Ignore the duplicate documents."overwrite"
: Update any existing documents with the same ID when adding documents."fail"
: An error is raised if the document ID of the document being added already exists.headers
: PineconeDocumentStore does not support headers.labels
: Tells us whether these records are labels or not. Defaults to False.namespace
: Optional namespace to write documents to. If not specified, None is default.use_async
: If set to True, Pinecone index will upsert documents in parallel.document_chunk_size
: Number of documents to process at a time. If use_async is set to True, along with batch_size will speed up document upsert by doing it in parallel.
Raises:
DuplicateDocumentError
: Exception trigger on duplicate document.
PineconeDocumentStore.update_embeddings
def update_embeddings(retriever: DenseRetriever,
index: Optional[str] = None,
update_existing_embeddings: bool = True,
filters: Optional[FilterType] = None,
batch_size: int = DEFAULT_BATCH_SIZE,
namespace: Optional[str] = None,
use_async: bool = False,
document_chunk_size: int = DEFAULT_DOCUMENT_CHUNK_SIZE)
Updates the embeddings in the document store using the encoding model specified in the retriever.
This can be useful if you want to add or change the embeddings for your documents (e.g. after changing the retriever config).
Arguments:
-
retriever
: Retriever to use to get embeddings for text. -
index
: Index name for which embeddings are to be updated. If set toNone
, the defaultself.index
is used. -
update_existing_embeddings
: Whether to update existing embeddings of the documents. If set toFalse
, only documents without embeddings are processed. This mode can be used for incremental updating of embeddings, wherein, only newly indexed documents get processed. -
filters
: Optional filters to narrow down the documents for which embeddings are to be updated. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and"
,"$or"
,"$not"
), a comparison operator ("$eq"
,"$in"
,"$gt"
,"$gte"
,"$lt"
,"$lte"
) or a metadata field name. Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of"$in"
) a list of values as value. If no logical operator is provided,"$and"
is used as default operation. If no comparison operator is provided,"$eq"
(or"$in"
if the comparison value is a list) is used as default operation. Example:```python filters = { "$and": { "type": {"$eq": "article"}, "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"}, "rating": {"$gte": 3}, "$or": { "genre": {"$in": ["economy", "politics"]}, "publisher": {"$eq": "nytimes"} } } } ```
-
batch_size
: Number of documents to process at a time. When working with large number of documents, batching can help reduce memory footprint. -
namespace
: Optional namespace to retrieve document from. If not specified, None is default. -
use_async
: If set to True, Pinecone index will update embeddings in parallel. -
document_chunk_size
: Number of documents to process at a time. If use_async is set to True, along with batch_size will speed up updating the embeddings by doing it in parallel.
PineconeDocumentStore.get_all_documents
def get_all_documents(index: Optional[str] = None,
filters: Optional[FilterType] = None,
return_embedding: Optional[bool] = None,
batch_size: int = DEFAULT_BATCH_SIZE,
headers: Optional[Dict[str, str]] = None,
type_metadata: Optional[DocTypeMetadata] = None,
namespace: Optional[str] = None) -> List[Document]
Retrieves all documents in the index.
Arguments:
-
index
: Optional index name to retrieve all documents from. -
filters
: Optional filters to narrow down the documents that will be retrieved. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and"
,"$or"
,"$not"
), a comparison operator ("$eq"
,"$in"
,"$gt"
,"$gte"
,"$lt"
,"$lte"
) or a metadata field name. Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of"$in"
) a list of values as value. If no logical operator is provided,"$and"
is used as default operation. If no comparison operator is provided,"$eq"
(or"$in"
if the comparison value is a list) is used as default operation. Example:```python filters = { "$and": { "type": {"$eq": "article"}, "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"}, "rating": {"$gte": 3}, "$or": { "genre": {"$in": ["economy", "politics"]}, "publisher": {"$eq": "nytimes"} } } } ```
-
return_embedding
: Optional flag to return the embedding of the document. -
batch_size
: Number of documents to process at a time. When working with large number of documents, batching can help reduce memory footprint. -
headers
: Pinecone does not support headers. -
type_metadata
: Value ofdoc_type
metadata that indicates which documents need to be retrieved. -
namespace
: Optional namespace to retrieve documents from. If not specified, None is default.
PineconeDocumentStore.get_all_documents_generator
def get_all_documents_generator(
index: Optional[str] = None,
filters: Optional[FilterType] = None,
return_embedding: Optional[bool] = None,
batch_size: int = DEFAULT_BATCH_SIZE,
headers: Optional[Dict[str, str]] = None,
namespace: Optional[str] = None,
type_metadata: Optional[DocTypeMetadata] = None,
include_type_metadata: Optional[bool] = False
) -> Generator[Document, None, None]
Get all documents from the document store. Under-the-hood, documents are fetched in batches from the
document store and yielded as individual documents. This method can be used to iteratively process a large number of documents without having to load all documents in memory.
Arguments:
-
index
: Name of the index to get the documents from. If None, the DocumentStore's default index (self.index) will be used. -
filters
: Optional filters to narrow down the documents for which embeddings are to be updated. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and"
,"$or"
,"$not"
), a comparison operator ("$eq"
,"$in"
,"$gt"
,"$gte"
,"$lt"
,"$lte"
) or a metadata field name. Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of"$in"
) a list of values as value. If no logical operator is provided,"$and"
is used as default operation. If no comparison operator is provided,"$eq"
(or"$in"
if the comparison value is a list) is used as default operation. Example:```python filters = { "$and": { "type": {"$eq": "article"}, "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"}, "rating": {"$gte": 3}, "$or": { "genre": {"$in": ["economy", "politics"]}, "publisher": {"$eq": "nytimes"} } } } ```
-
return_embedding
: Whether to return the document embeddings. -
batch_size
: When working with large number of documents, batching can help reduce memory footprint. -
headers
: PineconeDocumentStore does not support headers. -
namespace
: Optional namespace to retrieve document from. If not specified, None is default. -
type_metadata
: Value ofdoc_type
metadata that indicates which documents need to be retrieved. -
include_type_metadata
: Indicates ifdoc_type
value will be included in document metadata or not. If not specified,doc_type
field will be dropped from document metadata.
PineconeDocumentStore.get_documents_by_id
def get_documents_by_id(
ids: List[str],
index: Optional[str] = None,
batch_size: int = DEFAULT_BATCH_SIZE,
headers: Optional[Dict[str, str]] = None,
return_embedding: Optional[bool] = None,
namespace: Optional[str] = None,
include_type_metadata: Optional[bool] = False) -> List[Document]
Retrieves all documents in the index using their IDs.
Arguments:
ids
: List of IDs to retrieve.index
: Optional index name to retrieve all documents from.batch_size
: Number of documents to retrieve at a time. When working with large number of documents, batching can help reduce memory footprint.headers
: Pinecone does not support headers.return_embedding
: Optional flag to return the embedding of the document.namespace
: Optional namespace to retrieve document from. If not specified, None is default.include_type_metadata
: Indicates ifdoc_type
value will be included in document metadata or not. If not specified,doc_type
field will be dropped from document metadata.
PineconeDocumentStore.get_document_by_id
def get_document_by_id(id: str,
index: Optional[str] = None,
headers: Optional[Dict[str, str]] = None,
return_embedding: Optional[bool] = None,
namespace: Optional[str] = None) -> Document
Returns a single Document retrieved using an ID.
Arguments:
id
: ID string to retrieve.index
: Optional index name to retrieve all documents from.headers
: Pinecone does not support headers.return_embedding
: Optional flag to return the embedding of the document.namespace
: Optional namespace to retrieve document from. If not specified, None is default.
PineconeDocumentStore.update_document_meta
def update_document_meta(id: str,
meta: Dict[str, str],
index: Optional[str] = None)
Update the metadata dictionary of a document by specifying its string ID.
Arguments:
id
: ID of the Document to update.meta
: Dictionary of new metadata.namespace
: Optional namespace to update documents from.index
: Optional index name to update documents from.
PineconeDocumentStore.delete_documents
def delete_documents(index: Optional[str] = None,
ids: Optional[List[str]] = None,
filters: Optional[FilterType] = None,
headers: Optional[Dict[str, str]] = None,
drop_ids: Optional[bool] = True,
namespace: Optional[str] = None,
type_metadata: Optional[DocTypeMetadata] = None)
Delete documents from the document store.
Arguments:
-
index
: Index name to delete the documents from. IfNone
, the DocumentStore's default index name (self.index
) will be used. -
ids
: Optional list of IDs to narrow down the documents to be deleted. -
filters
: Optional filters to narrow down the documents for which embeddings are to be updated. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and"
,"$or"
,"$not"
), a comparison operator ("$eq"
,"$in"
,"$gt"
,"$gte"
,"$lt"
,"$lte"
) or a metadata field name. Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of"$in"
) a list of values as value. If no logical operator is provided,"$and"
is used as default operation. If no comparison operator is provided,"$eq"
(or"$in"
if the comparison value is a list) is used as default operation. Example:```python filters = { "$and": { "type": {"$eq": "article"}, "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"}, "rating": {"$gte": 3}, "$or": { "genre": {"$in": ["economy", "politics"]}, "publisher": {"$eq": "nytimes"} } } } ```
-
headers
: PineconeDocumentStore does not support headers. -
drop_ids
: Specifies if the locally stored IDs should be deleted. The default is True. -
namespace
: Optional namespace. If not specified, None is default. -
type_metadata
: Optional value fordoc_type
metadata field as reference for documents to delete.
Returns:
None
:
PineconeDocumentStore.delete_index
def delete_index(index: Optional[str])
Delete an existing index. The index including all data will be removed.
Arguments:
index
: The name of the index to delete.
Returns:
None
PineconeDocumentStore.query_by_embedding
def query_by_embedding(
query_emb: np.ndarray,
filters: Optional[FilterType] = None,
top_k: int = 10,
index: Optional[str] = None,
return_embedding: Optional[bool] = None,
headers: Optional[Dict[str, str]] = None,
scale_score: bool = True,
namespace: Optional[str] = None,
type_metadata: Optional[DocTypeMetadata] = None) -> List[Document]
Find the document that is most similar to the provided query_emb
by using a vector similarity metric.
Arguments:
-
query_emb
: Embedding of the query (e.g. gathered from DPR). -
filters
: Optional filters to narrow down the search space to documents whose metadata fulfill certain conditions. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and"
,"$or"
,"$not"
), a comparison operator ("$eq"
,"$in"
,"$gt"
,"$gte"
,"$lt"
,"$lte"
) or a metadata field name. Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of"$in"
) a list of values as value. If no logical operator is provided,"$and"
is used as default operation. If no comparison operator is provided,"$eq"
(or"$in"
if the comparison value is a list) is used as default operation. Example:```python filters = { "$and": { "type": {"$eq": "article"}, "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"}, "rating": {"$gte": 3}, "$or": { "genre": {"$in": ["economy", "politics"]}, "publisher": {"$eq": "nytimes"} } } } # or simpler using default operators filters = { "type": "article", "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"}, "rating": {"$gte": 3}, "$or": { "genre": ["economy", "politics"], "publisher": "nytimes" } } ``` To use the same logical operator multiple times on the same level, logical operators take optionally a list of dictionaries as value. __Example__: ```python filters = { "$or": [ { "$and": { "Type": "News Paper", "Date": { "$lt": "2019-01-01" } } }, { "$and": { "Type": "Blog Post", "Date": { "$gte": "2019-01-01" } } } ] } ```
-
top_k
: How many documents to return. -
index
: The name of the index from which to retrieve documents. -
return_embedding
: Whether to return document embedding. -
headers
: PineconeDocumentStore does not support headers. -
scale_score
: Whether to scale the similarity score to the unit interval (range of [0,1]). If true (default) similarity scores (e.g. cosine or dot_product) which naturally have a different value range will be scaled to a range of [0,1], where 1 means extremely relevant. Otherwise raw similarity scores (e.g. cosine or dot_product) will be used. -
namespace
: Optional namespace to query document from. If not specified, None is default. -
type_metadata
: Value ofdoc_type
metadata that indicates which documents need to be queried.
PineconeDocumentStore.load
@classmethod
def load(cls)
Default class method used for loading indexes. Not applicable to PineconeDocumentStore.
PineconeDocumentStore.delete_labels
def delete_labels(index: Optional[str] = None,
ids: Optional[List[str]] = None,
filters: Optional[FilterType] = None,
headers: Optional[Dict[str, str]] = None,
batch_size: int = DEFAULT_BATCH_SIZE,
namespace: Optional[str] = None)
Default class method used for deleting labels. Not supported by PineconeDocumentStore.
PineconeDocumentStore.get_all_labels
def get_all_labels(index=None,
filters: Optional[FilterType] = None,
headers: Optional[Dict[str, str]] = None,
namespace: Optional[str] = None)
Default class method used for getting all labels.
PineconeDocumentStore.get_label_count
def get_label_count(index: Optional[str] = None,
headers: Optional[Dict[str, str]] = None)
Default class method used for counting labels. Not supported by PineconeDocumentStore.
PineconeDocumentStore.write_labels
def write_labels(labels,
index=None,
headers: Optional[Dict[str, str]] = None,
namespace: Optional[str] = None)
Default class method used for writing labels.
PineconeDocumentStore.get_all_labels_aggregated
def get_all_labels_aggregated(
index: Optional[str] = None,
filters: Optional[FilterType] = None,
open_domain: bool = True,
drop_negative_labels: bool = False,
drop_no_answers: bool = False,
aggregate_by_meta: Optional[Union[str, list]] = None,
headers: Optional[Dict[str, str]] = None) -> List[MultiLabel]
Return all labels in the DocumentStore, aggregated into MultiLabel objects.
This aggregation step helps, for example, if you collected multiple possible answers for one question and you want now all answers bundled together in one place for evaluation. How they are aggregated is defined by the open_domain and aggregate_by_meta parameters. If the questions are being asked to a single document (i.e. SQuAD style), you should set open_domain=False to aggregate by question and document. If the questions are being asked to your full collection of documents, you should set open_domain=True to aggregate just by question. If the questions are being asked to a subslice of your document set (e.g. product review use cases), you should set open_domain=True and populate aggregate_by_meta with the names of Label meta fields to aggregate by question and your custom meta fields. For example, in a product review use case, you might set aggregate_by_meta=["product_id"] so that Labels with the same question but different answers from different documents are aggregated into the one MultiLabel object, provided that they have the same product_id (to be found in Label.meta["product_id"])
Arguments:
-
index
: Name of the index to get the labels from. If None, the DocumentStore's default index (self.index) will be used. -
filters
: Optional filters to narrow down the search space to documents whose metadata fulfill certain conditions. Filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and"
,"$or"
,"$not"
), a comparison operator ("$eq"
,"$in"
,"$gt"
,"$gte"
,"$lt"
,"$lte"
) or a metadata field name. Logical operator keys take a dictionary of metadata field names and/or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in case of"$in"
) a list of values as value. If no logical operator is provided,"$and"
is used as default operation. If no comparison operator is provided,"$eq"
(or"$in"
if the comparison value is a list) is used as default operation.__Example__: ```python filters = { "$and": { "type": {"$eq": "article"}, "date": {"$gte": "2015-01-01", "$lt": "2021-01-01"}, "rating": {"$gte": 3}, "$or": { "genre": {"$in": ["economy", "politics"]}, "publisher": {"$eq": "nytimes"} } } } ```
-
open_domain
: When True, labels are aggregated purely based on the question text alone. When False, labels are aggregated in a closed domain fashion based on the question text and also the id of the document that the label is tied to. In this setting, this function might return multiple MultiLabel objects with the same question string. -
headers
: Custom HTTP headers to pass to document store client if supported (e.g. {'Authorization': 'Basic YWRtaW46cm9vdA=='} for basic authentication) -
aggregate_by_meta
: The names of the Label meta fields by which to aggregate. For example: ["product_id"] -
drop_negative_labels
: When True, labels with incorrect answers and documents are dropped. -
drop_no_answers
: When True, labels with no answers are dropped.
PineconeDocumentStore.normalize_embedding
@staticmethod
def normalize_embedding(emb: np.ndarray) -> None
Performs L2 normalization of embeddings vector inplace. Input can be a single vector (1D array) or a matrix (2D array).
PineconeDocumentStore.add_eval_data
def add_eval_data(filename: str,
doc_index: str = "eval_document",
label_index: str = "label",
batch_size: Optional[int] = None,
preprocessor: Optional[PreProcessor] = None,
max_docs: Optional[Union[int, bool]] = None,
open_domain: bool = False,
headers: Optional[Dict[str, str]] = None)
Adds a SQuAD-formatted file to the DocumentStore in order to be able to perform evaluation on it.
If a jsonl file and a batch_size is passed to the function, documents are loaded batchwise from disk and also indexed batchwise to the DocumentStore in order to prevent out of memory errors.
Arguments: