Metadata Filtering
You can attach metadata to the documents you index into your DocumentStore. At query time, you can apply filters based on this metadata to limit the scope of your search and ensure your answers come from a specific slice of your data. This guide explains how to do it.
For example, if you have a set of annual reports from various companies, you may want to perform a search on just a specific year or on a small selection of companies. This can reduce the work load of the Retriever and also ensure that you get more relevant results.
Basic Filters
Filters are applied via the filters
argument of the Retriever
class. When working with a pipeline, the filter supplied to Pipeline.run()
, which will then route it on to the Retriever
class (see Arguments for an explanation). Basic filtering is supported by the ElasticsearchDocumentStore
, OpenSearchDocumentStore
and WeaviateDocumentStore
.
You can supply filters in the form a dictionary where the keys are Document metadata fields and the values are a list of accepted values. In the below example, the filter ensures that any returned Document has a value of 2019
in the years
metadata field and either BMW
or Mercedes
in the companies
metadata field.
pipeline.run(
query="Why did the revenue increase?",
params={
"filters": {
"years": ["2019"],
"companies": ["BMW", "Mercedes"]
}
},
)
Filtering Logic
Technically spoken, filters are defined as nested dictionaries. The keys of the dictionaries can be a logical operator ("$and", "$or", "$not"), a comparison operator ("$eq", "$in", "$gt", "$gte", "$lt", "$lte")
or a metadata field name.
Logical operator keys take a dictionary of metadata field names or logical operators as value. Metadata field names take a dictionary of comparison operators as value. Comparison operator keys take a single value or (in the case of "$in") a list of values as value.
If no logical operator is provided, "$and" is the default operator. If no comparison operator is provided, "$eq" (or "$in" if the comparison value is a list) is the default operator.
filters = {
"$and": {
"type": {"$eq": "article"},
"date": {"$gte": "2015-01-01", "$lt": "2021-01-01"},
"rating": {"$gte": 3},
"$or": {
"genre": {"$in": ["economy", "politics"]},
"publisher": {"$eq": "nytimes"}
}
}
}
# You can also use default operators. This expression then looks like the one below.
# To filter by dates using the API endpoints, you must use explicit operators.
# So for the example above to work with default operators, you must delete the date filter.
filters = {
"type": "article",
"rating": {"$gte": 3},
"$or": {
"genre": ["economy", "politics"],
"publisher": "nytimes"
}
}
Filtering by Dates
To filter by dates using the API endpoints, you must use explicit operators.
Logical Operators on the Same Level
It is not possible to use logical operators twice on the same level since dictionary keys have to be unique. For example, the following filter is not valid:
{ "$or": { "$and": { "Type": "News Paper", "Date": {"$lt": "2019-01-01"}, }, "$and": { # repeated key in dictionary "Type": "Blog post", "Date": {"$gte": "2019-01-01"} } } }
To get around this, we allow logical operators to take a list of dictionaries as values. This is what the above filter would look like in this style.
{ "$or": [ { "$and": { "Type": "News Paper", "Date": {"$lt": "2019-01-01"} } }, { "$and": { "Type": "Blog post", "Date": {"$gte": "2019-01-01"} } } ] }
ElasticsearchDocumentStore
and OpenSearchDocumentStore
support filtering with logical operators. In ElasticsearchDocumentStore, you must populate your documents with dates in ISO format (the format Elasticsearch uses by default). Otherwise, you may get parsing errors. You can also modify the Elasticsearch configuration to support your own date format.
Updated almost 2 years ago