Query Classifier
The Query Classifiers in Haystack distinguish between three different classes of queries: keywords, questions, and statements. Based on its classification, it can route the query to a specified branch of the Pipeline. By passing on queries to Nodes that are more suited to handle them, you get better search results.
For example, the Dense Passage Retriever is trained on full questions and so it works best if you only pass questions to it. By choosing also to pass keyword queries to a BM25 Retriever, such as the ElasticsearchRetriever, you can reduce the load on the GPU-powered Dense Passage Rertriever.
Position in a Pipeline | At the beginning of a query Pipeline |
Input | Query |
Output | Query |
Classes | TransformersQueryClassifier SklearnQueryClassifier |
The Query Classifier will populate the metadata fields of the Query with its classification and can also route it based on this.
Query Types
Keyword Queries
Such queries don't have sentence structure. They consist of keywords and the order of words does not matter:
- arya stark father
- jon snow country
- arya stark younger brothers
Questions
In such queries users ask a question in a complete, grammatical sentence. A Query Classifier should be able to classify a query regardless of whether it ends with a question mark or not.
- who is the father of arya stark?
- which country was jon snow filmed in
- who are the younger brothers of arya stark?
Statements
This type of query is a declarative sentence, such as:
- Arya stark was a daughter of a lord.
- Show countries that Jon snow was filmed in.
- List all brothers of Arya.
Usage
To use the Query Classifier as a stand-alone Node:
from haystack.nodes import TransformersQueryClassifier
queries = ["Arya Stark father","Jon Snow UK",
"who is the father of arya stark?","Which country was jon snow filmed in?"]
question_classifier = TransformersQueryClassifier(model_name_or_path="shahrukhx01/bert-mini-finetune-question-detection")
# Or Sklearn based:
for query in queries:
result = question_classifier.run(query=query)
if result[1] == "output_1":
category = "question"
else:
category = "keywords"
print(f"Query: {query}, raw_output: {result}, class: {category}")
# Returns:
# Query: Arya Stark father, raw_output: ({'query': 'Arya Stark father'}, 'output_2'), class: keywords
# Query: Jon Snow UK, raw_output: ({'query': 'Jon Snow UK'}, 'output_2'), class: keywords
# Query: who is the father of arya stark?, raw_output: ({'query': 'who is the father of arya stark?'}, 'output_1'), class: question
# Query: Which country was jon snow filmed in?, raw_output: ({'query': 'Which country was jon snow filmed in?'}, 'output_1'), class: question
Note how the node returns two objects: the query (e.g.'Arya Stark father') and the name of the output edge (e.g. "output_2"). This information can be leveraged in a pipeline for routing the query to the next node.
You can use a Query Classifier within a pipeline as a decision node. Depending on the output of the classifier only one branch of the Pipeline will be executed. For example, we can route keyword queries to an ElasticsearchRetriever and questions + statements to DPR.
Below, we define a pipeline with a TransformersQueryClassifier
that routes questions/statements to the node's output_1
and keyword queries to output_2
. We leverage this structure in the pipeline by connecting the DPRRetriever to QueryClassifier.output_1
and the ESRetriever to QueryClassifier.output_2
.
from haystack import Pipeline
from haystack.nodes import TransformersQueryClassifier
from haystack.utils import print_answers
query_classifier = TransformersQueryClassifier(model_name_or_path="shahrukhx01/bert-mini-finetune-question-detection")
pipe = Pipeline()
pipe.add_node(component=query_classifier, name="QueryClassifier", inputs=["Query"])
pipe.add_node(component=dpr_retriever, name="DPRRetriever", inputs=["QueryClassifier.output_1"])
pipe.add_node(component=es_retriever, name="ESRetriever", inputs=["QueryClassifier.output_2"])
# Pass a question -> run DPR
res_1 = pipe.run(query="Who is the father of Arya Stark?")
# Pass keywords -> run the ElasticsearchRetriever
res_2 = pipe.run(query="arya stark father")
One alternative set up is to route questions to a Question Answering branch and keywords to a Document Search branch:
haystack.pipeline import TransformersQueryClassifier, Pipeline
from haystack.utils import print_answers
query_classifier = TransformersQueryClassifier(model_name_or_path="shahrukhx01/question-vs-statement-classifier")
pipe = Pipeline()
pipe.add_node(component=query_classifier, name="QueryClassifier", inputs=["Query"])
pipe.add_node(component=dpr_retriever, name="DPRRetriever", inputs=["QueryClassifier.output_1"])
pipe.add_node(component=es_retriever, name="ESRetriever", inputs=["QueryClassifier.output_2"])
pipe.add_node(component=reader, name="QAReader", inputs=["DPRRetriever"])
# Pass a question -> run DPR + QA -> return answers
res_1 = pipe.run(query="Who is the father of Arya Stark?")
# Pass keywords -> run only ElasticsearchRetriever -> return docs
res_2 = pipe.run(query="arya stark father")
Models
The TransformersQueryClassifier is more accurate than the SkLearnQueryClassifier as it is sensitive to the syntax of a sentence. However, it requires more memory and a GPU in order to run quickly. You can mitigate those downsides by choosing a smaller transformer model. The default models that we trained use a mini BERT architecture which is about 50 MB
in size and allows relatively fast inference on CPU.
Transformers
Pass your own Transformer
binary classification model from file or use one of the following pretrained models hosted on Hugging Face:
Keywords vs. Questions/Statements (Default)
TransformersQueryClassifier(model_name_or_path="shahrukhx01/bert-mini-finetune-question-detection")
# output_1 => question/statement
# output_2 => keyword query
Learn more about this model from its model card.
Questions vs. Statements
```python
TransformersQueryClassifier(model_name_or_path="shahrukhx01/question-vs-statement-classifier")
# output_1 => question
# output_2 => statement
```
Learn more about this model from its model card.
Sklearn
Pass your own Sklearn
binary classification model or use one of the following pretrained Gradient boosting models:
Keywords vs. Questions/Statements (Default)
```python
SklearnQueryClassifier(query_classifier = "https://ext-models-haystack.s3.eu-central-1.amazonaws.com/gradboost_query_classifier/model.pickle",
query_vectorizer = "https://ext-models-haystack.s3.eu-central-1.amazonaws.com/gradboost_query_classifier/vectorizer.pickle")
# output_1 => question/statement
# output_2 => keyword query
```
Learn more about this model from its readme.
Questions vs. Statements
```python
SklearnQueryClassifier(query_classifier = "https://ext-models-haystack.s3.eu-central-1.amazonaws.com/gradboost_query_classifier_statements/model.pickle",
query_vectorizer = "https://ext-models-haystack.s3.eu-central-1.amazonaws.com/gradboost_query_classifier_statements/vectorizer.pickle")
output_1 => question
output_2 => statement
```
Learn more about this model from its readme.
Updated 6 months ago