Distinguishes between keyword, question and statements queries.
Module base
BaseQueryClassifier
class BaseQueryClassifier(BaseComponent)
Abstract class for Query Classifiers
Module sklearn
SklearnQueryClassifier
class SklearnQueryClassifier(BaseQueryClassifier)
A node to classify an incoming query into one of two categories using a lightweight sklearn model. Depending on the result, the query flows to a different branch in your pipeline
and the further processing can be customized. You can define this by connecting the further pipeline to either output_1
or output_2
from this node.
Example:
pipe = Pipeline()
pipe.add_node(component=SklearnQueryClassifier(), name="QueryClassifier", inputs=["Query"])
pipe.add_node(component=bm25_retriever, name="BM25Retriever", inputs=["QueryClassifier.output_2"])
pipe.add_node(component=dpr_retriever, name="DPRRetriever", inputs=["QueryClassifier.output_1"])
# Keyword queries will use the BM25Retriever
pipe.run("kubernetes aws")
# Semantic queries (questions, statements, sentences ...) will leverage the DPR retriever
pipe.run("How to manage kubernetes on aws")
Models:
Pass your own Sklearn
binary classification model or use one of the following pretrained ones:
-
Keywords vs. Questions/Statements (Default) query_classifier can be found here query_vectorizer can be found here output_1 => question/statement output_2 => keyword query Readme
-
Questions vs. Statements query_classifier can be found here query_vectorizer can be found here output_1 => question output_2 => statement Readme
See also the tutorial on pipelines.
SklearnQueryClassifier.__init__
def __init__(
model_name_or_path:
Union[
str,
Any] = "https://ext-models-haystack.s3.eu-central-1.amazonaws.com/gradboost_query_classifier_2022/model.pickle",
vectorizer_name_or_path:
Union[
str,
Any] = "https://ext-models-haystack.s3.eu-central-1.amazonaws.com/gradboost_query_classifier_2022/vectorizer.pickle",
batch_size: Optional[int] = None,
progress_bar: bool = True)
Arguments:
model_name_or_path
: Gradient boosting based binary classifier to classify between keyword vs statement/question queries or statement vs question queries.vectorizer_name_or_path
: A ngram based Tfidf vectorizer for extracting features from query.batch_size
: Number of queries to process at a time.progress_bar
: Whether to show a progress bar.
Module transformers
TransformersQueryClassifier
class TransformersQueryClassifier(BaseQueryClassifier)
A node to classify an incoming query into categories using a transformer model.
Depending on the result, the query flows to a different branch in your pipeline and the further processing
can be customized. You can define this by connecting the further pipeline to output_1
, output_2
, ..., output_n
from this node.
This node also supports zero-shot-classification.
Example:
{
pipe = Pipeline()
pipe.add_node(component=TransformersQueryClassifier(), name="QueryClassifier", inputs=["Query"])
pipe.add_node(component=bm25_retriever, name="BM25Retriever", inputs=["QueryClassifier.output_2"])
pipe.add_node(component=dpr_retriever, name="DPRRetriever", inputs=["QueryClassifier.output_1"])
# Keyword queries will use the BM25Retriever
pipe.run("kubernetes aws")
# Semantic queries (questions, statements, sentences ...) will leverage the DPR retriever
pipe.run("How to manage kubernetes on aws")
Models:
Pass your own Transformer
classification/zero-shot-classification model from file/huggingface or use one of the following
pretrained ones hosted on Huggingface:
-
Keywords vs. Questions/Statements (Default) model_name_or_path="shahrukhx01/bert-mini-finetune-question-detection" output_1 => question/statement output_2 => keyword query Readme
-
Questions vs. Statements
model_name_or_path
="shahrukhx01/question-vs-statement-classifier" output_1 => question output_2 => statement Readme
See also the tutorial on pipelines.
TransformersQueryClassifier.__init__
def __init__(model_name_or_path: Union[
Path, str] = "shahrukhx01/bert-mini-finetune-question-detection",
model_version: Optional[str] = None,
tokenizer: Optional[str] = None,
use_gpu: bool = True,
task: str = "text-classification",
labels: Optional[List[str]] = None,
batch_size: int = 16,
progress_bar: bool = True,
use_auth_token: Optional[Union[str, bool]] = None,
devices: Optional[List[Union[str, torch.device]]] = None)
Arguments:
model_name_or_path
: Directory of a saved model or the name of a public model, for example 'shahrukhx01/bert-mini-finetune-question-detection'. See Hugging Face models for a full list of available models.model_version
: The version of the model to use from the Hugging Face model hub. This can be a tag name, a branch name, or a commit hash.tokenizer
: The name of the tokenizer (usually the same as model).use_gpu
: Whether to use GPU (if available).task
: Specifies the type of classification. Possible values: 'text-classification' or 'zero-shot-classification'.labels
: If the task is 'text-classification' and an ordered list of labels is provided, the first label corresponds to output_1, the second label to output_2, and so on. The labels must match the model labels; only the order can differ. If the task is 'zero-shot-classification', these are the candidate labels.batch_size
: The number of queries to be processed at a time.progress_bar
: Whether to show a progress bar.use_auth_token
: The API token used to download private models from Huggingface. If this parameter is set toTrue
, then the token generated when runningtransformers-cli login
(stored in ~/.huggingface) will be used. Additional information can be found here https://huggingface.co/transformers/main_classes/model.html#transformers.PreTrainedModel.from_pretraineddevices
: List of torch devices (e.g. cuda, cpu, mps) to limit inference to specific devices. A list containing torch device objects and/or strings is supported (For example [torch.device('cuda:0'), "mps", "cuda:1"]). When specifyinguse_gpu=False
the devices parameter is not used and a single cpu device is used for inference.