DocumentationAPI ReferenceTutorialsGitHub Code ExamplesDiscord Community

Distinguishes between keyword, question and statements queries.

Module sklearn

SklearnQueryClassifier

class SklearnQueryClassifier(BaseQueryClassifier)

This component is now deprecated and will be removed in future versions. Use TransformersQueryClassifier instead of SklearnQueryClassifier.

A node to classify an incoming query into one of two categories using a lightweight sklearn model. Depending on the result, the query flows to a different branch in your pipeline and the further processing can be customized. You can define this by connecting the further pipeline to either output_1 or output_2 from this node.

Example:

pipe = Pipeline()
pipe.add_node(component=SklearnQueryClassifier(), name="QueryClassifier", inputs=["Query"])
pipe.add_node(component=bm25_retriever, name="BM25Retriever", inputs=["QueryClassifier.output_2"])
pipe.add_node(component=dpr_retriever, name="DPRRetriever", inputs=["QueryClassifier.output_1"])

# Keyword queries will use the BM25Retriever
pipe.run("kubernetes aws")

# Semantic queries (questions, statements, sentences ...) will leverage the DPR retriever
pipe.run("How to manage kubernetes on aws")

Models:

Pass your own Sklearn binary classification model or use one of the following pretrained ones:

  1. Keywords vs. Questions/Statements (Default) query_classifier can be found here query_vectorizer can be found here output_1 => question/statement output_2 => keyword query Readme

  2. Questions vs. Statements query_classifier can be found here query_vectorizer can be found here output_1 => question output_2 => statement Readme

See also the tutorial on pipelines.

SklearnQueryClassifier.__init__

def __init__(
        model_name_or_path:
    Union[
        str,
        Any] = "https://ext-models-haystack.s3.eu-central-1.amazonaws.com/gradboost_query_classifier_2022/model.pickle",
        vectorizer_name_or_path:
    Union[
        str,
        Any] = "https://ext-models-haystack.s3.eu-central-1.amazonaws.com/gradboost_query_classifier_2022/vectorizer.pickle",
        batch_size: Optional[int] = None,
        progress_bar: bool = True)

Arguments:

  • model_name_or_path: Gradient boosting based binary classifier to classify between keyword vs statement/question queries or statement vs question queries.
  • vectorizer_name_or_path: A ngram based Tfidf vectorizer for extracting features from query.
  • batch_size: Number of queries to process at a time.
  • progress_bar: Whether to show a progress bar.

Module transformers

TransformersQueryClassifier

class TransformersQueryClassifier(BaseQueryClassifier)

A node to classify an incoming query into categories using a transformer model. Depending on the result, the query flows to a different branch in your pipeline and the further processing can be customized. You can define this by connecting the further pipeline to output_1, output_2, ..., output_n from this node. This node also supports zero-shot-classification.

Example:

{
pipe = Pipeline()
pipe.add_node(component=TransformersQueryClassifier(), name="QueryClassifier", inputs=["Query"])
pipe.add_node(component=bm25_retriever, name="BM25Retriever", inputs=["QueryClassifier.output_2"])
pipe.add_node(component=dpr_retriever, name="DPRRetriever", inputs=["QueryClassifier.output_1"])

# Keyword queries will use the BM25Retriever
pipe.run("kubernetes aws")

# Semantic queries (questions, statements, sentences ...) will leverage the DPR retriever
pipe.run("How to manage kubernetes on aws")

Models:

Pass your own Transformer classification/zero-shot-classification model from file/huggingface or use one of the following pretrained ones hosted on Huggingface:

  1. Keywords vs. Questions/Statements (Default) model_name_or_path="shahrukhx01/bert-mini-finetune-question-detection" output_1 => question/statement output_2 => keyword query Readme

  2. Questions vs. Statements model_name_or_path="shahrukhx01/question-vs-statement-classifier" output_1 => question output_2 => statement Readme

See also the tutorial on pipelines.

TransformersQueryClassifier.__init__

def __init__(model_name_or_path: Union[
    Path, str] = "shahrukhx01/bert-mini-finetune-question-detection",
             model_version: Optional[str] = None,
             tokenizer: Optional[str] = None,
             use_gpu: bool = True,
             task: str = "text-classification",
             labels: Optional[List[str]] = None,
             batch_size: int = 16,
             progress_bar: bool = True,
             use_auth_token: Optional[Union[str, bool]] = None,
             devices: Optional[List[Union[str, "torch.device"]]] = None)

Arguments:

  • model_name_or_path: Directory of a saved model or the name of a public model, for example 'shahrukhx01/bert-mini-finetune-question-detection'. See Hugging Face models for a full list of available models.
  • model_version: The version of the model to use from the Hugging Face model hub. This can be a tag name, a branch name, or a commit hash.
  • tokenizer: The name of the tokenizer (usually the same as model).
  • use_gpu: Whether to use GPU (if available).
  • task: Specifies the type of classification. Possible values: 'text-classification' or 'zero-shot-classification'.
  • labels: If the task is 'text-classification' and an ordered list of labels is provided, the first label corresponds to output_1, the second label to output_2, and so on. The labels must match the model labels; only the order can differ. If the task is 'zero-shot-classification', these are the candidate labels.
  • batch_size: The number of queries to be processed at a time.
  • progress_bar: Whether to show a progress bar.
  • use_auth_token: The API token used to download private models from Huggingface. If this parameter is set to True, then the token generated when running transformers-cli login (stored in ~/.huggingface) will be used. Additional information can be found here https://huggingface.co/transformers/main_classes/model.html#transformers.PreTrainedModel.from_pretrained
  • devices: List of torch devices (e.g. cuda, cpu, mps) to limit inference to specific devices. A list containing torch device objects and/or strings is supported (For example [torch.device('cuda:0'), "mps", "cuda:1"]). When specifying use_gpu=False the devices parameter is not used and a single cpu device is used for inference.