DocumentationAPI Reference📓 Tutorials🧑‍🍳 Cookbook🤝 Integrations💜 Discord🎨 Studio (Waitlist)
Documentation

TransformersZeroShotDocumentClassifier

Classifies the documents based on the provided labels and adds them to their metadata.

Most common position in a pipelineBefore a MetadataRouter
Mandatory init variables“model”: The name or path of a Hugging Face model for zero shot document classification

”labels”: The set of possible class labels to classify each document into, for example, ["positive", "negative"]. The labels depend on the selected model.
Mandatory run variables“documents”: A list of documents to classify
Output variables “documents”: A list of processed documents with an added “classification” metadata field
API referenceClassifiers
GitHub linkhttps://github.com/deepset-ai/haystack/blob/main/haystack/components/classifiers/zero_shot_document_classifier.py

Overview

The TransformersZeroShotDocumentClassifier component performs zero-shot classification of documents based on the labels that you set and adds the predicted label to their metadata.

The component uses a Hugging Face pipeline for zero-shot classification.
To initialize the component, provide the model and the set of labels to be used for categorization.
You can additionally configure the component to allow multiple labels to be true with the multi_label boolean set to True.

Classification is run on the document's content field by default. If you want it to run on another field, set theclassification_field to one of the document's metadata fields.

The classification results are stored in the classification dictionary within each document's metadata. If multi_label is set to True, you will find the scores for each label under the details key within the classification dictionary.

Available models for the task of zero-shot-classification are:
- valhalla/distilbart-mnli-12-3
- cross-encoder/nli-distilroberta-base
- cross-encoder/nli-deberta-v3-xsmall

Usage

On its own

from haystack import Document
from haystack.components.classifiers import TransformersZeroShotDocumentClassifier

documents = [Document(id="0", content="Cats don't get teeth cavities."),
             Document(id="1", content="Cucumbers can be grown in water.")]
             
document_classifier = TransformersZeroShotDocumentClassifier(
    model="cross-encoder/nli-deberta-v3-xsmall",
    labels=["animals", "food"],
)

document_classifier.warm_up()
document_classifier.run(documents = documents)    

In a pipeline

The following is a pipeline that classifies documents based on predefined classification labels
retrieved from a search pipeline:

from haystack import Document
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.core.pipeline import Pipeline
from haystack.components.classifiers import TransformersZeroShotDocumentClassifier

documents = [Document(id="0", content="Today was a nice day!"),
             Document(id="1", content="Yesterday was a bad day!")]

document_store = InMemoryDocumentStore()
retriever = InMemoryBM25Retriever(document_store=document_store)
document_classifier = TransformersZeroShotDocumentClassifier(
    model="cross-encoder/nli-deberta-v3-xsmall",
    labels=["positive", "negative"],
)

document_store.write_documents(documents)

pipeline = Pipeline()
pipeline.add_component(instance=retriever, name="retriever")
pipeline.add_component(instance=document_classifier, name="document_classifier")
pipeline.connect("retriever", "document_classifier")

queries = ["How was your day today?", "How was your day yesterday?"]
expected_predictions = ["positive", "negative"]

for idx, query in enumerate(queries):
    result = pipeline.run({"retriever": {"query": query, "top_k": 1}})
    assert result["document_classifier"]["documents"][0].to_dict()["id"] == str(idx)
    assert (result["document_classifier"]["documents"][0].to_dict()["classification"]["label"]
            == expected_predictions[idx])