DocumentLanguageClassifier
Use this node to classify Documents by language. You can then route them to different branches of your pipeline based on the Documents' language.
DocumentLanguageClassifier detects the language of the Documents you pass to it and attaches it to the Document's metadata like this:
'meta': {'name': 'document1.txt', 'language': 'en'}``
This node has multiple outgoing edges whose number corresponds to the number of languages you specify. You can use the languages to route
parameter to add a list of languages you want DocumentLanguageClassifier to detect in your Documents. By default, the languages are: en
(English), de
(German), es
(Spanish), cs
(Czech), and nl
(Dutch).
It's important that all your Documents are in one of the languages you specify. If even one Document is in another language, DocumentLanguageClassifier breaks.
Available Classes
There are two classes of DocumentLanguageClassifier, here's how they differ:
LangdetectDocumentLanguageClassifier
- Uses fast and lightweight langdetect library for detecting document language.TransformersDocumentLanguageClassifier
- Uses a transformer-based model for language classification. You can choose the model to use with this classifier.
Usage
You can use the node in a pipeline or on its own.
Stand-Alone
To initialize the node, run:
from haystack.nodes import LangdetectDocumentLanguageClassifier
doc_classifier = LangdetectDocumentLanguageClassifier()
In a Pipeline
pipeline = Pipeline()
pipeline.add_node(component=retriever, name="Retriever", inputs=["Query"])
pipeline.add_node(component=doc_classifier, name='DocClassifier', inputs=['Retriever'])
Updated almost 2 years ago