DocumentationAPI ReferenceπŸ““ TutorialsπŸ§‘β€πŸ³ Cookbook🀝 IntegrationsπŸ’œ Discord

DocumentLanguageClassifier

Use this node to classify Documents by language. You can then route them to different branches of your pipeline based on the Documents' language.

Position in a PipelineAfter the PreProcessor in an indexing pipeline or after a Retriever in a query pipeline.
InputDocuments
OutputDocuments
ClassesLangdetectDocumentLanguageClassifier
TransformersDocumentLanguageClassifier

DocumentLanguageClassifier detects the language of the Documents you pass to it and attaches it to the Document's metadata like this:

'meta': {'name': 'document1.txt', 'language': 'en'}``

This node has multiple outgoing edges whose number corresponds to the number of languages you specify. You can use the languages to route parameter to add a list of languages you want DocumentLanguageClassifier to detect in your Documents. By default, the languages are: en (English), de (German), es (Spanish), cs(Czech), and nl (Dutch).

πŸ“˜

It's important that all your Documents are in one of the languages you specify. If even one Document is in another language, DocumentLanguageClassifier breaks.

Available Classes

There are two classes of DocumentLanguageClassifier, here's how they differ:

  • LangdetectDocumentLanguageClassifier - Uses fast and lightweight langdetect library for detecting document language.
  • TransformersDocumentLanguageClassifier - Uses a transformer-based model for language classification. You can choose the model to use with this classifier.

Usage

You can use the node in a pipeline or on its own.

Stand-Alone

To initialize the node, run:

from haystack.nodes import LangdetectDocumentLanguageClassifier

doc_classifier = LangdetectDocumentLanguageClassifier()

In a Pipeline

pipeline = Pipeline()
pipeline.add_node(component=retriever, name="Retriever", inputs=["Query"])
pipeline.add_node(component=doc_classifier, name='DocClassifier', inputs=['Retriever'])