Detects the language of the Documents and adds it to the metadata.
Module document_language_classifier
DocumentLanguageClassifier
@component
class DocumentLanguageClassifier()
Classify the language of documents and add the detected language to their metadata.
A MetadataRouter
can then route them onto different output connections depending on their language.
The set of supported languages can be specified.
For routing plain text using the same logic, use the related TextLanguageRouter
component instead.
Usage example within an indexing pipeline, storing in a Document Store only documents written in English:
from haystack import Document, Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.classifiers import DocumentLanguageClassifier
from haystack.components.routers import MetadataRouter
from haystack.components.writers import DocumentWriter
docs = [Document(id="1", content="This is an English document"),
Document(id="2", content="Este es un documento en español")]
document_store = InMemoryDocumentStore()
p = Pipeline()
p.add_component(instance=DocumentLanguageClassifier(languages=["en"]), name="language_classifier")
p.add_component(instance=MetadataRouter(rules={"en": {"language": {"$eq": "en"}}}), name="router")
p.add_component(instance=DocumentWriter(document_store=document_store), name="writer")
p.connect("language_classifier.documents", "router.documents")
p.connect("router.en", "writer.documents")
p.run({"language_classifier": {"documents": docs}})
written_docs = document_store.filter_documents()
assert len(written_docs) == 1
assert written_docs[0] == Document(id="1", content="This is an English document", meta={"language": "en"})
DocumentLanguageClassifier.__init__
def __init__(languages: Optional[List[str]] = None)
Arguments:
languages
: A list of languages in ISO code, each corresponding to a different output connection. For supported languages, see thelangdetect
documentation. If not specified, the default is ["en"].
DocumentLanguageClassifier.run
@component.output_types(documents=List[Document])
def run(documents: List[Document])
This method classifies the documents' language and adds it to their metadata.
If a Document's text does not match any of the languages specified at initialization, the metadata value "unmatched" will be stored.
Arguments:
documents
: A list of documents to classify their language.
Raises:
TypeError
: if the input is not a list of Documents.
Returns:
A dictionary with the following key:
documents
: List of Documents with an added metadata field calledlanguage
.