Langdetect
haystack_integrations.components.classifiers.langdetect.document_language_classifier
DocumentLanguageClassifier
Classifies the language of each document and adds it to its metadata.
Provide a list of languages during initialization. If the document's text doesn't match any of the specified languages, the metadata value is set to "unmatched". To route documents based on their language, use the MetadataRouter component after DocumentLanguageClassifier. For routing plain text, use the TextLanguageRouter component instead.
Usage example
from haystack import Document, Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack_integrations.components.classifiers.langdetect import DocumentLanguageClassifier
from haystack.components.routers import MetadataRouter
from haystack.components.writers import DocumentWriter
docs = [Document(id="1", content="This is an English document"),
Document(id="2", content="Este es un documento en español")]
document_store = InMemoryDocumentStore()
p = Pipeline()
p.add_component(instance=DocumentLanguageClassifier(languages=["en"]), name="language_classifier")
p.add_component(
instance=MetadataRouter(rules={
"en": {
"field": "meta.language",
"operator": "==",
"value": "en"
}
}),
name="router")
p.add_component(instance=DocumentWriter(document_store=document_store), name="writer")
p.connect("language_classifier.documents", "router.documents")
p.connect("router.en", "writer.documents")
p.run({"language_classifier": {"documents": docs}})
written_docs = document_store.filter_documents()
assert len(written_docs) == 1
assert written_docs[0] == Document(id="1", content="This is an English document", meta={"language": "en"})
init
Initializes the DocumentLanguageClassifier component.
Parameters:
- languages (
list[str] | None) – A list of ISO language codes. See the supported languages inlangdetectdocumentation. If not specified, defaults to ["en"].
run
Classifies the language of each document and adds it to its metadata.
If the document's text doesn't match any of the languages specified at initialization, sets the metadata value to "unmatched".
Parameters:
- documents (
list[Document]) – A list of documents for language classification.
Returns:
dict[str, list[Document]]– A dictionary with the following key:documents: A list of documents with an addedlanguagemetadata field.
Raises:
TypeError– if the input is not a list of Documents.
haystack_integrations.components.routers.langdetect.text_language_router
TextLanguageRouter
Routes text strings to different output connections based on their language.
Provide a list of languages during initialization. If the document's text doesn't match any of the specified languages, the metadata value is set to "unmatched". For routing documents based on their language, use the DocumentLanguageClassifier component, followed by the MetaDataRouter.
Usage example
from haystack import Pipeline, Document
from haystack_integrations.components.routers.langdetect import TextLanguageRouter
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
document_store = InMemoryDocumentStore()
document_store.write_documents([Document(content="Elvis Presley was an American singer and actor.")])
p = Pipeline()
p.add_component(instance=TextLanguageRouter(languages=["en"]), name="text_language_router")
p.add_component(instance=InMemoryBM25Retriever(document_store=document_store), name="retriever")
p.connect("text_language_router.en", "retriever.query")
result = p.run({"text_language_router": {"text": "Who was Elvis Presley?"}})
assert result["retriever"]["documents"][0].content == "Elvis Presley was an American singer and actor."
result = p.run({"text_language_router": {"text": "ένα ελληνικό κείμενο"}})
assert result["text_language_router"]["unmatched"] == "ένα ελληνικό κείμενο"
init
Initialize the TextLanguageRouter component.
Parameters:
- languages (
list[str] | None) – A list of ISO language codes. See the supported languages inlangdetectdocumentation. If not specified, defaults to ["en"].
run
Routes the text strings to different output connections based on their language.
If the document's text doesn't match any of the specified languages, the metadata value is set to "unmatched".
Parameters:
- text (
str) – A text string to route.
Returns:
dict[str, str]– A dictionary in which the key is the language (or"unmatched"), and the value is the text.
Raises:
TypeError– If the input is not a string.