Detects the language of the Documents
Module langdetect
LangdetectDocumentLanguageClassifier
class LangdetectDocumentLanguageClassifier(BaseDocumentLanguageClassifier)
A node based on the lightweight and fast langdetect library for classifying the language of documents.
This node detects the language of Documents and adds the output to the Documents metadata.
The meta field of the Document is a dictionary with the following format:
'meta': {'name': '450_Baelor.txt', 'language': 'en'}
- Using the document language classifier, you can directly get predictions with
predict()
. - You can route the Documents to different branches depending on their language
by setting the
route_by_language
parameter toTrue
and specifying thelanguages_to_route
parameter. Usage example
...
docs = [Document(content="The black dog runs across the meadow")]
doclangclassifier = LangdetectDocumentLanguageClassifier()
results = doclangclassifier.predict(documents=docs)
# print the predicted language
print(results[0].to_dict()["meta"]["language"]
**Usage example for routing**
```python
...
docs = [Document(content="My name is Ryan and I live in London"),
Document(content="Mi chiamo Matteo e vivo a Roma")]
doclangclassifier = LangdetectDocumentLanguageClassifier(
route_by_language = True,
languages_to_route = ['en','it','es']
)
for doc in docs:
doclangclassifier.run(doc)
LangdetectDocumentLanguageClassifier.__init__
def __init__(route_by_language: bool = True,
languages_to_route: Optional[List[str]] = None)
Arguments:
route_by_language
: Sends Documents to a different output edge depending on their language.languages_to_route
: A list of languages in ISO code, each corresponding to a different output edge (see langdetect` documentation).
LangdetectDocumentLanguageClassifier.predict
def predict(documents: List[Document],
batch_size: Optional[int] = None) -> List[Document]
Detect the language of Documents and add the output to the Documents metadata.
Arguments:
documents
: A list of Documents whose language you want to detect.
Returns:
List of Documents, where Document.meta["language"] contains the predicted language.
LangdetectDocumentLanguageClassifier.predict_batch
def predict_batch(documents: List[List[Document]],
batch_size: Optional[int] = None) -> List[List[Document]]
Detect the Document's language and add the output to the Document's meta data.
Arguments:
documents
: A list of lists of Documents to detect language.
Returns:
List of lists of Documents, where Document.meta["language"] contains the predicted language
LangdetectDocumentLanguageClassifier.run
def run(documents: List[Document]) -> Tuple[Dict[str, List[Document]], str]
Run language document classifier on a list of documents.
Arguments:
documents
: A list of documents whose language you want to detect.
LangdetectDocumentLanguageClassifier.run_batch
def run_batch(documents: List[List[Document]],
batch_size: Optional[int] = None) -> Tuple[Dict, str]
Run language document classifier on batches of documents.
Arguments:
documents
: A list of lists of documents whose language you want to detect.
Module transformers
TransformersDocumentLanguageClassifier
class TransformersDocumentLanguageClassifier(BaseDocumentLanguageClassifier)
Transformer-based model for classifying the document language using the Hugging Face's transformers framework.
While the underlying model can vary (BERT, Roberta, DistilBERT ...), the interface remains the same.
This node detects the language of Documents and adds the output to the Documents metadata.
The meta field of the Document is a dictionary with the following format:
'meta': {'name': '450_Baelor.txt', 'language': 'en'}
- Using the document language classifier, you can directly get predictions with the
predict()
method. - You can route the Documents to different branches depending on their language
by setting the
route_by_language
parameter to True and specifying thelanguages_to_route
parameter. Usage example
...
docs = [Document(content="The black dog runs across the meadow")]
doclangclassifier = TransformersDocumentLanguageClassifier()
results = doclangclassifier.predict(documents=docs)
# print the predicted language
print(results[0].to_dict()["meta"]["language"]
**Usage example for routing**
```python
...
docs = [Document(content="My name is Ryan and I live in London"),
Document(content="Mi chiamo Matteo e vivo a Roma")]
doclangclassifier = TransformersDocumentLanguageClassifier(
route_by_language = True,
languages_to_route = ['en','it','es']
)
for doc in docs:
doclangclassifier.run(doc)
TransformersDocumentLanguageClassifier.__init__
def __init__(
route_by_language: bool = True,
languages_to_route: Optional[List[str]] = None,
labels_to_languages_mapping: Optional[Dict[str, str]] = None,
model_name_or_path: str = "papluca/xlm-roberta-base-language-detection",
model_version: Optional[str] = None,
tokenizer: Optional[str] = None,
use_gpu: bool = True,
batch_size: int = 16,
progress_bar: bool = True,
use_auth_token: Optional[Union[str, bool]] = None,
devices: Optional[List[Union[str, "torch.device"]]] = None)
Load a language detection model from Transformers.
For a full list of available models, see Hugging Face models. For language detection models, see Language Detection models on Hugging Face.
Arguments:
route_by_language
: Sends Documents to a different output edge depending on their language.languages_to_route
: A list of languages, each corresponding to a different output edge (for the list of supported languages, see the model card of the chosen model).labels_to_languages_mapping
: Some Transformers models return generic labels instead of language names. In this case, you can provide a mapping indicating a language for each label. For example: {"LABEL_1": "ar", "LABEL_2": "bg", ...}.model_name_or_path
: Directory of a saved model or the name of a public model, for example 'papluca/xlm-roberta-base-language-detection'. See Hugging Face models for a full list of available models.model_version
: The version of the model to use from the Hugging Face model hub. Can be a tag name, a branch name, or a commit hash.tokenizer
: Name of the tokenizer (usually the same as model).use_gpu
: Whether to use GPU (if available).batch_size
: Number of Documents to be processed at a time.progress_bar
: Whether to show a progress bar while processing.use_auth_token
: The API token used to download private models from Hugging Face. If set toTrue
, the token generated when runningtransformers-cli login
(stored in ~/.huggingface) is used. For more information, see Hugging Face documentation.devices
: List of torch devices (for example, cuda, cpu, mps) to limit inference to specific devices. A list containing torch device objects or strings is supported (for example [torch.device('cuda:0'), "mps", "cuda:1"]). When specifyinguse_gpu=False
, the devices parameter is not used and a single cpu device is used for inference.
TransformersDocumentLanguageClassifier.predict
def predict(documents: List[Document],
batch_size: Optional[int] = None) -> List[Document]
Detect the language of Documents and add the output to the Documents metadata.
Arguments:
documents
: A list of Documents whose language you want to detect.batch_size
: The number of Documents to classify at a time.
Returns:
A list of Documents, where Document.meta["language"] contains the predicted language.
TransformersDocumentLanguageClassifier.predict_batch
def predict_batch(documents: List[List[Document]],
batch_size: Optional[int] = None) -> List[List[Document]]
Detect the Document's language and add the output to the Document's meta data.
Arguments:
documents
: A list of lists of Documents whose language you want to detect.
Returns:
A list of lists of Documents where Document.meta["language"] contains the predicted language.
TransformersDocumentLanguageClassifier.run
def run(documents: List[Document]) -> Tuple[Dict[str, List[Document]], str]
Run language document classifier on a list of documents.
Arguments:
documents
: A list of documents whose language you want to detect.
TransformersDocumentLanguageClassifier.run_batch
def run_batch(documents: List[List[Document]],
batch_size: Optional[int] = None) -> Tuple[Dict, str]
Run language document classifier on batches of documents.
Arguments:
documents
: A list of lists of documents whose language you want to detect.