DocumentationAPI ReferenceπŸ““ TutorialsπŸ§‘β€πŸ³ Cookbook🀝 IntegrationsπŸ’œ Discord

Detects the language of the Documents

Module langdetect

LangdetectDocumentLanguageClassifier

class LangdetectDocumentLanguageClassifier(BaseDocumentLanguageClassifier)

A node based on the lightweight and fast langdetect library for classifying the language of documents. This node detects the language of Documents and adds the output to the Documents metadata. The meta field of the Document is a dictionary with the following format: 'meta': {'name': '450_Baelor.txt', 'language': 'en'}

  • Using the document language classifier, you can directly get predictions with predict().
  • You can route the Documents to different branches depending on their language by setting the route_by_language parameter to True and specifying the languages_to_route parameter. Usage example
...
docs = [Document(content="The black dog runs across the meadow")]

doclangclassifier = LangdetectDocumentLanguageClassifier()
results = doclangclassifier.predict(documents=docs)

# print the predicted language
print(results[0].to_dict()["meta"]["language"]

**Usage example for routing**
```python
...
docs = [Document(content="My name is Ryan and I live in London"),
        Document(content="Mi chiamo Matteo e vivo a Roma")]

doclangclassifier = LangdetectDocumentLanguageClassifier(
    route_by_language = True,
    languages_to_route = ['en','it','es']
    )
for doc in docs:
    doclangclassifier.run(doc)

LangdetectDocumentLanguageClassifier.__init__

def __init__(route_by_language: bool = True,
             languages_to_route: Optional[List[str]] = None)

Arguments:

  • route_by_language: Sends Documents to a different output edge depending on their language.
  • languages_to_route: A list of languages in ISO code, each corresponding to a different output edge (see langdetect` documentation).

LangdetectDocumentLanguageClassifier.predict

def predict(documents: List[Document],
            batch_size: Optional[int] = None) -> List[Document]

Detect the language of Documents and add the output to the Documents metadata.

Arguments:

  • documents: A list of Documents whose language you want to detect.

Returns:

List of Documents, where Document.meta["language"] contains the predicted language.

LangdetectDocumentLanguageClassifier.predict_batch

def predict_batch(documents: List[List[Document]],
                  batch_size: Optional[int] = None) -> List[List[Document]]

Detect the Document's language and add the output to the Document's meta data.

Arguments:

  • documents: A list of lists of Documents to detect language.

Returns:

List of lists of Documents, where Document.meta["language"] contains the predicted language

LangdetectDocumentLanguageClassifier.run

def run(documents: List[Document]) -> Tuple[Dict[str, List[Document]], str]

Run language document classifier on a list of documents.

Arguments:

  • documents: A list of documents whose language you want to detect.

LangdetectDocumentLanguageClassifier.run_batch

def run_batch(documents: List[List[Document]],
              batch_size: Optional[int] = None) -> Tuple[Dict, str]

Run language document classifier on batches of documents.

Arguments:

  • documents: A list of lists of documents whose language you want to detect.

Module transformers

TransformersDocumentLanguageClassifier

class TransformersDocumentLanguageClassifier(BaseDocumentLanguageClassifier)

Transformer-based model for classifying the document language using the Hugging Face's transformers framework. While the underlying model can vary (BERT, Roberta, DistilBERT ...), the interface remains the same. This node detects the language of Documents and adds the output to the Documents metadata. The meta field of the Document is a dictionary with the following format: 'meta': {'name': '450_Baelor.txt', 'language': 'en'}

  • Using the document language classifier, you can directly get predictions with the predict() method.
  • You can route the Documents to different branches depending on their language by setting the route_by_language parameter to True and specifying the languages_to_route parameter. Usage example
...
docs = [Document(content="The black dog runs across the meadow")]

doclangclassifier = TransformersDocumentLanguageClassifier()
results = doclangclassifier.predict(documents=docs)

# print the predicted language
print(results[0].to_dict()["meta"]["language"]

**Usage example for routing**
```python
...
docs = [Document(content="My name is Ryan and I live in London"),
        Document(content="Mi chiamo Matteo e vivo a Roma")]

doclangclassifier = TransformersDocumentLanguageClassifier(
    route_by_language = True,
    languages_to_route = ['en','it','es']
    )
for doc in docs:
    doclangclassifier.run(doc)

TransformersDocumentLanguageClassifier.__init__

def __init__(
        route_by_language: bool = True,
        languages_to_route: Optional[List[str]] = None,
        labels_to_languages_mapping: Optional[Dict[str, str]] = None,
        model_name_or_path: str = "papluca/xlm-roberta-base-language-detection",
        model_version: Optional[str] = None,
        tokenizer: Optional[str] = None,
        use_gpu: bool = True,
        batch_size: int = 16,
        progress_bar: bool = True,
        use_auth_token: Optional[Union[str, bool]] = None,
        devices: Optional[List[Union[str, "torch.device"]]] = None)

Load a language detection model from Transformers.

For a full list of available models, see Hugging Face models. For language detection models, see Language Detection models on Hugging Face.

Arguments:

  • route_by_language: Sends Documents to a different output edge depending on their language.
  • languages_to_route: A list of languages, each corresponding to a different output edge (for the list of supported languages, see the model card of the chosen model).
  • labels_to_languages_mapping: Some Transformers models return generic labels instead of language names. In this case, you can provide a mapping indicating a language for each label. For example: {"LABEL_1": "ar", "LABEL_2": "bg", ...}.
  • model_name_or_path: Directory of a saved model or the name of a public model, for example 'papluca/xlm-roberta-base-language-detection'. See Hugging Face models for a full list of available models.
  • model_version: The version of the model to use from the Hugging Face model hub. Can be a tag name, a branch name, or a commit hash.
  • tokenizer: Name of the tokenizer (usually the same as model).
  • use_gpu: Whether to use GPU (if available).
  • batch_size: Number of Documents to be processed at a time.
  • progress_bar: Whether to show a progress bar while processing.
  • use_auth_token: The API token used to download private models from Hugging Face. If set to True, the token generated when running transformers-cli login (stored in ~/.huggingface) is used. For more information, see Hugging Face documentation.
  • devices: List of torch devices (for example, cuda, cpu, mps) to limit inference to specific devices. A list containing torch device objects or strings is supported (for example [torch.device('cuda:0'), "mps", "cuda:1"]). When specifying use_gpu=False, the devices parameter is not used and a single cpu device is used for inference.

TransformersDocumentLanguageClassifier.predict

def predict(documents: List[Document],
            batch_size: Optional[int] = None) -> List[Document]

Detect the language of Documents and add the output to the Documents metadata.

Arguments:

  • documents: A list of Documents whose language you want to detect.
  • batch_size: The number of Documents to classify at a time.

Returns:

A list of Documents, where Document.meta["language"] contains the predicted language.

TransformersDocumentLanguageClassifier.predict_batch

def predict_batch(documents: List[List[Document]],
                  batch_size: Optional[int] = None) -> List[List[Document]]

Detect the Document's language and add the output to the Document's meta data.

Arguments:

  • documents: A list of lists of Documents whose language you want to detect.

Returns:

A list of lists of Documents where Document.meta["language"] contains the predicted language.

TransformersDocumentLanguageClassifier.run

def run(documents: List[Document]) -> Tuple[Dict[str, List[Document]], str]

Run language document classifier on a list of documents.

Arguments:

  • documents: A list of documents whose language you want to detect.

TransformersDocumentLanguageClassifier.run_batch

def run_batch(documents: List[List[Document]],
              batch_size: Optional[int] = None) -> Tuple[Dict, str]

Run language document classifier on batches of documents.

Arguments:

  • documents: A list of lists of documents whose language you want to detect.