Languages Other Than English

Preprocessor

The PreProcessor's sentence tokenization is language specific. If you are using the PreProcessor on a language other than English, make sure to set the language argument when initializing it.

preprocessor = PreProcessor(language="sv", ...)

Here you will find the list of supported languages.

Retrievers

The sparse retriever methods themselves(BM25, TF-IDF) are language agnostic. Their only requirement is that the text be split into words. The ElasticsearchDocumentStore relies on an analyzer to impose word boundaries, but also to handle punctuation, casing and stop words.

The default analyzer is an English analyzer. While it can still work decently for a large range of languages, you will want to set it to your language's analyzer for optimal performance. In some cases, such as with Thai, the default analyzer is completely incompatible. See Language Analyzers for the full list of language specific analyzers.

from haystack.document_stores import ElasticsearchDocumentStore

document_store = ElasticsearchDocumentStore(analyzer="thai")

The models used in dense retrievers are language specific. Be sure to check language of the model used in your EmbeddingRetriever. The default model that is loaded in the DensePassageRetriever is for English.

We have created a German DensePassageRetriever model and know other teams who work on further languages. If you have a language model and a question answering dataset in your own language, you can also train a DPR model using Haystack! Below is a simplified example. See the Training a Dense Passage Retrieval model tutorial and also the DensePassageRetriever.train() API for more details.

from haystack.nodes import DensePassageRetriever

dense_passage_retriever = DensePassageRetriever(document_store)
dense_passage_retriever.train(self,
                              data_dir: str,
                              train_filename: str,
                              dev_filename: str = None,
                              test_filename: str = None,
                              batch_size: int = 16,
                              embed_title: bool = True,
                              num_hard_negatives: int = 1,
                              n_epochs: int = 3)

The Sentence Transformers team has trained various multilingual models that can be loaded via the Haystack EmbeddingRetriever class. To see a comparison of the different available models, see their Pretrained Models page.

Readers

While models are comparatively more performant on English, thanks to a wealth of available English training data, there are a couple QA models that are directly usable in Haystack.

Transformers

from haystack.nodes import TransformersReader
reader = TransformersReader("deepset/gelectra-large-germanquad")

from haystack.nodes import TransformersReader
reader = TransformersReader("etalab-ia/camembert-base-squadFR-fquad-piaf")

from haystack.nodes import TransformersReader
reader = TransformersReader("anakin87/electra-italian-xxl-cased-squad-it")

from haystack.nodes import TransformersReader
reader = TransformersReader("uer/roberta-base-chinese-extractive-qa")
# or
reader = TransformersReader("wptoux/albert-chinese-large-qa")

from haystack.nodes import TransformersReader
reader = TransformersReader("deepset/xlm-roberta-large-squad2")

FARM

from haystack.nodes import FARMReader
reader = FARMReader("deepset/gelectra-large-germanquad")

from haystack.nodes import FARMReader
reader = FARMReader("etalab-ia/camembert-base-squadFR-fquad-piaf")

from haystack.nodes import FARMReader
reader = raFARMReader("anakin87/electra-italian-xxl-cased-squad-it")

from haystack.nodes import FARMReader
reader = FARMReader("uer/roberta-base-chinese-extractive-qa")
# or
reader = FARMReader("wptoux/albert-chinese-large-qa")

from haystack.nodes import FARMReader
reader = FARMReader("deepset/xlm-roberta-large-squad2")

We are the creators of the German model and you can find out more about it on the GermanQuAD page.

The French, Italian, Spanish, Portuguese and Chinese models are monolingual language models trained on versions of the SQuAD dataset in their respective languages and their authors report decent results in their model cards. For examples, have a look at the French model card and the Italian model card. There also exist Korean QA models on the model hub but their performance is not reported.

The zero-shot model that is shown above is a multilingual XLM-RoBERTa Large that is trained on English SQuAD. It is clear, from our evaluations, that the model has been able to transfer some of its English QA capabilities to other languages, but still its performance lags behind that of the monolingual models. Nonetheless, if there is not yet a monolingual model for your language and it is one of the 100 supported by XLM-RoBERTa, this zero-shot model may serve as a decent first baseline.