DocumentationAPI ReferenceπŸ““ TutorialsπŸ§‘β€πŸ³ Cookbook🀝 IntegrationsπŸ’œ Discord

PseudoLabelGenerator

Use the Pseudo Label Generator to create training data for dense retrievers without human annotation. The PseudoLabelGenerator uses Generative Pseudo Labelling (GPL), which is an unsupervised domain adaptation method for training dense retrievers.

GPL generates labels for your Haystack Documents. When combined with a QuestionGenerator and a Retriever, it returns questions, labels, and negative passages that you can then use to train your EmbeddingRetriever.

InputUnlabelled target corpus (usually via DocumentStore) and Retriever (to mine for negative passages)
OutputLabels

Usage

from haystack.nodes import PseudoLabelGenerator

document_store = DocumentStore(your_document_store)
retriever = Retriever(your_retriever)
qg = QuestionGenerator(model_name_or_path="doc2query/msmarco-t5-base-v1")
plg = PseudoLabelGenerator(qg, retriever)
train_examples = []

for idx, doc in enumerate(document_store):
    output_samples = plg.run(documents=[doc])
    for item in output_samples:
        train_examples.append(item)

You can also provide a list of questions-document dictionary pairs for which you want to generate pseudo labels:

question_doc_pairs: List[Dict[str, str]] = [{"question": "question text", "document": "document text"}]
plg = PseudoLabelGenerator(question_doc_pairs, retriever)
train_examples = []

for idx, doc in enumerate(document_store):
    output_samples = plg.run(documents=[doc])
    for item in output_samples:
        train_examples.append(item)

For more information, see GPL Guide.