PseudoLabelGenerator
Use the Pseudo Label Generator to create training data for dense retrievers without human annotation. The PseudoLabelGenerator uses Generative Pseudo Labelling (GPL), which is an unsupervised domain adaptation method for training dense retrievers.
GPL generates labels for your Haystack Documents. When combined with a QuestionGenerator and a Retriever, it returns questions, labels, and negative passages that you can then use to train your EmbeddingRetriever.
Input | Unlabelled target corpus (usually via DocumentStore) and Retriever (to mine for negative passages) |
Output | Labels |
Usage
from haystack.nodes import PseudoLabelGenerator
document_store = DocumentStore(your_document_store)
retriever = Retriever(your_retriever)
qg = QuestionGenerator(model_name_or_path="doc2query/msmarco-t5-base-v1")
plg = PseudoLabelGenerator(qg, retriever)
train_examples = []
for idx, doc in enumerate(document_store):
output_samples = plg.run(documents=[doc])
for item in output_samples:
train_examples.append(item)
You can also provide a list of questions-document dictionary pairs for which you want to generate pseudo labels:
question_doc_pairs: List[Dict[str, str]] = [{"question": "question text", "document": "document text"}]
plg = PseudoLabelGenerator(question_doc_pairs, retriever)
train_examples = []
for idx, doc in enumerate(document_store):
output_samples = plg.run(documents=[doc])
for item in output_samples:
train_examples.append(item)
For more information, see GPL Guide.
Updated over 2 years ago