Creates training data for dense retrievers without human annotation.
Module pseudo_label_generator
PseudoLabelGenerator
class PseudoLabelGenerator(BaseComponent)
PseudoLabelGenerator is a component that creates Generative Pseudo Labeling (GPL) training data for the training of dense retrievers.
GPL is an unsupervised domain adaptation method for the training of dense retrievers. It is based on question generation and pseudo labelling with powerful cross-encoders. To train a domain-adapted model, it needs access to an unlabeled target corpus, usually through DocumentStore and a Retriever to mine for negatives.
For more details, see GPL.
For example:
document_store = ElasticsearchDocumentStore(...)
retriever = BM25Retriever(...)
qg = QuestionGenerator(model_name_or_path="doc2query/msmarco-t5-base-v1")
plg = PseudoLabelGenerator(qg, retriever)
output, output_id = psg.run(documents=document_store.get_all_documents())
Notes:
While the NLP researchers trained the default question generation and the cross encoder models on the English language corpus, we can also use the language-specific question generation and cross-encoder models in the target language of our choice to apply GPL to documents in languages other than English.
As of this writing, the German language question generation and the cross encoder models are already available, as well as question generation and the cross encoder models trained on fourteen languages.
PseudoLabelGenerator.__init__
def __init__(question_producer: Union[QuestionGenerator, List[Dict[str, str]]],
retriever,
cross_encoder_model_name_or_path:
str = "cross-encoder/ms-marco-MiniLM-L-6-v2",
max_questions_per_document: int = 3,
top_k: int = 50,
batch_size: int = 16,
progress_bar: bool = True,
use_auth_token: Optional[Union[str, bool]] = None,
use_gpu: bool = True,
devices: Optional[List[Union[str, torch.device]]] = None)
Loads the cross-encoder model and prepares PseudoLabelGenerator.
Arguments:
question_producer
(Union[QuestionGenerator, List[Dict[str, str]]]
): The question producer used to generate questions or a list of already produced questions/document pairs in a Dictionary format {"question": "question text ...", "document": "document text ..."}.retriever
(BaseRetriever
): The Retriever used to query document stores.cross_encoder_model_name_or_path
(str (optional)
): The path to the cross encoder model, defaults tocross-encoder/ms-marco-MiniLM-L-6-v2
.max_questions_per_document
(int
): The max number of questions generated per document, defaults to 3.top_k
(int (optional)
): The number of answers retrieved for each question, defaults to 50.batch_size
(int (optional)
): The number of documents to process at a time.progress_bar
(bool (optional)
): Whether to show a progress bar, defaults to True.use_auth_token
(Union[str, bool] (optional)
): The API token used to download private models from Huggingface. If this parameter is set toTrue
, then the token generated when runningtransformers-cli login
(stored in ~/.huggingface) will be used. Additional information can be found here https://huggingface.co/transformers/main_classes/model.html#transformers.PreTrainedModel.from_pretraineddevices
: List of torch devices (e.g. cuda, cpu, mps) to limit CrossEncoder inference to specific devices. A list containing torch device objects and/or strings is supported (For example [torch.device('cuda:0'), "mps", "cuda:1"]). When specifyinguse_gpu=False
the devices parameter is not used and a single cpu device is used for inference.
PseudoLabelGenerator.generate_questions
def generate_questions(
documents: List[Document],
batch_size: Optional[int] = None) -> List[Dict[str, str]]
It takes a list of documents and generates a list of question-document pairs.
Arguments:
documents
(List[Document]
): A list of documents to generate questions from.batch_size
(Optional[int]
): The number of documents to process at a time.
Returns:
A list of question-document pairs.
PseudoLabelGenerator.mine_negatives
def mine_negatives(question_doc_pairs: List[Dict[str, str]],
batch_size: Optional[int] = None) -> List[Dict[str, str]]
Given a list of question and positive document pairs, this function returns a list of question/positive document/negative document
dictionaries.
Arguments:
question_doc_pairs
(List[Dict[str, str]]
): A list of question/positive document pairs.batch_size
(int (optional)
): The number of queries to run in a batch.
Returns:
A list of dictionaries, where each dictionary contains the question, positive document, and negative document.
PseudoLabelGenerator.generate_margin_scores
def generate_margin_scores(mined_negatives: List[Dict[str, str]],
batch_size: Optional[int] = None) -> List[Dict]
Given a list of mined negatives, this function predicts the score margin between the positive and negative document using
the cross-encoder.
The function returns a list of examples, where each example is a dictionary with the following keys:
- question: The question string.
- pos_doc: Positive document string (the document containing the answer).
- neg_doc: Negative document string (the document that doesn't contain the answer).
- score: The margin between the score for question-positive document pair and the score for question-negative document pair.
Arguments:
mined_negatives
(List[Dict[str, str]]
): The list of mined negatives.batch_size
(int (optional)
): The number of mined negative lists to run in a batch.
Returns:
A list of dictionaries, each of which has the following keys:
- question: The question string
- pos_doc: Positive document string
- neg_doc: Negative document string
- score: The score margin
PseudoLabelGenerator.generate_pseudo_labels
def generate_pseudo_labels(
documents: List[Document],
batch_size: Optional[int] = None) -> Tuple[dict, str]
Given a list of documents, this function generates a list of question-document pairs, mines for negatives, and
scores a positive/negative margin with cross-encoder. The output is the training data for the adaptation of dense retriever models.
Arguments:
documents
(List[Document]
): List[Document] = The list of documents to mine negatives from.batch_size
(Optional[int]
): The number of documents to process in a batch.
Returns:
A dictionary with a single key 'gpl_labels' representing a list of dictionaries, where each dictionary contains the following keys:
- question: The question string.
- pos_doc: Positive document for the given question.
- neg_doc: Negative document for the given question.
- score: The margin between the score for question-positive document pair and the score for question-negative document pair.