TopPSampler
Uses nucleus sampling to filter documents. Useful in combination with WebRetriever to choose documents that are diverse but relevant to the query.
What's nucleus (top_p) sampling?
Nucleus sampling is expressed in the top_p
parameter used in generative question answering. It controls the level of randomness and diversity in the generated text.
When top_p
is set to a high value, the model is more likely to generate diverse and creative outputs. When set to a low value, the model is more likely to generate predictable and less risky outputs.
Nucleus sampling is often used in combination with other parameters, such as temperature
and top_k
to achieve the balance between creativity and coherence in the generated text.
See also Model Parameters.
While nucleus, or top p, sampling is usually mentioned in the context of the next token selection in generative NLP models, we can also use it to filter documents based on the cumulative probability of the similarity scores between the query and the documents.
In this context, top p sampling selects a subset of diverse query's most relevant documents while also removing unrelated documents. The technique involves calculating the cumulative probability of the scores of the query's most similar documents, and then selecting the top p percent of the most similar documents with the highest cumulative probability.
By default, TopPSampler uses the ms-marco-MiniLM-L-6-v2 model, but you can replace it with any other cross encoder model. For a full list of models, see Hugging Face.
Usage
TopPSampler is used in combination with other nodes, such as WebRetriever
to limit the number of results they return. Here's an example of TopPSampler in a pipeline:
retriever = WebRetriever(api_key="<your_api_key_here>", mode="preprocessed_documents")
sampler = TopPSampler(top_p=0.95)
p = Pipeline()
p.add_node(component=retriever, name="Retriever", inputs=["Query"])
p.add_node(component=sampler, name="Sampler", inputs=["Retriever"])
print(p.run(query="What's the secret of the Universe?"))
Updated over 1 year ago