PineconeEmbeddingRetriever
An embedding-based Retriever compatible with the Pinecone Document Store.
Most common position in a pipeline | 1. After a Text Embedder and before a PromptBuilder in a RAG pipeline 2. The last component in the semantic search pipeline 3. After a Text Embedder and before an ExtractiveReader in an extractive QA pipeline |
Mandatory init variables | "document_store": An instance of a PineconeDocumentStore |
Mandatory run variables | “query_embedding”: A vector representing the query (a list of floats) |
Output variables | “documents”: A list of documents |
API reference | Pinecone |
GitHub link | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/pinecone |
Overview
The PineconeEmbeddingRetriever
is an embedding-based Retriever compatible with the PineconeDocumentStore
. It compares the query and Document embeddings and fetches the Documents most relevant to the query from the PineconeDocumentStore
based on the outcome.
When using the PineconeEmbeddingRetriever
in your NLP system, make sure it has the query and Document embeddings available. You can do so by adding a Document Embedder to your indexing Pipeline and a Text Embedder to your query Pipeline.
In addition to the query_embedding
, the PineconeEmbeddingRetriever
accepts other optional parameters, including top_k
(the maximum number of Documents to retrieve) and filters
to narrow down the search space.
Some relevant parameters that impact the embedding retrieval must be defined when the corresponding PineconeDocumentStore
is initialized: these include the dimension
of the embeddings and the distance metric
to use.
Usage
On its own
This Retriever needs the PineconeDocumentStore
and indexed Documents to run.
from haystack_integrations.components.retrievers.pinecone import PineconeEmbeddingRetriever
from haystack_integrations.document_stores.pinecone import PineconeDocumentStore
# Make sure you have the PINECONE_API_KEY environment variable set
document_store = PineconeDocumentStore(index="my_index_with_documents",
namespace="my_namespace",
dimension=768)
retriever = PineconeEmbeddingRetriever(document_store=document_store)
# using an imaginary vector to keep the example simple, example run query:
retriever.run(query_embedding=[0.1]*768)
In a pipeline
Install the dependencies you’ll need:
pip install pinecone-haystack
pip install sentence-transformers
Use this Retriever in a query Pipeline like this:
from haystack.document_stores.types import DuplicatePolicy
from haystack import Document
from haystack import Pipeline
from haystack.components.embedders import SentenceTransformersTextEmbedder, SentenceTransformersDocumentEmbedder
from haystack_integrations.components.retrievers.pinecone import PineconeEmbeddingRetriever
from haystack_integrations.document_stores.pinecone import PineconeDocumentStore
# Make sure you have the PINECONE_API_KEY environment variable set
document_store = PineconeDocumentStore(index="my_index",
namespace="my_namespace",
dimension=768)
documents = [Document(content="There are over 7,000 languages spoken around the world today."),
Document(content="Elephants have been observed to behave in a way that indicates a high level of self-awareness, such as recognizing themselves in mirrors."),
Document(content="In certain parts of the world, like the Maldives, Puerto Rico, and San Diego, you can witness the phenomenon of bioluminescent waves.")]
document_embedder = SentenceTransformersDocumentEmbedder()
document_embedder.warm_up()
documents_with_embeddings = document_embedder.run(documents)
document_store.write_documents(documents_with_embeddings.get("documents"), policy=DuplicatePolicy.OVERWRITE)
query_pipeline = Pipeline()
query_pipeline.add_component("text_embedder", SentenceTransformersTextEmbedder())
query_pipeline.add_component("retriever", PineconeEmbeddingRetriever(document_store=document_store))
query_pipeline.connect("text_embedder.embedding", "retriever.query_embedding")
query = "How many languages are there?"
result = query_pipeline.run({"text_embedder": {"text": query}})
print(result['retriever']['documents'][0])
The example output would be:
Document(id=cfe93bc1c274908801e6670440bf2bbba54fad792770d57421f85ffa2a4fcc94, content: 'There are over 7,000 languages spoken around the world today.', score: 0.87717235, embedding: vector of size 768)
Updated 5 months ago