Name	PgvectorEmbeddingRetriever
Path	https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/pgvector/
Most common Position in a Pipeline	After a Text Embedder and before a `PromptBuilder` in a RAG Pipeline The last component in the semantic search Pipeline After a Text Embedder and before an `ExtractiveReader` in an ExtractiveQA Pipeline
Mandatory Input variables	“query_embedding”: a vector representing the query (a list of floats)
Output variables	“documents”: a list of Documents

Overview

The PgvectorEmbeddingRetriever is an embedding-based Retriever compatible with the PgvectorDocumentStore. It compares the query and Document embeddings and fetches the Documents most relevant to the query from the PgvectorDocumentStore based on the outcome.

When using the PgvectorEmbeddingRetriever in your Pipeline, make sure it has the query and Document embeddings available. You can do so by adding a Document Embedder to your indexing Pipeline and a Text Embedder to your query Pipeline.

In addition to the query_embedding, the PgvectorEmbeddingRetriever accepts other optional parameters, including top_k (the maximum number of Documents to retrieve) and filters to narrow down the search space.

Some relevant parameters that impact the embedding retrieval must be defined when the corresponding PgvectorDocumentStore is initialized: these include embedding dimension, vector function, and some others related to the search strategy (exact nearest neighbor or HNSW).

Installation

To quickly set up a PostgreSQL database with pgvector, you can use Docker:

docker run -d -p 5432:5432 -e POSTGRES_USER=postgres -e POSTGRES_PASSWORD=postgres -e POSTGRES_DB=postgres ankane/pgvector

For more information on installing pgvector, visit the pgvector GitHub repository.

To use pgvector with Haystack, install the pgvector-haystack integration:

pip install pgvector-haystack

Usage

On its own

This Retriever needs the PgvectorDocumentStore and indexed Documents to run.

import os
from haystack_integrations.document_stores.pgvector import PgvectorDocumentStore
from haystack_integrations.components.retrievers.pgvector import PgvectorEmbeddingRetriever

os.environ["PG_CONN_STR"] = "postgresql://postgres:postgres@localhost:5432/postgres"

document_store = PgvectorDocumentStore()
retriever = PgvectorEmbeddingRetriever(document_store=document_store)

# using a fake vector to keep the example simple
retriever.run(query_embedding=[0.1]*768)

In a Pipeline

import os
from haystack.document_stores import DuplicatePolicy
from haystack import Document, Pipeline
from haystack.components.embedders import SentenceTransformersTextEmbedder, SentenceTransformersDocumentEmbedder

from haystack_integrations.document_stores.pgvector import PgvectorDocumentStore
from haystack_integrations.components.retrievers.pgvector import PgvectorEmbeddingRetriever

os.environ["PG_CONN_STR"] = "postgresql://postgres:postgres@localhost:5432/postgres"

document_store = PgvectorDocumentStore(
    embedding_dimension=768,
    vector_function="cosine_similarity",
    recreate_table=True,
)

documents = [Document(content="There are over 7,000 languages spoken around the world today."),
						Document(content="Elephants have been observed to behave in a way that indicates a high level of self-awareness, such as recognizing themselves in mirrors."),
						Document(content="In certain parts of the world, like the Maldives, Puerto Rico, and San Diego, you can witness the phenomenon of bioluminescent waves.")]

document_embedder = SentenceTransformersDocumentEmbedder()  
document_embedder.warm_up()
documents_with_embeddings = document_embedder.run(documents)

document_store.write_documents(documents_with_embeddings.get("documents"), policy=DuplicatePolicy.OVERWRITE)

query_pipeline = Pipeline()
query_pipeline.add_component("text_embedder", SentenceTransformersTextEmbedder())
query_pipeline.add_component("retriever", PgvectorEmbeddingRetriever(document_store=document_store))
query_pipeline.connect("text_embedder.embedding", "retriever.query_embedding")

query = "How many languages are there?"

result = query_pipeline.run({"text_embedder": {"text": query}})

print(result['retriever']['documents'][0])