DocumentationAPI ReferenceπŸ““ TutorialsπŸ§‘β€πŸ³ Cookbook🀝 IntegrationsπŸ’œ Discord

Pgvector integration for Haystack

Module haystack_integrations.components.retrievers.pgvector.embedding_retriever

PgvectorEmbeddingRetriever

Retrieves documents from the PgvectorDocumentStore, based on their dense embeddings.

Example usage:

from haystack.document_stores import DuplicatePolicy
from haystack import Document, Pipeline
from haystack.components.embedders import SentenceTransformersTextEmbedder, SentenceTransformersDocumentEmbedder

from haystack_integrations.document_stores.pgvector import PgvectorDocumentStore
from haystack_integrations.components.retrievers.pgvector import PgvectorEmbeddingRetriever

# Set an environment variable `PG_CONN_STR` with the connection string to your PostgreSQL database.
# e.g., "postgresql://USER:PASSWORD@HOST:PORT/DB_NAME"

document_store = PgvectorDocumentStore(
    embedding_dimension=768,
    vector_function="cosine_similarity",
    recreate_table=True,
)

documents = [Document(content="There are over 7,000 languages spoken around the world today."),
             Document(content="Elephants have been observed to behave in a way that indicates..."),
             Document(content="In certain places, you can witness the phenomenon of bioluminescent waves.")]

document_embedder = SentenceTransformersDocumentEmbedder()
document_embedder.warm_up()
documents_with_embeddings = document_embedder.run(documents)

document_store.write_documents(documents_with_embeddings.get("documents"), policy=DuplicatePolicy.OVERWRITE)

query_pipeline = Pipeline()
query_pipeline.add_component("text_embedder", SentenceTransformersTextEmbedder())
query_pipeline.add_component("retriever", PgvectorEmbeddingRetriever(document_store=document_store))
query_pipeline.connect("text_embedder.embedding", "retriever.query_embedding")

query = "How many languages are there?"

res = query_pipeline.run({"text_embedder": {"text": query}})

assert res['retriever']['documents'][0].content == "There are over 7,000 languages spoken around the world today."

PgvectorEmbeddingRetriever.__init__

def __init__(*,
             document_store: PgvectorDocumentStore,
             filters: Optional[Dict[str, Any]] = None,
             top_k: int = 10,
             vector_function: Optional[Literal["cosine_similarity",
                                               "inner_product",
                                               "l2_distance"]] = None)

Arguments:

  • document_store: An instance of `PgvectorDocumentStore}.
  • filters: Filters applied to the retrieved Documents.
  • top_k: Maximum number of Documents to return.
  • vector_function: The similarity function to use when searching for similar embeddings. Defaults to the one set in the document_store instance. "cosine_similarity" and "inner_product" are similarity functions and higher scores indicate greater similarity between the documents. "l2_distance" returns the straight-line distance between vectors, and the most similar documents are the ones with the smallest score. Important: if the document store is using the "hnsw" search strategy, the vector function should match the one utilized during index creation to take advantage of the index.

Raises:

  • ValueError: If document_store is not an instance of PgvectorDocumentStore or if vector_function is not one of the valid options.

PgvectorEmbeddingRetriever.to_dict

def to_dict() -> Dict[str, Any]

Serializes the component to a dictionary.

Returns:

Dictionary with serialized data.

PgvectorEmbeddingRetriever.from_dict

@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "PgvectorEmbeddingRetriever"

Deserializes the component from a dictionary.

Arguments:

  • data: Dictionary to deserialize from.

Returns:

Deserialized component.

PgvectorEmbeddingRetriever.run

@component.output_types(documents=List[Document])
def run(query_embedding: List[float],
        filters: Optional[Dict[str, Any]] = None,
        top_k: Optional[int] = None,
        vector_function: Optional[Literal["cosine_similarity", "inner_product",
                                          "l2_distance"]] = None)

Retrieve documents from the PgvectorDocumentStore, based on their embeddings.

Arguments:

  • query_embedding: Embedding of the query.
  • filters: Filters applied to the retrieved Documents.
  • top_k: Maximum number of Documents to return.
  • vector_function: The similarity function to use when searching for similar embeddings.

Returns:

List of Documents similar to query_embedding.

Module haystack_integrations.document_stores.pgvector.document_store

PgvectorDocumentStore

A Document Store using PostgreSQL with the pgvector extension installed.

PgvectorDocumentStore.__init__

def __init__(*,
             connection_string: Secret = Secret.from_env_var("PG_CONN_STR"),
             table_name: str = "haystack_documents",
             embedding_dimension: int = 768,
             vector_function: Literal["cosine_similarity", "inner_product",
                                      "l2_distance"] = "cosine_similarity",
             recreate_table: bool = False,
             search_strategy: Literal["exact_nearest_neighbor",
                                      "hnsw"] = "exact_nearest_neighbor",
             hnsw_recreate_index_if_exists: bool = False,
             hnsw_index_creation_kwargs: Optional[Dict[str, int]] = None,
             hnsw_ef_search: Optional[int] = None)

Creates a new PgvectorDocumentStore instance.

It is meant to be connected to a PostgreSQL database with the pgvector extension installed. A specific table to store Haystack documents will be created if it doesn't exist yet.

Arguments:

  • connection_string: The connection string to use to connect to the PostgreSQL database, defined as an environment variable, e.g.: PG_CONN_STR="postgresql://USER:PASSWORD@HOST:PORT/DB_NAME"
  • table_name: The name of the table to use to store Haystack documents.
  • embedding_dimension: The dimension of the embedding.
  • vector_function: The similarity function to use when searching for similar embeddings. "cosine_similarity" and "inner_product" are similarity functions and higher scores indicate greater similarity between the documents. "l2_distance" returns the straight-line distance between vectors, and the most similar documents are the ones with the smallest score. Important: when using the "hnsw" search strategy, an index will be created that depends on the vector_function passed here. Make sure subsequent queries will keep using the same vector similarity function in order to take advantage of the index.
  • recreate_table: Whether to recreate the table if it already exists.
  • search_strategy: The search strategy to use when searching for similar embeddings. "exact_nearest_neighbor" provides perfect recall but can be slow for large numbers of documents. "hnsw" is an approximate nearest neighbor search strategy, which trades off some accuracy for speed; it is recommended for large numbers of documents. Important: when using the "hnsw" search strategy, an index will be created that depends on the vector_function passed here. Make sure subsequent queries will keep using the same vector similarity function in order to take advantage of the index.
  • hnsw_recreate_index_if_exists: Whether to recreate the HNSW index if it already exists. Only used if search_strategy is set to "hnsw".
  • hnsw_index_creation_kwargs: Additional keyword arguments to pass to the HNSW index creation. Only used if search_strategy is set to "hnsw". You can find the list of valid arguments in the pgvector documentation
  • hnsw_ef_search: The ef_search parameter to use at query time. Only used if search_strategy is set to "hnsw". You can find more information about this parameter in the pgvector documentation

PgvectorDocumentStore.to_dict

def to_dict() -> Dict[str, Any]

Serializes the component to a dictionary.

Returns:

Dictionary with serialized data.

PgvectorDocumentStore.from_dict

@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "PgvectorDocumentStore"

Deserializes the component from a dictionary.

Arguments:

  • data: Dictionary to deserialize from.

Returns:

Deserialized component.

PgvectorDocumentStore.delete_table

def delete_table()

Deletes the table used to store Haystack documents. The name of the table (table_name) is defined when initializing the PgvectorDocumentStore.

PgvectorDocumentStore.count_documents

def count_documents() -> int

Returns how many documents are present in the document store.

PgvectorDocumentStore.filter_documents

def filter_documents(
        filters: Optional[Dict[str, Any]] = None) -> List[Document]

Returns the documents that match the filters provided.

For a detailed specification of the filters, refer to the documentation

Arguments:

  • filters: The filters to apply to the document list.

Raises:

  • TypeError: If filters is not a dictionary.

Returns:

A list of Documents that match the given filters.

PgvectorDocumentStore.write_documents

def write_documents(documents: List[Document],
                    policy: DuplicatePolicy = DuplicatePolicy.NONE) -> int

Writes documents to the document store.

Arguments:

  • documents: A list of Documents to write to the document store.
  • policy: The duplicate policy to use when writing documents.

Raises:

  • DuplicateDocumentError: If a document with the same id already exists in the document store and the policy is set to DuplicatePolicy.FAIL (or not specified).

Returns:

The number of documents written to the document store.

PgvectorDocumentStore.delete_documents

def delete_documents(document_ids: List[str]) -> None

Deletes documents that match the provided document_ids from the document store.

Arguments:

  • document_ids: the document ids to delete