DocumentationAPI ReferenceπŸ““ TutorialsπŸ§‘β€πŸ³ Cookbook🀝 IntegrationsπŸ’œ Discord

Pgvector integration for Haystack

Module haystack_integrations.components.retrievers.pgvector.embedding_retriever

PgvectorEmbeddingRetriever

Retrieves documents from the PgvectorDocumentStore, based on their dense embeddings.

Example usage:

from haystack.document_stores import DuplicatePolicy
from haystack import Document, Pipeline
from haystack.components.embedders import SentenceTransformersTextEmbedder, SentenceTransformersDocumentEmbedder

from haystack_integrations.document_stores.pgvector import PgvectorDocumentStore
from haystack_integrations.components.retrievers.pgvector import PgvectorEmbeddingRetriever

# Set an environment variable `PG_CONN_STR` with the connection string to your PostgreSQL database.
# e.g., "postgresql://USER:PASSWORD@HOST:PORT/DB_NAME"

document_store = PgvectorDocumentStore(
    embedding_dimension=768,
    vector_function="cosine_similarity",
    recreate_table=True,
)

documents = [Document(content="There are over 7,000 languages spoken around the world today."),
             Document(content="Elephants have been observed to behave in a way that indicates..."),
             Document(content="In certain places, you can witness the phenomenon of bioluminescent waves.")]

document_embedder = SentenceTransformersDocumentEmbedder()
document_embedder.warm_up()
documents_with_embeddings = document_embedder.run(documents)

document_store.write_documents(documents_with_embeddings.get("documents"), policy=DuplicatePolicy.OVERWRITE)

query_pipeline = Pipeline()
query_pipeline.add_component("text_embedder", SentenceTransformersTextEmbedder())
query_pipeline.add_component("retriever", PgvectorEmbeddingRetriever(document_store=document_store))
query_pipeline.connect("text_embedder.embedding", "retriever.query_embedding")

query = "How many languages are there?"

res = query_pipeline.run({"text_embedder": {"text": query}})

assert res['retriever']['documents'][0].content == "There are over 7,000 languages spoken around the world today."

PgvectorEmbeddingRetriever.__init__

def __init__(*,
             document_store: PgvectorDocumentStore,
             filters: Optional[Dict[str, Any]] = None,
             top_k: int = 10,
             vector_function: Optional[Literal["cosine_similarity",
                                               "inner_product",
                                               "l2_distance"]] = None,
             filter_policy: Union[str, FilterPolicy] = FilterPolicy.REPLACE)

Arguments:

  • document_store: An instance of PgvectorDocumentStore.
  • filters: Filters applied to the retrieved Documents.
  • top_k: Maximum number of Documents to return.
  • vector_function: The similarity function to use when searching for similar embeddings. Defaults to the one set in the document_store instance. "cosine_similarity" and "inner_product" are similarity functions and higher scores indicate greater similarity between the documents. "l2_distance" returns the straight-line distance between vectors, and the most similar documents are the ones with the smallest score. Important: if the document store is using the "hnsw" search strategy, the vector function should match the one utilized during index creation to take advantage of the index.
  • filter_policy: Policy to determine how filters are applied.

Raises:

  • ValueError: If document_store is not an instance of PgvectorDocumentStore or if vector_function is not one of the valid options.

PgvectorEmbeddingRetriever.to_dict

def to_dict() -> Dict[str, Any]

Serializes the component to a dictionary.

Returns:

Dictionary with serialized data.

PgvectorEmbeddingRetriever.from_dict

@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "PgvectorEmbeddingRetriever"

Deserializes the component from a dictionary.

Arguments:

  • data: Dictionary to deserialize from.

Returns:

Deserialized component.

PgvectorEmbeddingRetriever.run

@component.output_types(documents=List[Document])
def run(query_embedding: List[float],
        filters: Optional[Dict[str, Any]] = None,
        top_k: Optional[int] = None,
        vector_function: Optional[Literal["cosine_similarity", "inner_product",
                                          "l2_distance"]] = None)

Retrieve documents from the PgvectorDocumentStore, based on their embeddings.

Arguments:

  • query_embedding: Embedding of the query.
  • filters: Filters applied to the retrieved Documents. The way runtime filters are applied depends on the filter_policy chosen at retriever initialization. See init method docstring for more details.
  • top_k: Maximum number of Documents to return.
  • vector_function: The similarity function to use when searching for similar embeddings.

Returns:

List of Documents similar to query_embedding.

Module haystack_integrations.components.retrievers.pgvector.keyword_retriever

PgvectorKeywordRetriever

Retrieve documents from the PgvectorDocumentStore, based on keywords.

To rank the documents, the ts_rank_cd function of PostgreSQL is used. It considers how often the query terms appear in the document, how close together the terms are in the document, and how important is the part of the document where they occur. For more details, see Postgres documentation.

Usage example:

from haystack.document_stores import DuplicatePolicy
from haystack import Document

from haystack_integrations.document_stores.pgvector import PgvectorDocumentStore
from haystack_integrations.components.retrievers.pgvector import PgvectorKeywordRetriever

# Set an environment variable `PG_CONN_STR` with the connection string to your PostgreSQL database.
# e.g., "postgresql://USER:PASSWORD@HOST:PORT/DB_NAME"

document_store = PgvectorDocumentStore(language="english", recreate_table=True)

documents = [Document(content="There are over 7,000 languages spoken around the world today."),
    Document(content="Elephants have been observed to behave in a way that indicates..."),
    Document(content="In certain places, you can witness the phenomenon of bioluminescent waves.")]

document_store.write_documents(documents_with_embeddings.get("documents"), policy=DuplicatePolicy.OVERWRITE)

retriever = PgvectorKeywordRetriever(document_store=document_store)

result = retriever.run(query="languages")

assert res['retriever']['documents'][0].content == "There are over 7,000 languages spoken around the world today."

<a id="haystack_integrations.components.retrievers.pgvector.keyword_retriever.PgvectorKeywordRetriever.__init__"></a>

#### PgvectorKeywordRetriever.\_\_init\_\_

```python
def __init__(*,
             document_store: PgvectorDocumentStore,
             filters: Optional[Dict[str, Any]] = None,
             top_k: int = 10,
             filter_policy: Union[str, FilterPolicy] = FilterPolicy.REPLACE)

Arguments:

  • document_store: An instance of PgvectorDocumentStore.
  • filters: Filters applied to the retrieved Documents.
  • top_k: Maximum number of Documents to return.
  • filter_policy: Policy to determine how filters are applied.

Raises:

  • ValueError: If document_store is not an instance of PgvectorDocumentStore.

PgvectorKeywordRetriever.to_dict

def to_dict() -> Dict[str, Any]

Serializes the component to a dictionary.

Returns:

Dictionary with serialized data.

PgvectorKeywordRetriever.from_dict

@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "PgvectorKeywordRetriever"

Deserializes the component from a dictionary.

Arguments:

  • data: Dictionary to deserialize from.

Returns:

Deserialized component.

PgvectorKeywordRetriever.run

@component.output_types(documents=List[Document])
def run(query: str,
        filters: Optional[Dict[str, Any]] = None,
        top_k: Optional[int] = None)

Retrieve documents from the PgvectorDocumentStore, based on keywords.

Arguments:

  • query: String to search in Documents' content.
  • filters: Filters applied to the retrieved Documents. The way runtime filters are applied depends on the filter_policy chosen at retriever initialization. See init method docstring for more details.
  • top_k: Maximum number of Documents to return.

Returns:

A dictionary with the following keys:

  • documents: List of Documents that match the query.

Module haystack_integrations.document_stores.pgvector.document_store

PgvectorDocumentStore

A Document Store using PostgreSQL with the pgvector extension installed.

PgvectorDocumentStore.__init__

def __init__(*,
             connection_string: Secret = Secret.from_env_var("PG_CONN_STR"),
             table_name: str = "haystack_documents",
             language: str = "english",
             embedding_dimension: int = 768,
             vector_function: Literal["cosine_similarity", "inner_product",
                                      "l2_distance"] = "cosine_similarity",
             recreate_table: bool = False,
             search_strategy: Literal["exact_nearest_neighbor",
                                      "hnsw"] = "exact_nearest_neighbor",
             hnsw_recreate_index_if_exists: bool = False,
             hnsw_index_creation_kwargs: Optional[Dict[str, int]] = None,
             hnsw_index_name: str = "haystack_hnsw_index",
             hnsw_ef_search: Optional[int] = None,
             keyword_index_name: str = "haystack_keyword_index")

Creates a new PgvectorDocumentStore instance.

It is meant to be connected to a PostgreSQL database with the pgvector extension installed. A specific table to store Haystack documents will be created if it doesn't exist yet.

Arguments:

  • connection_string: The connection string to use to connect to the PostgreSQL database, defined as an environment variable, e.g.: PG_CONN_STR="postgresql://USER:PASSWORD@HOST:PORT/DB_NAME"
  • table_name: The name of the table to use to store Haystack documents.
  • language: The language to be used to parse query and document content in keyword retrieval. To see the list of available languages, you can run the following SQL query in your PostgreSQL database: SELECT cfgname FROM pg_ts_config;. More information can be found in this StackOverflow answer.
  • embedding_dimension: The dimension of the embedding.
  • vector_function: The similarity function to use when searching for similar embeddings. "cosine_similarity" and "inner_product" are similarity functions and higher scores indicate greater similarity between the documents. "l2_distance" returns the straight-line distance between vectors, and the most similar documents are the ones with the smallest score. Important: when using the "hnsw" search strategy, an index will be created that depends on the vector_function passed here. Make sure subsequent queries will keep using the same vector similarity function in order to take advantage of the index.
  • recreate_table: Whether to recreate the table if it already exists.
  • search_strategy: The search strategy to use when searching for similar embeddings. "exact_nearest_neighbor" provides perfect recall but can be slow for large numbers of documents. "hnsw" is an approximate nearest neighbor search strategy, which trades off some accuracy for speed; it is recommended for large numbers of documents. Important: when using the "hnsw" search strategy, an index will be created that depends on the vector_function passed here. Make sure subsequent queries will keep using the same vector similarity function in order to take advantage of the index.
  • hnsw_recreate_index_if_exists: Whether to recreate the HNSW index if it already exists. Only used if search_strategy is set to "hnsw".
  • hnsw_index_creation_kwargs: Additional keyword arguments to pass to the HNSW index creation. Only used if search_strategy is set to "hnsw". You can find the list of valid arguments in the pgvector documentation
  • hnsw_index_name: Index name for the HNSW index.
  • hnsw_ef_search: The ef_search parameter to use at query time. Only used if search_strategy is set to "hnsw". You can find more information about this parameter in the pgvector documentation.
  • keyword_index_name: Index name for the Keyword index.

PgvectorDocumentStore.to_dict

def to_dict() -> Dict[str, Any]

Serializes the component to a dictionary.

Returns:

Dictionary with serialized data.

PgvectorDocumentStore.from_dict

@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "PgvectorDocumentStore"

Deserializes the component from a dictionary.

Arguments:

  • data: Dictionary to deserialize from.

Returns:

Deserialized component.

PgvectorDocumentStore.delete_table

def delete_table()

Deletes the table used to store Haystack documents. The name of the table (table_name) is defined when initializing the PgvectorDocumentStore.

PgvectorDocumentStore.count_documents

def count_documents() -> int

Returns how many documents are present in the document store.

PgvectorDocumentStore.filter_documents

def filter_documents(
        filters: Optional[Dict[str, Any]] = None) -> List[Document]

Returns the documents that match the filters provided.

For a detailed specification of the filters, refer to the documentation

Arguments:

  • filters: The filters to apply to the document list.

Raises:

  • TypeError: If filters is not a dictionary.

Returns:

A list of Documents that match the given filters.

PgvectorDocumentStore.write_documents

def write_documents(documents: List[Document],
                    policy: DuplicatePolicy = DuplicatePolicy.NONE) -> int

Writes documents to the document store.

Arguments:

  • documents: A list of Documents to write to the document store.
  • policy: The duplicate policy to use when writing documents.

Raises:

  • DuplicateDocumentError: If a document with the same id already exists in the document store and the policy is set to DuplicatePolicy.FAIL (or not specified).

Returns:

The number of documents written to the document store.

PgvectorDocumentStore.delete_documents

def delete_documents(document_ids: List[str]) -> None

Deletes documents that match the provided document_ids from the document store.

Arguments:

  • document_ids: the document ids to delete