Pgvector integration for Haystack
Module haystack_integrations.components.retrievers.pgvector.embedding_retriever
PgvectorEmbeddingRetriever
Retrieves documents from the PgvectorDocumentStore
, based on their dense embeddings.
Example usage:
from haystack.document_stores import DuplicatePolicy
from haystack import Document, Pipeline
from haystack.components.embedders import SentenceTransformersTextEmbedder, SentenceTransformersDocumentEmbedder
from haystack_integrations.document_stores.pgvector import PgvectorDocumentStore
from haystack_integrations.components.retrievers.pgvector import PgvectorEmbeddingRetriever
# Set an environment variable `PG_CONN_STR` with the connection string to your PostgreSQL database.
# e.g., "postgresql://USER:PASSWORD@HOST:PORT/DB_NAME"
document_store = PgvectorDocumentStore(
embedding_dimension=768,
vector_function="cosine_similarity",
recreate_table=True,
)
documents = [Document(content="There are over 7,000 languages spoken around the world today."),
Document(content="Elephants have been observed to behave in a way that indicates..."),
Document(content="In certain places, you can witness the phenomenon of bioluminescent waves.")]
document_embedder = SentenceTransformersDocumentEmbedder()
document_embedder.warm_up()
documents_with_embeddings = document_embedder.run(documents)
document_store.write_documents(documents_with_embeddings.get("documents"), policy=DuplicatePolicy.OVERWRITE)
query_pipeline = Pipeline()
query_pipeline.add_component("text_embedder", SentenceTransformersTextEmbedder())
query_pipeline.add_component("retriever", PgvectorEmbeddingRetriever(document_store=document_store))
query_pipeline.connect("text_embedder.embedding", "retriever.query_embedding")
query = "How many languages are there?"
res = query_pipeline.run({"text_embedder": {"text": query}})
assert res['retriever']['documents'][0].content == "There are over 7,000 languages spoken around the world today."
PgvectorEmbeddingRetriever.__init__
def __init__(*,
document_store: PgvectorDocumentStore,
filters: Optional[Dict[str, Any]] = None,
top_k: int = 10,
vector_function: Optional[Literal["cosine_similarity",
"inner_product",
"l2_distance"]] = None)
Arguments:
document_store
: An instance of `PgvectorDocumentStore}.filters
: Filters applied to the retrieved Documents.top_k
: Maximum number of Documents to return.vector_function
: The similarity function to use when searching for similar embeddings. Defaults to the one set in thedocument_store
instance."cosine_similarity"
and"inner_product"
are similarity functions and higher scores indicate greater similarity between the documents."l2_distance"
returns the straight-line distance between vectors, and the most similar documents are the ones with the smallest score. Important: if the document store is using the"hnsw"
search strategy, the vector function should match the one utilized during index creation to take advantage of the index.
Raises:
ValueError
: Ifdocument_store
is not an instance ofPgvectorDocumentStore
or ifvector_function
is not one of the valid options.
PgvectorEmbeddingRetriever.to_dict
def to_dict() -> Dict[str, Any]
Serializes the component to a dictionary.
Returns:
Dictionary with serialized data.
PgvectorEmbeddingRetriever.from_dict
@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "PgvectorEmbeddingRetriever"
Deserializes the component from a dictionary.
Arguments:
data
: Dictionary to deserialize from.
Returns:
Deserialized component.
PgvectorEmbeddingRetriever.run
@component.output_types(documents=List[Document])
def run(query_embedding: List[float],
filters: Optional[Dict[str, Any]] = None,
top_k: Optional[int] = None,
vector_function: Optional[Literal["cosine_similarity", "inner_product",
"l2_distance"]] = None)
Retrieve documents from the PgvectorDocumentStore
, based on their embeddings.
Arguments:
query_embedding
: Embedding of the query.filters
: Filters applied to the retrieved Documents.top_k
: Maximum number of Documents to return.vector_function
: The similarity function to use when searching for similar embeddings.
Returns:
List of Documents similar to query_embedding
.
Module haystack_integrations.document_stores.pgvector.document_store
PgvectorDocumentStore
A Document Store using PostgreSQL with the pgvector extension installed.
PgvectorDocumentStore.__init__
def __init__(*,
connection_string: Secret = Secret.from_env_var("PG_CONN_STR"),
table_name: str = "haystack_documents",
embedding_dimension: int = 768,
vector_function: Literal["cosine_similarity", "inner_product",
"l2_distance"] = "cosine_similarity",
recreate_table: bool = False,
search_strategy: Literal["exact_nearest_neighbor",
"hnsw"] = "exact_nearest_neighbor",
hnsw_recreate_index_if_exists: bool = False,
hnsw_index_creation_kwargs: Optional[Dict[str, int]] = None,
hnsw_ef_search: Optional[int] = None)
Creates a new PgvectorDocumentStore instance.
It is meant to be connected to a PostgreSQL database with the pgvector extension installed. A specific table to store Haystack documents will be created if it doesn't exist yet.
Arguments:
connection_string
: The connection string to use to connect to the PostgreSQL database, defined as an environment variable, e.g.:PG_CONN_STR="postgresql://USER:PASSWORD@HOST:PORT/DB_NAME"
table_name
: The name of the table to use to store Haystack documents.embedding_dimension
: The dimension of the embedding.vector_function
: The similarity function to use when searching for similar embeddings."cosine_similarity"
and"inner_product"
are similarity functions and higher scores indicate greater similarity between the documents."l2_distance"
returns the straight-line distance between vectors, and the most similar documents are the ones with the smallest score. Important: when using the"hnsw"
search strategy, an index will be created that depends on thevector_function
passed here. Make sure subsequent queries will keep using the same vector similarity function in order to take advantage of the index.recreate_table
: Whether to recreate the table if it already exists.search_strategy
: The search strategy to use when searching for similar embeddings."exact_nearest_neighbor"
provides perfect recall but can be slow for large numbers of documents."hnsw"
is an approximate nearest neighbor search strategy, which trades off some accuracy for speed; it is recommended for large numbers of documents. Important: when using the"hnsw"
search strategy, an index will be created that depends on thevector_function
passed here. Make sure subsequent queries will keep using the same vector similarity function in order to take advantage of the index.hnsw_recreate_index_if_exists
: Whether to recreate the HNSW index if it already exists. Only used if search_strategy is set to"hnsw"
.hnsw_index_creation_kwargs
: Additional keyword arguments to pass to the HNSW index creation. Only used if search_strategy is set to"hnsw"
. You can find the list of valid arguments in the pgvector documentationhnsw_ef_search
: Theef_search
parameter to use at query time. Only used if search_strategy is set to"hnsw"
. You can find more information about this parameter in the pgvector documentation
PgvectorDocumentStore.to_dict
def to_dict() -> Dict[str, Any]
Serializes the component to a dictionary.
Returns:
Dictionary with serialized data.
PgvectorDocumentStore.from_dict
@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "PgvectorDocumentStore"
Deserializes the component from a dictionary.
Arguments:
data
: Dictionary to deserialize from.
Returns:
Deserialized component.
PgvectorDocumentStore.delete_table
def delete_table()
Deletes the table used to store Haystack documents.
The name of the table (table_name
) is defined when initializing the PgvectorDocumentStore
.
PgvectorDocumentStore.count_documents
def count_documents() -> int
Returns how many documents are present in the document store.
PgvectorDocumentStore.filter_documents
def filter_documents(
filters: Optional[Dict[str, Any]] = None) -> List[Document]
Returns the documents that match the filters provided.
For a detailed specification of the filters, refer to the documentation
Arguments:
filters
: The filters to apply to the document list.
Raises:
TypeError
: Iffilters
is not a dictionary.
Returns:
A list of Documents that match the given filters.
PgvectorDocumentStore.write_documents
def write_documents(documents: List[Document],
policy: DuplicatePolicy = DuplicatePolicy.NONE) -> int
Writes documents to the document store.
Arguments:
documents
: A list of Documents to write to the document store.policy
: The duplicate policy to use when writing documents.
Raises:
DuplicateDocumentError
: If a document with the same id already exists in the document store and the policy is set toDuplicatePolicy.FAIL
(or not specified).
Returns:
The number of documents written to the document store.
PgvectorDocumentStore.delete_documents
def delete_documents(document_ids: List[str]) -> None
Deletes documents that match the provided document_ids
from the document store.
Arguments:
document_ids
: the document ids to delete