Pgvector integration for Haystack
Module haystack_integrations.components.retrievers.pgvector.embedding_retriever
PgvectorEmbeddingRetriever
Retrieves documents from the PgvectorDocumentStore
, based on their dense embeddings.
Example usage:
from haystack.document_stores import DuplicatePolicy
from haystack import Document, Pipeline
from haystack.components.embedders import SentenceTransformersTextEmbedder, SentenceTransformersDocumentEmbedder
from haystack_integrations.document_stores.pgvector import PgvectorDocumentStore
from haystack_integrations.components.retrievers.pgvector import PgvectorEmbeddingRetriever
# Set an environment variable `PG_CONN_STR` with the connection string to your PostgreSQL database.
# e.g., "postgresql://USER:PASSWORD@HOST:PORT/DB_NAME"
document_store = PgvectorDocumentStore(
embedding_dimension=768,
vector_function="cosine_similarity",
recreate_table=True,
)
documents = [Document(content="There are over 7,000 languages spoken around the world today."),
Document(content="Elephants have been observed to behave in a way that indicates..."),
Document(content="In certain places, you can witness the phenomenon of bioluminescent waves.")]
document_embedder = SentenceTransformersDocumentEmbedder()
document_embedder.warm_up()
documents_with_embeddings = document_embedder.run(documents)
document_store.write_documents(documents_with_embeddings.get("documents"), policy=DuplicatePolicy.OVERWRITE)
query_pipeline = Pipeline()
query_pipeline.add_component("text_embedder", SentenceTransformersTextEmbedder())
query_pipeline.add_component("retriever", PgvectorEmbeddingRetriever(document_store=document_store))
query_pipeline.connect("text_embedder.embedding", "retriever.query_embedding")
query = "How many languages are there?"
res = query_pipeline.run({"text_embedder": {"text": query}})
assert res['retriever']['documents'][0].content == "There are over 7,000 languages spoken around the world today."
PgvectorEmbeddingRetriever.__init__
def __init__(*,
document_store: PgvectorDocumentStore,
filters: Optional[Dict[str, Any]] = None,
top_k: int = 10,
vector_function: Optional[Literal["cosine_similarity",
"inner_product",
"l2_distance"]] = None,
filter_policy: Union[str, FilterPolicy] = FilterPolicy.REPLACE)
Arguments:
document_store
: An instance ofPgvectorDocumentStore
.filters
: Filters applied to the retrieved Documents.top_k
: Maximum number of Documents to return.vector_function
: The similarity function to use when searching for similar embeddings. Defaults to the one set in thedocument_store
instance."cosine_similarity"
and"inner_product"
are similarity functions and higher scores indicate greater similarity between the documents."l2_distance"
returns the straight-line distance between vectors, and the most similar documents are the ones with the smallest score. Important: if the document store is using the"hnsw"
search strategy, the vector function should match the one utilized during index creation to take advantage of the index.filter_policy
: Policy to determine how filters are applied.
Raises:
ValueError
: Ifdocument_store
is not an instance ofPgvectorDocumentStore
or ifvector_function
is not one of the valid options.
PgvectorEmbeddingRetriever.to_dict
def to_dict() -> Dict[str, Any]
Serializes the component to a dictionary.
Returns:
Dictionary with serialized data.
PgvectorEmbeddingRetriever.from_dict
@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "PgvectorEmbeddingRetriever"
Deserializes the component from a dictionary.
Arguments:
data
: Dictionary to deserialize from.
Returns:
Deserialized component.
PgvectorEmbeddingRetriever.run
@component.output_types(documents=List[Document])
def run(query_embedding: List[float],
filters: Optional[Dict[str, Any]] = None,
top_k: Optional[int] = None,
vector_function: Optional[Literal["cosine_similarity", "inner_product",
"l2_distance"]] = None)
Retrieve documents from the PgvectorDocumentStore
, based on their embeddings.
Arguments:
query_embedding
: Embedding of the query.filters
: Filters applied to the retrieved Documents. The way runtime filters are applied depends on thefilter_policy
chosen at retriever initialization. See init method docstring for more details.top_k
: Maximum number of Documents to return.vector_function
: The similarity function to use when searching for similar embeddings.
Returns:
List of Documents similar to query_embedding
.
Module haystack_integrations.components.retrievers.pgvector.keyword_retriever
PgvectorKeywordRetriever
Retrieve documents from the PgvectorDocumentStore
, based on keywords.
To rank the documents, the ts_rank_cd
function of PostgreSQL is used.
It considers how often the query terms appear in the document, how close together the terms are in the document,
and how important is the part of the document where they occur.
For more details, see
Postgres documentation.
Usage example:
from haystack.document_stores import DuplicatePolicy
from haystack import Document
from haystack_integrations.document_stores.pgvector import PgvectorDocumentStore
from haystack_integrations.components.retrievers.pgvector import PgvectorKeywordRetriever
# Set an environment variable `PG_CONN_STR` with the connection string to your PostgreSQL database.
# e.g., "postgresql://USER:PASSWORD@HOST:PORT/DB_NAME"
document_store = PgvectorDocumentStore(language="english", recreate_table=True)
documents = [Document(content="There are over 7,000 languages spoken around the world today."),
Document(content="Elephants have been observed to behave in a way that indicates..."),
Document(content="In certain places, you can witness the phenomenon of bioluminescent waves.")]
document_store.write_documents(documents_with_embeddings.get("documents"), policy=DuplicatePolicy.OVERWRITE)
retriever = PgvectorKeywordRetriever(document_store=document_store)
result = retriever.run(query="languages")
assert res['retriever']['documents'][0].content == "There are over 7,000 languages spoken around the world today."
<a id="haystack_integrations.components.retrievers.pgvector.keyword_retriever.PgvectorKeywordRetriever.__init__"></a>
#### PgvectorKeywordRetriever.\_\_init\_\_
```python
def __init__(*,
document_store: PgvectorDocumentStore,
filters: Optional[Dict[str, Any]] = None,
top_k: int = 10,
filter_policy: Union[str, FilterPolicy] = FilterPolicy.REPLACE)
Arguments:
document_store
: An instance ofPgvectorDocumentStore
.filters
: Filters applied to the retrieved Documents.top_k
: Maximum number of Documents to return.filter_policy
: Policy to determine how filters are applied.
Raises:
ValueError
: Ifdocument_store
is not an instance ofPgvectorDocumentStore
.
PgvectorKeywordRetriever.to_dict
def to_dict() -> Dict[str, Any]
Serializes the component to a dictionary.
Returns:
Dictionary with serialized data.
PgvectorKeywordRetriever.from_dict
@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "PgvectorKeywordRetriever"
Deserializes the component from a dictionary.
Arguments:
data
: Dictionary to deserialize from.
Returns:
Deserialized component.
PgvectorKeywordRetriever.run
@component.output_types(documents=List[Document])
def run(query: str,
filters: Optional[Dict[str, Any]] = None,
top_k: Optional[int] = None)
Retrieve documents from the PgvectorDocumentStore
, based on keywords.
Arguments:
query
: String to search inDocument
s' content.filters
: Filters applied to the retrieved Documents. The way runtime filters are applied depends on thefilter_policy
chosen at retriever initialization. See init method docstring for more details.top_k
: Maximum number of Documents to return.
Returns:
A dictionary with the following keys:
documents
: List ofDocument
s that match the query.
Module haystack_integrations.document_stores.pgvector.document_store
PgvectorDocumentStore
A Document Store using PostgreSQL with the pgvector extension installed.
PgvectorDocumentStore.__init__
def __init__(*,
connection_string: Secret = Secret.from_env_var("PG_CONN_STR"),
create_extension: bool = True,
schema_name: str = "public",
table_name: str = "haystack_documents",
language: str = "english",
embedding_dimension: int = 768,
vector_function: Literal["cosine_similarity", "inner_product",
"l2_distance"] = "cosine_similarity",
recreate_table: bool = False,
search_strategy: Literal["exact_nearest_neighbor",
"hnsw"] = "exact_nearest_neighbor",
hnsw_recreate_index_if_exists: bool = False,
hnsw_index_creation_kwargs: Optional[Dict[str, int]] = None,
hnsw_index_name: str = "haystack_hnsw_index",
hnsw_ef_search: Optional[int] = None,
keyword_index_name: str = "haystack_keyword_index")
Creates a new PgvectorDocumentStore instance.
It is meant to be connected to a PostgreSQL database with the pgvector extension installed. A specific table to store Haystack documents will be created if it doesn't exist yet.
Arguments:
connection_string
: The connection string to use to connect to the PostgreSQL database, defined as an environment variable. It can be provided in either URI format e.g.:PG_CONN_STR="postgresql://USER:PASSWORD@HOST:PORT/DB_NAME"
, or keyword/value format e.g.:PG_CONN_STR="host=HOST port=PORT dbname=DBNAME user=USER password=PASSWORD"
See PostgreSQL Documentation for more details.create_extension
: Whether to create the pgvector extension if it doesn't exist. Set this toTrue
(default) to automatically create the extension if it is missing. Creating the extension may require superuser privileges. If set toFalse
, ensure the extension is already installed; otherwise, an error will be raised.schema_name
: The name of the schema the table is created in. The schema must already exist.table_name
: The name of the table to use to store Haystack documents.language
: The language to be used to parse query and document content in keyword retrieval. To see the list of available languages, you can run the following SQL query in your PostgreSQL database:SELECT cfgname FROM pg_ts_config;
. More information can be found in this StackOverflow answer.embedding_dimension
: The dimension of the embedding.vector_function
: The similarity function to use when searching for similar embeddings."cosine_similarity"
and"inner_product"
are similarity functions and higher scores indicate greater similarity between the documents."l2_distance"
returns the straight-line distance between vectors, and the most similar documents are the ones with the smallest score. Important: when using the"hnsw"
search strategy, an index will be created that depends on thevector_function
passed here. Make sure subsequent queries will keep using the same vector similarity function in order to take advantage of the index.recreate_table
: Whether to recreate the table if it already exists.search_strategy
: The search strategy to use when searching for similar embeddings."exact_nearest_neighbor"
provides perfect recall but can be slow for large numbers of documents."hnsw"
is an approximate nearest neighbor search strategy, which trades off some accuracy for speed; it is recommended for large numbers of documents. Important: when using the"hnsw"
search strategy, an index will be created that depends on thevector_function
passed here. Make sure subsequent queries will keep using the same vector similarity function in order to take advantage of the index.hnsw_recreate_index_if_exists
: Whether to recreate the HNSW index if it already exists. Only used if search_strategy is set to"hnsw"
.hnsw_index_creation_kwargs
: Additional keyword arguments to pass to the HNSW index creation. Only used if search_strategy is set to"hnsw"
. You can find the list of valid arguments in the pgvector documentationhnsw_index_name
: Index name for the HNSW index.hnsw_ef_search
: Theef_search
parameter to use at query time. Only used if search_strategy is set to"hnsw"
. You can find more information about this parameter in the pgvector documentation.keyword_index_name
: Index name for the Keyword index.
PgvectorDocumentStore.to_dict
def to_dict() -> Dict[str, Any]
Serializes the component to a dictionary.
Returns:
Dictionary with serialized data.
PgvectorDocumentStore.from_dict
@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "PgvectorDocumentStore"
Deserializes the component from a dictionary.
Arguments:
data
: Dictionary to deserialize from.
Returns:
Deserialized component.
PgvectorDocumentStore.delete_table
def delete_table()
Deletes the table used to store Haystack documents.
The name of the schema (schema_name
) and the name of the table (table_name
)
are defined when initializing the PgvectorDocumentStore
.
PgvectorDocumentStore.count_documents
def count_documents() -> int
Returns how many documents are present in the document store.
PgvectorDocumentStore.filter_documents
def filter_documents(
filters: Optional[Dict[str, Any]] = None) -> List[Document]
Returns the documents that match the filters provided.
For a detailed specification of the filters, refer to the documentation
Arguments:
filters
: The filters to apply to the document list.
Raises:
TypeError
: Iffilters
is not a dictionary.
Returns:
A list of Documents that match the given filters.
PgvectorDocumentStore.write_documents
def write_documents(documents: List[Document],
policy: DuplicatePolicy = DuplicatePolicy.NONE) -> int
Writes documents to the document store.
Arguments:
documents
: A list of Documents to write to the document store.policy
: The duplicate policy to use when writing documents.
Raises:
DuplicateDocumentError
: If a document with the same id already exists in the document store and the policy is set toDuplicatePolicy.FAIL
(or not specified).
Returns:
The number of documents written to the document store.
PgvectorDocumentStore.delete_documents
def delete_documents(document_ids: List[str]) -> None
Deletes documents that match the provided document_ids
from the document store.
Arguments:
document_ids
: the document ids to delete