OpenSearch integration for Haystack
Module haystack_integrations.components.retrievers.opensearch.bm25_retriever
OpenSearchBM25Retriever
Fetches documents from OpenSearchDocumentStore using the keyword-based BM25 algorithm.
BM25 computes a weighted word overlap between the query string and a document to determine its similarity.
OpenSearchBM25Retriever.__init__
def __init__(*,
document_store: OpenSearchDocumentStore,
filters: Optional[Dict[str, Any]] = None,
fuzziness: Union[int, str] = "AUTO",
top_k: int = 10,
scale_score: bool = False,
all_terms_must_match: bool = False,
filter_policy: Union[str, FilterPolicy] = FilterPolicy.REPLACE,
custom_query: Optional[Dict[str, Any]] = None,
raise_on_failure: bool = True)
Creates the OpenSearchBM25Retriever component.
Arguments:
document_store
: An instance of OpenSearchDocumentStore to use with the Retriever.filters
: Filters to narrow down the search for documents in the Document Store.fuzziness
: Determines how approximate string matching is applied in full-text queries. This parameter sets the number of character edits (insertions, deletions, or substitutions) required to transform one word into another. For example, the "fuzziness" between the words "wined" and "wind" is 1 because only one edit is needed to match them.
Use "AUTO" (the default) for automatic adjustment based on term length, which is optimal for most scenarios. For detailed guidance, refer to the OpenSearch fuzzy query documentation.
-
top_k
: Maximum number of documents to return. -
scale_score
: IfTrue
, scales the score of retrieved documents to a range between 0 and 1. This is useful when comparing documents across different indexes. -
all_terms_must_match
: IfTrue
, all terms in the query string must be present in the retrieved documents. This is useful when searching for short text where even one term can make a difference. -
filter_policy
: Policy to determine how filters are applied. Possible options: -
replace
: Runtime filters replace initialization filters. Use this policy to change the filtering scope for specific queries. -
merge
: Runtime filters are merged with initialization filters. -
custom_query
: The query containing a mandatory$query
and an optional$filters
placeholder. An example custom_query:{ "query": { "bool": { "should": [{"multi_match": { "query": "$query", // mandatory query placeholder "type": "most_fields", "fields": ["content", "title"]}}], "filter": "$filters" // optional filter placeholder } } }
An example run()
method for this custom_query
:
retriever.run(
query="Why did the revenue increase?",
filters={
"operator": "AND",
"conditions": [
{"field": "meta.years", "operator": "==", "value": "2019"},
{"field": "meta.quarters", "operator": "in", "value": ["Q1", "Q2"]},
],
},
)
raise_on_failure
: Whether to raise an exception if the API call fails. Otherwise log a warning and return an empty list.
Raises:
ValueError
: Ifdocument_store
is not an instance of OpenSearchDocumentStore.
OpenSearchBM25Retriever.to_dict
def to_dict() -> Dict[str, Any]
Serializes the component to a dictionary.
Returns:
Dictionary with serialized data.
OpenSearchBM25Retriever.from_dict
@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "OpenSearchBM25Retriever"
Deserializes the component from a dictionary.
Arguments:
data
: Dictionary to deserialize from.
Returns:
Deserialized component.
OpenSearchBM25Retriever.run
@component.output_types(documents=List[Document])
def run(query: str,
filters: Optional[Dict[str, Any]] = None,
all_terms_must_match: Optional[bool] = None,
top_k: Optional[int] = None,
fuzziness: Optional[Union[int, str]] = None,
scale_score: Optional[bool] = None,
custom_query: Optional[Dict[str, Any]] = None)
Retrieve documents using BM25 retrieval.
Arguments:
-
query
: The query string. -
filters
: Filters applied to the retrieved documents. The way runtime filters are applied depends on thefilter_policy
specified at Retriever's initialization. -
all_terms_must_match
: IfTrue
, all terms in the query string must be present in the retrieved documents. -
top_k
: Maximum number of documents to return. -
fuzziness
: Fuzziness parameter for full-text queries to apply approximate string matching. For more information, see OpenSearch fuzzy query. -
scale_score
: IfTrue
, scales the score of retrieved documents to a range between 0 and 1. This is useful when comparing documents across different indexes. -
custom_query
: A custom OpenSearch query. It must include a$query
and may optionally include a$filters
placeholder.**An example custom_query:** ```python { "query": { "bool": { "should": [{"multi_match": { "query": "$query", // mandatory query placeholder "type": "most_fields", "fields": ["content", "title"]}}], "filter": "$filters" // optional filter placeholder } } } ```
For this custom_query, a sample run()
could be:
retriever.run(
query="Why did the revenue increase?",
filters={
"operator": "AND",
"conditions": [
{"field": "meta.years", "operator": "==", "value": "2019"},
{"field": "meta.quarters", "operator": "in", "value": ["Q1", "Q2"]},
],
},
)
Returns:
A dictionary containing the retrieved documents with the following structure:
- documents: List of retrieved Documents.
Module haystack_integrations.components.retrievers.opensearch.embedding_retriever
OpenSearchEmbeddingRetriever
Retrieves documents from the OpenSearchDocumentStore using a vector similarity metric.
Must be connected to the OpenSearchDocumentStore to run.
OpenSearchEmbeddingRetriever.__init__
def __init__(*,
document_store: OpenSearchDocumentStore,
filters: Optional[Dict[str, Any]] = None,
top_k: int = 10,
filter_policy: Union[str, FilterPolicy] = FilterPolicy.REPLACE,
custom_query: Optional[Dict[str, Any]] = None,
raise_on_failure: bool = True,
efficient_filtering: bool = False)
Create the OpenSearchEmbeddingRetriever component.
Arguments:
-
document_store
: An instance of OpenSearchDocumentStore to use with the Retriever. -
filters
: Filters applied when fetching documents from the Document Store. Filters are applied during the approximate kNN search to ensure the Retriever returnstop_k
matching documents. -
top_k
: Maximum number of documents to return. -
filter_policy
: Policy to determine how filters are applied. Possible options: -
merge
: Runtime filters are merged with initialization filters. -
replace
: Runtime filters replace initialization filters. Use this policy to change the filtering scope. -
custom_query
: The custom OpenSearch query containing a mandatory$query_embedding
and an optional$filters
placeholder.**An example custom_query:** ```python { "query": { "bool": { "must": [ { "knn": { "embedding": { "vector": "$query_embedding", // mandatory query placeholder "k": 10000, } } } ], "filter": "$filters" // optional filter placeholder } } } ```
For this custom_query
, an example run()
could be:
retriever.run(
query_embedding=embedding,
filters={
"operator": "AND",
"conditions": [
{"field": "meta.years", "operator": "==", "value": "2019"},
{"field": "meta.quarters", "operator": "in", "value": ["Q1", "Q2"]},
],
},
)
raise_on_failure
: IfTrue
, raises an exception if the API call fails. IfFalse
, logs a warning and returns an empty list.efficient_filtering
: IfTrue
, the filter will be applied during the approximate kNN search. This is only supported for knn engines "faiss" and "lucene" and does not work with the default "nmslib".
Raises:
ValueError
: Ifdocument_store
is not an instance of OpenSearchDocumentStore.
OpenSearchEmbeddingRetriever.to_dict
def to_dict() -> Dict[str, Any]
Serializes the component to a dictionary.
Returns:
Dictionary with serialized data.
OpenSearchEmbeddingRetriever.from_dict
@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "OpenSearchEmbeddingRetriever"
Deserializes the component from a dictionary.
Arguments:
data
: Dictionary to deserialize from.
Returns:
Deserialized component.
OpenSearchEmbeddingRetriever.run
@component.output_types(documents=List[Document])
def run(query_embedding: List[float],
filters: Optional[Dict[str, Any]] = None,
top_k: Optional[int] = None,
custom_query: Optional[Dict[str, Any]] = None,
efficient_filtering: Optional[bool] = None)
Retrieve documents using a vector similarity metric.
Arguments:
-
query_embedding
: Embedding of the query. -
filters
: Filters applied when fetching documents from the Document Store. Filters are applied during the approximate kNN search to ensure the Retriever returnstop_k
matching documents. The way runtime filters are applied depends on thefilter_policy
selected when initializing the Retriever. -
top_k
: Maximum number of documents to return. -
custom_query
: A custom OpenSearch query containing a mandatory$query_embedding
and an optional$filters
placeholder.**An example custom_query:** ```python { "query": { "bool": { "must": [ { "knn": { "embedding": { "vector": "$query_embedding", // mandatory query placeholder "k": 10000, } } } ], "filter": "$filters" // optional filter placeholder } } } ```
For this custom_query
, an example run()
could be:
retriever.run(
query_embedding=embedding,
filters={
"operator": "AND",
"conditions": [
{"field": "meta.years", "operator": "==", "value": "2019"},
{"field": "meta.quarters", "operator": "in", "value": ["Q1", "Q2"]},
],
},
)
efficient_filtering
: IfTrue
, the filter will be applied during the approximate kNN search. This is only supported for knn engines "faiss" and "lucene" and does not work with the default "nmslib".
Returns:
Dictionary with key "documents" containing the retrieved Documents.
- documents: List of Document similar to
query_embedding
.
Module haystack_integrations.document_stores.opensearch.document_store
OpenSearchDocumentStore
OpenSearchDocumentStore.__init__
def __init__(*,
hosts: Optional[Hosts] = None,
index: str = "default",
max_chunk_bytes: int = DEFAULT_MAX_CHUNK_BYTES,
embedding_dim: int = 768,
return_embedding: bool = False,
method: Optional[Dict[str, Any]] = None,
mappings: Optional[Dict[str, Any]] = None,
settings: Optional[Dict[str, Any]] = DEFAULT_SETTINGS,
create_index: bool = True,
http_auth: Any = None,
use_ssl: Optional[bool] = None,
verify_certs: Optional[bool] = None,
timeout: Optional[int] = None,
**kwargs)
Creates a new OpenSearchDocumentStore instance.
The embeddings_dim
, method
, mappings
, and settings
arguments are only used if the index does not
exists and needs to be created. If the index already exists, its current configurations will be used.
For more information on connection parameters, see the official OpenSearch documentation
Arguments:
hosts
: List of hosts running the OpenSearch client. Defaults to Noneindex
: Name of index in OpenSearch, if it doesn't exist it will be created. Defaults to "default"max_chunk_bytes
: Maximum size of the requests in bytes. Defaults to 100MBembedding_dim
: Dimension of the embeddings. Defaults to 768return_embedding
: Whether to return the embedding of the retrieved Documents.method
: The method definition of the underlying configuration of the approximate k-NN algorithm. Please see the official OpenSearch docs for more information. Defaults to Nonemappings
: The mapping of how the documents are stored and indexed. Please see the official OpenSearch docs for more information. If None, it uses the embedding_dim and method arguments to create default mappings. Defaults to Nonesettings
: The settings of the index to be created. Please see the official OpenSearch docs for more information. Defaults to {"index.knn": True}create_index
: Whether to create the index if it doesn't exist. Defaults to Truehttp_auth
: http_auth param passed to the underying connection class. For basic authentication with default connection classUrllib3HttpConnection
this can be- a tuple of (username, password)
- a list of [username, password]
- a string of "username:password"
For AWS authentication with
Urllib3HttpConnection
pass an instance ofAWSAuth
. Defaults to None use_ssl
: Whether to use SSL. Defaults to Noneverify_certs
: Whether to verify certificates. Defaults to Nonetimeout
: Timeout in seconds. Defaults to None**kwargs
: Optional arguments thatOpenSearch
takes. For the full list of supported kwargs, see the official OpenSearch reference
OpenSearchDocumentStore.create_index
def create_index(index: Optional[str] = None,
mappings: Optional[Dict[str, Any]] = None,
settings: Optional[Dict[str, Any]] = None) -> None
Creates an index in OpenSearch.
Note that this method ignores the create_index
argument from the constructor.
Arguments:
index
: Name of the index to create. If None, the index name from the constructor is used.mappings
: The mapping of how the documents are stored and indexed. Please see the official OpenSearch docs for more information. If None, the mappings from the constructor are used.settings
: The settings of the index to be created. Please see the official OpenSearch docs for more information. If None, the settings from the constructor are used.
OpenSearchDocumentStore.to_dict
def to_dict() -> Dict[str, Any]
Serializes the component to a dictionary.
Returns:
Dictionary with serialized data.
OpenSearchDocumentStore.from_dict
@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "OpenSearchDocumentStore"
Deserializes the component from a dictionary.
Arguments:
data
: Dictionary to deserialize from.
Returns:
Deserialized component.
OpenSearchDocumentStore.count_documents
def count_documents() -> int
Returns how many documents are present in the document store.
OpenSearchDocumentStore.write_documents
def write_documents(documents: List[Document],
policy: DuplicatePolicy = DuplicatePolicy.NONE) -> int
Writes Documents to OpenSearch. If policy is not specified or set to DuplicatePolicy.NONE, it will raise an exception if a document with the same ID already exists in the document store.
OpenSearchDocumentStore.delete_documents
def delete_documents(document_ids: List[str]) -> None
Deletes all documents with a matching document_ids from the document store.
Arguments:
object_ids
: the object_ids to delete
Module haystack_integrations.document_stores.opensearch.filters
normalize_filters
def normalize_filters(filters: Dict[str, Any]) -> Dict[str, Any]
Converts Haystack filters in OpenSearch compatible filters.