DocumentationAPI ReferenceπŸ““ TutorialsπŸ§‘β€πŸ³ Cookbook🀝 IntegrationsπŸ’œ Discord

OpenSearch integration for Haystack

Module haystack_integrations.components.retrievers.opensearch.bm25_retriever

OpenSearchBM25Retriever

OpenSearchBM25Retriever.__init__

def __init__(*,
             document_store: OpenSearchDocumentStore,
             filters: Optional[Dict[str, Any]] = None,
             fuzziness: str = "AUTO",
             top_k: int = 10,
             scale_score: bool = False,
             all_terms_must_match: bool = False,
             filter_policy: Union[str, FilterPolicy] = FilterPolicy.REPLACE,
             custom_query: Optional[Dict[str, Any]] = None,
             raise_on_failure: bool = True)

Create the OpenSearchBM25Retriever component.

Arguments:

  • document_store: An instance of OpenSearchDocumentStore.

  • filters: Filters applied to the retrieved Documents. Defaults to None.

  • fuzziness: Fuzziness parameter for full-text queries. Defaults to "AUTO".

  • top_k: Maximum number of Documents to return, defaults to 10

  • scale_score: Whether to scale the score of retrieved documents between 0 and 1. This is useful when comparing documents across different indexes. Defaults to False.

  • all_terms_must_match: If True, all terms in the query string must be present in the retrieved documents. This is useful when searching for short text where even one term can make a difference. Defaults to False.

  • filter_policy: Policy to determine how filters are applied.

  • custom_query: The query containing a mandatory $query and an optional $filters placeholder An example custom_query:

    {
        "query": {
            "bool": {
                "should": [{"multi_match": {
                    "query": "$query",                 // mandatory query placeholder
                    "type": "most_fields",
                    "fields": ["content", "title"]}}],
                "filter": "$filters"                  // optional filter placeholder
            }
        }
    }
    

For this custom_query, a sample run() could be:

retriever.run(query="Why did the revenue increase?",
                filters={"years": ["2019"], "quarters": ["Q1", "Q2"]})
  • raise_on_failure: Whether to raise an exception if the API call fails. Otherwise log a warning and return an empty list.

Raises:

  • ValueError: If document_store is not an instance of OpenSearchDocumentStore.

OpenSearchBM25Retriever.to_dict

def to_dict() -> Dict[str, Any]

Serializes the component to a dictionary.

Returns:

Dictionary with serialized data.

OpenSearchBM25Retriever.from_dict

@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "OpenSearchBM25Retriever"

Deserializes the component from a dictionary.

Arguments:

  • data: Dictionary to deserialize from.

Returns:

Deserialized component.

OpenSearchBM25Retriever.run

@component.output_types(documents=List[Document])
def run(query: str,
        filters: Optional[Dict[str, Any]] = None,
        all_terms_must_match: Optional[bool] = None,
        top_k: Optional[int] = None,
        fuzziness: Optional[str] = None,
        scale_score: Optional[bool] = None,
        custom_query: Optional[Dict[str, Any]] = None)

Retrieve documents using BM25 retrieval.

Arguments:

  • query: The query string

  • filters: Filters applied to the retrieved Documents. The way runtime filters are applied depends on the filter_policy chosen at retriever initialization. See init method docstring for more details.

  • all_terms_must_match: If True, all terms in the query string must be present in the retrieved documents.

  • top_k: Maximum number of Documents to return.

  • fuzziness: Fuzziness parameter for full-text queries.

  • scale_score: Whether to scale the score of retrieved documents between 0 and 1. This is useful when comparing documents across different indexes.

  • custom_query: The query containing a mandatory $query and an optional $filters placeholder An example custom_query:

    {
        "query": {
            "bool": {
                "should": [{"multi_match": {
                    "query": "$query",                 // mandatory query placeholder
                    "type": "most_fields",
                    "fields": ["content", "title"]}}],
                "filter": "$filters"                  // optional filter placeholder
            }
        }
    }
    

For this custom_query, a sample run() could be:

retriever.run(query="Why did the revenue increase?",
                filters={"years": ["2019"], "quarters": ["Q1", "Q2"]})

Returns:

A dictionary containing the retrieved documents with the following structure:

  • documents: List of retrieved Documents.

Module haystack_integrations.components.retrievers.opensearch.embedding_retriever

OpenSearchEmbeddingRetriever

Uses a vector similarity metric to retrieve documents from the OpenSearchDocumentStore.

Needs to be connected to the OpenSearchDocumentStore to run.

OpenSearchEmbeddingRetriever.__init__

def __init__(*,
             document_store: OpenSearchDocumentStore,
             filters: Optional[Dict[str, Any]] = None,
             top_k: int = 10,
             filter_policy: Union[str, FilterPolicy] = FilterPolicy.REPLACE,
             custom_query: Optional[Dict[str, Any]] = None,
             raise_on_failure: bool = True)

Create the OpenSearchEmbeddingRetriever component.

Arguments:

  • document_store: An instance of OpenSearchDocumentStore.

  • filters: Filters applied to the retrieved Documents. Defaults to None. Filters are applied during the approximate kNN search to ensure that top_k matching documents are returned.

  • top_k: Maximum number of Documents to return, defaults to 10

  • filter_policy: Policy to determine how filters are applied.

  • custom_query: The query containing a mandatory $query_embedding and an optional $filters placeholder An example custom_query:

    {
        "query": {
            "bool": {
                "must": [
                    {
                        "knn": {
                            "embedding": {
                                "vector": "$query_embedding",   // mandatory query placeholder
                                "k": 10000,
                            }
                        }
                    }
                ],
                "filter": "$filters"                            // optional filter placeholder
            }
        }
    }
    

For this custom_query, a sample run() could be:

retriever.run(query_embedding=embedding,
                filters={"years": ["2019"], "quarters": ["Q1", "Q2"]})
  • raise_on_failure: Whether to raise an exception if the API call fails. Otherwise log a warning and return an empty list.

Raises:

  • ValueError: If document_store is not an instance of OpenSearchDocumentStore.

OpenSearchEmbeddingRetriever.to_dict

def to_dict() -> Dict[str, Any]

Serializes the component to a dictionary.

Returns:

Dictionary with serialized data.

OpenSearchEmbeddingRetriever.from_dict

@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "OpenSearchEmbeddingRetriever"

Deserializes the component from a dictionary.

Arguments:

  • data: Dictionary to deserialize from.

Returns:

Deserialized component.

OpenSearchEmbeddingRetriever.run

@component.output_types(documents=List[Document])
def run(query_embedding: List[float],
        filters: Optional[Dict[str, Any]] = None,
        top_k: Optional[int] = None,
        custom_query: Optional[Dict[str, Any]] = None)

Retrieve documents using a vector similarity metric.

Arguments:

  • query_embedding: Embedding of the query.

  • filters: Filters applied to the retrieved Documents. The way runtime filters are applied depends on the filter_policy chosen at retriever initialization. See init method docstring for more details.

  • top_k: Maximum number of Documents to return.

  • custom_query: The query containing a mandatory $query_embedding and an optional $filters placeholder An example custom_query:

    {
        "query": {
            "bool": {
                "must": [
                    {
                        "knn": {
                            "embedding": {
                                "vector": "$query_embedding",   // mandatory query placeholder
                                "k": 10000,
                            }
                        }
                    }
                ],
                "filter": "$filters"                            // optional filter placeholder
            }
        }
    }
    

For this custom_query, a sample run() could be:

retriever.run(query_embedding=embedding,
                filters={"years": ["2019"], "quarters": ["Q1", "Q2"]})

Returns:

Dictionary with key "documents" containing the retrieved Documents.

  • documents: List of Document similar to query_embedding.

Module haystack_integrations.document_stores.opensearch.document_store

OpenSearchDocumentStore

OpenSearchDocumentStore.__init__

def __init__(*,
             hosts: Optional[Hosts] = None,
             index: str = "default",
             max_chunk_bytes: int = DEFAULT_MAX_CHUNK_BYTES,
             embedding_dim: int = 768,
             return_embedding: bool = False,
             method: Optional[Dict[str, Any]] = None,
             mappings: Optional[Dict[str, Any]] = None,
             settings: Optional[Dict[str, Any]] = DEFAULT_SETTINGS,
             create_index: bool = True,
             **kwargs)

Creates a new OpenSearchDocumentStore instance.

The embeddings_dim, method, mappings, and settings arguments are only used if the index does not exists and needs to be created. If the index already exists, its current configurations will be used.

For more information on connection parameters, see the official OpenSearch documentation

Arguments:

  • hosts: List of hosts running the OpenSearch client. Defaults to None
  • index: Name of index in OpenSearch, if it doesn't exist it will be created. Defaults to "default"
  • max_chunk_bytes: Maximum size of the requests in bytes. Defaults to 100MB
  • embedding_dim: Dimension of the embeddings. Defaults to 768
  • return_embedding: Whether to return the embedding of the retrieved Documents.
  • method: The method definition of the underlying configuration of the approximate k-NN algorithm. Please see the official OpenSearch docs for more information. Defaults to None
  • mappings: The mapping of how the documents are stored and indexed. Please see the official OpenSearch docs for more information. If None, it uses the embedding_dim and method arguments to create default mappings. Defaults to None
  • settings: The settings of the index to be created. Please see the official OpenSearch docs for more information. Defaults to {"index.knn": True}
  • create_index: Whether to create the index if it doesn't exist. Defaults to True
  • **kwargs: Optional arguments that OpenSearch takes. For the full list of supported kwargs, see the official OpenSearch reference

OpenSearchDocumentStore.create_index

def create_index(index: Optional[str] = None,
                 mappings: Optional[Dict[str, Any]] = None,
                 settings: Optional[Dict[str, Any]] = None) -> None

Creates an index in OpenSearch.

Note that this method ignores the create_index argument from the constructor.

Arguments:

  • index: Name of the index to create. If None, the index name from the constructor is used.
  • mappings: The mapping of how the documents are stored and indexed. Please see the official OpenSearch docs for more information. If None, the mappings from the constructor are used.
  • settings: The settings of the index to be created. Please see the official OpenSearch docs for more information. If None, the settings from the constructor are used.

OpenSearchDocumentStore.to_dict

def to_dict() -> Dict[str, Any]

Serializes the component to a dictionary.

Returns:

Dictionary with serialized data.

OpenSearchDocumentStore.from_dict

@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "OpenSearchDocumentStore"

Deserializes the component from a dictionary.

Arguments:

  • data: Dictionary to deserialize from.

Returns:

Deserialized component.

OpenSearchDocumentStore.count_documents

def count_documents() -> int

Returns how many documents are present in the document store.

OpenSearchDocumentStore.write_documents

def write_documents(documents: List[Document],
                    policy: DuplicatePolicy = DuplicatePolicy.NONE) -> int

Writes Documents to OpenSearch. If policy is not specified or set to DuplicatePolicy.NONE, it will raise an exception if a document with the same ID already exists in the document store.

OpenSearchDocumentStore.delete_documents

def delete_documents(document_ids: List[str]) -> None

Deletes all documents with a matching document_ids from the document store.

Arguments:

  • object_ids: the object_ids to delete

Module haystack_integrations.document_stores.opensearch.filters

normalize_filters

def normalize_filters(filters: Dict[str, Any]) -> Dict[str, Any]

Converts Haystack filters in OpenSearch compatible filters.