HomeGuidesAPI ReferenceTutorials
Haystack

DocumentStore

You can think of the DocumentStore as a database that stores your texts and meta data and provides them to the Retriever at query time. Learn how to choose the best DocumentStore for your use case and how to use it in a pipeline.

👍

Use with Retrievers

By far the most common way to use a DocumentStore in Haystack is to fetch documents using a Retriever. A DocumentStore needs to be provided as an argument to the initialization of a Retriever.

Initialization

Initializing a new DocumentStore within Haystack is straightforward.

Elasticsearch

Install Elasticsearch and then start an instance.

If you have Docker set up, we recommend pulling the Docker image and running it.

docker pull docker.elastic.co/elasticsearch/elasticsearch:7.9.2
docker run -d -p 9200:9200 -e "discovery.type=single-node" elasticsearch:7.9.2

Next you can initialize the Haystack object that will connect to this instance.

document_store = ElasticsearchDocumentStore()
Open Distro for Elasticsearch

Learn how to get started here. If you have Docker set up, we recommend pulling the Docker image and running it.

docker pull amazon/opendistro-for-elasticsearch:1.13.2
docker run -p 9200:9200 -p 9600:9600 -e "discovery.type=single-node" amazon/opendistro-for-elasticsearch:1.13.2

Next you can initialize the Haystack object that will connect to this instance.

from haystack.document_stores import OpenDistroElasticsearchDocumentStore
document_store = OpenDistroElasticsearchDocumentStore()
OpenSearch

Learn how to get started here. If you have Docker set up, we recommend pulling the Docker image and running it.

docker pull opensearchproject/opensearch:1.0.1
docker run -p 9200:9200 -p 9600:9600 -e "discovery.type=single-node" opensearchproject/opensearch:1.0.1

Next you can initialize the Haystack object that will connect to this instance.

from haystack.document_stores import OpenSearchDocumentStore
document_store = OpenSearchDocumentStore()
Milvus

Follow the official documentation to start a Milvus instance via Docker. Note that we also have a utility function haystack.utils.launch_milvus that can start up a Milvus instance.

You can initialize the Haystack object that will connect to this instance as follows:

from haystack.document_stores import MilvusDocumentStore
document_store = MilvusDocumentStore()
FAISS

The FAISSDocumentStore requires no external setup. Start it by simply using this line:

from haystack.document_stores import FAISSDocumentStore
document_store = FAISSDocumentStore(faiss_index_factory_str="Flat")

Save & Load

FAISS document stores can be saved to disk and reloaded:

from haystack.document_stores import FAISSDocumentStore
document_store = FAISSDocumentStore(faiss_index_factory_str="Flat")
# Generates two files: my_faiss_index.faiss and my_faiss_index.json
document_store.save("my_faiss_index.faiss")
# Looks for the two files generated above
new_document_store = FAISSDocumentStore.load("my_faiss_index.faiss")
assert new_document_store.faiss_index_factory_str == "Flat"

While my_faiss_index.faiss contains the index, my_faiss_index.json contains the parameters used to initialize it (like faiss_index_factory_store). This configuration file is necessary for load() to work. It simply contains the initial parameters in a JSON format.

For example, a hand-written configuration file for the above FAISS index could look like:

{faiss_index_factory_store: 'Flat'}
In Memory The InMemoryDocumentStore requires no external setup. Start it by simply using this line.
from haystack.document_stores import InMemoryDocumentStore
document_store = InMemoryDocumentStore()
SQL The SQLDocumentStore requires SQLite, PostgresQL or MySQL to be installed and started. Note that SQLite already comes packaged with most operating systems.
from haystack.document_stores import SQLDocumentStore
document_store = SQLDocumentStore()
Weaviate The WeaviateDocumentStore requires a running Weaviate Server version 1.8 or later. To start a basic instance, run:
docker run -d -p 8080:8080 --env AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED='true' --env PERSISTENCE_DATA_PATH='/var/lib/weaviate' semitechnologies/weaviate:1.12.0

See the Weaviate docs for more details.

Afterwards, you can use it in Haystack:

from haystack.document_stores import WeaviateDocumentStore
document_store = WeaviateDocumentStore()

Each DocumentStore constructor allows for arguments specifying how to connect to existing databases and the names of indexes. See API documentation for more info.

Input Format

Cast your data into Document objects before writing into a DocumentStore:

from haystack.document_stores import ElasticsearchDocumentStore

document_store = ElasticsearchDocumentStore()
documents = [
    Document(
      'content'=DOCUMENT_TEXT_HERE
      'meta'={'name': DOCUMENT_NAME, ...}
    ),
    ...
]
document_store.write_documents(dicts)

You can also cast it into dictionaries:

from haystack.document_stores import ElasticsearchDocumentStore

document_store = ElasticsearchDocumentStore()
dicts = [
    {
        'content': DOCUMENT_TEXT_HERE,
        'meta': {'name': DOCUMENT_NAME, ...}
    }, ...
]
document_store.write_documents(dicts)

Writing Documents (Sparse Retrievers)

Haystack allows for you to write store documents in an optimised fashion so that query times can be kept low. For sparse, keyword based retrievers such as BM25 and TF-IDF, you simply have to call DocumentStore.write_documents(). The creation of the inverted index which optimises querying speed is handled automatically.

document_store.write_documents(dicts)

Writing Documents (Dense Retrievers)

For dense neural network based retrievers like Dense Passage Retrieval, or Embedding Retrieval, indexing involves computing the Document embeddings which will be compared against the Query embedding.

The storing of the text is handled by DocumentStore.write_documents() and the computation of the embeddings is started by DocumentStore.update_embeddings().

document_store.write_documents(dicts)
document_store.update_embeddings(retriever)

This step is computationally intensive since it will engage the transformer based encoders. Having GPU acceleration will significantly speed this up.

Approximate Nearest Neighbors Search

Approximate nearest neighbors (ANN) search brings significant improvements in document retrieval speed. It does so by approximating embedding similarity calculations rather which in turn brings a slight tradeoff in retrieval accuracy You can use ANN in cases where documents have embeddings and where the collection of documents is sufficiently large. In our benchmarks, we found significant speed improvements when working with document collections as small as 10 thousand documents.

ANN is available in:

  • FAISSDocumentStore
  • OpensearchDocumentStore
  • MilvusDocumentStore
  • WeaviateDocumentStore
  • PineconeDocumentStore

For most of these DocumentStores, the class constructor has an index_type argument that you can set to turn ANN on or off. Each of these DocumentStores has support for the HNSW algorithm which you can learn more about in Hierarchical Navigable Small Worlds. See the API documentation of the specific DocumentStore you are using for more information on its ANN parameters.

Choosing the Right Document Store

The Document Stores have different characteristics. You should choose one depending on the maturity of your project, the use case and technical environment:

Document StoreProsCons
ElasticsearchFast & accurate sparse retrieval with many tuning options.

Basic support for dense retrieval.

Production-ready.

Support also for Open Distro.
Slow for dense retrieval with more than ~ 1 Mio documents.
Open Distro for ElasticsearchFully open source (Apache 2.0 license).

Essentially the same features as Elasticsearch.
Slow for dense retrieval with more than ~ 1 Mio documents.
OpenSearchFully open source (Apache 2.0 license).

Essentially the same features as Elasticsearch.

Has more support for vector similarity comparisons and approximate nearest neighbours algorithms.
Not as optimized as dedicated vector similarity options like Milvus and FAISS.
MilvusScalable DocumentStore that excels at handling vectors (hence suited to dense retrieval methods like DPR).

Encapsulates multiple ANN libraries (e.g. FAISS and ANNOY) and provides added reliability.

Runs as a separate service (e.g. a Docker container).

Allows dynamic data management.
No efficient sparse retrieval.

Does not support filters for queries.
FAISSFast & accurate dense retrieval.

Highly scalable due to approximate nearest neighbour algorithms (ANN).

Many options to tune dense retrieval via different index types (more info here).
No efficient sparse retrieval.

Does not support filters for queries.
In MemorySimple.

No extra services or dependencies.
Slow retrieval on larger datasets.

No Approximate Nearest Neighbours (ANN).

Not recommended for production.
SQLSimple & fast to test.

No database requirements.

Supports MySQL, PostgreSQL and SQLite.
Not scalable.

Not persisting your data on disk.
WeaviateSimple vector search.

Stores everything in one place: documents, meta data and vectors - so less network overhead when scaling this up.

Allows combination of vector search and scalar filtering, i.e. you can filter for a certain tag and do dense retrieval on that subset.
Less options for ANN algorithms than FAISS or Milvus.

No BM25 / TF-IDF retrieval.

Does not support dot product similarity.
PineconeA fully managed service for large-scale dense retrieval.

Low query latency at any scale.

Live index updates.
Stores embeddings and metadata separately from the document content which makes it easier to setup infrastructure and maintenance.

👍

Our Recommendations

Restricted environment: Use the InMemoryDocumentStore, if you are just giving Haystack a quick try on a small sample and are working in a restricted environment that complicates running Elasticsearch or other databases.

Allrounder: Use the ElasticSearchDocumentStore, if you want to evaluate the performance of different retrieval options (dense vs. sparse) and are aiming for a smooth transition from PoC to production.

Vector Specialist: Use the MilvusDocumentStore, if you want to focus on dense retrieval and possibly deal with larger datasets.

Working with Existing Databases

If you have an existing Elasticsearch or OpenSearch database with indexed documents, you can very quickly make a Haystack compliant version using our elasticsearch_index_to_document_store or open_search_index_to_document_store function.

from haystack.document_stores import elasticsearch_index_to_document_store

new_ds = elasticsearch_index_to_document_store(
    document_store=empty_document_store,
    original_content_field="content",
    original_index_name="document",
    original_name_field="title",
    preprocessor=preprocessor,
    port=9201,
    verify_certs=False,
    scheme="https",
    username="admin",
    password="admin"
)
from haystack.document_stores import open_search_index_to_document_store

new_ds = open_search_index_to_document_store(
    document_store=empty_document_store,
    original_content_field="content",
    original_index_name="document",
    original_name_field="title",
    preprocessor=preprocessor,
    port=9201,
    verify_certs=False,
    scheme="https",
    username="admin",
    password="admin"
)

👍

Our Recommendations

Restricted environment: Use the InMemoryDocumentStore, if you are just giving Haystack a quick try on a small sample and are working in a restricted environment that complicates running Elasticsearch or other databases.

Allrounder: Use the ElasticSearchDocumentStore, if you want to evaluate the performance of different retrieval options (dense vs. sparse) and are aiming for a smooth transition from PoC to production.

Vector Specialist: Use the MilvusDocumentStore, if you want to focus on dense retrieval and possibly deal with larger datasets.


Related Links