DocumentationAPI ReferenceπŸ““ TutorialsπŸ§‘β€πŸ³ Cookbook🀝 IntegrationsπŸ’œ Discord

DocumentStore

You can think of the DocumentStore as a database that stores your texts and meta data and provides them to the Retriever at query time. Learn how to choose the best DocumentStore for your use case and how to use it in a pipeline.

πŸ‘

Use with Retrievers

By far the most common way to use a DocumentStore in Haystack is to fetch documents using a Retriever. You provide a DocumentStore as an argument when you initialize a Retriever.

Initialization

To use a DocumentStore in a pipeline, you must initialize it first. Initializing a new DocumentStore in Haystack is straightforward. Have a look at the instructions for different types of DocumentStores:

Elasticsearch

Install Elasticsearch and then start an instance.

If you have Docker set up, we recommend pulling the Docker image and running it.

docker pull docker.elastic.co/elasticsearch/elasticsearch:7.9.2
docker run -d -p 9200:9200 -e "discovery.type=single-node" elasticsearch:7.9.2

Next, you can initialize the Haystack object connecting to this instance:

document_store = ElasticsearchDocumentStore()
In Memory The InMemoryDocumentStore requires no external setup. Start it by simply using this line.
from haystack.document_stores import InMemoryDocumentStore
document_store = InMemoryDocumentStore()

πŸ“˜

The InMemoryDocumentStore works with the BM25Retriever.

Milvus

MilvusDocumentStore is an external integration for Haystack. You can find the source code of MilvusDocumentStore in the milvus-haystack GitHub repo.

To use the MilvusDocumentStore, follow the official Milvus documentation and start a Milvus instance using Docker.

Install the milvus-haystack package:

pip install -e milvus-haystack

Then, import and initialize the MilvusDocumentStore:

from haystack import Document
from milvus_documentstore import MilvusDocumentStore

ds = MilvusDocumentStore()
ds.write_documents([Document("Some Content")])
ds.get_all_documents()  # prints [<Document: {'content': 'foo', 'content_type': 'text', ...>]
MongoDB Atlas

MongoDB Atlas is a multi-cloud database service built by people behind MongoDB. Using MongoDBAtlasDocumentStore, you can connect to databases deployed on MongoDB Atlas. To start with MongoDB, check out their documentation.

First, install MongoDBAtlasDocumentStore using Haystack mongodb optional dependency:

pip install -e "farm-haystack[mongodb]"

Then, import and initialize the MongoDBAtlasDocumentStore:

from haystack.document_stores.mongodb_atlas import MongoDBAtlasDocumentStore
from haystack import Document

ds= MongoDBAtlasDocumentStore(
    mongo_connection_string="mongodb+srv://{mongo_atlas_username}:{mongo_atlas_password}@{mongo_atlas_host}/?{mongo_atlas_params_string}",
    database_name="database_name",
    collection_name="collection_name",
)
ds.write_documents([Document("Some Content")])
ds.get_all_documents()
OpenSearch

Learn how to get started here. If you have Docker set up, we recommend pulling the Docker image and running it.

docker pull opensearchproject/opensearch:1.0.1
docker run -p 9200:9200 -p 9600:9600 -e "discovery.type=single-node" opensearchproject/opensearch:1.0.1

Next, you can initialize the Haystack object connecting to this instance:

from haystack.document_stores import OpenSearchDocumentStore
document_store = OpenSearchDocumentStore()
Pinecone

PineconeDocumentStore is a fast and scalable vector database that supports filtered search. It's a managed document store, which means the vectors are stored in the cloud. To learn more, see Pinecone documentation.

To initialize PineconeDocumentStore, you need to have an API key from an active Pinecone account.

from haystack.document_stores import PineconeDocumentStore

document_store = PineconeDocumentStore(
    api_key='YOUR_API_KEY',
    similarity="cosine",
    index='your_index_name',
    embedding_dim=768
)
Qdrant

QdrantDocumentStore is an external integration the Qdrant team maintains. It's optimized for high-dimensional vector search, and it supports various similarity metrics. You can find the QdrantDocumentStore in the qdrant-haystack GitHub repo.

QdrantDocumentStore supports all the configuration properties available in the Qdrant Python client. For more information, see Qdrant documentation.

To use this document store, install it first:

pip install qdrant-haystack

And then, initialize it:

from qdrant_haystack.document_stores import QdrantDocumentStore

document_store = QdrantDocumentStore(
    ":memory:",
    index="Document",
    embedding_dim=512,
    recreate_index=True,
    hnsw_config={"m": 16, "ef_construct": 64}  # Optional
)

You can then write Documents to it using the write_documents() method.

To learn more about the document store, see our Blog post.

SQL The SQLDocumentStore requires SQLite, PostgreSQL or MySQL to be installed and started. Note that SQLite already comes packaged with most operating systems.
from haystack.document_stores import SQLDocumentStore
document_store = SQLDocumentStore()
Weaviate The WeaviateDocumentStore requires a running Weaviate Server version 1.8 or later. To start a basic instance, run:
docker run -d -p 8080:8080 --env AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED='true' --env PERSISTENCE_DATA_PATH='/var/lib/weaviate' semitechnologies/weaviate:1.17.2

See the Weaviate docs for more details.

Afterwards, you can use it in Haystack:

from haystack.document_stores import WeaviateDocumentStore
document_store = WeaviateDocumentStore()

Each DocumentStore constructor allows for arguments specifying how to connect to existing databases and the names of indexes. See the API documentation for more info.

FAISS

The FAISSDocumentStore requires no external setup. Use this code to start it:

from haystack.document_stores import FAISSDocumentStore

document_store = FAISSDocumentStore()

Initiating the FAISSDocumentStore creates the faiss_document_store.db database file on your disk. If later on, you want to remove the DocumentStore, that's the file you need to delete.

Save Your DocumentStore

You can save FAISS document stores to disk and then reload them.

Under the hood, the FAISSDocumentStore contains a SQL database and a FAISS index. The database is saved to your disk when you initialize the FAISSDocumentStore. The FAISS index is not saved anywhere and stays in memory. To save it, call the save() method. Note that you must always initialize the DocumentStore before saving it.

from haystack.document_stores import FAISSDocumentStore

# First, initialize the document store:
document_store = FAISSDocumentStore(faiss_index_factory_str="Flat")

# Save the document store:
document_store.save(index_path="my_faiss_index.faiss")
# Saving the document store creates two files: my_faiss_index.faiss and my_faiss_index.json


Saving a FAISSDocumentStore creates two files on your disk:

  • my_faiss_index.faiss - This file contains the index.
  • my_faiss_index.json - This file contains the parameters used to initialize the DocumentStore. For example faiss_index_factory_str="Flat".

You can change the name and the location of the files. Just pass the new names or paths in the index_path and config_path parameters:

from haystack.document_stores import FAISSDocumentStore

document_store = FAISSDocumentStore(faiss_index_factory_str="Flat")
document_store.save(index_path="data/my_index.faiss", config_path="data/my_config.json")

Load a Saved DocumentStore

You can load a saved DocumentStore using the faiss_index_path parameter of FAISSDocumentStore or using the index_path parameter of the load() method. As a value, pass the index file path you indicated when saving the DocumentStore.

Here's how you load a saved DocumentStore:

# Load the saved index into a new DocumentStore instance:
new_document_store = FAISSDocumentStore(faiss_index_path="my_faiss_index.faiss")

# or use the `load()` method to create a new DocumentStore instance: 
new_document_store = FAISSDocumentStore.load(index_path="my_faiss_index.faiss")

# Also, provide `config_path` parameter if you set it when calling the `save()` method: 
new_document_store = FAISSDocumentStore.load(index_path="data/my_index.faiss", config_path="data/my_config.json")

# Check if the DocumentStore is loaded correctly
assert new_document_store.faiss_index_factory_str == "Flat"

Remember that load() is a class method, so don't call it on an instance but on a class.

Input Format

Cast your data into Document objects before writing into a DocumentStore:

from haystack.document_stores import ElasticsearchDocumentStore

document_store = ElasticsearchDocumentStore()
documents = [
    Document(
      'content'=DOCUMENT_TEXT_HERE,
      'meta'={'name': DOCUMENT_NAME, ...}
  	),
  	...
]
document_store.write_documents(documents)

You can also cast it into dictionaries:

from haystack.document_stores import ElasticsearchDocumentStore

document_store = ElasticsearchDocumentStore()
dicts = [
    {
        'content': DOCUMENT_TEXT_HERE,
        'meta': {'name': DOCUMENT_NAME, ...}
    }, ...
]
document_store.write_documents(dicts)

Writing Documents (Sparse Retrievers)

Haystack allows you to write store documents in an optimized fashion so that query times can be kept low. For sparse, keyword-based retrievers such as BM25 and TF-IDF, you simply have to call DocumentStore.write_documents(). The creation of the inverted index which optimizes querying speed is handled automatically.

document_store.write_documents(dicts)

Writing Documents (Dense Retrievers)

For dense neural network-based retrievers like Dense Passage Retrieval, or Embedding Retrieval, indexing involves computing the Document embeddings which will be compared against the Query embedding.

The storing of the text is handled by DocumentStore.write_documents() and the computation of the embeddings is started by DocumentStore.update_embeddings().

document_store.write_documents(dicts)
document_store.update_embeddings(retriever)

This step is computationally intensive since it will engage the transformer-based encoders. Having GPU acceleration will significantly speed this up.

Approximate Nearest Neighbors Search

Approximate nearest neighbors (ANN) search brings significant improvements in document retrieval speed. It does so by approximating embedding similarity calculations which in turn brings a slight tradeoff in retrieval accuracy. You can use ANN in cases where documents have embeddings and where the collection of documents is sufficiently large. In ourΒ benchmarks, we found significant speed improvements when working with document collections as small as 10 thousand documents.

ANN is available in:

  • ElasticsearchDocumentStore
  • QdrantDocumentStore
  • OpenSearchDocumentStore
  • WeaviateDocumentStore
  • PineconeDocumentStore
  • MilvusDocumentStore
  • FAISSDocumentStore

For most of these DocumentStores, the class constructor has anΒ index_typeΒ argument that you can set to turn ANN on or off. Each of these DocumentStores has support for the HNSW algorithm which you can learn more about in Hierarchical Navigable Small Worlds. See theΒ API documentationΒ of the specific DocumentStore you are using for more information on its ANN parameters.

Choosing the Right Document Store

The Document Stores have different characteristics. You should choose one depending on the maturity of your project, the use case, and the technical environment:

Document StoreMain featuresPart of Haystack core?
ElasticsearchSparse retrieval with many tuning options and basic support for dense retrieval.Yes
In MemorySimple document store, with no extra services or dependencies.

Not recommended for production.
Yes
MilvusOpen-source.

Dense retrieval for scalable similarity search.

Approximate nearest neighbor (ANN) search algorithms.
No, you can find it in haystack-extras
OpenSearchOpen-source.

Compatible with the Amazon OpenSearch Service.

Essentially has the same features as Elasticsearch.

Support for vector similarity comparisons and approximate nearest neighbors algorithms.
Yes
PineconeA fully managed service for large-scale dense retrieval. Metadata filters.

Low query latency at any scale.

Live index updates.
Yes
QdrantOpen-source.

Extended filtering support.
No, you can find it in Qdrant GitHub
SQLSimple & fast, with no database requirements.

Supports MySQL, PostgreSQL, and SQLite.
Yes
WeaviateOpen-source.

Simple dense retrieval.

Stores documents, metadata, and vectors in one place.

Allows a combination of vector search and scalar filtering – you can filter for a certain tag and do dense retrieval on that subset.
Yes
FAISSOpen-source.

Dense retrieval via different index types.
Yes

πŸ‘

Our Recommendations

Restricted environment: Use the InMemoryDocumentStore, if you are just giving Haystack a quick try on a small sample and are working in a restricted environment that complicates running Elasticsearch or other databases.

Allrounder: Use the ElasticSearchDocumentStore, if you want to evaluate the performance of different retrieval options (dense vs. sparse) and are aiming for a smooth transition from PoC to production.

Working with Existing Databases

If you have an existing Elasticsearch or OpenSearch database with indexed documents, you can very quickly make a Haystack compliant version using our elasticsearch_index_to_document_store or open_search_index_to_document_store function.

from haystack.document_stores import elasticsearch_index_to_document_store

new_ds = elasticsearch_index_to_document_store(
    document_store=empty_document_store,
    original_content_field="content",
    original_index_name="document",
    original_name_field="title",
    preprocessor=preprocessor,
    port=9201,
    verify_certs=False,
    scheme="https",
    username="admin",
    password="admin"
)
from haystack.document_stores import open_search_index_to_document_store

new_ds = open_search_index_to_document_store(
    document_store=empty_document_store,
    original_content_field="content",
    original_index_name="document",
    original_name_field="title",
    preprocessor=preprocessor,
    port=9201,
    verify_certs=False,
    scheme="https",
    username="admin",
    password="admin"
)