Document Store
You can think of the Document Store as a database that stores your data and provides them to the Retriever at query time. Learn how to use DocumentStore in a pipeline or how to create your own.
Document Store is an object that stores your Documents. In Haystack, a Document Store is different from a component, as it doesn’t have the run()
method. You can think of it as an interface to your database – you put the information there, or you can look through it. This means that a Document Store is not a piece of a Pipeline, but rather a tool that the components of a pipeline have access to and can interact with.
Work with Retrievers
The most common way to use a Document Store in Haystack is to fetch documents using a Retriever. A Document Store will often have a corresponding Retriever to get the most out of specific technologies. See more information in our Retriever documentation.
How to choose a Document Store?
To learn about different types of Document Stores and their strengths and disadvantages, head to the Choosing a Document Store page.
DocumentStore Protocol
Document Stores in Haystack are designed to use the following methods as part of their protocol:
count_documents
returns the number of Documents stored in the given store as an integer.filter_documents
returns a list of Documents that match the provided filters.write_documents
writes or overwrites Documents into the given store and returns the number of Documents that were written as an integer.delete_documents
deletes all Documents with givendocument_ids
from the Document Store.
Initialization
To use a Document Store in a pipeline, you must initialize it first.
See the installation and initialization details for each Document Store in the "Document Stores" section in the navigation panel on your left.
Work with Documents
Convert your data into Document objects before writing them into a Document Store along with its metadata and document ID.
The ID field is mandatory, so if you don’t choose a specific ID yourself, Haystack will do its best to come up with a unique ID based on the Document’s information and assign it automatically. However, since Haystack uses the Document’s contents to create an ID, two identical Documents might have identical IDs. Keep it in mind as you update your documents, as the ID will not be updated automatically.
document_store = ChromaDocumentStore()
documents = [
Document(
'meta'={'name': DOCUMENT_NAME, ...}
'id'="document_unique_id",
'content'="this is content"
),
...
]
document_store.write_documents(documents)
To write Documents into the InMemoryDocumentStore
, simply call the .write_documents()
function:
document_store.write_documents([
Document(content="My name is Jean and I live in Paris."),
Document(content="My name is Mark and I live in Berlin."),
Document(content="My name is Giorgio and I live in Rome.")
])
DocumentWriter
See
DocumentWriter
component docs to write your Documents into a Document Store in a pipeline.
DuplicatePolicy
The DuplicatePolicy
is a class that defines the different options for handling documents with the same ID in a DocumentStore
. It has three possible values:
- OVERWRITE: Indicates that if a document with the same ID already exists in the
DocumentStore
, it should be overwritten with the new document. - SKIP: If a document with the same ID already exists, the new document will be skipped and not added to the
DocumentStore
. - FAIL: Raises an error if a document with the same ID already exists in the
DocumentStore
. It prevents duplicate documents from being added.
Here is an example of how you could apply the policy to skip the existing document:
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.writers import DocumentWriter
from haystack.document_stores.types import DuplicatePolicy
document_store = InMemoryDocumentStore()
document_writer = DocumentWriter(document_store = document_store, policy=DuplicatePolicy.SKIP)
Custom Document Store
All custom document stores must implement the protocol with four mandatory methods: count_documents
,filter_documents
, write_documents
, and delete_documents
.
The init
function should indicate all the specifics for the chosen database or vector store.
We also recommend having a custom corresponding Retriever to get the most out of a specific Document Store.
See Creating Custom Document Stores page for more details.
Updated 9 months ago