DocumentSplitter
DocumentSplitter divides a list of text documents into a list of shorter text documents. This is useful for long texts that otherwise wouldn't fit into the maximum text length of language models and can also speed up question answering.
| Name | DocumentSplitter |
| Folder path | /preprocessors/ |
| Most common position in a pipeline | In indexing pipelines after Converters and DocumentCleaner, before Classifiers |
| Mandatory input variables | "documents": A list of documents |
| Output variables | "documents": A list of documents |
Overview
DocumentSplitter expects a list of documents as input and returns a list of documents with split texts. It splits each input document by split_by after split_length units with an overlap of split_overlap units. These additional parameters can be set when the component is initialized:
split_bycan be"word","sentence","passage"(paragraph), or"page".split_lengthis an integer indicating the chunk size, which is the number of words, sentences, or passages.split_overlapis an integer indicating the number of overlapping words, sentences, or passages between chunks.split_thresholdis an integer indicating the minimum number of words, sentences, or passages that the document fragment should have. If the fragment is below the threshold, it will be attached to the previous one.
A field "source_id" is added to each document's meta data to keep track of the original document that was split. Another meta field "page_number" is added to each document to keep track of the page it belonged to in the original document. Other metadata are copied from the original document.
The DocumentSplitter is compatible with the following DocumentStores:
- AstraDocumentStore
- ChromaDocumentStore – limited support, overlapping information is not stored.
- ElasticsearchDocumentStore
- OpenSearchDocumentStore
- PgvectorDocumentStore
- PineconeDocumentStore – limited support, overlapping information is not stored.
- QdrantDocumentStore
- WeaviateDocumentStore
Usage
On its own
You can use this component outside of a pipeline to shorten your documents like this:
from haystack.components.preprocessors import DocumentSplitter
splitter = DocumentSplitter(split_by="passage", split_length=10, split_overlap=0)
In a pipeline
Here's how you can use DocumentSplitter in an indexing pipeline:
from pathlib import Path
from haystack import Document
from haystack import Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.converters.txt import TextFileToDocument
from haystack.components.preprocessors import DocumentCleaner
from haystack.components.preprocessors import DocumentSplitter
from haystack.components.writers import DocumentWriter
document_store = InMemoryDocumentStore()
p = Pipeline()
p.add_component(instance=TextFileToDocument(), name="text_file_converter")
p.add_component(instance=DocumentCleaner(), name="cleaner")
p.add_component(instance=DocumentSplitter(split_by="sentence", split_length=1), name="splitter")
p.add_component(instance=DocumentWriter(document_store=document_store), name="writer")
p.connect("text_file_converter.documents", "cleaner.documents")
p.connect("cleaner.documents", "splitter.documents")
p.connect("splitter.documents", "writer.documents")
path = "path/to/your/files"
files = list(Path(path).glob("*.md"))
p.run({"text_file_converter": {"sources": files}})
Updated about 1 year ago
