HierarchicalDocumentSplitter
Use this component to create a multi-level document structure based on parent-children relationships between text segments.
Most common position in a pipeline | In indexing pipelines after Converters and DocumentCleaner |
Mandatory init variables | “block_sizes”: Set of block sizes to split the document into. The blocks are split in descending order. |
Mandatory run variables | “documents”: A list of documents to split into hierarchical blocks |
Output variables | “documents”: A list of hierarchical documents |
API reference | PreProcessors |
GitHub link | https://github.com/deepset-ai/haystack/blob/dae8c7babaf28d2ffab4f2a8dedecd63e2394fb4/haystack/components/preprocessors/hierarchical_document_splitter.py |
Overview
The HierarchicalDocumentSplitter
divides documents into blocks of different sizes, creating a tree-like structure. In this hierarchy, the original document serves as the root node, while the smallest text blocks form the leaf nodes.
All intermediate blocks are organized so that smaller blocks become children of their larger parent blocks, establishing parent-child relationships throughout the document structure.
The AutoMergingRetriever
component then leverages this hierarchical structure to improve document retrieval.
These additional parameters can be set when the component is initialized:
split_by
can be"word"
(default),"sentence"
,"passage"
,"page"
.split_overlap
is an integer indicating the number of overlapping words, sentences, or passages between chunks, 0 being the default.
Usage
On its own
from haystack import Document
from haystack.components.preprocessors import HierarchicalDocumentSplitter
doc = Document(content="This is a simple test document")
splitter = HierarchicalDocumentSplitter(block_sizes={3, 2}, split_overlap=0, split_by="word")
splitter.run([doc])
>> {'documents': [Document(id=3f7..., content: 'This is a simple test document', meta: {'block_size': 0, 'parent_id': None, 'children_ids': ['5ff..', '8dc..'], 'level': 0}),
>> Document(id=5ff.., content: 'This is a ', meta: {'block_size': 3, 'parent_id': '3f7..', 'children_ids': ['f19..', '52c..'], 'level': 1, 'source_id': '3f7..', 'page_number': 1, 'split_id': 0, 'split_idx_start': 0}),
>> Document(id=8dc.., content: 'simple test document', meta: {'block_size': 3, 'parent_id': '3f7..', 'children_ids': ['39d..', 'e23..'], 'level': 1, 'source_id': '3f7..', 'page_number': 1, 'split_id': 1, 'split_idx_start': 10}),
>> Document(id=f19.., content: 'This is ', meta: {'block_size': 2, 'parent_id': '5ff..', 'children_ids': [], 'level': 2, 'source_id': '5ff..', 'page_number': 1, 'split_id': 0, 'split_idx_start': 0}),
>> Document(id=52c.., content: 'a ', meta: {'block_size': 2, 'parent_id': '5ff..', 'children_ids': [], 'level': 2, 'source_id': '5ff..', 'page_number': 1, 'split_id': 1, 'split_idx_start': 8}),
>> Document(id=39d.., content: 'simple test ', meta: {'block_size': 2, 'parent_id': '8dc..', 'children_ids': [], 'level': 2, 'source_id': '8dc..', 'page_number': 1, 'split_id': 0, 'split_idx_start': 0}),
>> Document(id=e23.., content: 'document', meta: {'block_size': 2, 'parent_id': '8dc..', 'children_ids': [], 'level': 2, 'source_id': '8dc..', 'page_number': 1, 'split_id': 1, 'split_idx_start': 12})]}
In a pipeline
This Haystack pipeline processes .md
files by converting them to documents, cleaning the text, splitting it into sentence-based chunks, and storing the results in an In-Memory Document Store.
from pathlib import Path
from haystack import Document
from haystack import Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.converters.txt import TextFileToDocument
from haystack.components.preprocessors import DocumentCleaner
from haystack.components.preprocessors import HierarchicalDocumentSplitter
from haystack.components.writers import DocumentWriter
document_store = InMemoryDocumentStore()
Pipeline = Pipeline()
Pipeline.add_component(instance=TextFileToDocument(), name="text_file_converter")
Pipeline.add_component(instance=DocumentCleaner(), name="cleaner")
Pipeline.add_component(instance=HierarchicalDocumentSplitter(
block_sizes={10, 6, 3}, split_overlap=0, split_by="sentence"), name="splitter"
)
Pipeline.add_component(instance=DocumentWriter(document_store=document_store), name="writer")
Pipeline.connect("text_file_converter.documents", "cleaner.documents")
Pipeline.connect("cleaner.documents", "splitter.documents")
Pipeline.connect("splitter.documents", "writer.documents")
path = "path/to/your/files"
files = list(Path(path).glob("*.md"))
Pipeline.run({"text_file_converter": {"sources": files}})
Updated 9 days ago