DocumentationAPI Reference📓 Tutorials🧑‍🍳 Cookbook🤝 Integrations💜 Discord🎨 Studio
API Reference

Split documents into hierarchical chunks.

Module haystack_experimental.components.splitters.hierarchical_doc_splitter

HierarchicalDocumentSplitter

Splits a documents into different block sizes building a hierarchical tree structure of blocks of different sizes.

The root node of the tree is the original document, the leaf nodes are the smallest blocks. The blocks in between are connected such that the smaller blocks are children of the parent-larger blocks.

Usage example

from haystack import Document
from haystack_experimental.components.splitters import HierarchicalDocumentSplitter

doc = Document(content="This is a simple test document")
splitter = HierarchicalDocumentSplitter(block_sizes={3, 2}, split_overlap=0, split_by="word")
splitter.run([doc])
>> {'documents': [Document(id=3f7..., content: 'This is a simple test document', meta: {'block_size': 0, 'parent_id': None, 'children_ids': ['5ff..', '8dc..'], 'level': 0}),
>> Document(id=5ff.., content: 'This is a ', meta: {'block_size': 3, 'parent_id': '3f7..', 'children_ids': ['f19..', '52c..'], 'level': 1, 'source_id': '3f7..', 'page_number': 1, 'split_id': 0, 'split_idx_start': 0}),
>> Document(id=8dc.., content: 'simple test document', meta: {'block_size': 3, 'parent_id': '3f7..', 'children_ids': ['39d..', 'e23..'], 'level': 1, 'source_id': '3f7..', 'page_number': 1, 'split_id': 1, 'split_idx_start': 10}),
>> Document(id=f19.., content: 'This is ', meta: {'block_size': 2, 'parent_id': '5ff..', 'children_ids': [], 'level': 2, 'source_id': '5ff..', 'page_number': 1, 'split_id': 0, 'split_idx_start': 0}),
>> Document(id=52c.., content: 'a ', meta: {'block_size': 2, 'parent_id': '5ff..', 'children_ids': [], 'level': 2, 'source_id': '5ff..', 'page_number': 1, 'split_id': 1, 'split_idx_start': 8}),
>> Document(id=39d.., content: 'simple test ', meta: {'block_size': 2, 'parent_id': '8dc..', 'children_ids': [], 'level': 2, 'source_id': '8dc..', 'page_number': 1, 'split_id': 0, 'split_idx_start': 0}),
>> Document(id=e23.., content: 'document', meta: {'block_size': 2, 'parent_id': '8dc..', 'children_ids': [], 'level': 2, 'source_id': '8dc..', 'page_number': 1, 'split_id': 1, 'split_idx_start': 12})]}

HierarchicalDocumentSplitter.__init__

def __init__(block_sizes: Set[int],
             split_overlap: int = 0,
             split_by: Literal["word", "sentence", "page",
                               "passage"] = "word")

Initialize HierarchicalDocumentSplitter.

Arguments:

  • block_sizes: Set of block sizes to split the document into. The blocks are split in descending order.
  • split_overlap: The number of overlapping units for each split.
  • split_by: The unit for splitting your documents.

HierarchicalDocumentSplitter.run

@component.output_types(documents=List[Document])
def run(documents: List[Document])

Builds a hierarchical document structure for each document in a list of documents.

Arguments:

  • documents: List of Documents to split into hierarchical blocks.

Returns:

List of HierarchicalDocument

HierarchicalDocumentSplitter.build_hierarchy_from_doc

def build_hierarchy_from_doc(document: Document) -> List[Document]

Build a hierarchical tree document structure from a single document.

Given a document, this function splits the document into hierarchical blocks of different sizes represented as HierarchicalDocument objects.

Arguments:

  • document: Document to split into hierarchical blocks.

Returns:

List of HierarchicalDocument

HierarchicalDocumentSplitter.to_dict

def to_dict() -> Dict[str, Any]

Returns a dictionary representation of the component.

Returns:

Serialized dictionary representation of the component.

HierarchicalDocumentSplitter.from_dict

@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "HierarchicalDocumentSplitter"

Deserialize this component from a dictionary.

Arguments:

  • data: The dictionary to deserialize and create the component.

Returns:

The deserialized component.