Hierarchical Document Splitter for Haystack.
Module haystack_experimental.components.splitters.hierarchical_doc_splitter
HierarchicalDocumentSplitter
Splits a documents into different block sizes building a hierarchical tree structure of blocks of different sizes.
The root node of the tree is the original document, the leaf nodes are the smallest blocks. The blocks in between are connected such that the smaller blocks are children of the parent-larger blocks.
Usage example
from haystack import Document
from haystack_experimental.components.splitters import HierarchicalDocumentSplitter
doc = Document(content="This is a simple test document")
splitter = HierarchicalDocumentSplitter(block_sizes=[3, 2], split_overlap=0, split_by="word")
splitter.run([doc])
>> {'documents': [Document(id=3f7..., content: 'This is a simple test document', meta: {'block_size': 0, 'parent_id': None, 'children_ids': ['5ff..', '8dc..'], 'level': 0}),
>> Document(id=5ff.., content: 'This is a ', meta: {'block_size': 3, 'parent_id': '3f7..', 'children_ids': ['f19..', '52c..'], 'level': 1, 'source_id': '3f7..', 'page_number': 1, 'split_id': 0, 'split_idx_start': 0}),
>> Document(id=8dc.., content: 'simple test document', meta: {'block_size': 3, 'parent_id': '3f7..', 'children_ids': ['39d..', 'e23..'], 'level': 1, 'source_id': '3f7..', 'page_number': 1, 'split_id': 1, 'split_idx_start': 10}),
>> Document(id=f19.., content: 'This is ', meta: {'block_size': 2, 'parent_id': '5ff..', 'children_ids': [], 'level': 2, 'source_id': '5ff..', 'page_number': 1, 'split_id': 0, 'split_idx_start': 0}),
>> Document(id=52c.., content: 'a ', meta: {'block_size': 2, 'parent_id': '5ff..', 'children_ids': [], 'level': 2, 'source_id': '5ff..', 'page_number': 1, 'split_id': 1, 'split_idx_start': 8}),
>> Document(id=39d.., content: 'simple test ', meta: {'block_size': 2, 'parent_id': '8dc..', 'children_ids': [], 'level': 2, 'source_id': '8dc..', 'page_number': 1, 'split_id': 0, 'split_idx_start': 0}),
>> Document(id=e23.., content: 'document', meta: {'block_size': 2, 'parent_id': '8dc..', 'children_ids': [], 'level': 2, 'source_id': '8dc..', 'page_number': 1, 'split_id': 1, 'split_idx_start': 12})]}
HierarchicalDocumentSplitter.__init__
def __init__(block_sizes: Set[int],
split_overlap: int = 0,
split_by: Literal["word", "sentence", "page",
"passage"] = "word")
Initialize HierarchicalDocumentSplitter.
Arguments:
block_sizes
: Set of block sizes to split the document into. The blocks are split in descending order.split_overlap
: The number of overlapping units for each split.split_by
: The unit for splitting your documents.
HierarchicalDocumentSplitter.run
@component.output_types(documents=List[Document])
def run(documents: List[Document])
Builds a hierarchical document structure for each document in a list of documents.
Arguments:
documents
: List of Documents to split into hierarchical blocks.
Returns:
List of HierarchicalDocument
HierarchicalDocumentSplitter.build_hierarchy_from_doc
def build_hierarchy_from_doc(document: Document) -> List[Document]
Build a hierarchical tree document structure from a single document.
Given a document, this function splits the document into hierarchical blocks of different sizes represented as HierarchicalDocument objects.
Arguments:
document
: Document to split into hierarchical blocks.
Returns:
List of HierarchicalDocument
HierarchicalDocumentSplitter.to_dict
def to_dict() -> Dict[str, Any]
Returns a dictionary representation of the component.
Returns:
Serialized dictionary representation of the component.
HierarchicalDocumentSplitter.from_dict
@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "HierarchicalDocumentSplitter"
Deserialize this component from a dictionary.
Arguments:
data
: The dictionary to deserialize and create the component.
Returns:
The deserialized component.