ChonkieRecursiveDocumentSplitter
ChonkieRecursiveDocumentSplitter splits documents using a hierarchy of splitting rules via Chonkie's RecursiveChunker.
It applies progressively finer-grained splits until all chunks satisfy the configured size constraints, making it effective for structured text like Markdown or code.
| Most common position in a pipeline | In indexing pipelines after Converters, before Embedders |
| Mandatory run variables | documents: A list of documents |
| Output variables | documents: A list of documents |
| API reference | Chonkie |
| GitHub link | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/chonkie |
Overview
ChonkieRecursiveDocumentSplitter wraps Chonkie's RecursiveChunker to split documents by applying splitting rules level by level.
If a chunk produced at one level still exceeds chunk_size, the next level's rules are applied to it.
This continues recursively until all chunks are within the size limit.
You can customize the splitting behavior by providing RecursiveRules from Chonkie.
See the Chonkie documentation for details on defining custom rules.
Each output document includes the original document's metadata plus:
source_id: ID of the original documentpage_number: Page number of the chunk within the original documentsplit_id: Index of the chunk within the documentsplit_idx_start/split_idx_end: Character offsets of the chunk in the original texttoken_count: Number of tokens in the chunk
Installation
Configuration
| Parameter | Default | Description |
|---|---|---|
tokenizer | "character" | Tokenizer to use. Common options: "character", "gpt2", "cl100k_base". See Chonkie docs for all options. |
chunk_size | 2048 | Maximum number of tokens per chunk. |
min_characters_per_chunk | 24 | Minimum number of characters a chunk must contain. |
rules | None | Custom RecursiveRules defining the splitting hierarchy. If None, Chonkie's default rules are used. |
skip_empty_documents | True | Whether to skip documents with empty content. |
page_break_character | "\f" | Character used to detect page breaks when tracking page numbers. |
Usage
On its own
from haystack import Document
from haystack_integrations.components.preprocessors.chonkie import (
ChonkieRecursiveDocumentSplitter,
)
chunker = ChonkieRecursiveDocumentSplitter(chunk_size=512)
documents = [
Document(
content="# Introduction\n\nHaystack is a framework.\n\n## Features\n\nIt supports RAG pipelines.",
),
]
result = chunker.run(documents=documents)
print(result["documents"])
With custom rules
from chonkie.types.recursive import RecursiveLevel, RecursiveRules
from haystack import Document
from haystack_integrations.components.preprocessors.chonkie import (
ChonkieRecursiveDocumentSplitter,
)
rules = RecursiveRules(
levels=[
RecursiveLevel(delimiters=["\n\n"]),
RecursiveLevel(delimiters=["\n"]),
RecursiveLevel(delimiters=[". ", "! ", "? "]),
],
)
chunker = ChonkieRecursiveDocumentSplitter(chunk_size=256, rules=rules)
documents = [Document(content="First paragraph.\n\nSecond paragraph with more detail.")]
result = chunker.run(documents=documents)
print(result["documents"])
In a pipeline
from pathlib import Path
from haystack import Pipeline
from haystack.components.converters import TextFileToDocument
from haystack.components.preprocessors import DocumentCleaner
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack_integrations.components.preprocessors.chonkie import (
ChonkieRecursiveDocumentSplitter,
)
document_store = InMemoryDocumentStore()
p = Pipeline()
p.add_component("converter", TextFileToDocument())
p.add_component("cleaner", DocumentCleaner())
p.add_component("splitter", ChonkieRecursiveDocumentSplitter(chunk_size=512))
p.add_component("writer", DocumentWriter(document_store=document_store))
p.connect("converter.documents", "cleaner.documents")
p.connect("cleaner.documents", "splitter.documents")
p.connect("splitter.documents", "writer.documents")
files = list(Path("path/to/your/files").glob("*.md"))
p.run({"converter": {"sources": files}})