Skip to main content
Version: 2.29-unstable

ChonkieRecursiveDocumentSplitter

ChonkieRecursiveDocumentSplitter splits documents using a hierarchy of splitting rules via Chonkie's RecursiveChunker. It applies progressively finer-grained splits until all chunks satisfy the configured size constraints, making it effective for structured text like Markdown or code.

Most common position in a pipelineIn indexing pipelines after Converters, before Embedders
Mandatory run variablesdocuments: A list of documents
Output variablesdocuments: A list of documents
API referenceChonkie
GitHub linkhttps://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/chonkie

Overview

ChonkieRecursiveDocumentSplitter wraps Chonkie's RecursiveChunker to split documents by applying splitting rules level by level. If a chunk produced at one level still exceeds chunk_size, the next level's rules are applied to it. This continues recursively until all chunks are within the size limit.

You can customize the splitting behavior by providing RecursiveRules from Chonkie. See the Chonkie documentation for details on defining custom rules.

Each output document includes the original document's metadata plus:

  • source_id: ID of the original document
  • page_number: Page number of the chunk within the original document
  • split_id: Index of the chunk within the document
  • split_idx_start / split_idx_end: Character offsets of the chunk in the original text
  • token_count: Number of tokens in the chunk

Installation

bash
pip install chonkie-haystack

Configuration

ParameterDefaultDescription
tokenizer"character"Tokenizer to use. Common options: "character", "gpt2", "cl100k_base". See Chonkie docs for all options.
chunk_size2048Maximum number of tokens per chunk.
min_characters_per_chunk24Minimum number of characters a chunk must contain.
rulesNoneCustom RecursiveRules defining the splitting hierarchy. If None, Chonkie's default rules are used.
skip_empty_documentsTrueWhether to skip documents with empty content.
page_break_character"\f"Character used to detect page breaks when tracking page numbers.

Usage

On its own

python
from haystack import Document
from haystack_integrations.components.preprocessors.chonkie import (
ChonkieRecursiveDocumentSplitter,
)

chunker = ChonkieRecursiveDocumentSplitter(chunk_size=512)
documents = [
Document(
content="# Introduction\n\nHaystack is a framework.\n\n## Features\n\nIt supports RAG pipelines.",
),
]
result = chunker.run(documents=documents)
print(result["documents"])

With custom rules

python
from chonkie.types.recursive import RecursiveLevel, RecursiveRules
from haystack import Document
from haystack_integrations.components.preprocessors.chonkie import (
ChonkieRecursiveDocumentSplitter,
)

rules = RecursiveRules(
levels=[
RecursiveLevel(delimiters=["\n\n"]),
RecursiveLevel(delimiters=["\n"]),
RecursiveLevel(delimiters=[". ", "! ", "? "]),
],
)

chunker = ChonkieRecursiveDocumentSplitter(chunk_size=256, rules=rules)
documents = [Document(content="First paragraph.\n\nSecond paragraph with more detail.")]
result = chunker.run(documents=documents)
print(result["documents"])

In a pipeline

python
from pathlib import Path

from haystack import Pipeline
from haystack.components.converters import TextFileToDocument
from haystack.components.preprocessors import DocumentCleaner
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack_integrations.components.preprocessors.chonkie import (
ChonkieRecursiveDocumentSplitter,
)

document_store = InMemoryDocumentStore()

p = Pipeline()
p.add_component("converter", TextFileToDocument())
p.add_component("cleaner", DocumentCleaner())
p.add_component("splitter", ChonkieRecursiveDocumentSplitter(chunk_size=512))
p.add_component("writer", DocumentWriter(document_store=document_store))

p.connect("converter.documents", "cleaner.documents")
p.connect("cleaner.documents", "splitter.documents")
p.connect("splitter.documents", "writer.documents")

files = list(Path("path/to/your/files").glob("*.md"))
p.run({"converter": {"sources": files}})