Version: 2.30-unstable

ChonkieTokenDocumentSplitter

ChonkieTokenDocumentSplitter splits documents into fixed-size token-based chunks using Chonkie's TokenChunker. It supports multiple tokenizers and is well-suited for splitting long documents before indexing.


Most common position in a pipeline	In indexing pipelines after Converters and `DocumentCleaner`, before Embedders
Mandatory run variables	`documents`: A list of documents
Output variables	`documents`: A list of documents
API reference	Chonkie
GitHub link	https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/chonkie

Overview

ChonkieTokenDocumentSplitter wraps Chonkie's TokenChunker to split each input document into smaller chunks based on token count. You can configure the tokenizer, chunk size, and overlap between chunks.

Each output document includes the original document's metadata plus:

source_id: ID of the original document
page_number: Page number of the chunk within the original document
split_id: Index of the chunk within the document
split_idx_start / split_idx_end: Character offsets of the chunk in the original text
token_count: Number of tokens in the chunk

Installation

bash

pip install chonkie-haystack

Configuration

Parameter	Default	Description
`tokenizer`	`"character"`	Tokenizer to use. Common options: `"character"`, `"gpt2"`, `"cl100k_base"`. See Chonkie docs for all options.
`chunk_size`	`2048`	Maximum number of tokens per chunk.
`chunk_overlap`	`0`	Number of overlapping tokens between consecutive chunks.
`skip_empty_documents`	`True`	Whether to skip documents with empty content.
`page_break_character`	`"\f"`	Character used to detect page breaks when tracking page numbers.

Usage

On its own

python

from haystack import Document
from haystack_integrations.components.preprocessors.chonkie import (
    ChonkieTokenDocumentSplitter,
)

chunker = ChonkieTokenDocumentSplitter(
    tokenizer="gpt2",
    chunk_size=512,
    chunk_overlap=50,
)
documents = [
    Document(
        content="Haystack is an open-source framework for building LLM applications.",
    ),
]
result = chunker.run(documents=documents)
print(result["documents"])

In a pipeline

python

from pathlib import Path

from haystack import Pipeline
from haystack.components.converters import TextFileToDocument
from haystack.components.preprocessors import DocumentCleaner
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack_integrations.components.preprocessors.chonkie import (
    ChonkieTokenDocumentSplitter,
)

document_store = InMemoryDocumentStore()

p = Pipeline()
p.add_component("converter", TextFileToDocument())
p.add_component("cleaner", DocumentCleaner())
p.add_component(
    "splitter",
    ChonkieTokenDocumentSplitter(tokenizer="gpt2", chunk_size=512),
)
p.add_component("writer", DocumentWriter(document_store=document_store))

p.connect("converter.documents", "cleaner.documents")
p.connect("cleaner.documents", "splitter.documents")
p.connect("splitter.documents", "writer.documents")

files = list(Path("path/to/your/files").glob("*.txt"))
p.run({"converter": {"sources": files}})

Overview​

Installation​

Configuration​

Usage​

On its own​

In a pipeline​