Version: 2.30-unstable

ChonkieSemanticDocumentSplitter

ChonkieSemanticDocumentSplitter splits documents at semantically meaningful boundaries using Chonkie's SemanticChunker. Rather than splitting by a fixed token count, it uses an embedding model to detect topic shifts and keeps related sentences together.


Most common position in a pipeline	In indexing pipelines after Converters, before Embedders
Mandatory run variables	`documents`: A list of documents
Output variables	`documents`: A list of documents
API reference	Chonkie
GitHub link	https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/chonkie

Overview

ChonkieSemanticDocumentSplitter wraps Chonkie's SemanticChunker to produce context-aware chunks by grouping sentences with similar semantic content. It computes embeddings for sentences and uses cosine similarity to find natural topic boundaries.

The embedding model is loaded lazily — warm_up() is called automatically the first time run() is invoked, whether inside a pipeline or standalone.

Each output document includes the original document's metadata plus:

source_id: ID of the original document
page_number: Page number of the chunk within the original document
split_id: Index of the chunk within the document
split_idx_start / split_idx_end: Character offsets of the chunk in the original text
token_count: Number of tokens in the chunk

Installation

bash

pip install chonkie-haystack

Configuration

Parameter	Default	Description
`embedding_model`	`"minishlab/potion-base-32M"`	The embedding model used to compute sentence similarity. See Chonkie docs for supported models.
`threshold`	`0.8`	Cosine similarity threshold below which a sentence boundary becomes a split point.
`chunk_size`	`2048`	Maximum number of tokens per chunk (based on the embedding model's tokenizer).
`similarity_window`	`3`	Number of surrounding sentences to include when computing similarity.
`min_sentences_per_chunk`	`1`	Minimum number of sentences that must be included in each chunk.
`min_characters_per_sentence`	`24`	Minimum number of characters for a sentence to be considered valid.
`delim`	`None`	Custom sentence delimiters. If `None`, Chonkie's default delimiters are used.
`include_delim`	`"prev"`	Whether to attach the delimiter to the previous (`"prev"`) or next (`"next"`) chunk.
`skip_window`	`0`	Number of sentences to skip when computing similarity scores.
`filter_window`	`5`	Window size for the Savitzky-Golay smoothing filter applied to similarity scores.
`filter_polyorder`	`3`	Polynomial order for the Savitzky-Golay filter.
`filter_tolerance`	`0.2`	Tolerance used when filtering similarity scores.
`skip_empty_documents`	`True`	Whether to skip documents with empty content.
`page_break_character`	`"\f"`	Character used to detect page breaks when tracking page numbers.

Usage

On its own

python

from haystack import Document
from haystack_integrations.components.preprocessors.chonkie import (
    ChonkieSemanticDocumentSplitter,
)

chunker = ChonkieSemanticDocumentSplitter(chunk_size=512, threshold=0.5)

documents = [
    Document(
        content="Haystack is an open-source framework for LLM applications. "
        "It makes building RAG pipelines easy. "
        "The Eiffel Tower is located in Paris. "
        "Paris is the capital of France.",
    ),
]
result = chunker.run(documents=documents)
print(result["documents"])

In a pipeline

python

from pathlib import Path

from haystack import Pipeline
from haystack.components.converters import TextFileToDocument
from haystack.components.preprocessors import DocumentCleaner
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack_integrations.components.preprocessors.chonkie import (
    ChonkieSemanticDocumentSplitter,
)

document_store = InMemoryDocumentStore()

p = Pipeline()
p.add_component("converter", TextFileToDocument())
p.add_component("cleaner", DocumentCleaner())
p.add_component(
    "splitter",
    ChonkieSemanticDocumentSplitter(chunk_size=512, threshold=0.5),
)
p.add_component("writer", DocumentWriter(document_store=document_store))

p.connect("converter.documents", "cleaner.documents")
p.connect("cleaner.documents", "splitter.documents")
p.connect("splitter.documents", "writer.documents")

files = list(Path("path/to/your/files").glob("*.txt"))
p.run({"converter": {"sources": files}})

Overview​

Installation​

Configuration​

Usage​

On its own​

In a pipeline​