Skip to main content
Version: 2.29-unstable

ChonkieSemanticDocumentSplitter

ChonkieSemanticDocumentSplitter splits documents at semantically meaningful boundaries using Chonkie's SemanticChunker. Rather than splitting by a fixed token count, it uses an embedding model to detect topic shifts and keeps related sentences together.

Most common position in a pipelineIn indexing pipelines after Converters, before Embedders
Mandatory run variablesdocuments: A list of documents
Output variablesdocuments: A list of documents
API referenceChonkie
GitHub linkhttps://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/chonkie

Overview

ChonkieSemanticDocumentSplitter wraps Chonkie's SemanticChunker to produce context-aware chunks by grouping sentences with similar semantic content. It computes embeddings for sentences and uses cosine similarity to find natural topic boundaries.

The embedding model is loaded lazily — warm_up() is called automatically the first time run() is invoked, whether inside a pipeline or standalone.

Each output document includes the original document's metadata plus:

  • source_id: ID of the original document
  • page_number: Page number of the chunk within the original document
  • split_id: Index of the chunk within the document
  • split_idx_start / split_idx_end: Character offsets of the chunk in the original text
  • token_count: Number of tokens in the chunk

Installation

bash
pip install chonkie-haystack

Configuration

ParameterDefaultDescription
embedding_model"minishlab/potion-base-32M"The embedding model used to compute sentence similarity. See Chonkie docs for supported models.
threshold0.8Cosine similarity threshold below which a sentence boundary becomes a split point.
chunk_size2048Maximum number of tokens per chunk (based on the embedding model's tokenizer).
similarity_window3Number of surrounding sentences to include when computing similarity.
min_sentences_per_chunk1Minimum number of sentences that must be included in each chunk.
min_characters_per_sentence24Minimum number of characters for a sentence to be considered valid.
delimNoneCustom sentence delimiters. If None, Chonkie's default delimiters are used.
include_delim"prev"Whether to attach the delimiter to the previous ("prev") or next ("next") chunk.
skip_window0Number of sentences to skip when computing similarity scores.
filter_window5Window size for the Savitzky-Golay smoothing filter applied to similarity scores.
filter_polyorder3Polynomial order for the Savitzky-Golay filter.
filter_tolerance0.2Tolerance used when filtering similarity scores.
skip_empty_documentsTrueWhether to skip documents with empty content.
page_break_character"\f"Character used to detect page breaks when tracking page numbers.

Usage

On its own

python
from haystack import Document
from haystack_integrations.components.preprocessors.chonkie import (
ChonkieSemanticDocumentSplitter,
)

chunker = ChonkieSemanticDocumentSplitter(chunk_size=512, threshold=0.5)

documents = [
Document(
content="Haystack is an open-source framework for LLM applications. "
"It makes building RAG pipelines easy. "
"The Eiffel Tower is located in Paris. "
"Paris is the capital of France.",
),
]
result = chunker.run(documents=documents)
print(result["documents"])

In a pipeline

python
from pathlib import Path

from haystack import Pipeline
from haystack.components.converters import TextFileToDocument
from haystack.components.preprocessors import DocumentCleaner
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack_integrations.components.preprocessors.chonkie import (
ChonkieSemanticDocumentSplitter,
)

document_store = InMemoryDocumentStore()

p = Pipeline()
p.add_component("converter", TextFileToDocument())
p.add_component("cleaner", DocumentCleaner())
p.add_component(
"splitter",
ChonkieSemanticDocumentSplitter(chunk_size=512, threshold=0.5),
)
p.add_component("writer", DocumentWriter(document_store=document_store))

p.connect("converter.documents", "cleaner.documents")
p.connect("cleaner.documents", "splitter.documents")
p.connect("splitter.documents", "writer.documents")

files = list(Path("path/to/your/files").glob("*.txt"))
p.run({"converter": {"sources": files}})