ChonkieSemanticDocumentSplitter
ChonkieSemanticDocumentSplitter splits documents at semantically meaningful boundaries using Chonkie's SemanticChunker.
Rather than splitting by a fixed token count, it uses an embedding model to detect topic shifts and keeps related sentences together.
| Most common position in a pipeline | In indexing pipelines after Converters, before Embedders |
| Mandatory run variables | documents: A list of documents |
| Output variables | documents: A list of documents |
| API reference | Chonkie |
| GitHub link | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/chonkie |
Overview
ChonkieSemanticDocumentSplitter wraps Chonkie's SemanticChunker to produce context-aware chunks by grouping sentences with similar semantic content.
It computes embeddings for sentences and uses cosine similarity to find natural topic boundaries.
The embedding model is loaded lazily — warm_up() is called automatically the first time run() is invoked, whether inside a pipeline or standalone.
Each output document includes the original document's metadata plus:
source_id: ID of the original documentpage_number: Page number of the chunk within the original documentsplit_id: Index of the chunk within the documentsplit_idx_start/split_idx_end: Character offsets of the chunk in the original texttoken_count: Number of tokens in the chunk
Installation
Configuration
| Parameter | Default | Description |
|---|---|---|
embedding_model | "minishlab/potion-base-32M" | The embedding model used to compute sentence similarity. See Chonkie docs for supported models. |
threshold | 0.8 | Cosine similarity threshold below which a sentence boundary becomes a split point. |
chunk_size | 2048 | Maximum number of tokens per chunk (based on the embedding model's tokenizer). |
similarity_window | 3 | Number of surrounding sentences to include when computing similarity. |
min_sentences_per_chunk | 1 | Minimum number of sentences that must be included in each chunk. |
min_characters_per_sentence | 24 | Minimum number of characters for a sentence to be considered valid. |
delim | None | Custom sentence delimiters. If None, Chonkie's default delimiters are used. |
include_delim | "prev" | Whether to attach the delimiter to the previous ("prev") or next ("next") chunk. |
skip_window | 0 | Number of sentences to skip when computing similarity scores. |
filter_window | 5 | Window size for the Savitzky-Golay smoothing filter applied to similarity scores. |
filter_polyorder | 3 | Polynomial order for the Savitzky-Golay filter. |
filter_tolerance | 0.2 | Tolerance used when filtering similarity scores. |
skip_empty_documents | True | Whether to skip documents with empty content. |
page_break_character | "\f" | Character used to detect page breaks when tracking page numbers. |
Usage
On its own
python
from haystack import Document
from haystack_integrations.components.preprocessors.chonkie import (
ChonkieSemanticDocumentSplitter,
)
chunker = ChonkieSemanticDocumentSplitter(chunk_size=512, threshold=0.5)
documents = [
Document(
content="Haystack is an open-source framework for LLM applications. "
"It makes building RAG pipelines easy. "
"The Eiffel Tower is located in Paris. "
"Paris is the capital of France.",
),
]
result = chunker.run(documents=documents)
print(result["documents"])
In a pipeline
python
from pathlib import Path
from haystack import Pipeline
from haystack.components.converters import TextFileToDocument
from haystack.components.preprocessors import DocumentCleaner
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack_integrations.components.preprocessors.chonkie import (
ChonkieSemanticDocumentSplitter,
)
document_store = InMemoryDocumentStore()
p = Pipeline()
p.add_component("converter", TextFileToDocument())
p.add_component("cleaner", DocumentCleaner())
p.add_component(
"splitter",
ChonkieSemanticDocumentSplitter(chunk_size=512, threshold=0.5),
)
p.add_component("writer", DocumentWriter(document_store=document_store))
p.connect("converter.documents", "cleaner.documents")
p.connect("cleaner.documents", "splitter.documents")
p.connect("splitter.documents", "writer.documents")
files = list(Path("path/to/your/files").glob("*.txt"))
p.run({"converter": {"sources": files}})