Skip to main content
Version: 2.22

EmbeddingBasedDocumentSplitter

Use this component to split documents based on embedding similarity using cosine distances between sequential sentence groups.

Most common position in a pipelineIn indexing pipelines after Converters and DocumentCleaner
Mandatory run variablesdocuments: A list of documents to split each into smaller documents based on embedding similarity.
Output variablesdocuments: A list of documents
API referencePreProcessors
GitHub linkhttps://github.com/deepset-ai/haystack/blob/main/haystack/components/preprocessors/embedding_based_document_splitter.py

Overview

This component splits documents based on embedding similarity using cosine distances between sequential sentence groups.

It first splits text into sentences, optionally groups them, calculates embeddings for each group, and then uses cosine distance between sequential embeddings to determine split points. Any distance above the specified percentile is treated as a break point. The component also tracks page numbers based on form feed characters (\f) in the original document.

This component is inspired by 5 Levels of Text Splitting by Greg Kamradt.

Usage

On its own

python

from haystack import Document
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack.components.preprocessors import EmbeddingBasedDocumentSplitter

# Create a document with content that has a clear topic shift
doc = Document(
content="This is a first sentence. This is a second sentence. This is a third sentence. "
"Completely different topic. The same completely different topic."
)

# Initialize the embedder to calculate semantic similarities
embedder = SentenceTransformersDocumentEmbedder()

# Configure the splitter with parameters that control splitting behavior
splitter = EmbeddingBasedDocumentSplitter(
document_embedder=embedder,
sentences_per_group=2, # Group 2 sentences before calculating embeddings
percentile=0.95, # Split when cosine distance exceeds 95th percentile
min_length=50, # Merge splits shorter than 50 characters
max_length=1000 # Further split chunks longer than 1000 characters
)
splitter.warm_up()
result = splitter.run(documents=[doc])

# The result contains a list of Document objects, each representing a semantic chunk
# Each split document includes metadata: source_id, split_id, and page_number
print(f"Original document split into {len(result['documents'])} chunks")
for i, split_doc in enumerate(result['documents']):
print(f"Chunk {i}: {split_doc.content[:50]}...")

In a pipeline

python
from pathlib import Path

from haystack import Document
from haystack import Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.converters.txt import TextFileToDocument
from haystack.components.preprocessors import DocumentCleaner
from haystack.components.preprocessors import EmbeddingBasedDocumentSplitter
from haystack.components.writers import DocumentWriter

document_store = InMemoryDocumentStore()

Pipeline = Pipeline()
Pipeline.add_component(instance=TextFileToDocument(), name="text_file_converter")
Pipeline.add_component(instance=DocumentCleaner(), name="cleaner")
Pipeline.add_component(instance=EmbeddingBasedDocumentSplitter(document_embedder=embedder, sentences_per_group=2, percentile=0.95, min_length=50,max_length=1000)
Pipeline.add_component(instance=DocumentWriter(document_store=document_store), name="writer")
Pipeline.connect("text_file_converter.documents", "cleaner.documents")
Pipeline.connect("cleaner.documents", "splitter.documents")
Pipeline.connect("splitter.documents", "writer.documents")

path = "path/to/your/files"
files = list(Path(path).glob("*.md"))
Pipeline.run({"text_file_converter": {"sources": files}})