DocumentationAPI Reference📓 Tutorials🧑‍🍳 Cookbook🤝 Integrations💜 Discord🎨 Studio
API Reference

Preprocessors

Pipelines wrapped as components.

Module haystack_experimental.components.preprocessors.embedding_based_document_splitter

EmbeddingBasedDocumentSplitter

Splits documents based on embedding similarity using cosine distances between sequential sentence groups.

This component first splits text into sentences, optionally groups them, calculates embeddings for each group, and then uses cosine distance between sequential embeddings to determine split points. Any distance above the specified percentile is treated as a break point. The component also tracks page numbers based on form feed characters ( ) in the original document.

This component is inspired by 5 Levels of Text Splitting by Greg Kamradt.

Usage example

from haystack import Document
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack_experimental.components.preprocessors import EmbeddingBasedDocumentSplitter

doc = Document(
    content="This is a first sentence. This is a second sentence. This is a third sentence. "
    "Completely different topic. The same completely different topic."
)

embedder = SentenceTransformersDocumentEmbedder()

splitter = EmbeddingBasedDocumentSplitter(
    document_embedder=embedder,
    sentences_per_group=2,
    percentile=0.95,
    min_length=50,
    max_length=1000
)
splitter.warm_up()
result = splitter.run(documents=[doc])

EmbeddingBasedDocumentSplitter.__init__

def __init__(*,
             document_embedder: DocumentEmbedder,
             sentences_per_group: int = 3,
             percentile: float = 0.95,
             min_length: int = 50,
             max_length: int = 1000,
             language: Language = "en",
             use_split_rules: bool = True,
             extend_abbreviations: bool = True)

Initialize EmbeddingBasedDocumentSplitter.

Arguments:

  • document_embedder: The DocumentEmbedder to use for calculating embeddings.
  • sentences_per_group: Number of sentences to group together before embedding.
  • percentile: Percentile threshold for cosine distance. Distances above this percentile are treated as break points.
  • min_length: Minimum length of splits in characters. Splits below this length will be merged.
  • max_length: Maximum length of splits in characters. Splits above this length will be recursively split.
  • language: Language for sentence tokenization.
  • use_split_rules: Whether to use additional split rules for sentence tokenization.
  • extend_abbreviations: Whether to extend NLTK abbreviations.

EmbeddingBasedDocumentSplitter.warm_up

def warm_up() -> None

Warm up the component by initializing the sentence splitter.

EmbeddingBasedDocumentSplitter.run

@component.output_types(documents=List[Document])
def run(documents: List[Document]) -> Dict[str, List[Document]]

Split documents based on embedding similarity.

Arguments:

  • documents: The documents to split.

Raises:

  • None: - RuntimeError: If the component wasn't warmed up.
  • TypeError: If the input is not a list of Documents.
  • ValueError: If the document content is None or empty.

Returns:

A dictionary with the following key:

  • documents: List of documents with the split texts. Each document includes:
  • A metadata field source_id to track the original document.
  • A metadata field split_id to track the split number.
  • A metadata field page_number to track the original page number.
  • All other metadata copied from the original document.

EmbeddingBasedDocumentSplitter.to_dict

def to_dict() -> Dict[str, Any]

Serializes the component to a dictionary.

EmbeddingBasedDocumentSplitter.from_dict

@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "EmbeddingBasedDocumentSplitter"

Deserializes the component from a dictionary.