Skip to main content
Version: 2.27

Chonkie

haystack_integrations.components.preprocessors.chonkie.recursive_splitter

ChonkieRecursiveDocumentSplitter

A Document Splitter that uses Chonkie's RecursiveChunker to split documents.

Usage example

python
from haystack import Document
from haystack_integrations.components.preprocessors.chonkie import ChonkieRecursiveDocumentSplitter

chunker = ChonkieRecursiveDocumentSplitter(chunk_size=512)
documents = [Document(content="Hello world. This is a test.")]
result = chunker.run(documents=documents)
print(result["documents"])

init

python
__init__(
*,
tokenizer: str = "character",
chunk_size: int = 2048,
min_characters_per_chunk: int = 24,
rules: RecursiveRules | dict[str, Any] | None = None,
skip_empty_documents: bool = True,
page_break_character: str = "\x0c"
) -> None

Initializes the ChonkieRecursiveDocumentSplitter.

Parameters:

  • tokenizer (str) – The tokenizer to use for chunking. Defaults to "character". Common options include "character", "gpt2", and "cl100k_base". See the Chonkie documentation for more information on available tokenizers.
  • chunk_size (int) – The maximum number of tokens per chunk. The actual length depends on the chosen tokenizer.
  • min_characters_per_chunk (int) – The minimum number of characters per chunk.
  • rules (RecursiveRules | dict[str, Any] | None) – Custom rules for recursive chunking. If None, default rules are used. See the Chonkie documentation for more information.
  • skip_empty_documents (bool) – Whether to skip empty documents.
  • page_break_character (str) – The character to use for page breaks.

run

python
run(documents: list[Document]) -> dict[str, list[Document]]

Splits a list of documents into smaller chunks.

Parameters:

  • documents (list[Document]) – The list of documents to split.

Returns:

  • dict[str, list[Document]] – A dictionary with the "documents" key containing the list of chunks.

to_dict

python
to_dict() -> dict[str, Any]

Serializes the component to a dictionary.

Returns:

  • dict[str, Any] – Dictionary with serialized data.

from_dict

python
from_dict(data: dict[str, Any]) -> ChonkieRecursiveDocumentSplitter

Deserializes the component from a dictionary.

Parameters:

  • data (dict[str, Any]) – Dictionary to deserialize from.

Returns:

  • ChonkieRecursiveDocumentSplitter – Deserialized component.

haystack_integrations.components.preprocessors.chonkie.semantic_splitter

ChonkieSemanticDocumentSplitter

A Document Splitter that uses Chonkie's SemanticChunker to split documents.

Usage example

python
from haystack import Document
from haystack_integrations.components.preprocessors.chonkie import ChonkieSemanticDocumentSplitter

chunker = ChonkieSemanticDocumentSplitter(chunk_size=512)
documents = [Document(content="Hello world. This is a test.")]
result = chunker.run(documents=documents)
print(result["documents"])

init

python
__init__(
*,
embedding_model: Any = "minishlab/potion-base-32M",
threshold: float = 0.8,
chunk_size: int = 2048,
similarity_window: int = 3,
min_sentences_per_chunk: int = 1,
min_characters_per_sentence: int = 24,
delim: Any = None,
include_delim: str = "prev",
skip_window: int = 0,
filter_window: int = 5,
filter_polyorder: int = 3,
filter_tolerance: float = 0.2,
skip_empty_documents: bool = True,
page_break_character: str = "\x0c"
) -> None

Initializes the ChonkieSemanticDocumentSplitter.

Parameters:

  • embedding_model (Any) – The embedding model to use for semantic similarity. See the Chonkie documentation for more information on supported models.
  • threshold (float) – The semantic similarity threshold.
  • chunk_size (int) – The maximum number of tokens per chunk. The actual length depends on the embedding model's tokenizer.
  • similarity_window (int) – The window size for similarity calculations.
  • min_sentences_per_chunk (int) – The minimum number of sentences per chunk.
  • min_characters_per_sentence (int) – The minimum number of characters per sentence.
  • delim (Any) – Delimiters to use for splitting. If None, default delimiters are used.
  • include_delim (str) – Whether to include the delimiter in the chunks.
  • skip_window (int) – The skip window for similarity calculations.
  • filter_window (int) – The filter window for similarity calculations.
  • filter_polyorder (int) – The polynomial order for similarity filtering.
  • filter_tolerance (float) – The tolerance for similarity filtering.
  • skip_empty_documents (bool) – Whether to skip empty documents.
  • page_break_character (str) – The character to use for page breaks.

warm_up

python
warm_up() -> None

Initializes the component by loading the embedding model.

run

python
run(documents: list[Document]) -> dict[str, list[Document]]

Splits a list of documents into smaller semantic chunks.

Parameters:

  • documents (list[Document]) – The list of documents to split.

Returns:

  • dict[str, list[Document]] – A dictionary with the "documents" key containing the list of chunks.

to_dict

python
to_dict() -> dict[str, Any]

Serializes the component to a dictionary.

Returns:

  • dict[str, Any] – Dictionary with serialized data.

from_dict

python
from_dict(data: dict[str, Any]) -> ChonkieSemanticDocumentSplitter

Deserializes the component from a dictionary.

Parameters:

  • data (dict[str, Any]) – Dictionary to deserialize from.

Returns:

  • ChonkieSemanticDocumentSplitter – Deserialized component.

haystack_integrations.components.preprocessors.chonkie.sentence_splitter

ChonkieSentenceDocumentSplitter

A Document Splitter that uses Chonkie's SentenceChunker to split documents.

Usage example

python
from haystack import Document
from haystack_integrations.components.preprocessors.chonkie import ChonkieSentenceDocumentSplitter

chunker = ChonkieSentenceDocumentSplitter(chunk_size=512)
documents = [Document(content="Hello world. This is a test.")]
result = chunker.run(documents=documents)
print(result["documents"])

init

python
__init__(
*,
tokenizer: str = "character",
chunk_size: int = 2048,
chunk_overlap: int = 0,
min_sentences_per_chunk: int = 1,
min_characters_per_sentence: int = 12,
approximate: bool = False,
delim: Any = None,
include_delim: str = "prev",
skip_empty_documents: bool = True,
page_break_character: str = "\x0c"
) -> None

Initializes the ChonkieSentenceDocumentSplitter.

Parameters:

  • tokenizer (str) – The tokenizer to use for chunking. Defaults to "character". Common options include "character", "gpt2", and "cl100k_base". See the Chonkie documentation for more information on available tokenizers.
  • chunk_size (int) – The maximum number of tokens per chunk. The actual length depends on the chosen tokenizer.
  • chunk_overlap (int) – The overlap between consecutive chunks.
  • min_sentences_per_chunk (int) – The minimum number of sentences per chunk.
  • min_characters_per_sentence (int) – The minimum number of characters per sentence.
  • approximate (bool) – Whether to use approximate chunking.
  • delim (Any) – Delimiters to use for splitting. If None, default delimiters are used.
  • include_delim (str) – Whether to include the delimiter in the chunks ("prev" or "next").
  • skip_empty_documents (bool) – Whether to skip empty documents.
  • page_break_character (str) – The character to use for page breaks.

run

python
run(documents: list[Document]) -> dict[str, list[Document]]

Splits a list of documents into smaller sentence-based chunks.

Parameters:

  • documents (list[Document]) – The list of documents to split.

Returns:

  • dict[str, list[Document]] – A dictionary with the "documents" key containing the list of chunks.

to_dict

python
to_dict() -> dict[str, Any]

Serializes the component to a dictionary.

Returns:

  • dict[str, Any] – Dictionary with serialized data.

from_dict

python
from_dict(data: dict[str, Any]) -> ChonkieSentenceDocumentSplitter

Deserializes the component from a dictionary.

Parameters:

  • data (dict[str, Any]) – Dictionary to deserialize from.

Returns:

  • ChonkieSentenceDocumentSplitter – Deserialized component.

haystack_integrations.components.preprocessors.chonkie.token_splitter

ChonkieTokenDocumentSplitter

A Document Splitter that uses Chonkie's TokenChunker to split documents.

Usage example

python
from haystack import Document
from haystack_integrations.components.preprocessors.chonkie import ChonkieTokenDocumentSplitter

chunker = ChonkieTokenDocumentSplitter(chunk_size=512, chunk_overlap=50)
documents = [Document(content="Hello world. This is a test.")]
result = chunker.run(documents=documents)
print(result["documents"])

init

python
__init__(
*,
tokenizer: str = "character",
chunk_size: int = 2048,
chunk_overlap: int = 0,
skip_empty_documents: bool = True,
page_break_character: str = "\x0c"
) -> None

Initializes the ChonkieTokenDocumentSplitter.

Parameters:

  • tokenizer (str) – The tokenizer to use for chunking. Defaults to "character". Common options include "character", "gpt2", and "cl100k_base". See the Chonkie documentation for more information on available tokenizers.
  • chunk_size (int) – The maximum number of tokens per chunk. The actual length depends on the chosen tokenizer.
  • chunk_overlap (int) – The overlap between consecutive chunks.
  • skip_empty_documents (bool) – Whether to skip empty documents.
  • page_break_character (str) – The character to use for page breaks.

run

python
run(documents: list[Document]) -> dict[str, list[Document]]

Splits a list of documents into smaller token-based chunks.

Parameters:

  • documents (list[Document]) – The list of documents to split.

Returns:

  • dict[str, list[Document]] – A dictionary with the "documents" key containing the list of chunks.

to_dict

python
to_dict() -> dict[str, Any]

Serializes the component to a dictionary.

Returns:

  • dict[str, Any] – Dictionary with serialized data.

from_dict

python
from_dict(data: dict[str, Any]) -> ChonkieTokenDocumentSplitter

Deserializes the component from a dictionary.

Parameters:

  • data (dict[str, Any]) – Dictionary to deserialize from.

Returns:

  • ChonkieTokenDocumentSplitter – Deserialized component.