Version: 2.29

Chonkie

haystack_integrations.components.preprocessors.chonkie.recursive_splitter

ChonkieRecursiveDocumentSplitter

A Document Splitter that uses Chonkie's RecursiveChunker to split documents.

Usage example

python

from haystack import Document
from haystack_integrations.components.preprocessors.chonkie import ChonkieRecursiveDocumentSplitter

chunker = ChonkieRecursiveDocumentSplitter(chunk_size=512)
documents = [Document(content="Hello world. This is a test.")]
result = chunker.run(documents=documents)
print(result["documents"])

init

python

__init__(
    *,
    tokenizer: str = "character",
    chunk_size: int = 2048,
    min_characters_per_chunk: int = 24,
    rules: RecursiveRules | dict[str, Any] | None = None,
    skip_empty_documents: bool = True,
    page_break_character: str = "\x0c"
) -> None

Initializes the ChonkieRecursiveDocumentSplitter.

Parameters:

tokenizer (str) – The tokenizer to use for chunking. Defaults to "character". Common options include "character", "gpt2", and "cl100k_base". See the Chonkie documentation for more information on available tokenizers.
chunk_size (int) – The maximum number of tokens per chunk. The actual length depends on the chosen tokenizer.
min_characters_per_chunk (int) – The minimum number of characters per chunk.
rules (RecursiveRules | dict[str, Any] | None) – Custom rules for recursive chunking. If None, default rules are used. See the Chonkie documentation for more information.
skip_empty_documents (bool) – Whether to skip empty documents.
page_break_character (str) – The character to use for page breaks.

run

python

run(documents: list[Document]) -> dict[str, list[Document]]

Splits a list of documents into smaller chunks.

Parameters:

documents (list[Document]) – The list of documents to split.

Returns:

dict[str, list[Document]] – A dictionary with the "documents" key containing the list of chunks.

to_dict

python

to_dict() -> dict[str, Any]

Serializes the component to a dictionary.

Returns:

dict[str, Any] – Dictionary with serialized data.

from_dict

python

from_dict(data: dict[str, Any]) -> ChonkieRecursiveDocumentSplitter

Deserializes the component from a dictionary.

Parameters:

data (dict[str, Any]) – Dictionary to deserialize from.

Returns:

ChonkieRecursiveDocumentSplitter – Deserialized component.

haystack_integrations.components.preprocessors.chonkie.semantic_splitter

ChonkieSemanticDocumentSplitter

A Document Splitter that uses Chonkie's SemanticChunker to split documents.

Usage example

python

from haystack import Document
from haystack_integrations.components.preprocessors.chonkie import ChonkieSemanticDocumentSplitter

chunker = ChonkieSemanticDocumentSplitter(chunk_size=512)
documents = [Document(content="Hello world. This is a test.")]
result = chunker.run(documents=documents)
print(result["documents"])

init

python

__init__(
    *,
    embedding_model: Any = "minishlab/potion-base-32M",
    threshold: float = 0.8,
    chunk_size: int = 2048,
    similarity_window: int = 3,
    min_sentences_per_chunk: int = 1,
    min_characters_per_sentence: int = 24,
    delim: Any = None,
    include_delim: str = "prev",
    skip_window: int = 0,
    filter_window: int = 5,
    filter_polyorder: int = 3,
    filter_tolerance: float = 0.2,
    skip_empty_documents: bool = True,
    page_break_character: str = "\x0c"
) -> None

Initializes the ChonkieSemanticDocumentSplitter.

Parameters:

embedding_model (Any) – The embedding model to use for semantic similarity. See the Chonkie documentation for more information on supported models.
threshold (float) – The semantic similarity threshold.
chunk_size (int) – The maximum number of tokens per chunk. The actual length depends on the embedding model's tokenizer.
similarity_window (int) – The window size for similarity calculations.
min_sentences_per_chunk (int) – The minimum number of sentences per chunk.
min_characters_per_sentence (int) – The minimum number of characters per sentence.
delim (Any) – Delimiters to use for splitting. If None, default delimiters are used.
include_delim (str) – Whether to include the delimiter in the chunks.
skip_window (int) – The skip window for similarity calculations.
filter_window (int) – The filter window for similarity calculations.
filter_polyorder (int) – The polynomial order for similarity filtering.
filter_tolerance (float) – The tolerance for similarity filtering.
skip_empty_documents (bool) – Whether to skip empty documents.
page_break_character (str) – The character to use for page breaks.

warm_up

python

warm_up() -> None

Initializes the component by loading the embedding model.

run

python

run(documents: list[Document]) -> dict[str, list[Document]]

Splits a list of documents into smaller semantic chunks.

Parameters:

documents (list[Document]) – The list of documents to split.

Returns:

dict[str, list[Document]] – A dictionary with the "documents" key containing the list of chunks.

to_dict

python

to_dict() -> dict[str, Any]

Serializes the component to a dictionary.

Returns:

dict[str, Any] – Dictionary with serialized data.

from_dict

python

from_dict(data: dict[str, Any]) -> ChonkieSemanticDocumentSplitter

Deserializes the component from a dictionary.

Parameters:

data (dict[str, Any]) – Dictionary to deserialize from.

Returns:

ChonkieSemanticDocumentSplitter – Deserialized component.

haystack_integrations.components.preprocessors.chonkie.sentence_splitter

ChonkieSentenceDocumentSplitter

A Document Splitter that uses Chonkie's SentenceChunker to split documents.

Usage example

python

from haystack import Document
from haystack_integrations.components.preprocessors.chonkie import ChonkieSentenceDocumentSplitter

chunker = ChonkieSentenceDocumentSplitter(chunk_size=512)
documents = [Document(content="Hello world. This is a test.")]
result = chunker.run(documents=documents)
print(result["documents"])

init

python

__init__(
    *,
    tokenizer: str = "character",
    chunk_size: int = 2048,
    chunk_overlap: int = 0,
    min_sentences_per_chunk: int = 1,
    min_characters_per_sentence: int = 12,
    approximate: bool = False,
    delim: Any = None,
    include_delim: str = "prev",
    skip_empty_documents: bool = True,
    page_break_character: str = "\x0c"
) -> None

Initializes the ChonkieSentenceDocumentSplitter.

Parameters:

tokenizer (str) – The tokenizer to use for chunking. Defaults to "character". Common options include "character", "gpt2", and "cl100k_base". See the Chonkie documentation for more information on available tokenizers.
chunk_size (int) – The maximum number of tokens per chunk. The actual length depends on the chosen tokenizer.
chunk_overlap (int) – The overlap between consecutive chunks.
min_sentences_per_chunk (int) – The minimum number of sentences per chunk.
min_characters_per_sentence (int) – The minimum number of characters per sentence.
approximate (bool) – Whether to use approximate chunking.
delim (Any) – Delimiters to use for splitting. If None, default delimiters are used.
include_delim (str) – Whether to include the delimiter in the chunks ("prev" or "next").
skip_empty_documents (bool) – Whether to skip empty documents.
page_break_character (str) – The character to use for page breaks.

run

python

run(documents: list[Document]) -> dict[str, list[Document]]

Splits a list of documents into smaller sentence-based chunks.

Parameters:

documents (list[Document]) – The list of documents to split.

Returns:

dict[str, list[Document]] – A dictionary with the "documents" key containing the list of chunks.

to_dict

python

to_dict() -> dict[str, Any]

Serializes the component to a dictionary.

Returns:

dict[str, Any] – Dictionary with serialized data.

from_dict

python

from_dict(data: dict[str, Any]) -> ChonkieSentenceDocumentSplitter

Deserializes the component from a dictionary.

Parameters:

data (dict[str, Any]) – Dictionary to deserialize from.

Returns:

ChonkieSentenceDocumentSplitter – Deserialized component.

haystack_integrations.components.preprocessors.chonkie.token_splitter

ChonkieTokenDocumentSplitter

A Document Splitter that uses Chonkie's TokenChunker to split documents.

Usage example

python

from haystack import Document
from haystack_integrations.components.preprocessors.chonkie import ChonkieTokenDocumentSplitter

chunker = ChonkieTokenDocumentSplitter(chunk_size=512, chunk_overlap=50)
documents = [Document(content="Hello world. This is a test.")]
result = chunker.run(documents=documents)
print(result["documents"])

init

python

__init__(
    *,
    tokenizer: str = "character",
    chunk_size: int = 2048,
    chunk_overlap: int = 0,
    skip_empty_documents: bool = True,
    page_break_character: str = "\x0c"
) -> None

Initializes the ChonkieTokenDocumentSplitter.

Parameters:

tokenizer (str) – The tokenizer to use for chunking. Defaults to "character". Common options include "character", "gpt2", and "cl100k_base". See the Chonkie documentation for more information on available tokenizers.
chunk_size (int) – The maximum number of tokens per chunk. The actual length depends on the chosen tokenizer.
chunk_overlap (int) – The overlap between consecutive chunks.
skip_empty_documents (bool) – Whether to skip empty documents.
page_break_character (str) – The character to use for page breaks.

run

python

run(documents: list[Document]) -> dict[str, list[Document]]

Splits a list of documents into smaller token-based chunks.

Parameters:

documents (list[Document]) – The list of documents to split.

Returns:

dict[str, list[Document]] – A dictionary with the "documents" key containing the list of chunks.

to_dict

python

to_dict() -> dict[str, Any]

Serializes the component to a dictionary.

Returns:

dict[str, Any] – Dictionary with serialized data.

from_dict

python

from_dict(data: dict[str, Any]) -> ChonkieTokenDocumentSplitter

Deserializes the component from a dictionary.

Parameters:

data (dict[str, Any]) – Dictionary to deserialize from.

Returns:

ChonkieTokenDocumentSplitter – Deserialized component.

haystack_integrations.components.preprocessors.chonkie.recursive_splitter​

ChonkieRecursiveDocumentSplitter​

Usage example​

init​

run​

to_dict​

from_dict​

haystack_integrations.components.preprocessors.chonkie.semantic_splitter​

ChonkieSemanticDocumentSplitter​

Usage example​

init​

warm_up​

run​

to_dict​

from_dict​

haystack_integrations.components.preprocessors.chonkie.sentence_splitter​

ChonkieSentenceDocumentSplitter​

Usage example​

init​

run​

to_dict​

from_dict​

haystack_integrations.components.preprocessors.chonkie.token_splitter​

ChonkieTokenDocumentSplitter​

Usage example​

init​

run​

to_dict​

from_dict​

haystack_integrations.components.preprocessors.chonkie.recursive_splitter

ChonkieRecursiveDocumentSplitter

Usage example

init

run

to_dict

from_dict

haystack_integrations.components.preprocessors.chonkie.semantic_splitter

ChonkieSemanticDocumentSplitter

Usage example

init

warm_up

run

to_dict

from_dict

haystack_integrations.components.preprocessors.chonkie.sentence_splitter

ChonkieSentenceDocumentSplitter

Usage example

init

run

to_dict

from_dict

haystack_integrations.components.preprocessors.chonkie.token_splitter

ChonkieTokenDocumentSplitter

Usage example

init

run

to_dict

from_dict