Chonkie
haystack_integrations.components.preprocessors.chonkie.recursive_splitter
ChonkieRecursiveDocumentSplitter
A Document Splitter that uses Chonkie's RecursiveChunker to split documents.
Usage example
from haystack import Document
from haystack_integrations.components.preprocessors.chonkie import ChonkieRecursiveDocumentSplitter
chunker = ChonkieRecursiveDocumentSplitter(chunk_size=512)
documents = [Document(content="Hello world. This is a test.")]
result = chunker.run(documents=documents)
print(result["documents"])
init
__init__(
*,
tokenizer: str = "character",
chunk_size: int = 2048,
min_characters_per_chunk: int = 24,
rules: RecursiveRules | dict[str, Any] | None = None,
skip_empty_documents: bool = True,
page_break_character: str = "\x0c"
) -> None
Initializes the ChonkieRecursiveDocumentSplitter.
Parameters:
- tokenizer (
str) – The tokenizer to use for chunking. Defaults to "character". Common options include "character", "gpt2", and "cl100k_base". See the Chonkie documentation for more information on available tokenizers. - chunk_size (
int) – The maximum number of tokens per chunk. The actual length depends on the chosen tokenizer. - min_characters_per_chunk (
int) – The minimum number of characters per chunk. - rules (
RecursiveRules | dict[str, Any] | None) – Custom rules for recursive chunking. If None, default rules are used. See the Chonkie documentation for more information. - skip_empty_documents (
bool) – Whether to skip empty documents. - page_break_character (
str) – The character to use for page breaks.
run
Splits a list of documents into smaller chunks.
Parameters:
- documents (
list[Document]) – The list of documents to split.
Returns:
dict[str, list[Document]]– A dictionary with the "documents" key containing the list of chunks.
to_dict
Serializes the component to a dictionary.
Returns:
dict[str, Any]– Dictionary with serialized data.
from_dict
Deserializes the component from a dictionary.
Parameters:
- data (
dict[str, Any]) – Dictionary to deserialize from.
Returns:
ChonkieRecursiveDocumentSplitter– Deserialized component.
haystack_integrations.components.preprocessors.chonkie.semantic_splitter
ChonkieSemanticDocumentSplitter
A Document Splitter that uses Chonkie's SemanticChunker to split documents.
Usage example
from haystack import Document
from haystack_integrations.components.preprocessors.chonkie import ChonkieSemanticDocumentSplitter
chunker = ChonkieSemanticDocumentSplitter(chunk_size=512)
documents = [Document(content="Hello world. This is a test.")]
result = chunker.run(documents=documents)
print(result["documents"])
init
__init__(
*,
embedding_model: Any = "minishlab/potion-base-32M",
threshold: float = 0.8,
chunk_size: int = 2048,
similarity_window: int = 3,
min_sentences_per_chunk: int = 1,
min_characters_per_sentence: int = 24,
delim: Any = None,
include_delim: str = "prev",
skip_window: int = 0,
filter_window: int = 5,
filter_polyorder: int = 3,
filter_tolerance: float = 0.2,
skip_empty_documents: bool = True,
page_break_character: str = "\x0c"
) -> None
Initializes the ChonkieSemanticDocumentSplitter.
Parameters:
- embedding_model (
Any) – The embedding model to use for semantic similarity. See the Chonkie documentation for more information on supported models. - threshold (
float) – The semantic similarity threshold. - chunk_size (
int) – The maximum number of tokens per chunk. The actual length depends on the embedding model's tokenizer. - similarity_window (
int) – The window size for similarity calculations. - min_sentences_per_chunk (
int) – The minimum number of sentences per chunk. - min_characters_per_sentence (
int) – The minimum number of characters per sentence. - delim (
Any) – Delimiters to use for splitting. If None, default delimiters are used. - include_delim (
str) – Whether to include the delimiter in the chunks. - skip_window (
int) – The skip window for similarity calculations. - filter_window (
int) – The filter window for similarity calculations. - filter_polyorder (
int) – The polynomial order for similarity filtering. - filter_tolerance (
float) – The tolerance for similarity filtering. - skip_empty_documents (
bool) – Whether to skip empty documents. - page_break_character (
str) – The character to use for page breaks.
warm_up
Initializes the component by loading the embedding model.
run
Splits a list of documents into smaller semantic chunks.
Parameters:
- documents (
list[Document]) – The list of documents to split.
Returns:
dict[str, list[Document]]– A dictionary with the "documents" key containing the list of chunks.
to_dict
Serializes the component to a dictionary.
Returns:
dict[str, Any]– Dictionary with serialized data.
from_dict
Deserializes the component from a dictionary.
Parameters:
- data (
dict[str, Any]) – Dictionary to deserialize from.
Returns:
ChonkieSemanticDocumentSplitter– Deserialized component.
haystack_integrations.components.preprocessors.chonkie.sentence_splitter
ChonkieSentenceDocumentSplitter
A Document Splitter that uses Chonkie's SentenceChunker to split documents.
Usage example
from haystack import Document
from haystack_integrations.components.preprocessors.chonkie import ChonkieSentenceDocumentSplitter
chunker = ChonkieSentenceDocumentSplitter(chunk_size=512)
documents = [Document(content="Hello world. This is a test.")]
result = chunker.run(documents=documents)
print(result["documents"])
init
__init__(
*,
tokenizer: str = "character",
chunk_size: int = 2048,
chunk_overlap: int = 0,
min_sentences_per_chunk: int = 1,
min_characters_per_sentence: int = 12,
approximate: bool = False,
delim: Any = None,
include_delim: str = "prev",
skip_empty_documents: bool = True,
page_break_character: str = "\x0c"
) -> None
Initializes the ChonkieSentenceDocumentSplitter.
Parameters:
- tokenizer (
str) – The tokenizer to use for chunking. Defaults to "character". Common options include "character", "gpt2", and "cl100k_base". See the Chonkie documentation for more information on available tokenizers. - chunk_size (
int) – The maximum number of tokens per chunk. The actual length depends on the chosen tokenizer. - chunk_overlap (
int) – The overlap between consecutive chunks. - min_sentences_per_chunk (
int) – The minimum number of sentences per chunk. - min_characters_per_sentence (
int) – The minimum number of characters per sentence. - approximate (
bool) – Whether to use approximate chunking. - delim (
Any) – Delimiters to use for splitting. If None, default delimiters are used. - include_delim (
str) – Whether to include the delimiter in the chunks ("prev" or "next"). - skip_empty_documents (
bool) – Whether to skip empty documents. - page_break_character (
str) – The character to use for page breaks.
run
Splits a list of documents into smaller sentence-based chunks.
Parameters:
- documents (
list[Document]) – The list of documents to split.
Returns:
dict[str, list[Document]]– A dictionary with the "documents" key containing the list of chunks.
to_dict
Serializes the component to a dictionary.
Returns:
dict[str, Any]– Dictionary with serialized data.
from_dict
Deserializes the component from a dictionary.
Parameters:
- data (
dict[str, Any]) – Dictionary to deserialize from.
Returns:
ChonkieSentenceDocumentSplitter– Deserialized component.
haystack_integrations.components.preprocessors.chonkie.token_splitter
ChonkieTokenDocumentSplitter
A Document Splitter that uses Chonkie's TokenChunker to split documents.
Usage example
from haystack import Document
from haystack_integrations.components.preprocessors.chonkie import ChonkieTokenDocumentSplitter
chunker = ChonkieTokenDocumentSplitter(chunk_size=512, chunk_overlap=50)
documents = [Document(content="Hello world. This is a test.")]
result = chunker.run(documents=documents)
print(result["documents"])
init
__init__(
*,
tokenizer: str = "character",
chunk_size: int = 2048,
chunk_overlap: int = 0,
skip_empty_documents: bool = True,
page_break_character: str = "\x0c"
) -> None
Initializes the ChonkieTokenDocumentSplitter.
Parameters:
- tokenizer (
str) – The tokenizer to use for chunking. Defaults to "character". Common options include "character", "gpt2", and "cl100k_base". See the Chonkie documentation for more information on available tokenizers. - chunk_size (
int) – The maximum number of tokens per chunk. The actual length depends on the chosen tokenizer. - chunk_overlap (
int) – The overlap between consecutive chunks. - skip_empty_documents (
bool) – Whether to skip empty documents. - page_break_character (
str) – The character to use for page breaks.
run
Splits a list of documents into smaller token-based chunks.
Parameters:
- documents (
list[Document]) – The list of documents to split.
Returns:
dict[str, list[Document]]– A dictionary with the "documents" key containing the list of chunks.
to_dict
Serializes the component to a dictionary.
Returns:
dict[str, Any]– Dictionary with serialized data.
from_dict
Deserializes the component from a dictionary.
Parameters:
- data (
dict[str, Any]) – Dictionary to deserialize from.
Returns:
ChonkieTokenDocumentSplitter– Deserialized component.