RecursiveSplitter
This component recursively breaks down text into smaller chunks by applying a given list of separators to the text.
Most common position in a pipeline | In indexing pipelines after Converters and DocumentCleaner , before Classifiers |
Mandatory run variables | “documents”: A list of documents |
Output variables | “documents”: A list of documents |
API reference | PreProcessors |
Github link | https://github.com/deepset-ai/haystack/blob/main/haystack/components/preprocessors/recursive_splitter.py |
Overview
The RecursiveSplitter
expects a list of documents as input and returns a list of documents with split texts. You can set the following parameters when initializing the component:
split_length
: The maximum length of each chunk, in words, by default. See thesplit_units
parameter to change the the unit.split_overlap
: The number of characters or words that overlap between consecutive chunks.split_unit
: The unit of thesplit_length
parameter. Can be either"word"
or"char"
.separators
: An optional list of separator strings to use for splitting the text. If you don’t provide any separators, the default ones are["\n\n", "sentence", "\n", " "]
. The string separators will be treated as regular expressions. If the separator is"sentence"
, the text will be split into sentences using a custom sentence tokenizer based on NLTK. See SentenceSplitter code for more information.sentence_splitter_params
: Optional parameters to pass to the SentenceSplitter.
The separators are applied in the same order as they are defined in the list. The first separator is used on the text; any resulting chunk that is within the specified chunk_size
is retained. For chunks that exceed the defined chunk_size
, the next separator in the list is applied. If all separators are used and the chunk still exceeds the chunk_size
, a hard split occurs based on the chunk_size
, taking into account whether words or characters are used as counting units. This process is repeated until all chunks are within the limits of the specified chunk_size
.
Usage
from haystack.components.preprocessors import RecursiveDocumentSplitter
splitter = RecursiveDocumentSplitter(
split_length=400,
split_overlap=0,
split_unit="char",
separators=["\n\n", "\n", "sentence", " "],
sentence_splitter_params={
"language": "en",
"use_split_rules": True,
"keep_white_spaces": False
}
)
splitter.warm_up() # since 'sentence' is part of the separators it needs warm_up()
In a pipeline
Here's how you can use RecursiveSplitter
in an indexing pipeline:
from pathlib import Path
from haystack import Document
from haystack import Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.converters.txt import TextFileToDocument
from haystack.components.preprocessors import DocumentCleaner
from haystack.components.preprocessors import RecursiveDocumentSplitter
from haystack.components.writers import DocumentWriter
document_store = InMemoryDocumentStore()
p = Pipeline()
p.add_component(instance=TextFileToDocument(), name="text_file_converter")
p.add_component(instance=DocumentCleaner(), name="cleaner")
p.add_component(instance=RecursiveDocumentSplitter(
split_length=400,
split_overlap=0,
split_unit="char",
separators=["\n\n", "\n", "sentence", " "],
sentence_splitter_params={
"language": "en",
"use_split_rules": True,
"keep_white_spaces": False
}
),
name="recursive_splitter")
p.add_component(instance=DocumentWriter(document_store=document_store), name="writer")
p.connect("text_file_converter.documents", "cleaner.documents")
p.connect("cleaner.documents", "splitter.documents")
p.connect("splitter.documents", "writer.documents")
path = "path/to/your/files"
files = list(Path(path).glob("*.md"))
p.run({"text_file_converter": {"sources": files}})
Updated about 12 hours ago