NLTKDocumentSplitter
Divides a list of text documents into a list of shorter text documents.
NLTKDocumentSplitter
is more specialized version of a DocumentSplitter
and provides more control over sentence boundaries and language handling, while DocumentSplitter
is a simpler and more general-purpose solution for basic splitting needs.
Most common position in a pipeline | In indexing pipelines after Converters and DocumentCleaner , before Classifiers |
Mandatory run variables | "documents": A list of documents |
Output variables | "documents": A list of documents |
API reference | PreProcessors |
GitHub link | https://github.com/deepset-ai/haystack/blob/main/haystack/components/preprocessors/nltk_document_splitter.py |
Overview
NLTKDocumentSplitter
expects a list of documents as input and returns a list of documents with split texts. It splits each input document by split_by
after split_length
units with an overlap of split_overlap
units. These additional parameters can be set when the component is initialized:
split_by
can be"word"
,"sentence"
,"passage"
(paragraph), or"page"
.split_length
is an integer indicating the chunk size, which is the number of words, sentences, or passages.split_overlap
is an integer indicating the number of overlapping words, sentences, or passages between chunks.split_threshold
is an integer indicating the minimum number of words, sentences, or passages that the document fragment should have. If the fragment is below the threshold, it will be attached to the previous one.respect_sentence_boundary
is a boolean. IfTrue
, ensures that splits occur only between sentences whensplit_by
is"word"
. This uses NLTK’s sentence detection to maintain sentence boundaries.language
is a string. It selects the language for the NLTK tokenizer, with"en"
(English) as the default.use_split_rules
is a boolean. IfTrue
, applies additional split rules whensplit_by
is"sentence"
.extend_abbreviations
is a boolean. IfTrue
, extends NLTK’s PunktTokenizer with a list of curated abbreviations, currently supported for"en"
(English) and"de"
(German).
A field "source_id"
is added to each document's meta
data to keep track of the original document that was split. Another meta field "page_number"
is added to each document to keep track of the page it belonged to in the original document. Other metadata are copied from the original document.
The NLTKDocumentSplitter
can be used as a replacement for DocumentSplitter
whenever you need more fine grained control over sentence splitting.
Usage
On its own
You can use this component outside of a pipeline to shorten your documents like this:
from haystack.components.preprocessors import NLTKDocumentSplitter
splitter = NLTKDocumentSplitter(split_by="passage", split_length=10, split_overlap=0)
In a pipeline
Here's how you can use NLTKDocumentSplitter
in an indexing pipeline:
from pathlib import Path
from haystack import Document
from haystack import Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.converters.txt import TextFileToDocument
from haystack.components.preprocessors import DocumentCleaner
from haystack.components.preprocessors import NLTKDocumentSplitter
from haystack.components.writers import DocumentWriter
document_store = InMemoryDocumentStore()
p = Pipeline()
p.add_component(instance=TextFileToDocument(), name="text_file_converter")
p.add_component(instance=DocumentCleaner(), name="cleaner")
p.add_component(instance=NLTKDocumentSplitter(split_by="sentence", split_length=1), name="splitter")
p.add_component(instance=DocumentWriter(document_store=document_store), name="writer")
p.connect("text_file_converter.documents", "cleaner.documents")
p.connect("cleaner.documents", "splitter.documents")
p.connect("splitter.documents", "writer.documents")
path = "path/to/your/files"
files = list(Path(path).glob("*.md"))
p.run({"text_file_converter": {"sources": files}})
Updated 2 months ago