Preprocess your Documents and texts. Clean, split, and more.
Module document_cleaner
DocumentCleaner
Cleans the text in the documents.
It removes extra whitespaces, empty lines, specified substrings, regexes, page headers and footers (in this order).
Usage example:
from haystack import Document
from haystack.components.preprocessors import DocumentCleaner
doc = Document(content="This is a document to clean\n\n\nsubstring to remove")
cleaner = DocumentCleaner(remove_substrings = ["substring to remove"])
result = cleaner.run(documents=[doc])
assert result["documents"][0].content == "This is a document to clean "
DocumentCleaner.__init__
def __init__(remove_empty_lines: bool = True,
remove_extra_whitespaces: bool = True,
remove_repeated_substrings: bool = False,
keep_id: bool = False,
remove_substrings: Optional[List[str]] = None,
remove_regex: Optional[str] = None,
unicode_normalization: Optional[Literal["NFC", "NFKC", "NFD",
"NFKD"]] = None,
ascii_only: bool = False)
Initialize DocumentCleaner.
Arguments:
remove_empty_lines: IfTrue, removes empty lines.remove_extra_whitespaces: IfTrue, removes extra whitespaces.remove_repeated_substrings: IfTrue, removes repeated substrings (headers and footers) from pages. Pages must be separated by a form feed character "\f", which is supported byTextFileToDocumentandAzureOCRDocumentConverter.remove_substrings: List of substrings to remove from the text.remove_regex: Regex to match and replace substrings by "".keep_id: IfTrue, keeps the IDs of the original documents.unicode_normalization: Unicode normalization form to apply to the text. Note: This will run before any other steps.ascii_only: Whether to convert the text to ASCII only. Will remove accents from characters and replace them with ASCII characters. Other non-ASCII characters will be removed. Note: This will run before any pattern matching or removal.
DocumentCleaner.run
@component.output_types(documents=List[Document])
def run(documents: List[Document])
Cleans up the documents.
Arguments:
documents: List of Documents to clean.
Raises:
TypeError: if documents is not a list of Documents.
Returns:
A dictionary with the following key:
documents: List of cleaned Documents.
Module document_splitter
DocumentSplitter
Splits long documents into smaller chunks.
This is a common preprocessing step during indexing. It helps Embedders create meaningful semantic representations and prevents exceeding language model context limits.
Usage example
from haystack import Document
from haystack.components.preprocessors import DocumentSplitter
doc = Document(content="Moonlight shimmered softly, wolves howled nearby, night enveloped everything.")
splitter = DocumentSplitter(split_by="word", split_length=3, split_overlap=0)
result = splitter.run(documents=[doc])
DocumentSplitter.__init__
def __init__(split_by: Literal["word", "sentence", "page", "passage"] = "word",
split_length: int = 200,
split_overlap: int = 0,
split_threshold: int = 0)
Initialize DocumentSplitter.
Arguments:
split_by: The unit for splitting your documents. Choose fromwordfor splitting by spaces (" "),sentencefor splitting by periods ("."),pagefor splitting by form feed ("\f"), orpassagefor splitting by double line breaks ("\n\n").split_length: The maximum number of units in each split.split_overlap: The number of overlapping units for each split.split_threshold: The minimum number of units per split. If a split has fewer units than the threshold, it's attached to the previous split.
DocumentSplitter.run
@component.output_types(documents=List[Document])
def run(documents: List[Document])
Split documents into smaller parts.
Splits documents by the unit expressed in split_by, with a length of split_length
and an overlap of split_overlap.
Arguments:
documents: The documents to split.
Raises:
TypeError: if the input is not a list of Documents.ValueError: if the content of a document is None.
Returns:
A dictionary with the following key:
documents: List of documents with the split texts. Each document includes:- A metadata field
source_idto track the original document. - A metadata field
page_numberto track the original page number. - All other metadata copied from the original document.
Module text_cleaner
TextCleaner
A PreProcessor component to clean text data.
It can remove substrings matching a list of regular expressions, convert text to lowercase, remove punctuation, and remove numbers.
This is useful to clean up text data before evaluation.
Usage example:
from haystack.components.preprocessors import TextCleaner
text_to_clean = "1Moonlight shimmered softly, 300 Wolves howled nearby, Night enveloped everything."
cleaner = TextCleaner(convert_to_lowercase=True, remove_punctuation=False, remove_numbers=True)
result = cleaner.run(texts=[text_to_clean])
TextCleaner.__init__
def __init__(remove_regexps: Optional[List[str]] = None,
convert_to_lowercase: bool = False,
remove_punctuation: bool = False,
remove_numbers: bool = False)
Initialize the TextCleaner component.
Arguments:
remove_regexps: A list of regular expressions. If provided, it removes substrings matching these regular expressions from the text.convert_to_lowercase: If True, converts all characters to lowercase.remove_punctuation: If True, removes punctuation from the text.remove_numbers: If True, removes numerical digits from the text.
TextCleaner.run
@component.output_types(texts=List[str])
def run(texts: List[str]) -> Dict[str, Any]
Cleans up the given list of strings.
Arguments:
texts: List of strings to clean.
Returns:
A dictionary with the following key:
texts: the cleaned list of strings.
