DocumentationAPI ReferenceπŸ““ TutorialsπŸ§‘β€πŸ³ Cookbook🀝 IntegrationsπŸ’œ Discord

Preprocess your Documents and texts. Clean, split, and more.

Module document_cleaner

DocumentCleaner

Cleans the text in the documents.

Cleans up text documents by removing extra whitespaces, empty lines, specified substrings, regexes, page headers and footers (in this order).

Usage example:

from haystack import Document
from haystack.components.preprocessors import DocumentCleaner

doc = Document(content="This   is  a  document  to  clean\n\n\nsubstring to remove")

cleaner = DocumentCleaner(remove_substrings = ["substring to remove"])
result = cleaner.run(documents=[doc])

assert result["documents"][0].content == "This is a document to clean "

DocumentCleaner.__init__

def __init__(remove_empty_lines: bool = True,
             remove_extra_whitespaces: bool = True,
             remove_repeated_substrings: bool = False,
             keep_id: bool = False,
             remove_substrings: Optional[List[str]] = None,
             remove_regex: Optional[str] = None)

Initialize the DocumentCleaner.

Arguments:

  • remove_empty_lines: Whether to remove empty lines.
  • remove_extra_whitespaces: Whether to remove extra whitespaces.
  • remove_repeated_substrings: Whether to remove repeated substrings (headers/footers) from pages. Pages in the text need to be separated by form feed character "\f", which is supported by TextFileToDocument and AzureOCRDocumentConverter.
  • remove_substrings: List of substrings to remove from the text.
  • remove_regex: Regex to match and replace substrings by "".
  • keep_id: keep the ids of the original documents

DocumentCleaner.run

@component.output_types(documents=List[Document])
def run(documents: List[Document])

Cleans up the documents.

Arguments:

  • documents: List of Documents to clean.

Raises:

  • TypeError: if documents is not a list of Documents.

Returns:

A dictionary with the following key:

  • documents: List of cleaned Documents.

Module document_splitter

DocumentSplitter

Splits a list of text documents into a list of text documents with shorter texts.

Splitting documents with long texts is a common preprocessing step during indexing. This allows Embedders to create significant semantic representations and avoids exceeding the maximum context length of language models.

Usage example:

from haystack import Document
from haystack.components.preprocessors import DocumentSplitter

doc = Document(content="Moonlight shimmered softly, wolves howled nearby, night enveloped everything.")

splitter = DocumentSplitter(split_by="word", split_length=3, split_overlap=0)
result = splitter.run(documents=[doc])

DocumentSplitter.__init__

def __init__(split_by: Literal["word", "sentence", "page", "passage"] = "word",
             split_length: int = 200,
             split_overlap: int = 0,
             split_threshold: int = 0)

Initialize the DocumentSplitter.

Arguments:

  • split_by: The unit by which the document should be split. Choose from "word" for splitting by " ", "sentence" for splitting by ".", "page" for splitting by "\f" or "passage" for splitting by "\n\n".
  • split_length: The maximum number of units in each split.
  • split_overlap: The number of units that each split should overlap.
  • split_threshold: The minimum number of units that the split should have. If the split has fewer units than the threshold, it will be attached to the previous split.

DocumentSplitter.run

@component.output_types(documents=List[Document])
def run(documents: List[Document])

Split documents into smaller parts.

Splits documents by the unit expressed in split_by, with a length of split_length and an overlap of split_overlap.

Arguments:

  • documents: The documents to split.

Raises:

  • TypeError: if the input is not a list of Documents.
  • ValueError: if the content of a document is None.

Returns:

A dictionary with the following key:

  • documents: List of documents with the split texts. A metadata field "source_id" is added to each document to keep track of the original document that was split. Another metadata field "page_number" is added to each number to keep track of the page it belonged to in the original document. Other metadata are copied from the original document.

Module text_cleaner

TextCleaner

A PreProcessor component to clean text data.

It can remove substrings matching a list of regular expressions, convert text to lowercase, remove punctuation, and remove numbers.

This is useful to clean up text data before evaluation.

Usage example:

from haystack.components.preprocessors import TextCleaner

text_to_clean = "1Moonlight shimmered softly, 300 Wolves howled nearby, Night enveloped everything."

cleaner = TextCleaner(convert_to_lowercase=True, remove_punctuation=False, remove_numbers=True)
result = cleaner.run(texts=[text_to_clean])

TextCleaner.__init__

def __init__(remove_regexps: Optional[List[str]] = None,
             convert_to_lowercase: bool = False,
             remove_punctuation: bool = False,
             remove_numbers: bool = False)

Initialize the TextCleaner component.

Arguments:

  • remove_regexps: A list of regular expressions. If provided, it removes substrings matching these regular expressions from the text.
  • convert_to_lowercase: If True, converts all characters to lowercase.
  • remove_punctuation: If True, removes punctuation from the text.
  • remove_numbers: If True, removes numerical digits from the text.

TextCleaner.run

@component.output_types(texts=List[str])
def run(texts: List[str]) -> Dict[str, Any]

Cleans up the given list of strings.

Arguments:

  • texts: List of strings to clean.

Returns:

A dictionary with the following key:

  • texts: the cleaned list of strings.