API Reference

Preprocess your Documents and texts. Clean, split, and more.

Module document_cleaner


class DocumentCleaner()

Cleans up text documents by removing extra whitespaces, empty lines, specified substrings, regexes, page headers and footers (in this order).

Usage example:

from haystack import Document
from haystack.components.preprocessors import DocumentCleaner

doc = Document(content="This   is  a  document  to  clean\n\n\nsubstring to remove")

cleaner = DocumentCleaner(remove_substrings = ["substring to remove"])
result = cleaner.run(documents=[doc])

assert result["documents"][0].content == "This is a document to clean "


def __init__(remove_empty_lines: bool = True,
             remove_extra_whitespaces: bool = True,
             remove_repeated_substrings: bool = False,
             remove_substrings: Optional[List[str]] = None,
             remove_regex: Optional[str] = None)


  • remove_empty_lines: Whether to remove empty lines.
  • remove_extra_whitespaces: Whether to remove extra whitespaces.
  • remove_repeated_substrings: Whether to remove repeated substrings (headers/footers) from pages. Pages in the text need to be separated by form feed character "\f", which is supported by TextFileToDocument and AzureOCRDocumentConverter.
  • remove_substrings: List of substrings to remove from the text.
  • remove_regex: Regex to match and replace substrings by "".


def run(documents: List[Document])

Cleans up the documents.


  • documents: List of Documents to clean.


  • TypeError: if documents is not a list of Documents.


A dictionary with the following key:

  • documents: List of cleaned Documents.

Module document_splitter


class DocumentSplitter()

Splits a list of text documents into a list of text documents with shorter texts.

Splitting documents with long texts is a common preprocessing step during indexing. This allows Embedders to create significant semantic representations and avoids exceeding the maximum context length of language models.


def __init__(split_by: Literal["word", "sentence", "page", "passage"] = "word",
             split_length: int = 200,
             split_overlap: int = 0)


  • split_by: The unit by which the document should be split. Choose from "word" for splitting by " ", "sentence" for splitting by ".", "page" for splitting by "\f" or "passage" for splitting by "\n\n".
  • split_length: The maximum number of units in each split.
  • split_overlap: The number of units that each split should overlap.


def run(documents: List[Document])

Splits documents by the unit expressed in split_by, with a length of split_length

and an overlap of split_overlap.


  • documents: The documents to split.


  • TypeError: if the input is not a list of Documents.
  • ValueError: if the content of a document is None.


A dictionary with the following key:

  • documents: List of documents with the split texts. A metadata field "source_id" is added to each document to keep track of the original document that was split. Other metadata are copied from the original document.

Module text_cleaner


class TextCleaner()

A preprocessor component to clean text data. It can remove substrings matching a list of regular expressions, convert text to lowercase, remove punctuation, and remove numbers.

This is useful to cleanup text data before evaluation.


def __init__(remove_regexps: Optional[List[str]] = None,
             convert_to_lowercase: bool = False,
             remove_punctuation: bool = False,
             remove_numbers: bool = False)


  • remove_regexps: A list of regular expressions. If provided, it removes substrings matching these regular expressions from the text.
  • convert_to_lowercase: If True, converts all characters to lowercase.
  • remove_punctuation: If True, removes punctuation from the text.
  • remove_numbers: If True, removes numerical digits from the text.


def run(texts: List[str]) -> Dict[str, Any]

Cleans up the given list of strings.


  • texts: List of strings to clean.


A dictionary with the following key:

  • texts: the cleaned list of strings.