Name	DocumentCleaner
Folder path	/preprocessors/
Most common position in a pipeline	In indexing pipelines after Converters, before `DocumentSplitter`
Mandatory input variables	"documents": A list of documents
Output variables	"documents": A list of documents

Overview

DocumentCleaner expects a list of documents as input and returns a list of documents with cleaned texts. Selectable cleaning steps for each input document are to remove_empty_lines, remove_extra_whitespaces and to remove_repeated_substrings. These three parameters are booleans that can be set when the component is initialized.

remove_empty_lines removes empty lines from the document.
remove_extra_whitespaces removes extra whitespaces from the document.
remove_repeated_substrings removes repeated substrings (headers/footers) from pages in the document. Pages in the text need to be separated by form feed character "\f", which is supported by TextFileToDocument and AzureOCRDocumentConverter.

In addition, you can specify a list of strings that should be removed from all documents as part of the cleaning with the parameter remove_substring. You can also specify a regular expression with the parameter remove_regex and any matches will be removed.

The cleaning steps are executed in the following order:

remove_extra_whitespaces
remove_empty_lines
remove_substrings
remove_regex
remove_repeated_substrings

Usage

On its own

You can use it outside of a pipeline to clean up your documents:

from haystack.components.preprocessors import DocumentCleaner

cleaner = DocumentCleaner(
	remove_empty_lines=True,
	remove_extra_whitespaces=True,
	remove_repeated_substrings=False)

In a pipeline

from haystack import Document
from haystack import Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.converters import TextFileToDocument
from haystack.components.preprocessors import DocumentCleaner
from haystack.components.preprocessors import DocumentSplitter
from haystack.components.writers import DocumentWriter

document_store = InMemoryDocumentStore()
p = Pipeline()
p.add_component(instance=TextFileToDocument(), name="text_file_converter")
p.add_component(instance=DocumentCleaner(), name="cleaner")
p.add_component(instance=DocumentSplitter(split_by="sentence", split_length=1), name="splitter")
p.add_component(instance=DocumentWriter(document_store=document_store), name="writer")
p.connect("text_file_converter.documents", "cleaner.documents")
p.connect("cleaner.documents", "splitter.documents")
p.connect("splitter.documents", "writer.documents")

p.run({"cleaner": {"documents": your_docs}})