DocumentPreprocessor

Divides a list of text documents into a list of shorter text documents and then makes them more readable by cleaning.


Most common position in a pipeline	In indexing pipelines after Converters
Mandatory run variables	"documents": A list of documents
Output variables	"documents": A list of split and cleaned documents
API reference	PreProcessors
GitHub link	https://github.com/deepset-ai/haystack/blob/main/haystack/components/preprocessors/document_preprocessor.py

Overview

DocumentPreprocessor first splits and then cleans documents.

It is a SuperComponent that combines a DocumentSplitter and a DocumentCleaner into a single component.

Parameters

The DocumentPreprocessor exposes all initialization parameters of the underlying DocumentSplitter and DocumentCleaner, and they are all optional. A detailed description of their parameters is in the respective documentation pages:

Usage

On its own

from haystack import Document
from haystack.components.preprocessors import DocumentPreprocessor

doc = Document(content="I love pizza!")
preprocessor = DocumentPreprocessor()

result = preprocessor.run(documents=[doc])
print(result["documents"])

In a pipeline

You can use the DocumentPreprocessor in your indexing pipeline. The example below requires installing additional dependencies for the MultiFileConverter:

pip install pypdf markdown-it-py  mdit_plain trafilatura python-pptx python-docx jq openpyxl tabulate pandas

from haystack import Pipeline
from haystack.components.converters import MultiFileConverter
from haystack.components.preprocessors import DocumentPreprocessor
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore

document_store = InMemoryDocumentStore()

pipeline = Pipeline()
pipeline.add_component("converter", MultiFileConverter())
pipeline.add_component("preprocessor", DocumentPreprocessor())
pipeline.add_component("writer", DocumentWriter(document_store = document_store))
pipeline.connect("converter", "preprocessor")
pipeline.connect("preprocessor", "writer")

result = pipeline.run(data={"sources": ["test.txt", "test.pdf"]})
print(result)
# {'writer': {'documents_written': 3}}

Updated 3 months ago