DocumentPreprocessor
Divides a list of text documents into a list of shorter text documents and then makes them more readable by cleaning.
Most common position in a pipeline | In indexing pipelines after Converters |
Mandatory run variables | "documents": A list of documents |
Output variables | "documents": A list of split and cleaned documents |
API reference | PreProcessors |
GitHub link | https://github.com/deepset-ai/haystack/blob/main/haystack/components/preprocessors/document_preprocessor.py |
Overview
DocumentPreprocessor
first splits and then cleans documents.
It is a SuperComponent that combines a DocumentSplitter
and a DocumentCleaner
into a single component.
Parameters
The DocumentPreprocessor
exposes all initialization parameters of the underlying DocumentSplitter
and DocumentCleaner
, and they are all optional. A detailed description of their parameters is in the respective documentation pages:
Usage
On its own
from haystack import Document
from haystack.components.preprocessors import DocumentPreprocessor
doc = Document(content="I love pizza!")
preprocessor = DocumentPreprocessor()
result = preprocessor.run(documents=[doc])
print(result["documents"])
In a pipeline
You can use the DocumentPreprocessor
in your indexing pipeline. The example below requires installing additional dependencies for the MultiFileConverter
:
pip install pypdf markdown-it-py mdit_plain trafilatura python-pptx python-docx jq openpyxl tabulate pandas
from haystack import Pipeline
from haystack.components.converters import MultiFileConverter
from haystack.components.preprocessors import DocumentPreprocessor
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
document_store = InMemoryDocumentStore()
pipeline = Pipeline()
pipeline.add_component("converter", MultiFileConverter())
pipeline.add_component("preprocessor", DocumentPreprocessor())
pipeline.add_component("writer", DocumentWriter(document_store = document_store))
pipeline.connect("converter", "preprocessor")
pipeline.connect("preprocessor", "writer")
result = pipeline.run(data={"sources": ["test.txt", "test.pdf"]})
print(result)
# {'writer': {'documents_written': 3}}
Updated about 16 hours ago