DocumentationAPI ReferenceπŸ““ TutorialsπŸ§‘β€πŸ³ Cookbook🀝 IntegrationsπŸ’œ Discord

PreProcessor

Use the PreProcessor to normalize white spaces, get rid of headers and footers, clean empty lines in your Documents, or split them into smaller pieces. PreProcessor is useful in an indexing pipeline to prepare your files for search.

Splitting is generally recommended for long Documents as it makes the Retriever's job easier and speeds up Question Answering. For suggestions on how best to split your documents, see Optimization.

Position in a PipelineAs early in an indexing Pipeline as possible but after File Converters and Crawlers
InputDocuments
OutputDocuments
ClassesPreProcessor

πŸ‘

Tutorial

To start working with code examples, have a look at the How to Preprocess Documents tutorial. For ideas on what you can do at indexing time, see the DocumentClassifier at Index Time tutorial.

Usage

To initialize PreProcessor, run:

from haystack.nodes import PreProcessor

processor = PreProcessor(
    clean_empty_lines=True,
    clean_whitespace=True,
    clean_header_footer=True,
  	remove_substrings=None,
    split_by="word",
    split_length=200,
    split_respect_sentence_boundary=True,
    split_overlap=0,
  	max_chars_check: int = 10_000
)
ArgumentTypeDescription
clean_empty_linesboolNormalizes 3 or more consecutive empty lines to be just two empty lines.
clean_whitespaceboolRemoves any whitespace at the beginning or end of each line in the text.
clean_header_footerboolRemoves any long header or footer texts that are repeated on each page.
remove_substringslistRemove specified substrings from the text. If no value is provided, an empty list is created by default.
split_bystringDetermines what unit the document is split by. Choose from 'word', 'sentence' or 'passage'.
split_lengthintSets a maximum number of 'word', 'sentence' or 'passage' units per output document
split_respect_sentence_boundaryboolEnsures that document boundaries do not fall in the middle of sentences
split_overlapintSets the amount of overlap between two adjacent documents after a split. Setting this to a positive number essentially enables the sliding window approach.
max_chars_checkintSets the maximum length of a document. If the document exceeds this limit, it will generate a warning and be split at the maximum character value. The resulting fragments will be further cut if they still exceed the maximum limit.

There are also additional parameters you can set to customize your PreProcessing:

ArgumentTypeDescription
tokenizer_model_folderpathPath to the folder containing the NTLK PunktSentenceTokenizer models, if loading a model from a local path. Leave empty otherwise.
languagestrThe language used by "nltk.tokenize.sent_tokenize" in iso639 format.
id_hash_keyslistGenerate the document id from a custom list of strings that refer to the document's attributes.
progress_barboolEnable or disable the progress bar.
add_page_numberboolAdd the number of the page a paragraph occurs in to the Document's meta field "page". Page boundaries are determined by "\f" character which is added in between pages by PDFToTextConverter, TikaConverter, ParsrConverter and AzureConverter.

πŸ“˜

See the complete list of parameters with descriptions in PreProcessor API documentation.

To run the PreProcessor by itself, run:

doc = converter.convert(file_path=file, meta=None)
docs = processor.process(doc)

To use PreProcessor in a pipeline, run:

from haystack.pipelines import Pipeline
from haystack.nodes import PreProcessor, TextConverter, Retriever
from haystack.nodes import DeepsetCloudDocumentStore

pipeline = Pipeline()
pipeline.add_node(component=text_converter, name="TextConverter", inputs=["File"])
pipeline.add_node(component=preprocessor, name="PreProcessor", inputs=["TextConverter"])
pipeline.add_node(component=retriever, name="EmbeddingRetriever", inputs=["PreProcessor"])
pipeline.add_node(component=document_store, name="DeepsetCloudDocumentStore", inputs="EmbeddingRetriever")

Document Format

When you are not using an indexing Pipeline, the PreProcessor can take either Document objects (recommended) as input or plain dictionaries. To learn more about the Document class, see Documents, Answers, and Labels.

# Option 1: Native Haystack Documents
docs = [
    Document(
        content='DOCUMENT_TEXT_HERE',
        meta={'name': DOCUMENT_NAME, ...}
        ...
    ), ...
]

# Option 2: Plain dictionary
docs = [
    {
        'content': 'DOCUMENT_TEXT_HERE',
        'meta': {'name': DOCUMENT_NAME, ...}
    }, ...
]

Related Links