DocumentationAPI ReferenceTutorialsGitHub Code ExamplesDiscord Community


Splitting is generally recommended for long Documents as it makes the Retriever's job easier and speeds up Question Answering. For suggestions on how best to split your documents, see [Optimization](🔗).

Column Title
Column Title
**Position in a Pipeline**As early in an indexing Pipeline as possible but after File Converters and Crawlers
**Input**[Documents](🔗)
**Output**[Documents](🔗)
**Classes**PreProcessor

Tutorial

To start working with code examples, have a look at the [preprocessing tutorial](🔗). For ideas on what you can do at indexing time, see [advanced indexing tutorial](🔗).

## Usage

To initialize `PreProcessor`, run:


ArgumentTypeDescription
clean_empty_linesboolNormalizes 3 or more consecutive empty lines to be just a two empty lines.
clean_whitespaceboolRemoves any whitespace at the beginning or end of each line in the text.
clean_header_footerboolRemoves any long header or footer texts that are repeated on each page.
split_bystringDetermines what unit the document is split by. Choose from `'word'`, `'sentence'` or `'passage'`.
split_lengthintSets a maximum number of `'word'`, `'sentence'` or `'passage'` units per output document
split_respect_sentence_boundaryboolEnsures that document boundaries do not fall in the middle of sentences
split_overlapintSets the amount of overlap between two adjacent documents after a split. Setting this to a positive number essentially enables the sliding window approach.

To run the `PreProcessor` by itself, run:



To use `PreProcessor` in a pipeline, run:



## Document Format

When you are not using an indexing Pipeline, the PreProcessor can take either `Document` objects (recommended) as input or plain dictionaries. To learn more about the `Document` class, see [Documents, Answers, and Labels](🔗).