Splitting is generally recommended for long Documents as it makes the Retriever's job easier and speeds up Question Answering. For suggestions on how best to split your documents, see [Optimization](🔗).
|Column Title||Column Title|
|**Position in a Pipeline**||As early in an indexing Pipeline as possible but after File Converters and Crawlers|
To start working with code examples, have a look at the [preprocessing tutorial](🔗). For ideas on what you can do at indexing time, see [advanced indexing tutorial](🔗).
To initialize `
|clean_empty_lines||bool||Normalizes 3 or more consecutive empty lines to be just a two empty lines.|
|clean_whitespace||bool||Removes any whitespace at the beginning or end of each line in the text.|
|clean_header_footer||bool||Removes any long header or footer texts that are repeated on each page.|
|split_by||string||Determines what unit the document is split by. Choose from `|
|split_length||int||Sets a maximum number of `|
|split_respect_sentence_boundary||bool||Ensures that document boundaries do not fall in the middle of sentences|
|split_overlap||int||Sets the amount of overlap between two adjacent documents after a split. Setting this to a positive number essentially enables the sliding window approach.|
To run the `
PreProcessor` by itself, run:
To use `
PreProcessor` in a pipeline, run:
## Document Format
When you are not using an indexing Pipeline, the PreProcessor can take either `
Document` objects (recommended) as input or plain dictionaries. To learn more about the `
Document` class, see [Documents, Answers, and Labels](🔗).