DocumentationAPI Reference📓 Tutorials🧑‍🍳 Cookbook🤝 Integrations💜 Discord🎨 Studio (Waitlist)
Documentation

PreProcessors

Use the PreProcessors to preprare your data normalize white spaces, remove headers and footers, clean empty lines in your Documents, or split them into smaller pieces. PreProcessors are useful in an indexing pipeline to prepare your files for search.

PreProcessorDescription
DocumentCleanerRemoves extra whitespaces, empty lines, specified substrings, regexes, page headers, and footers from documents.
DocumentSplitterSplits a list of text documents into a list of text documents with shorter texts.
NLTKDocumentSplitterA more specialized version of DocumentSplitter that provides more control over sentence boundaries and language handling.
TextCleanerRemoves regexes, punctuation, and numbers, as well as converts text to lowercase. Useful to clean up text data before evaluation.

Related Links