DocumentationAPI ReferenceπŸ““ TutorialsπŸ§‘β€πŸ³ Cookbook🀝 IntegrationsπŸ’œ Discord

Normalize white spaces, gets rid of headers and footers, cleans empty lines in your Documents, or splits them into smaller pieces.

Module preprocessor

PreProcessor

class PreProcessor(BasePreProcessor)

PreProcessor.__init__

def __init__(clean_whitespace: bool = True,
             clean_header_footer: bool = False,
             clean_empty_lines: bool = True,
             remove_substrings: Optional[List[str]] = None,
             split_by: Optional[Literal["token", "word", "sentence", "passage",
                                        "page"]] = "word",
             split_length: int = 200,
             split_overlap: int = 0,
             split_respect_sentence_boundary: bool = True,
             tokenizer_model_folder: Optional[Union[str, Path]] = None,
             tokenizer: Optional[Union[str,
                                       PreTrainedTokenizerBase]] = "tiktoken",
             language: str = "en",
             id_hash_keys: Optional[List[str]] = None,
             progress_bar: bool = True,
             add_page_number: bool = False,
             max_chars_check: int = 10_000)

Arguments:

  • clean_header_footer: Use heuristic to remove footers and headers across different pages by searching for the longest common string. This heuristic uses exact matches and therefore works well for footers like "Copyright 2019 by XXX", but won't detect "Page 3 of 4" or similar.
  • clean_whitespace: Strip whitespaces before or after each line in the text.
  • clean_empty_lines: Remove more than two empty lines in the text.
  • remove_substrings: Remove specified substrings from the text. If no value is provided an empty list is created by default.
  • split_by: Unit for splitting the document. Can be "token", "word", "sentence", "passage", or "page". Set to None to disable splitting.
  • split_length: Max. number of the above split unit (e.g. words) that are allowed in one document. For instance, if n -> 10 & split_by -> "sentence", then each output document will have 10 sentences.
  • split_overlap: Word overlap between two adjacent documents after a split. Setting this to a positive number essentially enables the sliding window approach. For example, if split_by -> word, split_length -> 5 & split_overlap -> 2, then the splits would be like: [w1 w2 w3 w4 w5, w4 w5 w6 w7 w8, w7 w8 w10 w11 w12]. Set the value to 0 to ensure there is no overlap among the documents after splitting.
  • split_respect_sentence_boundary: Whether to split in partial sentences if split_by -> word. If set to True, the individual split will always have complete sentences & the number of words will be <= split_length.
  • tokenizer: Specifies the tokenizer to use if split_by="token". Supported options are "tiktoken" (for OpenAI's GPT-3.5 and GPT-4) and any HuggingFace tokenizer (e.g. 'bert-base-uncased'). HuggingFace tokenizers can also be passed directly as an PreTrainedTokenizerBase object.
  • language: The language used by "nltk.tokenize.sent_tokenize" in iso639 format. Available options: "ru","sl","es","sv","tr","cs","da","nl","en","et","fi","fr","de","el","it","no","pl","pt","ml"
  • tokenizer_model_folder: Path to the folder containing the NTLK PunktSentenceTokenizer models, if loading a model from a local path. Leave empty otherwise.
  • id_hash_keys: Generate the document id from a custom list of strings that refer to the document's attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are not unique, you can modify the metadata and pass e.g. "meta" to this field (e.g. ["content", "meta"]). In this case the id will be generated by using the content and the defined metadata.
  • progress_bar: Whether to show a progress bar.
  • add_page_number: Add the number of the page a paragraph occurs in to the Document's meta field "page". Page boundaries are determined by " " character which is added in between pages by PDFToTextConverter, TikaConverter, ParsrConverter and AzureConverter.
  • max_chars_check: the maximum length a document is expected to have. Each document that is longer than max_chars_check in characters after pre-processing will raise a warning and is going to be split at the max_char_check-th char, regardless of any other constraint. If the resulting documents are still too long, they'll be cut again until all fragments are below the maximum allowed length.

PreProcessor.process

def process(documents: Union[dict, Document, List[Union[dict, Document]]],
            clean_whitespace: Optional[bool] = None,
            clean_header_footer: Optional[bool] = None,
            clean_empty_lines: Optional[bool] = None,
            remove_substrings: Optional[List[str]] = None,
            split_by: Optional[Literal["token", "word", "sentence", "passage",
                                       "page"]] = None,
            split_length: Optional[int] = None,
            split_overlap: Optional[int] = None,
            split_respect_sentence_boundary: Optional[bool] = None,
            tokenizer: Optional[Union[str, PreTrainedTokenizerBase]] = None,
            id_hash_keys: Optional[List[str]] = None) -> List[Document]

Perform document cleaning and splitting. Can take a single document or a list of documents as input and returns a list of documents.

PreProcessor.clean

def clean(document: Union[dict, Document],
          clean_whitespace: bool,
          clean_header_footer: bool,
          clean_empty_lines: bool,
          remove_substrings: Optional[List[str]] = None,
          id_hash_keys: Optional[List[str]] = None) -> Document

Perform document cleaning on a single document and return a single document. This method will deal with whitespaces, headers, footers and empty lines. Its exact functionality is defined by the parameters passed into PreProcessor.init().

PreProcessor.split

def split(document: Union[dict, Document],
          split_by: Optional[Literal["token", "word", "sentence", "passage",
                                     "page"]],
          split_length: int,
          split_overlap: int,
          split_respect_sentence_boundary: bool,
          tokenizer: Optional[Union[str, PreTrainedTokenizerBase]] = None,
          id_hash_keys: Optional[List[str]] = None) -> List[Document]

Perform document splitting on a single document. This method can split on different units, at different lengths, with different strides. It can also respect sentence boundaries. Its exact functionality is defined by the parameters passed into PreProcessor.init(). Takes a single document as input and returns a list of documents.