Preprocess your Documents and texts. Clean, split, and more.
Module document_cleaner
DocumentCleaner
Cleans the text in the documents.
Cleans up text documents by removing extra whitespaces, empty lines, specified substrings, regexes, page headers and footers (in this order).
Usage example:
from haystack import Document
from haystack.components.preprocessors import DocumentCleaner
doc = Document(content="This is a document to clean\n\n\nsubstring to remove")
cleaner = DocumentCleaner(remove_substrings = ["substring to remove"])
result = cleaner.run(documents=[doc])
assert result["documents"][0].content == "This is a document to clean "
DocumentCleaner.__init__
def __init__(remove_empty_lines: bool = True,
remove_extra_whitespaces: bool = True,
remove_repeated_substrings: bool = False,
keep_id: bool = False,
remove_substrings: Optional[List[str]] = None,
remove_regex: Optional[str] = None)
Initialize the DocumentCleaner.
Arguments:
remove_empty_lines
: Whether to remove empty lines.remove_extra_whitespaces
: Whether to remove extra whitespaces.remove_repeated_substrings
: Whether to remove repeated substrings (headers/footers) from pages. Pages in the text need to be separated by form feed character "\f", which is supported byTextFileToDocument
andAzureOCRDocumentConverter
.remove_substrings
: List of substrings to remove from the text.remove_regex
: Regex to match and replace substrings by "".keep_id
: keep the ids of the original documents
DocumentCleaner.run
@component.output_types(documents=List[Document])
def run(documents: List[Document])
Cleans up the documents.
Arguments:
documents
: List of Documents to clean.
Raises:
TypeError
: if documents is not a list of Documents.
Returns:
A dictionary with the following key:
documents
: List of cleaned Documents.
Module document_splitter
DocumentSplitter
Splits a list of text documents into a list of text documents with shorter texts.
Splitting documents with long texts is a common preprocessing step during indexing. This allows Embedders to create significant semantic representations and avoids exceeding the maximum context length of language models.
DocumentSplitter.__init__
def __init__(split_by: Literal["word", "sentence", "page", "passage"] = "word",
split_length: int = 200,
split_overlap: int = 0,
split_threshold: int = 0)
Initialize the DocumentSplitter.
Arguments:
split_by
: The unit by which the document should be split. Choose from "word" for splitting by " ", "sentence" for splitting by ".", "page" for splitting by "\f" or "passage" for splitting by "\n\n".split_length
: The maximum number of units in each split.split_overlap
: The number of units that each split should overlap.split_threshold
: The minimum number of units that the split should have. If the split has fewer units than the threshold, it will be attached to the previous split.
DocumentSplitter.run
@component.output_types(documents=List[Document])
def run(documents: List[Document])
Split documents into smaller parts.
Splits documents by the unit expressed in split_by
, with a length of split_length
and an overlap of split_overlap
.
Arguments:
documents
: The documents to split.
Raises:
TypeError
: if the input is not a list of Documents.ValueError
: if the content of a document is None.
Returns:
A dictionary with the following key:
documents
: List of documents with the split texts. A metadata field "source_id" is added to each document to keep track of the original document that was split. Another metadata field "page_number" is added to each number to keep track of the page it belonged to in the original document. Other metadata are copied from the original document.
Module text_cleaner
TextCleaner
A PreProcessor component to clean text data.
It can remove substrings matching a list of regular expressions, convert text to lowercase, remove punctuation, and remove numbers.
This is useful to clean up text data before evaluation.
TextCleaner.__init__
def __init__(remove_regexps: Optional[List[str]] = None,
convert_to_lowercase: bool = False,
remove_punctuation: bool = False,
remove_numbers: bool = False)
Initialize the TextCleaner component.
Arguments:
remove_regexps
: A list of regular expressions. If provided, it removes substrings matching these regular expressions from the text.convert_to_lowercase
: If True, converts all characters to lowercase.remove_punctuation
: If True, removes punctuation from the text.remove_numbers
: If True, removes numerical digits from the text.
TextCleaner.run
@component.output_types(texts=List[str])
def run(texts: List[str]) -> Dict[str, Any]
Cleans up the given list of strings.
Arguments:
texts
: List of strings to clean.
Returns:
A dictionary with the following key:
texts
: the cleaned list of strings.