Preprocess your Documents and texts. Clean, split, and more.
Module document_cleaner
DocumentCleaner
@component
class DocumentCleaner()
Cleans up text documents by removing extra whitespaces, empty lines, specified substrings, regexes, page headers and footers (in this order).
Usage example:
from haystack import Document
from haystack.components.preprocessors import DocumentCleaner
doc = Document(content="This is a document to clean\n\n\nsubstring to remove")
cleaner = DocumentCleaner(remove_substrings = ["substring to remove"])
result = cleaner.run(documents=[doc])
assert result["documents"][0].content == "This is a document to clean "
DocumentCleaner.__init__
def __init__(remove_empty_lines: bool = True,
remove_extra_whitespaces: bool = True,
remove_repeated_substrings: bool = False,
remove_substrings: Optional[List[str]] = None,
remove_regex: Optional[str] = None)
Arguments:
remove_empty_lines
: Whether to remove empty lines.remove_extra_whitespaces
: Whether to remove extra whitespaces.remove_repeated_substrings
: Whether to remove repeated substrings (headers/footers) from pages. Pages in the text need to be separated by form feed character "\f", which is supported byTextFileToDocument
andAzureOCRDocumentConverter
.remove_substrings
: List of substrings to remove from the text.remove_regex
: Regex to match and replace substrings by "".
DocumentCleaner.run
@component.output_types(documents=List[Document])
def run(documents: List[Document])
Cleans up the documents.
Arguments:
documents
: List of Documents to clean.
Raises:
TypeError
: if documents is not a list of Documents.
Returns:
A dictionary with the following key:
documents
: List of cleaned Documents.
Module document_splitter
DocumentSplitter
@component
class DocumentSplitter()
Splits a list of text documents into a list of text documents with shorter texts.
Splitting documents with long texts is a common preprocessing step during indexing. This allows Embedders to create significant semantic representations and avoids exceeding the maximum context length of language models.
DocumentSplitter.__init__
def __init__(split_by: Literal["word", "sentence", "page", "passage"] = "word",
split_length: int = 200,
split_overlap: int = 0)
Arguments:
split_by
: The unit by which the document should be split. Choose from "word" for splitting by " ", "sentence" for splitting by ".", "page" for splitting by "\f" or "passage" for splitting by "\n\n".split_length
: The maximum number of units in each split.split_overlap
: The number of units that each split should overlap.
DocumentSplitter.run
@component.output_types(documents=List[Document])
def run(documents: List[Document])
Splits documents by the unit expressed in split_by
, with a length of split_length
and an overlap of split_overlap
.
Arguments:
documents
: The documents to split.
Raises:
TypeError
: if the input is not a list of Documents.ValueError
: if the content of a document is None.
Returns:
A dictionary with the following key:
documents
: List of documents with the split texts. A metadata field "source_id" is added to each document to keep track of the original document that was split. Other metadata are copied from the original document.
Module text_cleaner
TextCleaner
@component
class TextCleaner()
A preprocessor component to clean text data. It can remove substrings matching a list of regular expressions, convert text to lowercase, remove punctuation, and remove numbers.
This is useful to cleanup text data before evaluation.
TextCleaner.__init__
def __init__(remove_regexps: Optional[List[str]] = None,
convert_to_lowercase: bool = False,
remove_punctuation: bool = False,
remove_numbers: bool = False)
Arguments:
remove_regexps
: A list of regular expressions. If provided, it removes substrings matching these regular expressions from the text.convert_to_lowercase
: If True, converts all characters to lowercase.remove_punctuation
: If True, removes punctuation from the text.remove_numbers
: If True, removes numerical digits from the text.
TextCleaner.run
@component.output_types(texts=List[str])
def run(texts: List[str]) -> Dict[str, Any]
Cleans up the given list of strings.
Arguments:
texts
: List of strings to clean.
Returns:
A dictionary with the following key:
texts
: the cleaned list of strings.