Preprocess your Documents and texts. Clean, split, and more.
Module document_cleaner
DocumentCleaner
Cleans the text in the documents.
It removes extra whitespaces, empty lines, specified substrings, regexes, page headers and footers (in this order).
Usage example:
from haystack import Document
from haystack.components.preprocessors import DocumentCleaner
doc = Document(content="This is a document to clean\n\n\nsubstring to remove")
cleaner = DocumentCleaner(remove_substrings = ["substring to remove"])
result = cleaner.run(documents=[doc])
assert result["documents"][0].content == "This is a document to clean "
DocumentCleaner.__init__
def __init__(remove_empty_lines: bool = True,
remove_extra_whitespaces: bool = True,
remove_repeated_substrings: bool = False,
keep_id: bool = False,
remove_substrings: Optional[List[str]] = None,
remove_regex: Optional[str] = None,
unicode_normalization: Optional[Literal["NFC", "NFKC", "NFD",
"NFKD"]] = None,
ascii_only: bool = False)
Initialize DocumentCleaner.
Arguments:
remove_empty_lines
: IfTrue
, removes empty lines.remove_extra_whitespaces
: IfTrue
, removes extra whitespaces.remove_repeated_substrings
: IfTrue
, removes repeated substrings (headers and footers) from pages. Pages must be separated by a form feed character "\f", which is supported byTextFileToDocument
andAzureOCRDocumentConverter
.remove_substrings
: List of substrings to remove from the text.remove_regex
: Regex to match and replace substrings by "".keep_id
: IfTrue
, keeps the IDs of the original documents.unicode_normalization
: Unicode normalization form to apply to the text. Note: This will run before any other steps.ascii_only
: Whether to convert the text to ASCII only. Will remove accents from characters and replace them with ASCII characters. Other non-ASCII characters will be removed. Note: This will run before any pattern matching or removal.
DocumentCleaner.run
@component.output_types(documents=List[Document])
def run(documents: List[Document])
Cleans up the documents.
Arguments:
documents
: List of Documents to clean.
Raises:
TypeError
: if documents is not a list of Documents.
Returns:
A dictionary with the following key:
documents
: List of cleaned Documents.
Module document_splitter
DocumentSplitter
Splits long documents into smaller chunks.
This is a common preprocessing step during indexing. It helps Embedders create meaningful semantic representations and prevents exceeding language model context limits.
The DocumentSplitter is compatible with the following DocumentStores:
- (Astra)[https://docs.haystack.deepset.ai/docs/astradocumentstore]
- (Chroma)[https://docs.haystack.deepset.ai/docs/chromadocumentstore] limited support, overlapping information is not stored
- (Elasticsearch)[https://docs.haystack.deepset.ai/docs/elasticsearch-document-store]
- (OpenSearch)[https://docs.haystack.deepset.ai/docs/opensearch-document-store]
- (Pgvector)[https://docs.haystack.deepset.ai/docs/pgvectordocumentstore]
- (Pinecone)[https://docs.haystack.deepset.ai/docs/pinecone-document-store] limited support, overlapping information is not stored
- (Qdrant)[https://docs.haystack.deepset.ai/docs/qdrant-document-store]
- (Weaviate)[https://docs.haystack.deepset.ai/docs/weaviatedocumentstore]
Usage example
from haystack import Document
from haystack.components.preprocessors import DocumentSplitter
doc = Document(content="Moonlight shimmered softly, wolves howled nearby, night enveloped everything.")
splitter = DocumentSplitter(split_by="word", split_length=3, split_overlap=0)
result = splitter.run(documents=[doc])
DocumentSplitter.__init__
def __init__(split_by: Literal["word", "sentence", "page", "passage"] = "word",
split_length: int = 200,
split_overlap: int = 0,
split_threshold: int = 0)
Initialize DocumentSplitter.
Arguments:
split_by
: The unit for splitting your documents. Choose fromword
for splitting by spaces (" "),sentence
for splitting by periods ("."),page
for splitting by form feed ("\f"), orpassage
for splitting by double line breaks ("\n\n").split_length
: The maximum number of units in each split.split_overlap
: The number of overlapping units for each split.split_threshold
: The minimum number of units per split. If a split has fewer units than the threshold, it's attached to the previous split.
DocumentSplitter.run
@component.output_types(documents=List[Document])
def run(documents: List[Document])
Split documents into smaller parts.
Splits documents by the unit expressed in split_by
, with a length of split_length
and an overlap of split_overlap
.
Arguments:
documents
: The documents to split.
Raises:
TypeError
: if the input is not a list of Documents.ValueError
: if the content of a document is None.
Returns:
A dictionary with the following key:
documents
: List of documents with the split texts. Each document includes:- A metadata field
source_id
to track the original document. - A metadata field
page_number
to track the original page number. - All other metadata copied from the original document.
Module text_cleaner
TextCleaner
Cleans text strings.
It can remove substrings matching a list of regular expressions, convert text to lowercase, remove punctuation, and remove numbers. Use it to clean up text data before evaluation.
Usage example
from haystack.components.preprocessors import TextCleaner
text_to_clean = "1Moonlight shimmered softly, 300 Wolves howled nearby, Night enveloped everything."
cleaner = TextCleaner(convert_to_lowercase=True, remove_punctuation=False, remove_numbers=True)
result = cleaner.run(texts=[text_to_clean])
TextCleaner.__init__
def __init__(remove_regexps: Optional[List[str]] = None,
convert_to_lowercase: bool = False,
remove_punctuation: bool = False,
remove_numbers: bool = False)
Initializes the TextCleaner component.
Arguments:
remove_regexps
: A list of regex patterns to remove matching substrings from the text.convert_to_lowercase
: IfTrue
, converts all characters to lowercase.remove_punctuation
: IfTrue
, removes punctuation from the text.remove_numbers
: IfTrue
, removes numerical digits from the text.
TextCleaner.run
@component.output_types(texts=List[str])
def run(texts: List[str]) -> Dict[str, Any]
Cleans up the given list of strings.
Arguments:
texts
: List of strings to clean.
Returns:
A dictionary with the following key:
texts
: the cleaned list of strings.