TextCleaner

Use TextCleaner to make text data more readable. It removes regexes, punctuation, and numbers, as well as converts text to lowercase. This is especially useful to clean up text data before evaluation.

Suggest Edits


Name	TextCleaner
Folder Path	/preprocessors/
Position in a Pipeline	Between a Generator and an Evaluator
Inputs	"texts": a list of strings to be cleaned
Outputs	"texts": a list of cleaned texts

Overview

TextCleaner expects a list of strings as input and returns a list of strings with cleaned texts. Selectable cleaning steps are to convert_to_lowercase, remove_punctuation, and to remove_numbers. These three parameters are booleans that need to be set when the component is initialized.

convert_to_lowercase converts all characters in texts to lowercase.
remove_punctuation removes all punctuation from the text.
remove_numbers removes all numerical digits from the text.

In addition, you can specify a regular expression with the parameter remove_regexps, and any matches will be removed.

Usage

On its own

You can use it outside of a Pipeline to clean up any texts:

from haystack.components.preprocessors import TextCleaner

cleaner = TextCleaner(
	convert_to_lowercase=True,
	remove_punctuation=True,
	remove_numbers=False)

In a Pipeline

In this example, we are using TextCleaner after an ExtractiveReader and an OutputAdapter to remove the punctuation in texts. Then, our custom-made ExactMatchEvaluator component compares the retrieved answer to the ground truth answer.

from typing import List
from haystack import component, Document, Pipeline
from haystack.components.converters import OutputAdapter
from haystack.components.preprocessors import TextCleaner
from haystack.components.readers import ExtractiveReader
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack.document_stores.in_memory import InMemoryDocumentStore

document_store = InMemoryDocumentStore()
documents = [Document(content="There are over 7,000 languages spoken around the world today."),
			       Document(content="Elephants have been observed to behave in a way that indicates a high level of self-awareness, such as recognizing themselves in mirrors."),
			       Document(content="In certain parts of the world, like the Maldives, Puerto Rico, and San Diego, you can witness the phenomenon of bioluminescent waves.")]
document_store.write_documents(documents=documents)

@component
class ExactMatchEvaluator:
	@component.output_types(score=int)
	def run(self, expected: str, provided: List[str]):
		return {"score": int(expected in provided)}

adapter = OutputAdapter(
    template="{{answers | extract_data}}",
    output_type=List[str],
    custom_filters={"extract_data": lambda data: [answer.data for answer in data if answer.data]}
)

p = Pipeline()
p.add_component("retriever", InMemoryBM25Retriever(document_store=document_store))
p.add_component("reader", ExtractiveReader())
p.add_component("adapter", adapter)
p.add_component("cleaner", TextCleaner(remove_punctuation=True))
p.add_component("evaluator", ExactMatchEvaluator())

p.connect("retriever", "reader")
p.connect("reader", "adapter")
p.connect("adapter", "cleaner.texts")
p.connect("cleaner", "evaluator.provided")

question = "What behavior indicates a high level of self-awareness of elephants?"
ground_truth_answer = "recognizing themselves in mirrors"

result = p.run({"retriever": {"query": question}, "reader": {"query": question}, "evaluator": {"expected": ground_truth_answer}})
print(result)

Updated 2 days ago