TextCleaner
Use TextCleaner
to make text data more readable. It removes regexes, punctuation, and numbers, as well as converts text to lowercase. This is especially useful to clean up text data before evaluation.
Most common position in a pipeline | Between a Generator and an Evaluator |
Mandatory run variables | "texts": A list of strings to be cleaned |
Output variables | "texts": A list of cleaned texts |
API reference | PreProcessors |
GitHub link | https://github.com/deepset-ai/haystack/blob/main/haystack/components/preprocessors/text_cleaner.py |
Overview
TextCleaner
expects a list of strings as input and returns a list of strings with cleaned texts. Selectable cleaning steps are to convert_to_lowercase
, remove_punctuation
, and to remove_numbers
. These three parameters are booleans that need to be set when the component is initialized.
convert_to_lowercase
converts all characters in texts to lowercase.remove_punctuation
removes all punctuation from the text.remove_numbers
removes all numerical digits from the text.
In addition, you can specify a regular expression with the parameter remove_regexps
, and any matches will be removed.
Usage
On its own
You can use it outside of a pipeline to clean up any texts:
from haystack.components.preprocessors import TextCleaner
cleaner = TextCleaner(
convert_to_lowercase=True,
remove_punctuation=True,
remove_numbers=False)
In a pipeline
In this example, we are using TextCleaner
after an ExtractiveReader
and an OutputAdapter
to remove the punctuation in texts. Then, our custom-made ExactMatchEvaluator
component compares the retrieved answer to the ground truth answer.
from typing import List
from haystack import component, Document, Pipeline
from haystack.components.converters import OutputAdapter
from haystack.components.preprocessors import TextCleaner
from haystack.components.readers import ExtractiveReader
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack.document_stores.in_memory import InMemoryDocumentStore
document_store = InMemoryDocumentStore()
documents = [Document(content="There are over 7,000 languages spoken around the world today."),
Document(content="Elephants have been observed to behave in a way that indicates a high level of self-awareness, such as recognizing themselves in mirrors."),
Document(content="In certain parts of the world, like the Maldives, Puerto Rico, and San Diego, you can witness the phenomenon of bioluminescent waves.")]
document_store.write_documents(documents=documents)
@component
class ExactMatchEvaluator:
@component.output_types(score=int)
def run(self, expected: str, provided: List[str]):
return {"score": int(expected in provided)}
adapter = OutputAdapter(
template="{{answers | extract_data}}",
output_type=List[str],
custom_filters={"extract_data": lambda data: [answer.data for answer in data if answer.data]}
)
p = Pipeline()
p.add_component("retriever", InMemoryBM25Retriever(document_store=document_store))
p.add_component("reader", ExtractiveReader())
p.add_component("adapter", adapter)
p.add_component("cleaner", TextCleaner(remove_punctuation=True))
p.add_component("evaluator", ExactMatchEvaluator())
p.connect("retriever", "reader")
p.connect("reader", "adapter")
p.connect("adapter", "cleaner.texts")
p.connect("cleaner", "evaluator.provided")
question = "What behavior indicates a high level of self-awareness of elephants?"
ground_truth_answer = "recognizing themselves in mirrors"
result = p.run({"retriever": {"query": question}, "reader": {"query": question}, "evaluator": {"expected": ground_truth_answer}})
print(result)
Updated 5 months ago