Presidio
haystack_integrations.components.extractors.presidio.presidio_entity_extractor
PresidioEntityExtractor
Detects PII entities in Haystack Documents using Microsoft Presidio Analyzer.
See Presidio Analyzer for details.
Accepts a list of Documents and returns new Documents with detected PII entities stored
in each Document's metadata under the key "entities". Each entry in the list contains
the entity type, start/end character offsets, and the confidence score.
Original Documents are not mutated. Documents without text content are passed through unchanged.
The analyzer engine is loaded on the first call to run(),
or by calling warm_up() explicitly beforehand.
Usage example
from haystack import Document
from haystack_integrations.components.extractors.presidio import PresidioEntityExtractor
extractor = PresidioEntityExtractor()
result = extractor.run(documents=[Document(content="Contact Alice at alice@example.com")])
print(result["documents"][0].meta["entities"])
# [{"entity_type": "PERSON", "start": 8, "end": 13, "score": 0.85},
# {"entity_type": "EMAIL_ADDRESS", "start": 17, "end": 34, "score": 1.0}]
init
__init__(
*,
language: str = "en",
entities: list[str] | None = None,
score_threshold: float = 0.35
) -> None
Initializes the PresidioEntityExtractor.
Parameters:
- language (
str) – Language code for PII detection. Defaults to"en". See Presidio supported languages. - entities (
list[str] | None) – List of PII entity types to detect (e.g.["PERSON", "EMAIL_ADDRESS"]). IfNone, all supported entity types are detected. See Presidio supported entities. - score_threshold (
float) – Minimum confidence score (0-1) for a detected entity to be included. Defaults to0.35. See Presidio analyzer documentation.
warm_up
Initializes the Presidio analyzer engine.
This method loads the underlying NLP models. In a Haystack Pipeline, this is called automatically before the first run.
run
Detects PII entities in the provided Documents.
Parameters:
- documents (
list[Document]) – List of Documents to analyze for PII entities.
Returns:
dict[str, list[Document]]– A dictionary with keydocumentscontaining Documents with detected entities stored in metadata under the key"entities".
haystack_integrations.components.preprocessors.presidio.presidio_document_cleaner
PresidioDocumentCleaner
Anonymizes PII in Haystack Documents using Microsoft Presidio.
Accepts a list of Documents, detects personally identifiable information (PII) in their
text content, and returns new Documents with PII replaced by entity type placeholders
(e.g. <PERSON>, <EMAIL_ADDRESS>). Original Documents are not mutated.
Documents without text content are passed through unchanged.
The analyzer and anonymizer engines are loaded on the first call to run(),
or by calling warm_up() explicitly beforehand.
Usage example
from haystack import Document
from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner
cleaner = PresidioDocumentCleaner()
result = cleaner.run(documents=[Document(content="My name is John and my email is john@example.com")])
print(result["documents"][0].content)
# My name is <PERSON> and my email is <EMAIL_ADDRESS>
init
__init__(
*,
language: str = "en",
entities: list[str] | None = None,
score_threshold: float = 0.35
) -> None
Initializes the PresidioDocumentCleaner.
Parameters:
- language (
str) – Language code for PII detection. Defaults to"en". See Presidio supported languages. - entities (
list[str] | None) – List of PII entity types to detect and anonymize (e.g.["PERSON", "EMAIL_ADDRESS"]). IfNone, all supported entity types are used. See Presidio supported entities. - score_threshold (
float) – Minimum confidence score (0-1) for a detected entity to be anonymized. Defaults to0.35. See Presidio analyzer documentation.
warm_up
Initializes the Presidio analyzer and anonymizer engines.
This method loads the underlying NLP models. In a Haystack Pipeline, this is called automatically before the first run.
run
Anonymizes PII in the provided Documents.
Parameters:
- documents (
list[Document]) – List of Documents whose text content will be anonymized.
Returns:
dict[str, list[Document]]– A dictionary with keydocumentscontaining the cleaned Documents.
haystack_integrations.components.preprocessors.presidio.presidio_text_cleaner
PresidioTextCleaner
Anonymizes PII in plain strings using Microsoft Presidio.
Accepts a list of strings, detects personally identifiable information (PII), and returns
a new list of strings with PII replaced by entity type placeholders (e.g. <PERSON>).
Useful for sanitizing user queries before they are sent to an LLM.
The analyzer and anonymizer engines are loaded on the first call to run(),
or by calling warm_up() explicitly beforehand.
Usage example
from haystack_integrations.components.preprocessors.presidio import PresidioTextCleaner
cleaner = PresidioTextCleaner()
result = cleaner.run(texts=["Hi, I am John Smith, call me at 212-555-1234"])
print(result["texts"][0])
# Hi, I am <PERSON>, call me at <PHONE_NUMBER>
init
__init__(
*,
language: str = "en",
entities: list[str] | None = None,
score_threshold: float = 0.35
) -> None
Initializes the PresidioTextCleaner.
Parameters:
- language (
str) – Language code for PII detection. Defaults to"en". See Presidio supported languages. - entities (
list[str] | None) – List of PII entity types to detect and anonymize (e.g.["PERSON", "PHONE_NUMBER"]). IfNone, all supported entity types are used. See Presidio supported entities. - score_threshold (
float) – Minimum confidence score (0-1) for a detected entity to be anonymized. Defaults to0.35. See Presidio analyzer documentation.
warm_up
Initializes the Presidio analyzer and anonymizer engines.
This method loads the underlying NLP models. In a Haystack Pipeline, this is called automatically before the first run.
run
Anonymizes PII in the provided strings.
Parameters:
- texts (
list[str]) – List of strings to anonymize.
Returns:
dict[str, list[str]]– A dictionary with keytextscontaining the cleaned strings.