Presidio
haystack_integrations.components.extractors.presidio.presidio_entity_extractor
PresidioEntityExtractor
Detects PII entities in Haystack Documents using Microsoft Presidio Analyzer.
See Presidio Analyzer for details.
Accepts a list of Documents and returns new Documents with detected PII entities stored
in each Document's metadata under the key "entities". Each entry in the list contains
the entity type, start/end character offsets, and the confidence score.
Original Documents are not mutated. Documents without text content are passed through unchanged.
The analyzer engine is loaded on the first call to run(),
or by calling warm_up() explicitly beforehand.
Usage example
from haystack import Document
from haystack_integrations.components.extractors.presidio import PresidioEntityExtractor
extractor = PresidioEntityExtractor()
result = extractor.run(documents=[Document(content="Contact Alice at alice@example.com")])
print(result["documents"][0].meta["entities"])
# [{"entity_type": "PERSON", "start": 8, "end": 13, "score": 0.85},
# {"entity_type": "EMAIL_ADDRESS", "start": 17, "end": 34, "score": 1.0}]
SPACY_DEFAULT_MODELS
Mapping from ISO 639-1 language code to the largest available spaCy model for that language.
Used to automatically select an NLP model when models is not specified.
See spaCy documentation for the full list of available spaCy models.
init
__init__(
*,
language: str = "en",
entities: list[str] | None = None,
score_threshold: float = 0.35,
models: list[dict[str, str]] | None = None
) -> None
Initializes the PresidioEntityExtractor.
Parameters:
- language (
str) – ISO 639-1 language code for PII detection. Defaults to"en". For languages in the built-in mapping (e.g."de","fr","es"), the appropriate spaCy model is loaded automatically at warm-up time — no need to setmodels. For unsupported languages, use themodelsparameter to configure a custom model. See Presidio supported languages. - entities (
list[str] | None) – List of PII entity types to detect (e.g.["PERSON", "EMAIL_ADDRESS"]). IfNone, all supported entity types are detected. See Presidio supported entities. - score_threshold (
float) – Minimum confidence score (0-1) for a detected entity to be included. Defaults to0.35. See Presidio analyzer documentation. - models (
list[dict[str, str]] | None) – Advanced override: list of spaCy model configurations. Each entry must contain"lang_code"and"model_name"keys, e.g.[{"lang_code": "fr", "model_name": "fr_core_news_md"}]. Use this only when you need a specific model variant or a language not covered by the built-in mapping. IfNone, the model is selected automatically fromSPACY_DEFAULT_MODELSbased onlanguage.
warm_up
Initializes the Presidio analyzer engine.
This method loads the underlying NLP models. In a Haystack Pipeline, this is called automatically before the first run.
run
Detects PII entities in the provided Documents.
Parameters:
- documents (
list[Document]) – List of Documents to analyze for PII entities.
Returns:
dict[str, list[Document]]– A dictionary with keydocumentscontaining Documents with detected entities stored in metadata under the key"entities".
haystack_integrations.components.preprocessors.presidio.presidio_document_cleaner
PresidioDocumentCleaner
Anonymizes PII in Haystack Documents using Microsoft Presidio.
Accepts a list of Documents, detects personally identifiable information (PII) in their
text content, and returns new Documents with PII replaced by entity type placeholders
(e.g. <PERSON>, <EMAIL_ADDRESS>). Original Documents are not mutated.
Documents without text content are passed through unchanged.
The analyzer and anonymizer engines are loaded on the first call to run(),
or by calling warm_up() explicitly beforehand.
Usage example
from haystack import Document
from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner
cleaner = PresidioDocumentCleaner()
result = cleaner.run(documents=[Document(content="My name is John and my email is john@example.com")])
print(result["documents"][0].content)
# My name is <PERSON> and my email is <EMAIL_ADDRESS>
SPACY_DEFAULT_MODELS
Mapping from ISO 639-1 language code to the largest available spaCy model for that language.
Used to automatically select an NLP model when models is not specified.
See spaCy documentation for the full list of available spaCy models.
init
__init__(
*,
language: str = "en",
entities: list[str] | None = None,
score_threshold: float = 0.35,
models: list[dict[str, str]] | None = None
) -> None
Initializes the PresidioDocumentCleaner.
Parameters:
- language (
str) – ISO 639-1 language code for PII detection. Defaults to"en". For languages in the built-in mapping (e.g."de","fr","es"), the appropriate spaCy model is loaded automatically at warm-up time — no need to setmodels. For unsupported languages, use themodelsparameter to configure a custom model. See Presidio supported languages. - entities (
list[str] | None) – List of PII entity types to detect and anonymize (e.g.["PERSON", "EMAIL_ADDRESS"]). IfNone, all supported entity types are used. See Presidio supported entities. - score_threshold (
float) – Minimum confidence score (0-1) for a detected entity to be anonymized. Defaults to0.35. See Presidio analyzer documentation. - models (
list[dict[str, str]] | None) – Advanced override: list of spaCy model configurations. Each entry must contain"lang_code"and"model_name"keys, e.g.[{"lang_code": "fr", "model_name": "fr_core_news_md"}]. Use this only when you need a specific model variant or a language not covered by the built-in mapping. IfNone, the model is selected automatically fromSPACY_DEFAULT_MODELSbased onlanguage.
warm_up
Initializes the Presidio analyzer and anonymizer engines.
This method loads the underlying NLP models. In a Haystack Pipeline, this is called automatically before the first run.
run
Anonymizes PII in the provided Documents.
Parameters:
- documents (
list[Document]) – List of Documents whose text content will be anonymized.
Returns:
dict[str, list[Document]]– A dictionary with keydocumentscontaining the cleaned Documents.
haystack_integrations.components.preprocessors.presidio.presidio_text_cleaner
PresidioTextCleaner
Anonymizes PII in plain strings using Microsoft Presidio.
Accepts a list of strings, detects personally identifiable information (PII), and returns
a new list of strings with PII replaced by entity type placeholders (e.g. <PERSON>).
Useful for sanitizing user queries before they are sent to an LLM.
The analyzer and anonymizer engines are loaded on the first call to run(),
or by calling warm_up() explicitly beforehand.
Usage example
from haystack_integrations.components.preprocessors.presidio import PresidioTextCleaner
cleaner = PresidioTextCleaner()
result = cleaner.run(texts=["Hi, I am John Smith, call me at 212-555-1234"])
print(result["texts"][0])
# Hi, I am <PERSON>, call me at <PHONE_NUMBER>
SPACY_DEFAULT_MODELS
Mapping from ISO 639-1 language code to the largest available spaCy model for that language.
Used to automatically select an NLP model when models is not specified.
See spaCy documentation for the full list of available spaCy models.
init
__init__(
*,
language: str = "en",
entities: list[str] | None = None,
score_threshold: float = 0.35,
models: list[dict[str, str]] | None = None
) -> None
Initializes the PresidioTextCleaner.
Parameters:
- language (
str) – ISO 639-1 language code for PII detection. Defaults to"en". For languages in the built-in mapping (e.g."de","fr","es"), the appropriate spaCy model is loaded automatically at warm-up time — no need to setmodels. For unsupported languages, use themodelsparameter to configure a custom model. See Presidio supported languages. - entities (
list[str] | None) – List of PII entity types to detect and anonymize (e.g.["PERSON", "PHONE_NUMBER"]). IfNone, all supported entity types are used. See Presidio supported entities. - score_threshold (
float) – Minimum confidence score (0-1) for a detected entity to be anonymized. Defaults to0.35. See Presidio analyzer documentation. - models (
list[dict[str, str]] | None) – Advanced override: list of spaCy model configurations. Each entry must contain"lang_code"and"model_name"keys, e.g.[{"lang_code": "fr", "model_name": "fr_core_news_md"}]. Use this only when you need a specific model variant or a language not covered by the built-in mapping. IfNone, the model is selected automatically fromSPACY_DEFAULT_MODELSbased onlanguage.
warm_up
Initializes the Presidio analyzer and anonymizer engines.
This method loads the underlying NLP models. In a Haystack Pipeline, this is called automatically before the first run.
run
Anonymizes PII in the provided strings.
Parameters:
- texts (
list[str]) – List of strings to anonymize.
Returns:
dict[str, list[str]]– A dictionary with keytextscontaining the cleaned strings.