PresidioEntityExtractor
PresidioEntityExtractor detects personally identifiable information (PII) in Documents and stores the detected entities as structured metadata under the "entities" key, without modifying the document text. Each entry contains the entity type, character offsets, and confidence score.
| Most common position in a pipeline | In an indexing pipeline, before writing Documents to a Document Store |
| Mandatory run variables | documents: A list of Document objects |
| Output variables | documents: A list of Document objects with PII metadata added |
| API reference | Presidio |
| GitHub link | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/presidio |
Overviewβ
Microsoft Presidio is an open-source framework for PII detection and anonymization. PresidioEntityExtractor uses Presidio's Analyzer Engine to scan document text and identify entities such as names, email addresses, phone numbers, and more.
The extractor does not modify the document text. Instead, it adds the detected entities as structured metadata, letting you inspect or act on PII findings without altering the original content. This is useful when you want to audit what PII is present before deciding how to handle it β for example, routing documents to a review queue, logging PII findings, or conditionally applying anonymization.
If you want to replace PII directly rather than annotate it, see PresidioDocumentCleaner for Documents or PresidioTextCleaner for plain strings.
Configurationβ
| Parameter | Default | Description |
|---|---|---|
language | "en" | ISO 639-1 language code for PII detection. The appropriate spaCy model is selected automatically for supported languages. See Presidio supported languages. |
entities | None | List of PII entity types to detect (e.g. ["PERSON", "EMAIL_ADDRESS"]). If None, all supported types are detected. See supported entities. |
score_threshold | 0.35 | Minimum confidence score (0β1) for a detected entity to be included. |
models | None | Advanced override: explicit list of spaCy model configs, e.g. [{"lang_code": "fr", "model_name": "fr_core_news_md"}]. Use this only when you need a specific model variant or a language not in the built-in mapping. If None, the model is selected automatically based on language. |
Usageβ
Install the presidio-haystack package to use the PresidioEntityExtractor.
On its ownβ
from haystack import Document
from haystack_integrations.components.extractors.presidio import PresidioEntityExtractor
extractor = PresidioEntityExtractor()
result = extractor.run(
documents=[Document(content="Contact Alice at alice@example.com")],
)
print(result["documents"][0].meta["entities"])
# [{"entity_type": "PERSON", "start": 8, "end": 13, "score": 0.85},
# {"entity_type": "EMAIL_ADDRESS", "start": 17, "end": 34, "score": 1.0}]
In a pipelineβ
from haystack import Document, Pipeline
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack_integrations.components.extractors.presidio import PresidioEntityExtractor
document_store = InMemoryDocumentStore()
indexing_pipeline = Pipeline()
indexing_pipeline.add_component("extractor", PresidioEntityExtractor())
indexing_pipeline.add_component("writer", DocumentWriter(document_store=document_store))
indexing_pipeline.connect("extractor", "writer")
indexing_pipeline.run(
{
"extractor": {
"documents": [
Document(content="Alice Smith's email is alice@example.com"),
Document(content="Call Bob at 212-555-9876"),
],
},
},
)
# Documents are stored with detected PII in doc.meta["entities"]
Using Custom Parametersβ
Use entities to limit detection to the PII types you actually care about. This reduces false positives and improves performance by skipping recognizers you don't need.
Use score_threshold to tune the precision-recall tradeoff. The default 0.35 casts a wide net and may include some false positives. Raise it (e.g. 0.7) when you need high confidence in each detected entity; lower it when missing any PII is the bigger risk.
from haystack_integrations.components.extractors.presidio import PresidioEntityExtractor
extractor = PresidioEntityExtractor(
language="de",
entities=["PERSON", "EMAIL_ADDRESS"], # only detect names and emails
score_threshold=0.7, # higher precision, fewer false positives
)
Non-English languagesβ
For any language in the built-in mapping, just set language β the right spaCy model is selected and loaded automatically at warm-up time.
from haystack import Document
from haystack_integrations.components.extractors.presidio import PresidioEntityExtractor
# No `models` parameter needed β de_core_news_lg is selected automatically
extractor = PresidioEntityExtractor(language="de")
result = extractor.run(
documents=[Document(content="Kontaktieren Sie Hans MΓΌller unter hans@example.com")],
)
Supported languages and their default models are listed in PresidioEntityExtractor.SPACY_DEFAULT_MODELS. Using a language not in that mapping without providing models raises a ValueError at warm-up time with a list of the supported language codes.
To use a non-default model variant, or a language outside the built-in mapping, pass models explicitly:
extractor = PresidioEntityExtractor(
language="fr",
models=[{"lang_code": "fr", "model_name": "fr_core_news_md"}],
)