Skip to main content
Version: 2.29-unstable

PresidioEntityExtractor

PresidioEntityExtractor detects personally identifiable information (PII) in Documents and stores the detected entities as structured metadata under the "entities" key, without modifying the document text. Each entry contains the entity type, character offsets, and confidence score.

Most common position in a pipelineIn an indexing pipeline, before writing Documents to a Document Store
Mandatory run variablesdocuments: A list of Document objects
Output variablesdocuments: A list of Document objects with PII metadata added
API referencePresidio
GitHub linkhttps://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/presidio

Overview​

Microsoft Presidio is an open-source framework for PII detection and anonymization. PresidioEntityExtractor uses Presidio's Analyzer Engine to scan document text and identify entities such as names, email addresses, phone numbers, and more.

The extractor does not modify the document text. Instead, it adds the detected entities as structured metadata, letting you inspect or act on PII findings without altering the original content. This is useful when you want to audit what PII is present before deciding how to handle it β€” for example, routing documents to a review queue, logging PII findings, or conditionally applying anonymization.

If you want to replace PII directly rather than annotate it, see PresidioDocumentCleaner for Documents or PresidioTextCleaner for plain strings.

Configuration​

ParameterDefaultDescription
language"en"ISO 639-1 language code for PII detection. The appropriate spaCy model is selected automatically for supported languages. See Presidio supported languages.
entitiesNoneList of PII entity types to detect (e.g. ["PERSON", "EMAIL_ADDRESS"]). If None, all supported types are detected. See supported entities.
score_threshold0.35Minimum confidence score (0–1) for a detected entity to be included.
modelsNoneAdvanced override: explicit list of spaCy model configs, e.g. [{"lang_code": "fr", "model_name": "fr_core_news_md"}]. Use this only when you need a specific model variant or a language not in the built-in mapping. If None, the model is selected automatically based on language.

Usage​

Install the presidio-haystack package to use the PresidioEntityExtractor.

bash
pip install presidio-haystack

On its own​

python
from haystack import Document
from haystack_integrations.components.extractors.presidio import PresidioEntityExtractor

extractor = PresidioEntityExtractor()
result = extractor.run(
documents=[Document(content="Contact Alice at alice@example.com")],
)
print(result["documents"][0].meta["entities"])
# [{"entity_type": "PERSON", "start": 8, "end": 13, "score": 0.85},
# {"entity_type": "EMAIL_ADDRESS", "start": 17, "end": 34, "score": 1.0}]

In a pipeline​

python
from haystack import Document, Pipeline
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack_integrations.components.extractors.presidio import PresidioEntityExtractor

document_store = InMemoryDocumentStore()

indexing_pipeline = Pipeline()
indexing_pipeline.add_component("extractor", PresidioEntityExtractor())
indexing_pipeline.add_component("writer", DocumentWriter(document_store=document_store))
indexing_pipeline.connect("extractor", "writer")

indexing_pipeline.run(
{
"extractor": {
"documents": [
Document(content="Alice Smith's email is alice@example.com"),
Document(content="Call Bob at 212-555-9876"),
],
},
},
)
# Documents are stored with detected PII in doc.meta["entities"]

Using Custom Parameters​

Use entities to limit detection to the PII types you actually care about. This reduces false positives and improves performance by skipping recognizers you don't need.

Use score_threshold to tune the precision-recall tradeoff. The default 0.35 casts a wide net and may include some false positives. Raise it (e.g. 0.7) when you need high confidence in each detected entity; lower it when missing any PII is the bigger risk.

python
from haystack_integrations.components.extractors.presidio import PresidioEntityExtractor

extractor = PresidioEntityExtractor(
language="de",
entities=["PERSON", "EMAIL_ADDRESS"], # only detect names and emails
score_threshold=0.7, # higher precision, fewer false positives
)

Non-English languages​

For any language in the built-in mapping, just set language β€” the right spaCy model is selected and loaded automatically at warm-up time.

python
from haystack import Document
from haystack_integrations.components.extractors.presidio import PresidioEntityExtractor

# No `models` parameter needed β€” de_core_news_lg is selected automatically
extractor = PresidioEntityExtractor(language="de")
result = extractor.run(
documents=[Document(content="Kontaktieren Sie Hans MΓΌller unter hans@example.com")],
)

Supported languages and their default models are listed in PresidioEntityExtractor.SPACY_DEFAULT_MODELS. Using a language not in that mapping without providing models raises a ValueError at warm-up time with a list of the supported language codes.

To use a non-default model variant, or a language outside the built-in mapping, pass models explicitly:

python
extractor = PresidioEntityExtractor(
language="fr",
models=[{"lang_code": "fr", "model_name": "fr_core_news_md"}],
)