Version: 3.1-unstable

PresidioEntityExtractor

PresidioEntityExtractor detects personally identifiable information (PII) in Documents and stores the detected entities as structured metadata under the "entities" key, without modifying the document text. Each entry contains the entity type, character offsets, and confidence score.


Most common position in a pipeline	In an indexing pipeline, before writing Documents to a Document Store
Mandatory run variables	`documents`: A list of Document objects
Output variables	`documents`: A list of Document objects with PII metadata added
API reference	Presidio
GitHub link	https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/presidio
Package name	`presidio-haystack`

Overview

Microsoft Presidio is an open-source framework for PII detection and anonymization. PresidioEntityExtractor uses Presidio's Analyzer Engine to scan document text and identify entities such as names, email addresses, phone numbers, and more.

The extractor does not modify the document text. Instead, it adds the detected entities as structured metadata, letting you inspect or act on PII findings without altering the original content. This is useful when you want to audit what PII is present before deciding how to handle it — for example, routing documents to a review queue, logging PII findings, or conditionally applying anonymization.

If you want to replace PII directly rather than annotate it, see PresidioDocumentCleaner for Documents or PresidioTextCleaner for plain strings.

Configuration

Parameter	Default	Description
`language`	`"en"`	ISO 639-1 language code for PII detection. The appropriate spaCy model is selected automatically for supported languages. See Presidio supported languages.
`entities`	`None`	List of PII entity types to detect (e.g. `["PERSON", "EMAIL_ADDRESS"]`). If `None`, all supported types are detected. See supported entities.
`score_threshold`	`0.35`	Minimum confidence score (0–1) for a detected entity to be included.
`models`	`None`	Advanced override: explicit list of spaCy model configs, e.g. `[{"lang_code": "fr", "model_name": "fr_core_news_md"}]`. Use this only when you need a specific model variant or a language not in the built-in mapping. If `None`, the model is selected automatically based on `language`.

Usage

Install the presidio-haystack package to use the PresidioEntityExtractor.

bash

pip install presidio-haystack

On its own

python

from haystack import Document
from haystack_integrations.components.extractors.presidio import PresidioEntityExtractor

extractor = PresidioEntityExtractor()
result = extractor.run(
    documents=[Document(content="Contact Alice at alice@example.com")],
)
print(result["documents"][0].meta["entities"])
# [{"entity_type": "PERSON", "start": 8, "end": 13, "score": 0.85},
#  {"entity_type": "EMAIL_ADDRESS", "start": 17, "end": 34, "score": 1.0}]

In a pipeline

python

from haystack import Document, Pipeline
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack_integrations.components.extractors.presidio import PresidioEntityExtractor

document_store = InMemoryDocumentStore()

indexing_pipeline = Pipeline()
indexing_pipeline.add_component("extractor", PresidioEntityExtractor())
indexing_pipeline.add_component("writer", DocumentWriter(document_store=document_store))
indexing_pipeline.connect("extractor", "writer")

indexing_pipeline.run(
    {
        "extractor": {
            "documents": [
                Document(content="Alice Smith's email is alice@example.com"),
                Document(content="Call Bob at 212-555-9876"),
            ],
        },
    },
)
# Documents are stored with detected PII in doc.meta["entities"]

Using Custom Parameters

Use entities to limit detection to the PII types you actually care about. This reduces false positives and improves performance by skipping recognizers you don't need.

Use score_threshold to tune the precision-recall tradeoff. The default 0.35 casts a wide net and may include some false positives. Raise it (e.g. 0.7) when you need high confidence in each detected entity; lower it when missing any PII is the bigger risk.

python

from haystack_integrations.components.extractors.presidio import PresidioEntityExtractor

extractor = PresidioEntityExtractor(
    language="de",
    entities=["PERSON", "EMAIL_ADDRESS"],  # only detect names and emails
    score_threshold=0.7,  # higher precision, fewer false positives
)

Non-English languages

For any language in the built-in mapping, just set language — the right spaCy model is selected and loaded automatically at warm-up time.

python

from haystack import Document
from haystack_integrations.components.extractors.presidio import PresidioEntityExtractor

# No `models` parameter needed — de_core_news_lg is selected automatically
extractor = PresidioEntityExtractor(language="de")
result = extractor.run(
    documents=[Document(content="Kontaktieren Sie Hans Müller unter hans@example.com")],
)

Supported languages and their default models are listed in PresidioEntityExtractor.SPACY_DEFAULT_MODELS. Using a language not in that mapping without providing models raises a ValueError at warm-up time with a list of the supported language codes.

To use a non-default model variant, or a language outside the built-in mapping, pass models explicitly:

python

extractor = PresidioEntityExtractor(
    language="fr",
    models=[{"lang_code": "fr", "model_name": "fr_core_news_md"}],
)

Overview​

Configuration​

Usage​

On its own​

In a pipeline​

Using Custom Parameters​

Non-English languages​

Overview

Configuration

Usage

On its own

In a pipeline

Using Custom Parameters

Non-English languages