Skip to main content
Version: 2.31-unstable

SpacyNamedEntityExtractor

This component extracts predefined entities out of a piece of text and writes them into documents’ meta field.

Most common position in a pipelineAfter the PreProcessor in an indexing pipeline or after a Retriever in a query pipeline
Mandatory init variablesmodel: Name or path of the spaCy model to use
Mandatory run variablesdocuments: A list of documents
Output variablesdocuments: A list of documents
API referenceSpacy
GitHub linkhttps://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/spacy
Package namespacy-haystack

Overview

SpacyNamedEntityExtractor looks for entities, which are spans in the text. The extractor automatically recognizes and groups them depending on their class, such as people's names, organizations, locations, and other types. The exact classes are determined by the model that you initialize the component with.

SpacyNamedEntityExtractor takes a list of documents as input and returns a list of the same documents with their meta data enriched with NamedEntityAnnotations. A NamedEntityAnnotation consists of the type of the entity and the start and end of the span, for example: NamedEntityAnnotation(entity='PERSON', start=11, end=16, score=None).

When the SpacyNamedEntityExtractor is initialized, you need to set a model. Optionally, you can set pipeline_kwargs, which are then passed on to the spaCy pipeline. You can additionally set the device that is used to run the component.

Usage

Install the spacy-haystack package to use the SpacyNamedEntityExtractor:

shell
pip install spacy-haystack

The component works with any spaCy model that contains an NER component.

SpacyNamedEntityExtractor accepts a list of Documents as its input. The extractor annotates the raw text in the documents and stores the annotations in the document's meta dictionary under the named_entities key.

python
from haystack.dataclasses import Document
from haystack_integrations.components.extractors.spacy import (
SpacyNamedEntityExtractor,
)

extractor = SpacyNamedEntityExtractor(model="en_core_web_sm")

documents = [
Document(content="My name is Clara and I live in Berkeley, California."),
Document(content="I'm Merlin, the happy pig!"),
Document(content="New York State is home to the Empire State Building."),
]

result = extractor.run(documents)
print(result["documents"])

Here is the example result:

python
[Document(id=aec840d1b6c85609f4f16c3e222a5a25fd8c4c53bd981a40c1268ab9c72cee10, content: 'My name is Clara and I live in Berkeley, California.', meta: {'named_entities': [NamedEntityAnnotation(entity='PERSON', start=11, end=16, score=None), NamedEntityAnnotation(entity='GPE', start=31, end=39, score=None), NamedEntityAnnotation(entity='GPE', start=41, end=51, score=None)]}),
Document(id=98f1dc5d0ccd9d9950cd191d1076db0f7af40c401dd7608f11c90cb3fc38c0c2, content: 'I'm Merlin, the happy pig!', meta: {'named_entities': [NamedEntityAnnotation(entity='PERSON', start=4, end=10, score=None)]}),
Document(id=44948ea0eec018b33aceaaedde4616eb9e93ce075e0090ec1613fc145f84b4a9, content: 'New York State is home to the Empire State Building.', meta: {'named_entities': [NamedEntityAnnotation(entity='GPE', start=0, end=14, score=None), NamedEntityAnnotation(entity='ORG', start=26, end=51, score=None)]})]

Get stored annotations

This component includes the get_stored_annotations helper class method that allows you to retrieve the annotations stored in a Document transparently:

python
from haystack.dataclasses import Document
from haystack_integrations.components.extractors.spacy import (
SpacyNamedEntityExtractor,
)

extractor = SpacyNamedEntityExtractor(model="en_core_web_sm")

documents = [
Document(content="My name is Clara and I live in Berkeley, California."),
Document(content="I'm Merlin, the happy pig!"),
Document(content="New York State is home to the Empire State Building."),
]

result = extractor.run(documents)

annotations = [
SpacyNamedEntityExtractor.get_stored_annotations(doc) for doc in result["documents"]
]
print(annotations)

# If a Document doesn't contain any annotations, this returns None.
new_doc = Document(content="In one of many possible worlds...")
assert SpacyNamedEntityExtractor.get_stored_annotations(new_doc) is None