Spacy
haystack_integrations.components.extractors.spacy.named_entity_extractor
NamedEntityAnnotation
Describes a single NER annotation.
Parameters:
- entity (
str) – Entity label. - start (
int) – Start index of the entity in the document. - end (
int) – End index of the entity in the document. - score (
float | None) – Score calculated by the model.
SpacyNamedEntityExtractor
Annotates named entities in a collection of documents.
The component can be used with any spaCy model that contains an NER component. Annotations are stored as metadata in the documents.
Usage example:
python
from haystack import Document
from haystack_integrations.components.extractors.spacy import SpacyNamedEntityExtractor
documents = [
Document(content="I'm Merlin, the happy pig!"),
Document(content="My name is Clara and I live in Berkeley, California."),
]
extractor = SpacyNamedEntityExtractor(model="en_core_web_sm")
results = extractor.run(documents=documents)["documents"]
annotations = [SpacyNamedEntityExtractor.get_stored_annotations(doc) for doc in results]
print(annotations)
init
python
__init__(
*,
model: str,
pipeline_kwargs: dict[str, Any] | None = None,
device: ComponentDevice | None = None
) -> None
Create a Named Entity extractor component.
Parameters:
- model (
str) – Name of the spaCy model or a path to the model on the local disk. - pipeline_kwargs (
dict[str, Any] | None) – Keyword arguments passed to the pipeline. The pipeline can override these arguments. - device (
ComponentDevice | None) – The device on which the model is loaded. IfNone, the default device is automatically selected.
Raises:
ValueError– If the device represents multiple devices, which the spaCy backend does not support.
warm_up
Initialize the component.
Raises:
ComponentError– If the component fails to initialize successfully.
run
Annotate named entities in each document and store the annotations in the document's metadata.
Parameters:
- documents (
list[Document]) – Documents to process. - batch_size (
int) – Batch size used for processing the documents.
Returns:
dict[str, Any]– Processed documents.
Raises:
ComponentError– If the model fails to process a document.
to_dict
Serializes the component to a dictionary.
Returns:
dict[str, Any]– Dictionary with serialized data.
from_dict
Deserializes the component from a dictionary.
Parameters:
- data (
dict[str, Any]) – Dictionary to deserialize from.
Returns:
SpacyNamedEntityExtractor– Deserialized component.
initialized
Returns if the extractor is ready to annotate text.
get_stored_annotations
python
get_stored_annotations(
document: Document,
) -> list[NamedEntityAnnotation] | None
Returns the document's named entity annotations stored in its metadata, if any.
Parameters:
- document (
Document) – Document whose annotations are to be fetched.
Returns:
list[NamedEntityAnnotation] | None– The stored annotations.