EntityExtractor
The EntityExtractor node extracts predefined entities out of a piece of text. Its most common use case is named entity extraction (NER) but you can also use it, for example, to label each word by part of speech.
Usage
To initialize EntityExtractor, run the following code. You can provide any token classification model from the Hugging Face model hub as a model name.
from haystack.nodes import EntityExtractor
entity_extractor = EntityExtractor(model_name_or_path="dslim/bert-base-NER")
To run EntityExtractor as a stand-alone node, use the extract()
method:
entities = entity_extractor.extract(
text="Berlin is the capital and largest city of Germany by both area and population. "
"Its 3.8 million inhabitants make it the European Union's most populous city, "
"according to population within city limits."
)
The return format is a list of dictionaries containing the entity string, the group that the entity belongs to, the start and end character indices, and the model's confidence. Note that multi-token entities such as European Union
are merged together even though the model is making predictions on a per token level.
[{'end': 6,
'entity_group': 'LOC',
'score': 0.9997844,
'start': 0,
'word': 'Berlin'},
{'end': 49,
'entity_group': 'LOC',
'score': 0.999785,
'start': 42,
'word': 'Germany'},
{'end': 133,
'entity_group': 'ORG',
'score': 0.9991541,
'start': 119,
'word': 'European Union'}]
To run EntityExtractor in a query pipeline, run the following code. Entity extraction occurs on retrieved Documents only and the extracted entities populate Document.meta
.
pipeline = Pipeline()
pipeline.add_node(component=es_retriever, name="ESRetriever", inputs=["Query"])
pipeline.add_node(component=ner, name="NER", inputs=["ESRetriever"])
pipeline.add_node(component=reader, name="Reader", inputs=["NER"])
result = pipeline.run(query="What is Berlin?")
To run EntityExtractor in an indexing pipeline, run the following code. All Documents in the DocumentStore have extracted entities as part of their metadata.
Tip
Ensure that the
EntityExtractor
comes after thePreProcessor
becausePreProcessor
splits documents which breaks the alignment with the start and end character indices that theEntityExtractor
returns.
pipeline = Pipeline()
pipeline.add_node(text_converter, "TextConverter", ["File"])
pipeline.add_node(preprocessor, "PreProcessor", ["TextConverter"])
pipeline.add_node(entity_extractor, "EntityExtractor", ["PreProcessor"])
pipeline.add_node(document_store, "DocumentStore", ["EntityExtractor"])
pipeline.run(file_paths=file_paths)
Use Cases
Entity Relation
You can use EntityExtractor
in conjunction with question answering as a kind of entity relation extraction. After extracting all entities within a text, you can form a question that represents a relation using an entity and a template. For example, after extracting all person names from a set of Documents, you can then ask the question, "Who is the father of [PERS]?". In the most naive implementation, any other entities that occur in the answer to this question can be considered part of the relational triple.
Tip
If you are using the
EntityExtractor
in a question answering pipeline, check out our utility functionhaystack.extractor.entities.simplify_ner_for_qa
which prints out predicted QA answers together with the NER entities which are contained in them.
Metadata Filtering
Performing entity extraction can be a powerful way to generate metadata data by which your documents can be filtered.
Note
Make sure to set the option
flatten_entities_in_meta_data
toTrue
so all entity information predicted for a Document is converted into five flattened lists accessible by the keysentity_groups
,entity_scores
,entity_words
,entity_starts
, andentity_ends
in the Document's metadata.
Updated almost 2 years ago