HomeDocumentationAPI ReferenceTutorials
Haystack Homepage

Extracts predefined entities out of a piece of text.

Module entity

Acknowledgements: Many of the postprocessing parts here come from the great transformers repository: https://github.com/huggingface/transformers. Thanks for the great work!

EntityExtractor

class EntityExtractor(BaseComponent)

This node is used to extract entities out of documents.

The most common use case for this would be as a named entity extractor. The default model used is elastic/distilbert-base-cased-finetuned-conll03-english. This node can be placed in a querying pipeline to perform entity extraction on retrieved documents only, or it can be placed in an indexing pipeline so that all documents in the document store have extracted entities. This Node will automatically split up long Documents based on the max token length of the underlying model and aggregate the predictions of each split to predict the final set of entities for each Document. The entities extracted by this Node will populate Document.meta.entities.

Arguments:

  • model_name_or_path: The name of the model to use for entity extraction.
  • model_version: The version of the model to use for entity extraction.
  • use_gpu: Whether to use the GPU or not.
  • progress_bar: Whether to show a progress bar or not.
  • batch_size: The batch size to use for entity extraction.
  • use_auth_token: The API token used to download private models from Huggingface. If this parameter is set to True, then the token generated when running transformers-cli login (stored in ~/.huggingface) will be used. Additional information can be found here https://huggingface.co/transformers/main_classes/model.html#transformers.PreTrainedModel.from_pretrained
  • devices: List of torch devices (e.g. cuda, cpu, mps) to limit inference to specific devices. A list containing torch device objects and/or strings is supported (For example [torch.device('cuda:0'), "mps", "cuda:1"]). When specifying use_gpu=False the devices parameter is not used and a single cpu device is used for inference.
  • aggregation_strategy: The strategy to fuse (or not) tokens based on the model prediction. None: Will not do any aggregation and simply return raw results from the model. "simple": Will attempt to group entities following the default schema. (A, B-TAG), (B, I-TAG), (C, I-TAG), (D, B-TAG2) (E, B-TAG2) will end up being [{"word": ABC, "entity": "TAG"}, {"word": "D", "entity": "TAG2"}, {"word": "E", "entity": "TAG2"}] Notice that two consecutive B tags will end up as different entities. On word based languages, we might end up splitting words undesirably: Imagine Microsoft being tagged as [{"word": "Micro", "entity": "ENTERPRISE"}, {"word": "soft", "entity": "NAME"}]. Look at the options FIRST, MAX, and AVERAGE for ways to mitigate this example and disambiguate words (on languages that support that meaning, which is basically tokens separated by a space). These mitigations will only work on real words, "New york" might still be tagged with two different entities. "first": Will use the SIMPLE strategy except that words, cannot end up with different tags. Words will simply use the tag of the first token of the word when there is ambiguity. "average": Will use the SIMPLE strategy except that words, cannot end up with different tags. The scores will be averaged across tokens, and then the label with the maximum score is chosen. "max": Will use the SIMPLE strategy except that words, cannot end up with different tags. Word entity will simply be the token with the maximum score.
  • add_prefix_space: Do this if you do not want the first word to be treated differently. This is relevant for model types such as "bloom", "gpt2", and "roberta". Explained in more detail here: https://huggingface.co/docs/transformers/model_doc/roberta#transformers.RobertaTokenizer
  • num_workers: Number of workers to be used in the Pytorch Dataloader.
  • flatten_entities_in_meta_data: If True this converts all entities predicted for a document from a list of dictionaries into a single list for each key in the dictionary.
  • max_seq_len: Max sequence length of one input text for the model. If not provided the max length is automatically determined by the model_max_length variable of the tokenizer.
  • pre_split_text: If True split the text of a Document into words before being passed into the model. This is common practice for models trained for named entity recognition and is recommended when using architectures that do not use word-level tokenizers.
  • ignore_labels: Optionally specify a list of labels to ignore. If None is specified it defaults to ["O"].

EntityExtractor.run

def run(documents: Optional[Union[List[Document], List[dict]]] = None) -> Tuple[Dict, str]

This is the method called when this node is used in a pipeline

EntityExtractor.preprocess

def preprocess(sentence: List[str])

Preprocessing step to tokenize the provided text.

Arguments:

  • sentence: List of text to tokenize. This expects a list of texts.

EntityExtractor.forward

def forward(model_inputs: Dict[str, Any]) -> Dict[str, Any]

Forward step

Arguments:

  • model_inputs: Dictionary of inputs to be given to the model.

EntityExtractor.postprocess

def postprocess(model_outputs_grouped_by_doc: List[Dict[str, Any]]) -> List[List[Dict]]

Postprocess the model outputs grouped by document to collect all entities detected for each document.

Arguments:

  • model_outputs_grouped_by_doc: model outputs grouped by Document

EntityExtractor.extract

def extract(text: Union[str, List[str]], batch_size: int = 1)

This function can be called to perform entity extraction when using the node in isolation.

Arguments:

  • text: Text to extract entities from. Can be a str or a List of str.
  • batch_size: Number of texts to make predictions on at a time.

EntityExtractor.extract_batch

def extract_batch(texts: Union[List[str], List[List[str]]], batch_size: int = 1) -> List[List[Dict]]

This function allows the extraction of entities out of a list of strings or a list of lists of strings.

The only difference between this function and self.extract is that it has additional logic to handle a list of lists of strings.

Arguments:

  • texts: List of str or list of lists of str to extract entities from.
  • batch_size: Number of texts to make predictions on at a time.

simplify_ner_for_qa

def simplify_ner_for_qa(output)

Returns a simplified version of the output dictionary

with the following structure:

[
    {
        answer: { ... }
        entities: [ { ... }, {} ]
    }
]

The entities included are only the ones that overlap with the answer itself.

Arguments:

  • output: Output from a query pipeline

_EntityPostProcessor

class _EntityPostProcessor()

This class is used to conveniently collect all functions related to the postprocessing of entity extraction.

Arguments:

  • model:
  • tokenizer:

_EntityPostProcessor.postprocess

def postprocess(model_outputs: Dict[str, Any], aggregation_strategy: Literal[None, "simple", "first", "average", "max"], ignore_labels: Optional[List[str]] = None) -> List[Dict[str, Any]]

Postprocess the model outputs for a single Document.

Arguments:

  • model_outputs: Model outputs for a single Document.
  • aggregation_strategy: The strategy to fuse (or not) tokens based on the model prediction. None: Will not do any aggregation and simply return raw results from the model. "simple": Will attempt to group entities following the default schema. (A, B-TAG), (B, I-TAG), (C, I-TAG), (D, B-TAG2) (E, B-TAG2) will end up being [{"word": ABC, "entity": "TAG"}, {"word": "D", "entity": "TAG2"}, {"word": "E", "entity": "TAG2"}] Notice that two consecutive B tags will end up as different entities. On word based languages, we might end up splitting words undesirably: Imagine Microsoft being tagged as [{"word": "Micro", "entity": "ENTERPRISE"}, {"word": "soft", "entity": "NAME"}]. Look at the options FIRST, MAX, and AVERAGE for ways to mitigate this example and disambiguate words (on languages that support that meaning, which is basically tokens separated by a space). These mitigations will only work on real words, "New york" might still be tagged with two different entities. "first": Will use the SIMPLE strategy except that words, cannot end up with different tags. Words will simply use the tag of the first token of the word when there is ambiguity. "average": Will use the SIMPLE strategy except that words, cannot end up with different tags. The scores will be averaged across tokens, and then the label with the maximum score is chosen. "max": Will use the SIMPLE strategy except that words, cannot end up with different tags. Word entity will simply be the token with the maximum score.
  • ignore_labels: Optionally specify a list of labels to ignore. If None is specified it defaults to ["O"].

_EntityPostProcessor.aggregate

def aggregate(pre_entities: List[Dict[str, Any]], aggregation_strategy: Literal[None, "simple", "first", "average", "max"], word_offset_mapping: Optional[List[Tuple]] = None) -> List[Dict[str, Any]]

Aggregate the pre_entities depending on the aggregation_strategy.

Arguments:

  • pre_entities: List of entity predictions for each token in a text.
  • aggregation_strategy: The strategy to fuse (or not) tokens based on the model prediction.
  • word_offset_mapping: List of (word, (char_start, char_end)) tuples for each word in a text.

_EntityPostProcessor.update_character_spans

@staticmethod
def update_character_spans(word_entities: List[Dict[str, Any]], word_offset_mapping: List[Tuple]) -> List[Dict[str, Any]]

Update the character spans of each word in word_entities to match the character spans provided in

word_offset_mapping.

Arguments:

  • word_entities: List of entity predictions for each word in the text.
  • word_offset_mapping: List of (word, (char_start, char_end)) tuples for each word in a text.

_EntityPostProcessor.gather_pre_entities

def gather_pre_entities(sentence: Union[str, List[str]], input_ids: np.ndarray, scores: np.ndarray, offset_mapping: np.ndarray, special_tokens_mask: np.ndarray, word_ids: List) -> List[Dict[str, Any]]

Gather the pre-entities from the model outputs.

Arguments:

  • sentence: The original text. Will be a list of words if self.pre_split_text is set to True.
  • input_ids: Array of token ids.
  • scores: Array of confidence scores of the model for the classification of each token.
  • offset_mapping: Array of (char_start, char_end) tuples for each token.
  • special_tokens_mask: Special tokens mask used to identify which tokens are special.
  • word_ids: List of integers or None types that provides the token index to word id mapping. None types correspond to special tokens.

_EntityPostProcessor.aggregate_word

def aggregate_word(entities: List[Dict[str, Any]], aggregation_strategy: Literal["first", "average", "max"]) -> Dict[str, Any]

Aggregate token entities into a single word entity.

Arguments:

  • entities: List of token entities to be combined.
  • aggregation_strategy: The strategy to fuse the tokens based on the model prediction.

_EntityPostProcessor.aggregate_words

def aggregate_words(entities: List[Dict[str, Any]], aggregation_strategy: Literal[None, "simple", "first", "average", "max"]) -> List[Dict[str, Any]]

Override tokens from a given word that disagree to force agreement on word boundaries.

Example: micro|soft| com|pany| B-ENT I-NAME I-ENT I-ENT will be rewritten with first strategy as microsoft| company| B-ENT I-ENT

Arguments:

  • entities: List of predicted entities for each token in the text.
  • aggregation_strategy: The strategy to fuse (or not) tokens based on the model prediction.

_EntityPostProcessor.group_sub_entities

def group_sub_entities(entities: List[Dict[str, Any]]) -> Dict[str, Any]

Group together the adjacent tokens with the same entity predicted.

Arguments:

  • entities: The entities predicted by the pipeline.

_EntityPostProcessor.get_tag

@staticmethod
def get_tag(entity_name: str) -> Tuple[str, str]

Get the entity tag and its prefix.

Arguments:

  • entity_name: name of the entity

_EntityPostProcessor.group_entities

def group_entities(entities: List[Dict[str, Any]]) -> List[Dict[str, Any]]

Find and group together the adjacent tokens (or words) with the same entity predicted.

Arguments:

  • entities: List of predicted entities.

TokenClassificationDataset

class TokenClassificationDataset(Dataset)

Token Classification Dataset

This is a wrapper class to create a Pytorch dataset object from the data attribute of a transformers.tokenization_utils_base.BatchEncoding object.

Arguments:

  • model_inputs: The data attribute of the output from a HuggingFace tokenizer which is needed to evaluate the forward pass of a token classification model.