Extracts predefined entities out of a piece of text.
Module entity
Acknowledgements: Many of the postprocessing parts here come from the great transformers repository: https://github.com/huggingface/transformers. Thanks for the great work!
EntityExtractor
class EntityExtractor(BaseComponent)
This node is used to extract entities out of documents.
The most common use case for this would be as a named entity extractor. The default model used is elastic/distilbert-base-cased-finetuned-conll03-english. This node can be placed in a querying pipeline to perform entity extraction on retrieved documents only, or it can be placed in an indexing pipeline so that all documents in the document store have extracted entities. This Node will automatically split up long Documents based on the max token length of the underlying model and aggregate the predictions of each split to predict the final set of entities for each Document. The entities extracted by this Node will populate Document.meta.entities.
Arguments:
model_name_or_path
: The name of the model to use for entity extraction.model_version
: The version of the model to use for entity extraction.use_gpu
: Whether to use the GPU or not.progress_bar
: Whether to show a progress bar or not.batch_size
: The batch size to use for entity extraction.use_auth_token
: The API token used to download private models from Huggingface. If this parameter is set toTrue
, then the token generated when runningtransformers-cli login
(stored in ~/.huggingface) will be used. Additional information can be found here https://huggingface.co/transformers/main_classes/model.html#transformers.PreTrainedModel.from_pretraineddevices
: List of torch devices (e.g. cuda, cpu, mps) to limit inference to specific devices. A list containing torch device objects and/or strings is supported (For example [torch.device('cuda:0'), "mps", "cuda:1"]). When specifyinguse_gpu=False
the devices parameter is not used and a single cpu device is used for inference.aggregation_strategy
: The strategy to fuse (or not) tokens based on the model prediction. None: Will not do any aggregation and simply return raw results from the model. "simple": Will attempt to group entities following the default schema. (A, B-TAG), (B, I-TAG), (C, I-TAG), (D, B-TAG2) (E, B-TAG2) will end up being [{"word": ABC, "entity": "TAG"}, {"word": "D", "entity": "TAG2"}, {"word": "E", "entity": "TAG2"}] Notice that two consecutive B tags will end up as different entities. On word based languages, we might end up splitting words undesirably: Imagine Microsoft being tagged as [{"word": "Micro", "entity": "ENTERPRISE"}, {"word": "soft", "entity": "NAME"}]. Look at the options FIRST, MAX, and AVERAGE for ways to mitigate this example and disambiguate words (on languages that support that meaning, which is basically tokens separated by a space). These mitigations will only work on real words, "New york" might still be tagged with two different entities. "first": Will use the SIMPLE strategy except that words, cannot end up with different tags. Words will simply use the tag of the first token of the word when there is ambiguity. "average": Will use the SIMPLE strategy except that words, cannot end up with different tags. The scores will be averaged across tokens, and then the label with the maximum score is chosen. "max": Will use the SIMPLE strategy except that words, cannot end up with different tags. Word entity will simply be the token with the maximum score.add_prefix_space
: Do this if you do not want the first word to be treated differently. This is relevant for model types such as "bloom", "gpt2", and "roberta". Explained in more detail here: https://huggingface.co/docs/transformers/model_doc/roberta#transformers.RobertaTokenizernum_workers
: Number of workers to be used in the Pytorch Dataloader.flatten_entities_in_meta_data
: If True this converts all entities predicted for a document from a list of dictionaries into a single list for each key in the dictionary.max_seq_len
: Max sequence length of one input text for the model. If not provided the max length is automatically determined by themodel_max_length
variable of the tokenizer.pre_split_text
: If True split the text of a Document into words before being passed into the model. This is common practice for models trained for named entity recognition and is recommended when using architectures that do not use word-level tokenizers.ignore_labels
: Optionally specify a list of labels to ignore. If None is specified it defaults to["O"]
.
EntityExtractor.run
def run(
documents: Optional[Union[List[Document], List[dict]]] = None
) -> Tuple[Dict, str]
This is the method called when this node is used in a pipeline
EntityExtractor.preprocess
def preprocess(sentence: List[str])
Preprocessing step to tokenize the provided text.
Arguments:
sentence
: List of text to tokenize. This expects a list of texts.
EntityExtractor.forward
def forward(model_inputs: Dict[str, Any]) -> Dict[str, Any]
Forward step
Arguments:
model_inputs
: Dictionary of inputs to be given to the model.
EntityExtractor.postprocess
def postprocess(
model_outputs_grouped_by_doc: List[Dict[str,
Any]]) -> List[List[Dict]]
Postprocess the model outputs grouped by document to collect all entities detected for each document.
Arguments:
model_outputs_grouped_by_doc
: model outputs grouped by Document
EntityExtractor.extract
def extract(text: Union[str, List[str]], batch_size: int = 1)
This function can be called to perform entity extraction when using the node in isolation.
Arguments:
text
: Text to extract entities from. Can be a str or a List of str.batch_size
: Number of texts to make predictions on at a time.
EntityExtractor.extract_batch
def extract_batch(texts: Union[List[str], List[List[str]]],
batch_size: int = 1) -> List[List[Dict]]
This function allows the extraction of entities out of a list of strings or a list of lists of strings.
The only difference between this function and self.extract
is that it has additional logic to handle a
list of lists of strings.
Arguments:
texts
: List of str or list of lists of str to extract entities from.batch_size
: Number of texts to make predictions on at a time.
simplify_ner_for_qa
def simplify_ner_for_qa(output)
Returns a simplified version of the output dictionary
with the following structure:
[
{
answer: { ... }
entities: [ { ... }, {} ]
}
]
The entities included are only the ones that overlap with the answer itself.
Arguments:
output
: Output from a query pipeline
TokenClassificationDataset
class TokenClassificationDataset(Dataset)
Token Classification Dataset
This is a wrapper class to create a Pytorch dataset object from the data attribute of a
transformers.tokenization_utils_base.BatchEncoding
object.
Arguments:
model_inputs
: The data attribute of the output from a HuggingFace tokenizer which is needed to evaluate the forward pass of a token classification model.