The Summarizer gives a short overview of a long Document.
Module base
BaseSummarizer
class BaseSummarizer(BaseComponent)
Abstract class for Summarizer
BaseSummarizer.predict
@abstractmethod
def predict(documents: List[Document]) -> List[Document]
Abstract method for creating a summary.
Arguments:
documents
: Related documents (e.g. coming from a retriever) that the answer shall be conditioned on.
Returns:
List of Documents, where Document.meta["summary"] contains the summarization
Module transformers
TransformersSummarizer
class TransformersSummarizer(BaseSummarizer)
Transformer based model to summarize the documents using the HuggingFace's transformers framework
You can use any model that has been fine-tuned on a summarization task. For example:
'bart-large-cnn
', 't5-small
', 't5-base
', 't5-large
', 't5-3b
', 't5-11b
'.
See the up-to-date list of available models on
huggingface.co/models <https://huggingface.co/models?filter=summarization>
__
Example
docs = [Document(content="PG&E stated it scheduled the blackouts in response to forecasts for high winds amid dry conditions."
"The aim is to reduce the risk of wildfires. Nearly 800 thousand customers were scheduled to be affected by"
"the shutoffs which were expected to last through at least midday tomorrow.")]
# Summarize
summary = summarizer.predict(
documents=docs)
# Show results (List of Documents, containing summary and original content)
print(summary)
[
{
"content": "PGE stated it scheduled the blackouts in response to forecasts for high winds amid dry conditions. ...",
...
"meta": {
"summary": "California's largest electricity provider has turned off power to hundreds of thousands of customers.",
...
},
...
},
TransformersSummarizer.__init__
def __init__(model_name_or_path: str = "google/pegasus-xsum", model_version: Optional[str] = None, tokenizer: Optional[str] = None, max_length: int = 200, min_length: int = 5, use_gpu: bool = True, clean_up_tokenization_spaces: bool = True, batch_size: int = 16, progress_bar: bool = True, use_auth_token: Optional[Union[str, bool]] = None, devices: Optional[List[Union[str, torch.device]]] = None)
Load a Summarization model from Transformers.
See the up-to-date list of available models at https://huggingface.co/models?filter=summarization
Arguments:
model_name_or_path
: Directory of a saved model or the name of a public model e.g. 'facebook/rag-token-nq', 'facebook/rag-sequence-nq'. See https://huggingface.co/models?filter=summarization for full list of available models.model_version
: The version of model to use from the HuggingFace model hub. Can be tag name, branch name, or commit hash.tokenizer
: Name of the tokenizer (usually the same as model)max_length
: Maximum length of summarized textmin_length
: Minimum length of summarized textuse_gpu
: Whether to use GPU (if available).clean_up_tokenization_spaces
: Whether or not to clean up the potential extra spaces in the text outputbatch_size
: Number of documents to process at a time.progress_bar
: Whether to show a progress bar.use_auth_token
: The API token used to download private models from Huggingface. If this parameter is set toTrue
, then the token generated when runningtransformers-cli login
(stored in ~/.huggingface) will be used. Additional information can be found here https://huggingface.co/transformers/main_classes/model.html#transformers.PreTrainedModel.from_pretraineddevices
: List of torch devices (e.g. cuda, cpu, mps) to limit inference to specific devices. A list containing torch device objects and/or strings is supported (For example [torch.device('cuda:0'), "mps", "cuda:1"]). When specifyinguse_gpu=False
the devices parameter is not used and a single cpu device is used for inference.
TransformersSummarizer.predict
def predict(documents: List[Document]) -> List[Document]
Produce the summarization from the supplied documents.
These document can for example be retrieved via the Retriever.
Arguments:
documents
: Related documents (e.g. coming from a retriever) that the answer shall be conditioned on.
Returns:
List of Documents, where Document.meta["summary"] contains the summarization
TransformersSummarizer.predict_batch
def predict_batch(documents: Union[List[Document], List[List[Document]]], batch_size: Optional[int] = None) -> Union[List[Document], List[List[Document]]]
Produce the summarization from the supplied documents.
These documents can for example be retrieved via the Retriever.
Arguments:
documents
: Single list of related documents or list of lists of related documents (e.g. coming from a retriever) that the answer shall be conditioned on.batch_size
: Number of Documents to process at a time.