The Summarizer gives a short overview of a long Document.
Module base
BaseSummarizer
class BaseSummarizer(BaseComponent)
Abstract class for Summarizer
BaseSummarizer.predict
@abstractmethod
def predict(documents: List[Document]) -> List[Document]
Abstract method for creating a summary.
Arguments:
documents
: Related documents (e.g. coming from a retriever) that the answer shall be conditioned on.
Returns:
List of Documents, where Document.meta["summary"] contains the summarization
Module transformers
TransformersSummarizer
class TransformersSummarizer(BaseSummarizer)
Summarizes documents using the Hugging Face's transformers framework.
You can use any model fine-tuned on a summarization task. For example:
'bart-large-cnn
', 't5-small
', 't5-base
', 't5-large
', 't5-3b
', 't5-11b
'.
See the up-to-date list of available models in Hugging Face Documentattion.
Example
docs = [Document(content="PG&E stated it scheduled the blackouts in response to forecasts for high winds amid dry conditions."
"The aim is to reduce the risk of wildfires. Nearly 800 thousand customers were scheduled to be affected by"
"the shutoffs which were expected to last through at least midday tomorrow.")]
# Summarize
summary = summarizer.predict(
documents=docs)
# Show results (List of Documents, containing summary and original content)
print(summary)
[
{
"content": "PGE stated it scheduled the blackouts in response to forecasts for high winds amid dry conditions. ...",
...
"meta": {
"summary": "California's largest electricity provider has turned off power to hundreds of thousands of customers.",
...
},
...
},
TransformersSummarizer.__init__
def __init__(model_name_or_path: str = "google/pegasus-xsum",
model_version: Optional[str] = None,
tokenizer: Optional[str] = None,
max_length: int = 200,
min_length: int = 5,
use_gpu: bool = True,
clean_up_tokenization_spaces: bool = True,
batch_size: int = 16,
progress_bar: bool = True,
use_auth_token: Optional[Union[str, bool]] = None,
devices: Optional[List[Union[str, torch.device]]] = None)
Load a summarization model from transformers.
See the up-to-date list of available models at Hugging Face.
Arguments:
model_name_or_path
: The path to the locally saved model or the name of a public model, for example 'facebook/rag-token-nq', 'facebook/rag-sequence-nq'. See Hugging Face for a full list of available models.model_version
: The version of the model to use from the Hugging Face model hub. Can be a tag name, a branch name, or a commit hash.tokenizer
: Name of the tokenizer (usually the same as model).max_length
: Maximum length of the summarized text.min_length
: Minimum length of the summarized text.use_gpu
: Whether to use GPU (if available).clean_up_tokenization_spaces
: Whether or not to clean up the potential extra spaces in the text output.batch_size
: Number of documents to process at a time.progress_bar
: Whether to show a progress bar.use_auth_token
: The API token used to download private models from Hugging Face. If set toTrue
, the token generated when runningtransformers-cli login
(stored in ~/.huggingface) is used. More information at Hugging Face.devices
: List of torch devices (for example, cuda, cpu, mps) to limit inference to specific devices. A list containing torch device objects or strings is supported (for example [torch.device('cuda:0'), "mps", "cuda:1"]). If you specifyuse_gpu=False
, the devices parameter is not used and a single CPU device is used for inference.
TransformersSummarizer.predict
def predict(documents: List[Document]) -> List[Document]
Produce the summarization from the supplied documents.
The documents can come from the Retriever.
Arguments:
documents
: A list of Documents (for example, coming from a Retriever) to summarize individually.
Returns:
List of Documents, where Document.meta["summary"] contains the summarization.
TransformersSummarizer.predict_batch
def predict_batch(
documents: Union[List[Document], List[List[Document]]],
batch_size: Optional[int] = None
) -> Union[List[Document], List[List[Document]]]
Summarize supplied documents in batches.
These documents can come from the Retriever.
Arguments:
documents
: A single list of documents or a list of lists of documents (for example, coming from a Retriever) to summarize.batch_size
: Number of Documents to process at a time.