HomeGuidesAPI ReferenceTutorials
Haystack

The Summarizer gives a short overview of a long Document.

Module base

BaseSummarizer

class BaseSummarizer(BaseComponent)

Abstract class for Summarizer

BaseSummarizer.predict

@abstractmethod
def predict(documents: List[Document], generate_single_summary: Optional[bool] = None) -> List[Document]

Abstract method for creating a summary.

Arguments:

  • documents: Related documents (e.g. coming from a retriever) that the answer shall be conditioned on.
  • generate_single_summary: This parameter is deprecated and will be removed in Haystack 1.12

Returns:

List of Documents, where Document.meta["summary"] contains the summarization

Module transformers

TransformersSummarizer

class TransformersSummarizer(BaseSummarizer)

Transformer based model to summarize the documents using the HuggingFace's transformers framework

You can use any model that has been fine-tuned on a summarization task. For example:
'bart-large-cnn', 't5-small', 't5-base', 't5-large', 't5-3b', 't5-11b'.
See the up-to-date list of available models on
huggingface.co/models <https://huggingface.co/models?filter=summarization>__

Example

|     docs = [Document(content="PG&E stated it scheduled the blackouts in response to forecasts for high winds amid dry conditions."
|            "The aim is to reduce the risk of wildfires. Nearly 800 thousand customers were scheduled to be affected by"
|            "the shutoffs which were expected to last through at least midday tomorrow.")]
|
|     # Summarize
|     summary = summarizer.predict(
|        documents=docs)
|
|     # Show results (List of Documents, containing summary and original content)
|     print(summary)
|
|    [
|      {
|        "content": "PGE stated it scheduled the blackouts in response to forecasts for high winds amid dry conditions. ...",
|        ...
|        "meta": {
|                   "summary": "California's largest electricity provider has turned off power to hundreds of thousands of customers.",
|                   ...
|              },
|        ...
|      },

TransformersSummarizer.__init__

def __init__(model_name_or_path: str = "google/pegasus-xsum", model_version: Optional[str] = None, tokenizer: Optional[str] = None, max_length: int = 200, min_length: int = 5, use_gpu: bool = True, clean_up_tokenization_spaces: bool = True, separator_for_single_summary: str = " ", generate_single_summary: bool = False, batch_size: int = 16, progress_bar: bool = True, use_auth_token: Optional[Union[str, bool]] = None, devices: Optional[List[Union[str, torch.device]]] = None)

Load a Summarization model from Transformers.

See the up-to-date list of available models at
https://huggingface.co/models?filter=summarization

Arguments:

  • model_name_or_path: Directory of a saved model or the name of a public model e.g.
    'facebook/rag-token-nq', 'facebook/rag-sequence-nq'.
    See https://huggingface.co/models?filter=summarization for full list of available models.
  • model_version: The version of model to use from the HuggingFace model hub. Can be tag name, branch name, or commit hash.
  • tokenizer: Name of the tokenizer (usually the same as model)
  • max_length: Maximum length of summarized text
  • min_length: Minimum length of summarized text
  • use_gpu: Whether to use GPU (if available).
  • clean_up_tokenization_spaces: Whether or not to clean up the potential extra spaces in the text output
  • separator_for_single_summary: This parameter is deprecated and will be removed in Haystack 1.12
  • generate_single_summary: This parameter is deprecated and will be removed in Haystack 1.12.
    To obtain single summaries from multiple documents, consider using the DocumentMerger.
  • batch_size: Number of documents to process at a time.
  • progress_bar: Whether to show a progress bar.
  • use_auth_token: The API token used to download private models from Huggingface.
    If this parameter is set to True, then the token generated when running
    transformers-cli login (stored in ~/.huggingface) will be used.
    Additional information can be found here
    https://huggingface.co/transformers/main_classes/model.html#transformers.PreTrainedModel.from_pretrained
  • devices: List of torch devices (e.g. cuda, cpu, mps) to limit inference to specific devices.
    A list containing torch device objects and/or strings is supported (For example
    [torch.device('cuda:0'), "mps", "cuda:1"]). When specifying use_gpu=False the devices
    parameter is not used and a single cpu device is used for inference.

TransformersSummarizer.predict

def predict(documents: List[Document], generate_single_summary: Optional[bool] = None) -> List[Document]

Produce the summarization from the supplied documents.

These document can for example be retrieved via the Retriever.

Arguments:

  • documents: Related documents (e.g. coming from a retriever) that the answer shall be conditioned on.
  • generate_single_summary: This parameter is deprecated and will be removed in Haystack 1.12.
    To obtain single summaries from multiple documents, consider using the DocumentMerger.

Returns:

List of Documents, where Document.meta["summary"] contains the summarization

TransformersSummarizer.predict_batch

def predict_batch(documents: Union[List[Document], List[List[Document]]], generate_single_summary: Optional[bool] = None, batch_size: Optional[int] = None) -> Union[List[Document], List[List[Document]]]

Produce the summarization from the supplied documents.

These documents can for example be retrieved via the Retriever.

Arguments:

  • documents: Single list of related documents or list of lists of related documents
    (e.g. coming from a retriever) that the answer shall be conditioned on.
  • generate_single_summary: This parameter is deprecated and will be removed in Haystack 1.12.
    To obtain single summaries from multiple documents, consider using the DocumentMerger.
  • batch_size: Number of Documents to process at a time.