Components that summarize texts into concise versions.
Module haystack_experimental.components.summarizers.llm_summarizer
LLMSummarizer
Summarizes text using a language model.
It's inspired by code from the OpenAI blog post: https://cookbook.openai.com/examples/summarizing_long_documents
Example
from haystack_experimental.components.summarizers.summarizer import Summarizer
from haystack.components.generators.chat import OpenAIChatGenerator
from haystack import Document
text = ("Machine learning is a subset of artificial intelligence that provides systems "
"the ability to automatically learn and improve from experience without being "
"explicitly programmed. The process of learning begins with observations or data. "
"Supervised learning algorithms build a mathematical model of sample data, known as "
"training data, in order to make predictions or decisions. Unsupervised learning "
"algorithms take a set of data that contains only inputs and find structure in the data. "
"Reinforcement learning is an area of machine learning where an agent learns to behave "
"in an environment by performing actions and seeing the results. Deep learning uses "
"artificial neural networks to model complex patterns in data. Neural networks consist "
"of layers of connected nodes, each performing a simple computation.")
doc = Document(content=text)
chat_generator = OpenAIChatGenerator(model="gpt-4")
summarizer = Summarizer(chat_generator=chat_generator)
summarizer.run(documents=[doc])
LLMSummarizer.__init__
def __init__(
chat_generator: ChatGenerator,
system_prompt: Optional[str] = "Rewrite this text in summarized form.",
summary_detail: float = 0,
minimum_chunk_size: Optional[int] = 500,
chunk_delimiter: str = ".",
summarize_recursively: bool = False,
split_overlap: int = 0)
Initialize the Summarizer component.
:param chat_generator: A ChatGenerator instance to use for summarization. :param system_prompt: The prompt to instruct the LLM to summarise text, if not given defaults to: "Rewrite this text in summarized form." :param summary_detail: The level of detail for the summary (0-1), defaults to 0. This parameter controls the trade-off between conciseness and completeness by adjusting how many chunks the text is divided into. At detail=0, the text is processed as a single chunk (or very few chunks), producing the most concise summary. At detail=1, the text is split into the maximum number of chunks allowed by minimum_chunk_size, enabling more granular analysis and detailed summaries. The formula uses linear interpolation: num_chunks = 1 + detail * (max_chunks - 1), where max_chunks is determined by dividing the document length by minimum_chunk_size. :param minimum_chunk_size: The minimum token count per chunk, defaults to 500 :param chunk_delimiter: The character used to determine separator priority. "." uses sentence-based splitting, " " uses paragraph-based splitting, defaults to "." :param summarize_recursively: Whether to use previous summaries as context, defaults to False. :param split_overlap: Number of tokens to overlap between consecutive chunks, defaults to 0.
LLMSummarizer.warm_up
def warm_up()
Warm up the chat generator and document splitter components.
LLMSummarizer.to_dict
def to_dict() -> dict[str, Any]
Serializes the component to a dictionary.
Returns:
Dictionary with serialized data.
LLMSummarizer.from_dict
@classmethod
def from_dict(cls, data: dict[str, Any]) -> "LLMSummarizer"
Deserializes the component from a dictionary.
Arguments:
data
: Dictionary with serialized data.
Returns:
An instance of the component.
LLMSummarizer.num_tokens
def num_tokens(text: str) -> int
Estimates the token count for a given text.
Uses the RecursiveDocumentSplitter's tokenization logic for consistency.
Arguments:
text
: The text to tokenize
Returns:
The estimated token count
LLMSummarizer.summarize
def summarize(text: str,
detail: float,
minimum_chunk_size: int,
summarize_recursively: bool = False) -> str
Summarizes text by splitting it into optimally-sized chunks and processing each with an LLM.
Arguments:
text
: Text to summarizedetail
: Detail level (0-1) where 0 is most concise and 1 is most detailedminimum_chunk_size
: Minimum token count per chunksummarize_recursively
: Whether to use previous summaries as context
Raises:
ValueError
: If detail is not between 0 and 1
Returns:
The textual content summarized by the LLM.
LLMSummarizer.run
@component.output_types(summary=list[Document])
def run(*,
documents: list[Document],
detail: Optional[float] = None,
minimum_chunk_size: Optional[int] = None,
summarize_recursively: Optional[bool] = None,
system_prompt: Optional[str] = None) -> dict[str, list[Document]]
Run the summarizer on a list of documents.
Arguments:
documents
: List of documents to summarizedetail
: The level of detail for the summary (0-1), defaults to 0 overwriting the component's default.minimum_chunk_size
: The minimum token count per chunk, defaults to 500 overwriting the component's default.system_prompt
: If given it will overwrite prompt given at init time or the default one.summarize_recursively
: Whether to use previous summaries as context, defaults to False overwriting the component's default.
Raises:
RuntimeError
: If the component wasn't warmed up.