Enables text generation using LLMs.
Module azure
AzureOpenAIGenerator
Generates text using OpenAI's large language models (LLMs).
It works with the gpt-4 - type models and supports streaming responses
from OpenAI API.
You can customize how the text is generated by passing parameters to the
OpenAI API. Use the **generation_kwargs argument when you initialize
the component or when you run it. Any parameter that works with
openai.ChatCompletion.create will work here too.
For details on OpenAI API parameters, see
OpenAI documentation.
Usage example
from haystack.components.generators import AzureOpenAIGenerator
from haystack.utils import Secret
client = AzureOpenAIGenerator(
azure_endpoint="<Your Azure endpoint e.g. `https://your-company.azure.openai.com/>",
api_key=Secret.from_token("<your-api-key>"),
azure_deployment="<this a model name, e.g. gpt-4o-mini>")
response = client.run("What's Natural Language Processing? Be brief.")
print(response)>> {'replies': ['Natural Language Processing (NLP) is a branch of artificial intelligence that focuses on
>> the interaction between computers and human language. It involves enabling computers to understand, interpret,
>> and respond to natural human language in a way that is both meaningful and useful.'], 'meta': [{'model':
>> 'gpt-4o-mini', 'index': 0, 'finish_reason': 'stop', 'usage': {'prompt_tokens': 16,
>> 'completion_tokens': 49, 'total_tokens': 65}}]}
AzureOpenAIGenerator.__init__
def __init__(azure_endpoint: Optional[str] = None,
api_version: Optional[str] = "2023-05-15",
azure_deployment: Optional[str] = "gpt-4o-mini",
api_key: Optional[Secret] = Secret.from_env_var(
"AZURE_OPENAI_API_KEY", strict=False),
azure_ad_token: Optional[Secret] = Secret.from_env_var(
"AZURE_OPENAI_AD_TOKEN", strict=False),
organization: Optional[str] = None,
streaming_callback: Optional[StreamingCallbackT] = None,
system_prompt: Optional[str] = None,
timeout: Optional[float] = None,
max_retries: Optional[int] = None,
http_client_kwargs: Optional[dict[str, Any]] = None,
generation_kwargs: Optional[dict[str, Any]] = None,
default_headers: Optional[dict[str, str]] = None,
*,
azure_ad_token_provider: Optional[AzureADTokenProvider] = None)Initialize the Azure OpenAI Generator.
Arguments:
azure_endpoint: The endpoint of the deployed model, for examplehttps://example-resource.azure.openai.com/.api_version: The version of the API to use. Defaults to 2023-05-15.azure_deployment: The deployment of the model, usually the model name.api_key: The API key to use for authentication.azure_ad_token: Azure Active Directory token.organization: Your organization ID, defaults toNone. For help, see
Setting up your organization.streaming_callback: A callback function called when a new token is received from the stream.
It accepts StreamingChunk
as an argument.system_prompt: The system prompt to use for text generation. If not provided, the Generator
omits the system prompt and uses the default system prompt.timeout: Timeout for AzureOpenAI client. If not set, it is inferred from the
OPENAI_TIMEOUTenvironment variable or set to 30.max_retries: Maximum retries to establish contact with AzureOpenAI if it returns an internal error.
If not set, it is inferred from theOPENAI_MAX_RETRIESenvironment variable or set to 5.http_client_kwargs: A dictionary of keyword arguments to configure a customhttpx.Clientorhttpx.AsyncClient.
For more information, see the HTTPX documentation.generation_kwargs: Other parameters to use for the model, sent directly to
the OpenAI endpoint. See OpenAI documentation for
more details.
Some of the supported parameters:max_completion_tokens: An upper bound for the number of tokens that can be generated for a completion,
including visible output tokens and reasoning tokens.temperature: The sampling temperature to use. Higher values mean the model takes more risks.
Try 0.9 for more creative applications and 0 (argmax sampling) for ones with a well-defined answer.top_p: An alternative to sampling with temperature, called nucleus sampling, where the model
considers the results of the tokens with top_p probability mass. For example, 0.1 means only the tokens
comprising the top 10% probability mass are considered.n: The number of completions to generate for each prompt. For example, with 3 prompts and n=2,
the LLM will generate two completions per prompt, resulting in 6 completions total.stop: One or more sequences after which the LLM should stop generating tokens.presence_penalty: The penalty applied if a token is already present.
Higher values make the model less likely to repeat the token.frequency_penalty: Penalty applied if a token has already been generated.
Higher values make the model less likely to repeat the token.logit_bias: Adds a logit bias to specific tokens. The keys of the dictionary are tokens, and the
values are the bias to add to that token.default_headers: Default headers to use for the AzureOpenAI client.azure_ad_token_provider: A function that returns an Azure Active Directory token, will be invoked on
every request.
AzureOpenAIGenerator.to_dict
def to_dict() -> dict[str, Any]Serialize this component to a dictionary.
Returns:
The serialized component as a dictionary.
AzureOpenAIGenerator.from_dict
@classmethod
def from_dict(cls, data: dict[str, Any]) -> "AzureOpenAIGenerator"Deserialize this component from a dictionary.
Arguments:
data: The dictionary representation of this component.
Returns:
The deserialized component instance.
AzureOpenAIGenerator.run
@component.output_types(replies=list[str], meta=list[dict[str, Any]])
def run(prompt: str,
system_prompt: Optional[str] = None,
streaming_callback: Optional[StreamingCallbackT] = None,
generation_kwargs: Optional[dict[str, Any]] = None)Invoke the text generation inference based on the provided messages and generation parameters.
Arguments:
prompt: The string prompt to use for text generation.system_prompt: The system prompt to use for text generation. If this run time system prompt is omitted, the system
prompt, if defined at initialisation time, is used.streaming_callback: A callback function that is called when a new token is received from the stream.generation_kwargs: Additional keyword arguments for text generation. These parameters will potentially override the parameters
passed in the__init__method. For more details on the parameters supported by the OpenAI API, refer to
the OpenAI documentation.
Returns:
A list of strings containing the generated responses and a list of dictionaries containing the metadata
for each response.
Module hugging_face_local
HuggingFaceLocalGenerator
Generates text using models from Hugging Face that run locally.
LLMs running locally may need powerful hardware.
Usage example
from haystack.components.generators import HuggingFaceLocalGenerator
generator = HuggingFaceLocalGenerator(
model="google/flan-t5-large",
task="text2text-generation",
generation_kwargs={"max_new_tokens": 100, "temperature": 0.9})
generator.warm_up()
print(generator.run("Who is the best American actor?"))
# {'replies': ['John Cusack']}HuggingFaceLocalGenerator.__init__
def __init__(model: str = "google/flan-t5-base",
task: Optional[Literal["text-generation",
"text2text-generation"]] = None,
device: Optional[ComponentDevice] = None,
token: Optional[Secret] = Secret.from_env_var(
["HF_API_TOKEN", "HF_TOKEN"], strict=False),
generation_kwargs: Optional[dict[str, Any]] = None,
huggingface_pipeline_kwargs: Optional[dict[str, Any]] = None,
stop_words: Optional[list[str]] = None,
streaming_callback: Optional[StreamingCallbackT] = None)Creates an instance of a HuggingFaceLocalGenerator.
Arguments:
model: The Hugging Face text generation model name or path.task: The task for the Hugging Face pipeline. Possible options:text-generation: Supported by decoder models, like GPT.text2text-generation: Supported by encoder-decoder models, like T5.
If the task is specified inhuggingface_pipeline_kwargs, this parameter is ignored.
If not specified, the component calls the Hugging Face API to infer the task from the model name.device: The device for loading the model. IfNone, automatically selects the default device.
If a device or device map is specified inhuggingface_pipeline_kwargs, it overrides this parameter.token: The token to use as HTTP bearer authorization for remote files.
If the token is specified inhuggingface_pipeline_kwargs, this parameter is ignored.generation_kwargs: A dictionary with keyword arguments to customize text generation.
Some examples:max_length,max_new_tokens,temperature,top_k,top_p.
See Hugging Face's documentation for more information:- customize-text-generation
- transformers.GenerationConfig
huggingface_pipeline_kwargs: Dictionary with keyword arguments to initialize the
Hugging Face pipeline for text generation.
These keyword arguments provide fine-grained control over the Hugging Face pipeline.
In case of duplication, these kwargs overridemodel,task,device, andtokeninit parameters.
For available kwargs, see Hugging Face documentation.
In this dictionary, you can also includemodel_kwargsto specify the kwargs for model initialization:
transformers.PreTrainedModel.from_pretrainedstop_words: If the model generates a stop word, the generation stops.
If you provide this parameter, don't specify thestopping_criteriaingeneration_kwargs.
For some chat models, the output includes both the new text and the original prompt.
In these cases, make sure your prompt has no stop words.streaming_callback: An optional callable for handling streaming responses.
HuggingFaceLocalGenerator.warm_up
def warm_up()Initializes the component.
HuggingFaceLocalGenerator.to_dict
def to_dict() -> dict[str, Any]Serializes the component to a dictionary.
Returns:
Dictionary with serialized data.
HuggingFaceLocalGenerator.from_dict
@classmethod
def from_dict(cls, data: dict[str, Any]) -> "HuggingFaceLocalGenerator"Deserializes the component from a dictionary.
Arguments:
data: The dictionary to deserialize from.
Returns:
The deserialized component.
HuggingFaceLocalGenerator.run
@component.output_types(replies=list[str])
def run(prompt: str,
streaming_callback: Optional[StreamingCallbackT] = None,
generation_kwargs: Optional[dict[str, Any]] = None)Run the text generation model on the given prompt.
Arguments:
prompt: A string representing the prompt.streaming_callback: A callback function that is called when a new token is received from the stream.generation_kwargs: Additional keyword arguments for text generation.
Returns:
A dictionary containing the generated replies.
- replies: A list of strings representing the generated replies.
Module hugging_face_api
HuggingFaceAPIGenerator
Generates text using Hugging Face APIs.
Use it with the following Hugging Face APIs:
Note: As of July 2025, the Hugging Face Inference API no longer offers generative models through the
text_generation endpoint. Generative models are now only available through providers supporting the
chat_completion endpoint. As a result, this component might no longer work with the Hugging Face Inference API.
Use the HuggingFaceAPIChatGenerator component, which supports the chat_completion endpoint.
Usage examples
With Hugging Face Inference Endpoints
With self-hosted text generation inference
With the free serverless inference API
Be aware that this example might not work as the Hugging Face Inference API no longer offer models that support the
text_generation endpoint. Use the HuggingFaceAPIChatGenerator for generative models through the
chat_completion endpoint.
from haystack.components.generators import HuggingFaceAPIGenerator
from haystack.utils import Secret
generator = HuggingFaceAPIGenerator(api_type="inference_endpoints",
api_params={"url": "<your-inference-endpoint-url>"},
token=Secret.from_token("<your-api-key>"))
result = generator.run(prompt="What's Natural Language Processing?")
print(result)from haystack.components.generators import HuggingFaceAPIGenerator
generator = HuggingFaceAPIGenerator(api_type="text_generation_inference",
api_params={"url": "http://localhost:8080"})
result = generator.run(prompt="What's Natural Language Processing?")
print(result)from haystack.components.generators import HuggingFaceAPIGenerator
from haystack.utils import Secret
generator = HuggingFaceAPIGenerator(api_type="serverless_inference_api",
api_params={"model": "HuggingFaceH4/zephyr-7b-beta"},
token=Secret.from_token("<your-api-key>"))
result = generator.run(prompt="What's Natural Language Processing?")
print(result)HuggingFaceAPIGenerator.__init__
def __init__(api_type: Union[HFGenerationAPIType, str],
api_params: dict[str, str],
token: Optional[Secret] = Secret.from_env_var(
["HF_API_TOKEN", "HF_TOKEN"], strict=False),
generation_kwargs: Optional[dict[str, Any]] = None,
stop_words: Optional[list[str]] = None,
streaming_callback: Optional[StreamingCallbackT] = None)Initialize the HuggingFaceAPIGenerator instance.
Arguments:
api_type: The type of Hugging Face API to use. Available types:text_generation_inference: See TGI.inference_endpoints: See Inference Endpoints.serverless_inference_api: See Serverless Inference API.
This might no longer work due to changes in the models offered in the Hugging Face Inference API.
Please use theHuggingFaceAPIChatGeneratorcomponent instead.api_params: A dictionary with the following keys:model: Hugging Face model ID. Required whenapi_typeisSERVERLESS_INFERENCE_API.url: URL of the inference endpoint. Required whenapi_typeisINFERENCE_ENDPOINTSor
TEXT_GENERATION_INFERENCE.- Other parameters specific to the chosen API type, such as
timeout,headers,provideretc. token: The Hugging Face token to use as HTTP bearer authorization.
Check your HF token in your account settings.generation_kwargs: A dictionary with keyword arguments to customize text generation. Some examples:max_new_tokens,
temperature,top_k,top_p.
For details, see Hugging Face documentation
for more information.stop_words: An optional list of strings representing the stop words.streaming_callback: An optional callable for handling streaming responses.
HuggingFaceAPIGenerator.to_dict
def to_dict() -> dict[str, Any]Serialize this component to a dictionary.
Returns:
A dictionary containing the serialized component.
HuggingFaceAPIGenerator.from_dict
@classmethod
def from_dict(cls, data: dict[str, Any]) -> "HuggingFaceAPIGenerator"Deserialize this component from a dictionary.
HuggingFaceAPIGenerator.run
@component.output_types(replies=list[str], meta=list[dict[str, Any]])
def run(prompt: str,
streaming_callback: Optional[StreamingCallbackT] = None,
generation_kwargs: Optional[dict[str, Any]] = None)Invoke the text generation inference for the given prompt and generation parameters.
Arguments:
prompt: A string representing the prompt.streaming_callback: A callback function that is called when a new token is received from the stream.generation_kwargs: Additional keyword arguments for text generation.
Returns:
A dictionary with the generated replies and metadata. Both are lists of length n.
- replies: A list of strings representing the generated replies.
Module openai
OpenAIGenerator
Generates text using OpenAI's large language models (LLMs).
It works with the gpt-4 and o-series models and supports streaming responses
from OpenAI API. It uses strings as input and output.
You can customize how the text is generated by passing parameters to the
OpenAI API. Use the **generation_kwargs argument when you initialize
the component or when you run it. Any parameter that works with
openai.ChatCompletion.create will work here too.
For details on OpenAI API parameters, see
OpenAI documentation.
Usage example
from haystack.components.generators import OpenAIGenerator
client = OpenAIGenerator()
response = client.run("What's Natural Language Processing? Be brief.")
print(response)
>> {'replies': ['Natural Language Processing (NLP) is a branch of artificial intelligence that focuses on
>> the interaction between computers and human language. It involves enabling computers to understand, interpret,
>> and respond to natural human language in a way that is both meaningful and useful.'], 'meta': [{'model':
>> 'gpt-4o-mini', 'index': 0, 'finish_reason': 'stop', 'usage': {'prompt_tokens': 16,
>> 'completion_tokens': 49, 'total_tokens': 65}}]}OpenAIGenerator.__init__
def __init__(api_key: Secret = Secret.from_env_var("OPENAI_API_KEY"),
model: str = "gpt-4o-mini",
streaming_callback: Optional[StreamingCallbackT] = None,
api_base_url: Optional[str] = None,
organization: Optional[str] = None,
system_prompt: Optional[str] = None,
generation_kwargs: Optional[dict[str, Any]] = None,
timeout: Optional[float] = None,
max_retries: Optional[int] = None,
http_client_kwargs: Optional[dict[str, Any]] = None)Creates an instance of OpenAIGenerator. Unless specified otherwise in model, uses OpenAI's gpt-4o-mini
By setting the 'OPENAI_TIMEOUT' and 'OPENAI_MAX_RETRIES' you can change the timeout and max_retries parameters
in the OpenAI client.
Arguments:
api_key: The OpenAI API key to connect to OpenAI.model: The name of the model to use.streaming_callback: A callback function that is called when a new token is received from the stream.
The callback function accepts StreamingChunk as an argument.api_base_url: An optional base URL.organization: The Organization ID, defaults toNone.system_prompt: The system prompt to use for text generation. If not provided, the system prompt is
omitted, and the default system prompt of the model is used.generation_kwargs: Other parameters to use for the model. These parameters are all sent directly to
the OpenAI endpoint. See OpenAI documentation for
more details.
Some of the supported parameters:max_completion_tokens: An upper bound for the number of tokens that can be generated for a completion,
including visible output tokens and reasoning tokens.temperature: What sampling temperature to use. Higher values mean the model will take more risks.
Try 0.9 for more creative applications and 0 (argmax sampling) for ones with a well-defined answer.top_p: An alternative to sampling with temperature, called nucleus sampling, where the model
considers the results of the tokens with top_p probability mass. So, 0.1 means only the tokens
comprising the top 10% probability mass are considered.n: How many completions to generate for each prompt. For example, if the LLM gets 3 prompts and n is 2,
it will generate two completions for each of the three prompts, ending up with 6 completions in total.stop: One or more sequences after which the LLM should stop generating tokens.presence_penalty: What penalty to apply if a token is already present at all. Bigger values mean
the model will be less likely to repeat the same token in the text.frequency_penalty: What penalty to apply if a token has already been generated in the text.
Bigger values mean the model will be less likely to repeat the same token in the text.logit_bias: Add a logit bias to specific tokens. The keys of the dictionary are tokens, and the
values are the bias to add to that token.timeout: Timeout for OpenAI Client calls, if not set it is inferred from theOPENAI_TIMEOUTenvironment variable
or set to 30.max_retries: Maximum retries to establish contact with OpenAI if it returns an internal error, if not set it is inferred
from theOPENAI_MAX_RETRIESenvironment variable or set to 5.http_client_kwargs: A dictionary of keyword arguments to configure a customhttpx.Clientorhttpx.AsyncClient.
For more information, see the HTTPX documentation.
OpenAIGenerator.to_dict
def to_dict() -> dict[str, Any]Serialize this component to a dictionary.
Returns:
The serialized component as a dictionary.
OpenAIGenerator.from_dict
@classmethod
def from_dict(cls, data: dict[str, Any]) -> "OpenAIGenerator"Deserialize this component from a dictionary.
Arguments:
data: The dictionary representation of this component.
Returns:
The deserialized component instance.
OpenAIGenerator.run
@component.output_types(replies=list[str], meta=list[dict[str, Any]])
def run(prompt: str,
system_prompt: Optional[str] = None,
streaming_callback: Optional[StreamingCallbackT] = None,
generation_kwargs: Optional[dict[str, Any]] = None)Invoke the text generation inference based on the provided messages and generation parameters.
Arguments:
prompt: The string prompt to use for text generation.system_prompt: The system prompt to use for text generation. If this run time system prompt is omitted, the system
prompt, if defined at initialisation time, is used.streaming_callback: A callback function that is called when a new token is received from the stream.generation_kwargs: Additional keyword arguments for text generation. These parameters will potentially override the parameters
passed in the__init__method. For more details on the parameters supported by the OpenAI API, refer to
the OpenAI documentation.
Returns:
A list of strings containing the generated responses and a list of dictionaries containing the metadata
for each response.
Module openai_dalle
DALLEImageGenerator
Generates images using OpenAI's DALL-E model.
For details on OpenAI API parameters, see
OpenAI documentation.
Usage example
from haystack.components.generators import DALLEImageGenerator
image_generator = DALLEImageGenerator()
response = image_generator.run("Show me a picture of a black cat.")
print(response)DALLEImageGenerator.__init__
def __init__(model: str = "dall-e-3",
quality: Literal["standard", "hd"] = "standard",
size: Literal["256x256", "512x512", "1024x1024", "1792x1024",
"1024x1792"] = "1024x1024",
response_format: Literal["url", "b64_json"] = "url",
api_key: Secret = Secret.from_env_var("OPENAI_API_KEY"),
api_base_url: Optional[str] = None,
organization: Optional[str] = None,
timeout: Optional[float] = None,
max_retries: Optional[int] = None,
http_client_kwargs: Optional[dict[str, Any]] = None)Creates an instance of DALLEImageGenerator. Unless specified otherwise in model, uses OpenAI's dall-e-3.
Arguments:
model: The model to use for image generation. Can be "dall-e-2" or "dall-e-3".quality: The quality of the generated image. Can be "standard" or "hd".size: The size of the generated images.
Must be one of 256x256, 512x512, or 1024x1024 for dall-e-2.
Must be one of 1024x1024, 1792x1024, or 1024x1792 for dall-e-3 models.response_format: The format of the response. Can be "url" or "b64_json".api_key: The OpenAI API key to connect to OpenAI.api_base_url: An optional base URL.organization: The Organization ID, defaults toNone.timeout: Timeout for OpenAI Client calls. If not set, it is inferred from theOPENAI_TIMEOUTenvironment variable
or set to 30.max_retries: Maximum retries to establish contact with OpenAI if it returns an internal error. If not set, it is inferred
from theOPENAI_MAX_RETRIESenvironment variable or set to 5.http_client_kwargs: A dictionary of keyword arguments to configure a customhttpx.Clientorhttpx.AsyncClient.
For more information, see the HTTPX documentation.
DALLEImageGenerator.warm_up
def warm_up() -> NoneWarm up the OpenAI client.
DALLEImageGenerator.run
@component.output_types(images=list[str], revised_prompt=str)
def run(prompt: str,
size: Optional[Literal["256x256", "512x512", "1024x1024", "1792x1024",
"1024x1792"]] = None,
quality: Optional[Literal["standard", "hd"]] = None,
response_format: Optional[Optional[Literal["url",
"b64_json"]]] = None)Invokes the image generation inference based on the provided prompt and generation parameters.
Arguments:
prompt: The prompt to generate the image.size: If provided, overrides the size provided during initialization.quality: If provided, overrides the quality provided during initialization.response_format: If provided, overrides the response format provided during initialization.
Returns:
A dictionary containing the generated list of images and the revised prompt.
Depending on the response_format parameter, the list of images can be URLs or base64 encoded JSON strings.
The revised prompt is the prompt that was used to generate the image, if there was any revision
to the prompt made by OpenAI.
DALLEImageGenerator.to_dict
def to_dict() -> dict[str, Any]Serialize this component to a dictionary.
Returns:
The serialized component as a dictionary.
DALLEImageGenerator.from_dict
@classmethod
def from_dict(cls, data: dict[str, Any]) -> "DALLEImageGenerator"Deserialize this component from a dictionary.
Arguments:
data: The dictionary representation of this component.
Returns:
The deserialized component instance.
Module chat/azure
AzureOpenAIChatGenerator
Generates text using OpenAI's models on Azure.
It works with the gpt-4 - type models and supports streaming responses
from OpenAI API. It uses ChatMessage
format in input and output.
You can customize how the text is generated by passing parameters to the
OpenAI API. Use the **generation_kwargs argument when you initialize
the component or when you run it. Any parameter that works with
openai.ChatCompletion.create will work here too.
For details on OpenAI API parameters, see
OpenAI documentation.
Usage example
from haystack.components.generators.chat import AzureOpenAIChatGenerator
from haystack.dataclasses import ChatMessage
from haystack.utils import Secret
messages = [ChatMessage.from_user("What's Natural Language Processing?")]
client = AzureOpenAIChatGenerator(
azure_endpoint="<Your Azure endpoint e.g. `https://your-company.azure.openai.com/>",
api_key=Secret.from_token("<your-api-key>"),
azure_deployment="<this a model name, e.g. gpt-4o-mini>")
response = client.run(messages)
print(response){'replies':
[ChatMessage(_role=<ChatRole.ASSISTANT: 'assistant'>, _content=[TextContent(text=
"Natural Language Processing (NLP) is a branch of artificial intelligence that focuses on
enabling computers to understand, interpret, and generate human language in a way that is useful.")],
_name=None,
_meta={'model': 'gpt-4o-mini', 'index': 0, 'finish_reason': 'stop',
'usage': {'prompt_tokens': 15, 'completion_tokens': 36, 'total_tokens': 51}})]
}
AzureOpenAIChatGenerator.__init__
def __init__(azure_endpoint: Optional[str] = None,
api_version: Optional[str] = "2023-05-15",
azure_deployment: Optional[str] = "gpt-4o-mini",
api_key: Optional[Secret] = Secret.from_env_var(
"AZURE_OPENAI_API_KEY", strict=False),
azure_ad_token: Optional[Secret] = Secret.from_env_var(
"AZURE_OPENAI_AD_TOKEN", strict=False),
organization: Optional[str] = None,
streaming_callback: Optional[StreamingCallbackT] = None,
timeout: Optional[float] = None,
max_retries: Optional[int] = None,
generation_kwargs: Optional[dict[str, Any]] = None,
default_headers: Optional[dict[str, str]] = None,
tools: Optional[ToolsType] = None,
tools_strict: bool = False,
*,
azure_ad_token_provider: Optional[Union[
AzureADTokenProvider, AsyncAzureADTokenProvider]] = None,
http_client_kwargs: Optional[dict[str, Any]] = None)Initialize the Azure OpenAI Chat Generator component.
Arguments:
azure_endpoint: The endpoint of the deployed model, for example"https://example-resource.azure.openai.com/".api_version: The version of the API to use. Defaults to 2023-05-15.azure_deployment: The deployment of the model, usually the model name.api_key: The API key to use for authentication.azure_ad_token: Azure Active Directory token.organization: Your organization ID, defaults toNone. For help, see
Setting up your organization.streaming_callback: A callback function called when a new token is received from the stream.
It accepts StreamingChunk
as an argument.timeout: Timeout for OpenAI client calls. If not set, it defaults to either the
OPENAI_TIMEOUTenvironment variable, or 30 seconds.max_retries: Maximum number of retries to contact OpenAI after an internal error.
If not set, it defaults to either theOPENAI_MAX_RETRIESenvironment variable, or set to 5.generation_kwargs: Other parameters to use for the model. These parameters are sent directly to
the OpenAI endpoint. For details, see OpenAI documentation.
Some of the supported parameters:max_completion_tokens: An upper bound for the number of tokens that can be generated for a completion,
including visible output tokens and reasoning tokens.temperature: The sampling temperature to use. Higher values mean the model takes more risks.
Try 0.9 for more creative applications and 0 (argmax sampling) for ones with a well-defined answer.top_p: Nucleus sampling is an alternative to sampling with temperature, where the model considers
tokens with a top_p probability mass. For example, 0.1 means only the tokens comprising
the top 10% probability mass are considered.n: The number of completions to generate for each prompt. For example, with 3 prompts and n=2,
the LLM will generate two completions per prompt, resulting in 6 completions total.stop: One or more sequences after which the LLM should stop generating tokens.presence_penalty: The penalty applied if a token is already present.
Higher values make the model less likely to repeat the token.frequency_penalty: Penalty applied if a token has already been generated.
Higher values make the model less likely to repeat the token.logit_bias: Adds a logit bias to specific tokens. The keys of the dictionary are tokens, and the
values are the bias to add to that token.response_format: A JSON schema or a Pydantic model that enforces the structure of the model's response.
If provided, the output will always be validated against this
format (unless the model returns a tool call).
For details, see the OpenAI Structured Outputs documentation.
Notes:- This parameter accepts Pydantic models and JSON schemas for latest models starting from GPT-4o.
Older models only support basic version of structured outputs through{"type": "json_object"}.
For detailed information on JSON mode, see the OpenAI Structured Outputs documentation. - For structured outputs with streaming,
theresponse_formatmust be a JSON schema and not a Pydantic model.
- This parameter accepts Pydantic models and JSON schemas for latest models starting from GPT-4o.
default_headers: Default headers to use for the AzureOpenAI client.tools: A list of Tool and/or Toolset objects, or a single Toolset for which the model can prepare calls.tools_strict: Whether to enable strict schema adherence for tool calls. If set toTrue, the model will follow exactly
the schema provided in theparametersfield of the tool definition, but this may increase latency.azure_ad_token_provider: A function that returns an Azure Active Directory token, will be invoked on
every request.http_client_kwargs: A dictionary of keyword arguments to configure a customhttpx.Clientorhttpx.AsyncClient.
For more information, see the HTTPX documentation.
AzureOpenAIChatGenerator.warm_up
def warm_up()Warm up the Azure OpenAI chat generator.
This will warm up the tools registered in the chat generator.
This method is idempotent and will only warm up the tools once.
AzureOpenAIChatGenerator.to_dict
def to_dict() -> dict[str, Any]Serialize this component to a dictionary.
Returns:
The serialized component as a dictionary.
AzureOpenAIChatGenerator.from_dict
@classmethod
def from_dict(cls, data: dict[str, Any]) -> "AzureOpenAIChatGenerator"Deserialize this component from a dictionary.
Arguments:
data: The dictionary representation of this component.
Returns:
The deserialized component instance.
AzureOpenAIChatGenerator.run
@component.output_types(replies=list[ChatMessage])
def run(messages: list[ChatMessage],
streaming_callback: Optional[StreamingCallbackT] = None,
generation_kwargs: Optional[dict[str, Any]] = None,
*,
tools: Optional[ToolsType] = None,
tools_strict: Optional[bool] = None)Invokes chat completion based on the provided messages and generation parameters.
Arguments:
messages: A list of ChatMessage instances representing the input messages.streaming_callback: A callback function that is called when a new token is received from the stream.generation_kwargs: Additional keyword arguments for text generation. These parameters will
override the parameters passed during component initialization.
For details on OpenAI API parameters, see OpenAI documentation.tools: A list of Tool and/or Toolset objects, or a single Toolset for which the model can prepare calls.
If set, it will override thetoolsparameter provided during initialization.tools_strict: Whether to enable strict schema adherence for tool calls. If set toTrue, the model will follow exactly
the schema provided in theparametersfield of the tool definition, but this may increase latency.
If set, it will override thetools_strictparameter set during component initialization.
Returns:
A dictionary with the following key:
replies: A list containing the generated responses as ChatMessage instances.
AzureOpenAIChatGenerator.run_async
@component.output_types(replies=list[ChatMessage])
async def run_async(messages: list[ChatMessage],
streaming_callback: Optional[StreamingCallbackT] = None,
generation_kwargs: Optional[dict[str, Any]] = None,
*,
tools: Optional[ToolsType] = None,
tools_strict: Optional[bool] = None)Asynchronously invokes chat completion based on the provided messages and generation parameters.
This is the asynchronous version of the run method. It has the same parameters and return values
but can be used with await in async code.
Arguments:
messages: A list of ChatMessage instances representing the input messages.streaming_callback: A callback function that is called when a new token is received from the stream.
Must be a coroutine.generation_kwargs: Additional keyword arguments for text generation. These parameters will
override the parameters passed during component initialization.
For details on OpenAI API parameters, see OpenAI documentation.tools: A list of Tool and/or Toolset objects, or a single Toolset for which the model can prepare calls.
If set, it will override thetoolsparameter provided during initialization.tools_strict: Whether to enable strict schema adherence for tool calls. If set toTrue, the model will follow exactly
the schema provided in theparametersfield of the tool definition, but this may increase latency.
If set, it will override thetools_strictparameter set during component initialization.
Returns:
A dictionary with the following key:
replies: A list containing the generated responses as ChatMessage instances.
Module chat/azure_responses
AzureOpenAIResponsesChatGenerator
Completes chats using OpenAI's Responses API on Azure.
It works with the gpt-5 and o-series models and supports streaming responses
from OpenAI API. It uses ChatMessage
format in input and output.
You can customize how the text is generated by passing parameters to the
OpenAI API. Use the **generation_kwargs argument when you initialize
the component or when you run it. Any parameter that works with
openai.Responses.create will work here too.
For details on OpenAI API parameters, see
OpenAI documentation.
Usage example
from haystack.components.generators.chat import AzureOpenAIResponsesChatGenerator
from haystack.dataclasses import ChatMessage
messages = [ChatMessage.from_user("What's Natural Language Processing?")]
client = AzureOpenAIResponsesChatGenerator(
azure_endpoint="https://example-resource.azure.openai.com/",
generation_kwargs={"reasoning": {"effort": "low", "summary": "auto"}}
)
response = client.run(messages)
print(response)AzureOpenAIResponsesChatGenerator.__init__
def __init__(*,
api_key: Union[Secret, Callable[[], str],
Callable[[],
Awaitable[str]]] = Secret.from_env_var(
"AZURE_OPENAI_API_KEY", strict=False),
azure_endpoint: Optional[str] = None,
azure_deployment: str = "gpt-5-mini",
streaming_callback: Optional[StreamingCallbackT] = None,
organization: Optional[str] = None,
generation_kwargs: Optional[dict[str, Any]] = None,
timeout: Optional[float] = None,
max_retries: Optional[int] = None,
tools: Optional[ToolsType] = None,
tools_strict: bool = False,
http_client_kwargs: Optional[dict[str, Any]] = None)Initialize the AzureOpenAIResponsesChatGenerator component.
Arguments:
api_key: The API key to use for authentication. Can be:- A
Secretobject containing the API key. - A
Secretobject containing the Azure Active Directory token. - A function that returns an Azure Active Directory token.
azure_endpoint: The endpoint of the deployed model, for example"https://example-resource.azure.openai.com/".azure_deployment: The deployment of the model, usually the model name.organization: Your organization ID, defaults toNone. For help, see
Setting up your organization.streaming_callback: A callback function called when a new token is received from the stream.
It accepts StreamingChunk
as an argument.timeout: Timeout for OpenAI client calls. If not set, it defaults to either the
OPENAI_TIMEOUTenvironment variable, or 30 seconds.max_retries: Maximum number of retries to contact OpenAI after an internal error.
If not set, it defaults to either theOPENAI_MAX_RETRIESenvironment variable, or set to 5.generation_kwargs: Other parameters to use for the model. These parameters are sent
directly to the OpenAI endpoint.
See OpenAI documentation for
more details.
Some of the supported parameters:temperature: What sampling temperature to use. Higher values like 0.8 will make the output more random,
while lower values like 0.2 will make it more focused and deterministic.top_p: An alternative to sampling with temperature, called nucleus sampling, where the model
considers the results of the tokens with top_p probability mass. For example, 0.1 means only the tokens
comprising the top 10% probability mass are considered.previous_response_id: The ID of the previous response.
Use this to create multi-turn conversations.text_format: A JSON schema or a Pydantic model that enforces the structure of the model's response.
If provided, the output will always be validated against this
format (unless the model returns a tool call).
For details, see the OpenAI Structured Outputs documentation.
Notes:- This parameter accepts Pydantic models and JSON schemas for latest models starting from GPT-4o.
Older models only support basic version of structured outputs through{"type": "json_object"}.
For detailed information on JSON mode, see the OpenAI Structured Outputs documentation. - For structured outputs with streaming,
thetext_formatmust be a JSON schema and not a Pydantic model.
- This parameter accepts Pydantic models and JSON schemas for latest models starting from GPT-4o.
reasoning: A dictionary of parameters for reasoning. For example:summary: The summary of the reasoning.effort: The level of effort to put into the reasoning. Can below,mediumorhigh.generate_summary: Whether to generate a summary of the reasoning.
Note: OpenAI does not return the reasoning tokens, but we can view summary if its enabled.
For details, see the OpenAI Reasoning documentation.
tools: A list of Tool and/or Toolset objects, or a single Toolset for which the model can prepare calls.tools_strict: Whether to enable strict schema adherence for tool calls. If set toTrue, the model will follow exactly
the schema provided in theparametersfield of the tool definition, but this may increase latency.http_client_kwargs: A dictionary of keyword arguments to configure a customhttpx.Clientorhttpx.AsyncClient.
For more information, see the HTTPX documentation.
AzureOpenAIResponsesChatGenerator.to_dict
def to_dict() -> dict[str, Any]Serialize this component to a dictionary.
Returns:
The serialized component as a dictionary.
AzureOpenAIResponsesChatGenerator.from_dict
@classmethod
def from_dict(cls, data: dict[str,
Any]) -> "AzureOpenAIResponsesChatGenerator"Deserialize this component from a dictionary.
Arguments:
data: The dictionary representation of this component.
Returns:
The deserialized component instance.
AzureOpenAIResponsesChatGenerator.warm_up
def warm_up()Warm up the OpenAI responses chat generator.
This will warm up the tools registered in the chat generator.
This method is idempotent and will only warm up the tools once.
AzureOpenAIResponsesChatGenerator.run
@component.output_types(replies=list[ChatMessage])
def run(messages: list[ChatMessage],
*,
streaming_callback: Optional[StreamingCallbackT] = None,
generation_kwargs: Optional[dict[str, Any]] = None,
tools: Optional[Union[ToolsType, list[dict]]] = None,
tools_strict: Optional[bool] = None)Invokes response generation based on the provided messages and generation parameters.
Arguments:
messages: A list of ChatMessage instances representing the input messages.streaming_callback: A callback function that is called when a new token is received from the stream.generation_kwargs: Additional keyword arguments for text generation. These parameters will
override the parameters passed during component initialization.
For details on OpenAI API parameters, see OpenAI documentation.tools: The tools that the model can use to prepare calls. If set, it will override the
toolsparameter set during component initialization. This parameter can accept either a
mixed list of HaystackToolobjects and HaystackToolset. Or you can pass a dictionary of
OpenAI/MCP tool definitions.
Note: You cannot pass OpenAI/MCP tools and Haystack tools together.
For details on tool support, see OpenAI documentation.tools_strict: Whether to enable strict schema adherence for tool calls. If set toFalse, the model may not exactly
follow the schema provided in theparametersfield of the tool definition. In Response API, tool calls
are strict by default.
If set, it will override thetools_strictparameter set during component initialization.
Returns:
A dictionary with the following key:
replies: A list containing the generated responses as ChatMessage instances.
AzureOpenAIResponsesChatGenerator.run_async
@component.output_types(replies=list[ChatMessage])
async def run_async(messages: list[ChatMessage],
*,
streaming_callback: Optional[StreamingCallbackT] = None,
generation_kwargs: Optional[dict[str, Any]] = None,
tools: Optional[Union[ToolsType, list[dict]]] = None,
tools_strict: Optional[bool] = None)Asynchronously invokes response generation based on the provided messages and generation parameters.
This is the asynchronous version of the run method. It has the same parameters and return values
but can be used with await in async code.
Arguments:
messages: A list of ChatMessage instances representing the input messages.streaming_callback: A callback function that is called when a new token is received from the stream.
Must be a coroutine.generation_kwargs: Additional keyword arguments for text generation. These parameters will
override the parameters passed during component initialization.
For details on OpenAI API parameters, see OpenAI documentation.tools: A list of tools or a Toolset for which the model can prepare calls. If set, it will override the
toolsparameter set during component initialization. This parameter can accept either a list of
mixed list of HaystackToolobjects and HaystackToolset. Or you can pass a dictionary of
OpenAI/MCP tool definitions.
Note: You cannot pass OpenAI/MCP tools and Haystack tools together.tools_strict: Whether to enable strict schema adherence for tool calls. If set toTrue, the model will follow exactly
the schema provided in theparametersfield of the tool definition, but this may increase latency.
If set, it will override thetools_strictparameter set during component initialization.
Returns:
A dictionary with the following key:
replies: A list containing the generated responses as ChatMessage instances.
Module chat/hugging_face_local
default_tool_parser
def default_tool_parser(text: str) -> Optional[list[ToolCall]]Default implementation for parsing tool calls from model output text.
Uses DEFAULT_TOOL_PATTERN to extract tool calls.
Arguments:
text: The text to parse for tool calls.
Returns:
A list containing a single ToolCall if a valid tool call is found, None otherwise.
HuggingFaceLocalChatGenerator
Generates chat responses using models from Hugging Face that run locally.
Use this component with chat-based models,
such as HuggingFaceH4/zephyr-7b-beta or meta-llama/Llama-2-7b-chat-hf.
LLMs running locally may need powerful hardware.
Usage example
from haystack.components.generators.chat import HuggingFaceLocalChatGenerator
from haystack.dataclasses import ChatMessage
generator = HuggingFaceLocalChatGenerator(model="HuggingFaceH4/zephyr-7b-beta")
generator.warm_up()
messages = [ChatMessage.from_user("What's Natural Language Processing? Be brief.")]
print(generator.run(messages)){'replies':
[ChatMessage(_role=<ChatRole.ASSISTANT: 'assistant'>, _content=[TextContent(text=
"Natural Language Processing (NLP) is a subfield of artificial intelligence that deals
with the interaction between computers and human language. It enables computers to understand, interpret, and
generate human language in a valuable way. NLP involves various techniques such as speech recognition, text
analysis, sentiment analysis, and machine translation. The ultimate goal is to make it easier for computers to
process and derive meaning from human language, improving communication between humans and machines.")],
_name=None,
_meta={'finish_reason': 'stop', 'index': 0, 'model':
'mistralai/Mistral-7B-Instruct-v0.2',
'usage': {'completion_tokens': 90, 'prompt_tokens': 19, 'total_tokens': 109}})
]
}
HuggingFaceLocalChatGenerator.__init__
def __init__(model: str = "HuggingFaceH4/zephyr-7b-beta",
task: Optional[Literal["text-generation",
"text2text-generation"]] = None,
device: Optional[ComponentDevice] = None,
token: Optional[Secret] = Secret.from_env_var(
["HF_API_TOKEN", "HF_TOKEN"], strict=False),
chat_template: Optional[str] = None,
generation_kwargs: Optional[dict[str, Any]] = None,
huggingface_pipeline_kwargs: Optional[dict[str, Any]] = None,
stop_words: Optional[list[str]] = None,
streaming_callback: Optional[StreamingCallbackT] = None,
tools: Optional[ToolsType] = None,
tool_parsing_function: Optional[Callable[
[str], Optional[list[ToolCall]]]] = None,
async_executor: Optional[ThreadPoolExecutor] = None) -> NoneInitializes the HuggingFaceLocalChatGenerator component.
Arguments:
model: The Hugging Face text generation model name or path,
for example,mistralai/Mistral-7B-Instruct-v0.2orTheBloke/OpenHermes-2.5-Mistral-7B-16k-AWQ.
The model must be a chat model supporting the ChatML messaging
format.
If the model is specified inhuggingface_pipeline_kwargs, this parameter is ignored.task: The task for the Hugging Face pipeline. Possible options:text-generation: Supported by decoder models, like GPT.text2text-generation: Supported by encoder-decoder models, like T5.
If the task is specified inhuggingface_pipeline_kwargs, this parameter is ignored.
If not specified, the component calls the Hugging Face API to infer the task from the model name.device: The device for loading the model. IfNone, automatically selects the default device.
If a device or device map is specified inhuggingface_pipeline_kwargs, it overrides this parameter.token: The token to use as HTTP bearer authorization for remote files.
If the token is specified inhuggingface_pipeline_kwargs, this parameter is ignored.chat_template: Specifies an optional Jinja template for formatting chat
messages. Most high-quality chat models have their own templates, but for models without this
feature or if you prefer a custom template, use this parameter.generation_kwargs: A dictionary with keyword arguments to customize text generation.
Some examples:max_length,max_new_tokens,temperature,top_k,top_p.
See Hugging Face's documentation for more information:-
- GenerationConfig
The onlygeneration_kwargsset by default ismax_new_tokens, which is set to 512 tokens.
- GenerationConfig
huggingface_pipeline_kwargs: Dictionary with keyword arguments to initialize the
Hugging Face pipeline for text generation.
These keyword arguments provide fine-grained control over the Hugging Face pipeline.
In case of duplication, these kwargs overridemodel,task,device, andtokeninit parameters.
For kwargs, see Hugging Face documentation.
In this dictionary, you can also includemodel_kwargsto specify the kwargs for model initializationstop_words: A list of stop words. If the model generates a stop word, the generation stops.
If you provide this parameter, don't specify thestopping_criteriaingeneration_kwargs.
For some chat models, the output includes both the new text and the original prompt.
In these cases, make sure your prompt has no stop words.streaming_callback: An optional callable for handling streaming responses.tools: A list of Tool and/or Toolset objects, or a single Toolset for which the model can prepare calls.tool_parsing_function: A callable that takes a string and returns a list of ToolCall objects or None.
If None, the default_tool_parser will be used which extracts tool calls using a predefined pattern.async_executor: Optional ThreadPoolExecutor to use for async calls. If not provided, a single-threaded executor will be
initialized and used
HuggingFaceLocalChatGenerator.__del__
def __del__() -> NoneCleanup when the instance is being destroyed.
HuggingFaceLocalChatGenerator.shutdown
def shutdown() -> NoneExplicitly shutdown the executor if we own it.
HuggingFaceLocalChatGenerator.warm_up
def warm_up() -> NoneInitializes the component and warms up tools if provided.
HuggingFaceLocalChatGenerator.to_dict
def to_dict() -> dict[str, Any]Serializes the component to a dictionary.
Returns:
Dictionary with serialized data.
HuggingFaceLocalChatGenerator.from_dict
@classmethod
def from_dict(cls, data: dict[str, Any]) -> "HuggingFaceLocalChatGenerator"Deserializes the component from a dictionary.
Arguments:
data: The dictionary to deserialize from.
Returns:
The deserialized component.
HuggingFaceLocalChatGenerator.run
@component.output_types(replies=list[ChatMessage])
def run(messages: list[ChatMessage],
generation_kwargs: Optional[dict[str, Any]] = None,
streaming_callback: Optional[StreamingCallbackT] = None,
tools: Optional[ToolsType] = None) -> dict[str, list[ChatMessage]]Invoke text generation inference based on the provided messages and generation parameters.
Arguments:
messages: A list of ChatMessage objects representing the input messages.generation_kwargs: Additional keyword arguments for text generation.streaming_callback: An optional callable for handling streaming responses.tools: A list of Tool and/or Toolset objects, or a single Toolset for which the model can prepare calls.
If set, it will override thetoolsparameter provided during initialization.
Returns:
A dictionary with the following keys:
replies: A list containing the generated responses as ChatMessage instances.
HuggingFaceLocalChatGenerator.create_message
def create_message(text: str,
index: int,
tokenizer: Union["PreTrainedTokenizer",
"PreTrainedTokenizerFast"],
prompt: str,
generation_kwargs: dict[str, Any],
parse_tool_calls: bool = False) -> ChatMessageCreate a ChatMessage instance from the provided text, populated with metadata.
Arguments:
text: The generated text.index: The index of the generated text.tokenizer: The tokenizer used for generation.prompt: The prompt used for generation.generation_kwargs: The generation parameters.parse_tool_calls: Whether to attempt parsing tool calls from the text.
Returns:
A ChatMessage instance.
HuggingFaceLocalChatGenerator.run_async
@component.output_types(replies=list[ChatMessage])
async def run_async(
messages: list[ChatMessage],
generation_kwargs: Optional[dict[str, Any]] = None,
streaming_callback: Optional[StreamingCallbackT] = None,
tools: Optional[ToolsType] = None) -> dict[str, list[ChatMessage]]Asynchronously invokes text generation inference based on the provided messages and generation parameters.
This is the asynchronous version of the run method. It has the same parameters
and return values but can be used with await in an async code.
Arguments:
messages: A list of ChatMessage objects representing the input messages.generation_kwargs: Additional keyword arguments for text generation.streaming_callback: An optional callable for handling streaming responses.tools: A list of Tool and/or Toolset objects, or a single Toolset for which the model can prepare calls.
If set, it will override thetoolsparameter provided during initialization.
Returns:
A dictionary with the following keys:
replies: A list containing the generated responses as ChatMessage instances.
Module chat/hugging_face_api
HuggingFaceAPIChatGenerator
Completes chats using Hugging Face APIs.
HuggingFaceAPIChatGenerator uses the ChatMessage
format for input and output. Use it to generate text with Hugging Face APIs:
- Serverless Inference API (Inference Providers)
- Paid Inference Endpoints
- Self-hosted Text Generation Inference
Usage examples
With the serverless inference API (Inference Providers) - free tier available
from haystack.components.generators.chat import HuggingFaceAPIChatGenerator
from haystack.dataclasses import ChatMessage
from haystack.utils import Secret
from haystack.utils.hf import HFGenerationAPIType
messages = [ChatMessage.from_system("\nYou are a helpful, respectful and honest assistant"),
ChatMessage.from_user("What's Natural Language Processing?")]
# the api_type can be expressed using the HFGenerationAPIType enum or as a string
api_type = HFGenerationAPIType.SERVERLESS_INFERENCE_API
api_type = "serverless_inference_api" # this is equivalent to the above
generator = HuggingFaceAPIChatGenerator(api_type=api_type,
api_params={"model": "Qwen/Qwen2.5-7B-Instruct",
"provider": "together"},
token=Secret.from_token("<your-api-key>"))
result = generator.run(messages)
print(result)With the serverless inference API (Inference Providers) and text+image input
from haystack.components.generators.chat import HuggingFaceAPIChatGenerator
from haystack.dataclasses import ChatMessage, ImageContent
from haystack.utils import Secret
from haystack.utils.hf import HFGenerationAPIType
# Create an image from file path, URL, or base64
image = ImageContent.from_file_path("path/to/your/image.jpg")
# Create a multimodal message with both text and image
messages = [ChatMessage.from_user(content_parts=["Describe this image in detail", image])]
generator = HuggingFaceAPIChatGenerator(
api_type=HFGenerationAPIType.SERVERLESS_INFERENCE_API,
api_params={
"model": "Qwen/Qwen2.5-VL-7B-Instruct", # Vision Language Model
"provider": "hyperbolic"
},
token=Secret.from_token("<your-api-key>")
)
result = generator.run(messages)
print(result)With paid inference endpoints
from haystack.components.generators.chat import HuggingFaceAPIChatGenerator
from haystack.dataclasses import ChatMessage
from haystack.utils import Secret
messages = [ChatMessage.from_system("\nYou are a helpful, respectful and honest assistant"),
ChatMessage.from_user("What's Natural Language Processing?")]
generator = HuggingFaceAPIChatGenerator(api_type="inference_endpoints",
api_params={"url": "<your-inference-endpoint-url>"},
token=Secret.from_token("<your-api-key>"))
result = generator.run(messages)
print(result)
#### With self-hosted text generation inference
```python
from haystack.components.generators.chat import HuggingFaceAPIChatGenerator
from haystack.dataclasses import ChatMessage
messages = [ChatMessage.from_system("\nYou are a helpful, respectful and honest assistant"),
ChatMessage.from_user("What's Natural Language Processing?")]
generator = HuggingFaceAPIChatGenerator(api_type="text_generation_inference",
api_params={"url": "http://localhost:8080"})
result = generator.run(messages)
print(result)HuggingFaceAPIChatGenerator.__init__
def __init__(api_type: Union[HFGenerationAPIType, str],
api_params: dict[str, str],
token: Optional[Secret] = Secret.from_env_var(
["HF_API_TOKEN", "HF_TOKEN"], strict=False),
generation_kwargs: Optional[dict[str, Any]] = None,
stop_words: Optional[list[str]] = None,
streaming_callback: Optional[StreamingCallbackT] = None,
tools: Optional[ToolsType] = None)Initialize the HuggingFaceAPIChatGenerator instance.
Arguments:
api_type: The type of Hugging Face API to use. Available types:text_generation_inference: See TGI.inference_endpoints: See Inference Endpoints.serverless_inference_api: See
Serverless Inference API - Inference Providers.api_params: A dictionary with the following keys:model: Hugging Face model ID. Required whenapi_typeisSERVERLESS_INFERENCE_API.provider: Provider name. Recommended whenapi_typeisSERVERLESS_INFERENCE_API.url: URL of the inference endpoint. Required whenapi_typeisINFERENCE_ENDPOINTSor
TEXT_GENERATION_INFERENCE.- Other parameters specific to the chosen API type, such as
timeout,headers, etc. token: The Hugging Face token to use as HTTP bearer authorization.
Check your HF token in your account settings.generation_kwargs: A dictionary with keyword arguments to customize text generation.
Some examples:max_tokens,temperature,top_p.
For details, see Hugging Face chat_completion documentation.stop_words: An optional list of strings representing the stop words.streaming_callback: An optional callable for handling streaming responses.tools: A list of Tool and/or Toolset objects, or a single Toolset for which the model can prepare calls.
The chosen model should support tool/function calling, according to the model card.
Support for tools in the Hugging Face API and TGI is not yet fully refined and you may experience
unexpected behavior.
HuggingFaceAPIChatGenerator.warm_up
def warm_up()Warm up the Hugging Face API chat generator.
This will warm up the tools registered in the chat generator.
This method is idempotent and will only warm up the tools once.
HuggingFaceAPIChatGenerator.to_dict
def to_dict() -> dict[str, Any]Serialize this component to a dictionary.
Returns:
A dictionary containing the serialized component.
HuggingFaceAPIChatGenerator.from_dict
@classmethod
def from_dict(cls, data: dict[str, Any]) -> "HuggingFaceAPIChatGenerator"Deserialize this component from a dictionary.
HuggingFaceAPIChatGenerator.run
@component.output_types(replies=list[ChatMessage])
def run(messages: list[ChatMessage],
generation_kwargs: Optional[dict[str, Any]] = None,
tools: Optional[ToolsType] = None,
streaming_callback: Optional[StreamingCallbackT] = None)Invoke the text generation inference based on the provided messages and generation parameters.
Arguments:
messages: A list of ChatMessage objects representing the input messages.generation_kwargs: Additional keyword arguments for text generation.tools: A list of tools or a Toolset for which the model can prepare calls. If set, it will override
thetoolsparameter set during component initialization. This parameter can accept either a
list ofToolobjects or aToolsetinstance.streaming_callback: An optional callable for handling streaming responses. If set, it will override thestreaming_callback
parameter set during component initialization.
Returns:
A dictionary with the following keys:
replies: A list containing the generated responses as ChatMessage objects.
HuggingFaceAPIChatGenerator.run_async
@component.output_types(replies=list[ChatMessage])
async def run_async(messages: list[ChatMessage],
generation_kwargs: Optional[dict[str, Any]] = None,
tools: Optional[ToolsType] = None,
streaming_callback: Optional[StreamingCallbackT] = None)Asynchronously invokes the text generation inference based on the provided messages and generation parameters.
This is the asynchronous version of the run method. It has the same parameters
and return values but can be used with await in an async code.
Arguments:
messages: A list of ChatMessage objects representing the input messages.generation_kwargs: Additional keyword arguments for text generation.tools: A list of tools or a Toolset for which the model can prepare calls. If set, it will override thetools
parameter set during component initialization. This parameter can accept either a list ofToolobjects
or aToolsetinstance.streaming_callback: An optional callable for handling streaming responses. If set, it will override thestreaming_callback
parameter set during component initialization.
Returns:
A dictionary with the following keys:
replies: A list containing the generated responses as ChatMessage objects.
Module chat/openai
OpenAIChatGenerator
Completes chats using OpenAI's large language models (LLMs).
It works with the gpt-4 and o-series models and supports streaming responses
from OpenAI API. It uses ChatMessage
format in input and output.
You can customize how the text is generated by passing parameters to the
OpenAI API. Use the **generation_kwargs argument when you initialize
the component or when you run it. Any parameter that works with
openai.ChatCompletion.create will work here too.
For details on OpenAI API parameters, see
OpenAI documentation.
Usage example
from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.dataclasses import ChatMessage
messages = [ChatMessage.from_user("What's Natural Language Processing?")]
client = OpenAIChatGenerator()
response = client.run(messages)
print(response)Output:
{'replies':
[ChatMessage(_role=<ChatRole.ASSISTANT: 'assistant'>, _content=
[TextContent(text="Natural Language Processing (NLP) is a branch of artificial intelligence
that focuses on enabling computers to understand, interpret, and generate human language in
a way that is meaningful and useful.")],
_name=None,
_meta={'model': 'gpt-4o-mini', 'index': 0, 'finish_reason': 'stop',
'usage': {'prompt_tokens': 15, 'completion_tokens': 36, 'total_tokens': 51}})
]
}
OpenAIChatGenerator.__init__
def __init__(api_key: Secret = Secret.from_env_var("OPENAI_API_KEY"),
model: str = "gpt-4o-mini",
streaming_callback: Optional[StreamingCallbackT] = None,
api_base_url: Optional[str] = None,
organization: Optional[str] = None,
generation_kwargs: Optional[dict[str, Any]] = None,
timeout: Optional[float] = None,
max_retries: Optional[int] = None,
tools: Optional[ToolsType] = None,
tools_strict: bool = False,
http_client_kwargs: Optional[dict[str, Any]] = None)Creates an instance of OpenAIChatGenerator. Unless specified otherwise in model, uses OpenAI's gpt-4o-mini
Before initializing the component, you can set the 'OPENAI_TIMEOUT' and 'OPENAI_MAX_RETRIES'
environment variables to override the timeout and max_retries parameters respectively
in the OpenAI client.
Arguments:
api_key: The OpenAI API key.
You can set it with an environment variableOPENAI_API_KEY, or pass with this parameter
during initialization.model: The name of the model to use.streaming_callback: A callback function that is called when a new token is received from the stream.
The callback function accepts StreamingChunk
as an argument.api_base_url: An optional base URL.organization: Your organization ID, defaults toNone. See
production best practices.generation_kwargs: Other parameters to use for the model. These parameters are sent directly to
the OpenAI endpoint. See OpenAI documentation for
more details.
Some of the supported parameters:max_completion_tokens: An upper bound for the number of tokens that can be generated for a completion,
including visible output tokens and reasoning tokens.temperature: What sampling temperature to use. Higher values mean the model will take more risks.
Try 0.9 for more creative applications and 0 (argmax sampling) for ones with a well-defined answer.top_p: An alternative to sampling with temperature, called nucleus sampling, where the model
considers the results of the tokens with top_p probability mass. For example, 0.1 means only the tokens
comprising the top 10% probability mass are considered.n: How many completions to generate for each prompt. For example, if the LLM gets 3 prompts and n is 2,
it will generate two completions for each of the three prompts, ending up with 6 completions in total.stop: One or more sequences after which the LLM should stop generating tokens.presence_penalty: What penalty to apply if a token is already present at all. Bigger values mean
the model will be less likely to repeat the same token in the text.frequency_penalty: What penalty to apply if a token has already been generated in the text.
Bigger values mean the model will be less likely to repeat the same token in the text.logit_bias: Add a logit bias to specific tokens. The keys of the dictionary are tokens, and the
values are the bias to add to that token.response_format: A JSON schema or a Pydantic model that enforces the structure of the model's response.
If provided, the output will always be validated against this
format (unless the model returns a tool call).
For details, see the OpenAI Structured Outputs documentation.
Notes:- This parameter accepts Pydantic models and JSON schemas for latest models starting from GPT-4o.
Older models only support basic version of structured outputs through{"type": "json_object"}.
For detailed information on JSON mode, see the OpenAI Structured Outputs documentation. - For structured outputs with streaming,
theresponse_formatmust be a JSON schema and not a Pydantic model.
- This parameter accepts Pydantic models and JSON schemas for latest models starting from GPT-4o.
timeout: Timeout for OpenAI client calls. If not set, it defaults to either the
OPENAI_TIMEOUTenvironment variable, or 30 seconds.max_retries: Maximum number of retries to contact OpenAI after an internal error.
If not set, it defaults to either theOPENAI_MAX_RETRIESenvironment variable, or set to 5.tools: A list of Tool and/or Toolset objects, or a single Toolset for which the model can prepare calls.tools_strict: Whether to enable strict schema adherence for tool calls. If set toTrue, the model will follow exactly
the schema provided in theparametersfield of the tool definition, but this may increase latency.http_client_kwargs: A dictionary of keyword arguments to configure a customhttpx.Clientorhttpx.AsyncClient.
For more information, see the HTTPX documentation.
OpenAIChatGenerator.warm_up
def warm_up()Warm up the OpenAI chat generator.
This will warm up the tools registered in the chat generator.
This method is idempotent and will only warm up the tools once.
OpenAIChatGenerator.to_dict
def to_dict() -> dict[str, Any]Serialize this component to a dictionary.
Returns:
The serialized component as a dictionary.
OpenAIChatGenerator.from_dict
@classmethod
def from_dict(cls, data: dict[str, Any]) -> "OpenAIChatGenerator"Deserialize this component from a dictionary.
Arguments:
data: The dictionary representation of this component.
Returns:
The deserialized component instance.
OpenAIChatGenerator.run
@component.output_types(replies=list[ChatMessage])
def run(messages: list[ChatMessage],
streaming_callback: Optional[StreamingCallbackT] = None,
generation_kwargs: Optional[dict[str, Any]] = None,
*,
tools: Optional[ToolsType] = None,
tools_strict: Optional[bool] = None)Invokes chat completion based on the provided messages and generation parameters.
Arguments:
messages: A list of ChatMessage instances representing the input messages.streaming_callback: A callback function that is called when a new token is received from the stream.generation_kwargs: Additional keyword arguments for text generation. These parameters will
override the parameters passed during component initialization.
For details on OpenAI API parameters, see OpenAI documentation.tools: A list of Tool and/or Toolset objects, or a single Toolset for which the model can prepare calls.
If set, it will override thetoolsparameter provided during initialization.tools_strict: Whether to enable strict schema adherence for tool calls. If set toTrue, the model will follow exactly
the schema provided in theparametersfield of the tool definition, but this may increase latency.
If set, it will override thetools_strictparameter set during component initialization.
Returns:
A dictionary with the following key:
replies: A list containing the generated responses as ChatMessage instances.
OpenAIChatGenerator.run_async
@component.output_types(replies=list[ChatMessage])
async def run_async(messages: list[ChatMessage],
streaming_callback: Optional[StreamingCallbackT] = None,
generation_kwargs: Optional[dict[str, Any]] = None,
*,
tools: Optional[ToolsType] = None,
tools_strict: Optional[bool] = None)Asynchronously invokes chat completion based on the provided messages and generation parameters.
This is the asynchronous version of the run method. It has the same parameters and return values
but can be used with await in async code.
Arguments:
messages: A list of ChatMessage instances representing the input messages.streaming_callback: A callback function that is called when a new token is received from the stream.
Must be a coroutine.generation_kwargs: Additional keyword arguments for text generation. These parameters will
override the parameters passed during component initialization.
For details on OpenAI API parameters, see OpenAI documentation.tools: A list of Tool and/or Toolset objects, or a single Toolset for which the model can prepare calls.
If set, it will override thetoolsparameter provided during initialization.tools_strict: Whether to enable strict schema adherence for tool calls. If set toTrue, the model will follow exactly
the schema provided in theparametersfield of the tool definition, but this may increase latency.
If set, it will override thetools_strictparameter set during component initialization.
Returns:
A dictionary with the following key:
replies: A list containing the generated responses as ChatMessage instances.
Module chat/openai_responses
OpenAIResponsesChatGenerator
Completes chats using OpenAI's Responses API.
It works with the gpt-4 and o-series models and supports streaming responses
from OpenAI API. It uses ChatMessage
format in input and output.
You can customize how the text is generated by passing parameters to the
OpenAI API. Use the **generation_kwargs argument when you initialize
the component or when you run it. Any parameter that works with
openai.Responses.create will work here too.
For details on OpenAI API parameters, see
OpenAI documentation.
Usage example
from haystack.components.generators.chat import OpenAIResponsesChatGenerator
from haystack.dataclasses import ChatMessage
messages = [ChatMessage.from_user("What's Natural Language Processing?")]
client = OpenAIResponsesChatGenerator(generation_kwargs={"reasoning": {"effort": "low", "summary": "auto"}})
response = client.run(messages)
print(response)OpenAIResponsesChatGenerator.__init__
def __init__(*,
api_key: Secret = Secret.from_env_var("OPENAI_API_KEY"),
model: str = "gpt-5-mini",
streaming_callback: Optional[StreamingCallbackT] = None,
api_base_url: Optional[str] = None,
organization: Optional[str] = None,
generation_kwargs: Optional[dict[str, Any]] = None,
timeout: Optional[float] = None,
max_retries: Optional[int] = None,
tools: Optional[Union[ToolsType, list[dict]]] = None,
tools_strict: bool = False,
http_client_kwargs: Optional[dict[str, Any]] = None)Creates an instance of OpenAIResponsesChatGenerator. Uses OpenAI's gpt-5-mini by default.
Before initializing the component, you can set the 'OPENAI_TIMEOUT' and 'OPENAI_MAX_RETRIES'
environment variables to override the timeout and max_retries parameters respectively
in the OpenAI client.
Arguments:
api_key: The OpenAI API key.
You can set it with an environment variableOPENAI_API_KEY, or pass with this parameter
during initialization.model: The name of the model to use.streaming_callback: A callback function that is called when a new token is received from the stream.
The callback function accepts StreamingChunk
as an argument.api_base_url: An optional base URL.organization: Your organization ID, defaults toNone. See
production best practices.generation_kwargs: Other parameters to use for the model. These parameters are sent
directly to the OpenAI endpoint.
See OpenAI documentation for
more details.
Some of the supported parameters:temperature: What sampling temperature to use. Higher values like 0.8 will make the output more random,
while lower values like 0.2 will make it more focused and deterministic.top_p: An alternative to sampling with temperature, called nucleus sampling, where the model
considers the results of the tokens with top_p probability mass. For example, 0.1 means only the tokens
comprising the top 10% probability mass are considered.previous_response_id: The ID of the previous response.
Use this to create multi-turn conversations.text_format: A JSON schema or a Pydantic model that enforces the structure of the model's response.
If provided, the output will always be validated against this
format (unless the model returns a tool call).
For details, see the OpenAI Structured Outputs documentation.
Notes:- This parameter accepts Pydantic models and JSON schemas for latest models starting from GPT-4o.
Older models only support basic version of structured outputs through{"type": "json_object"}.
For detailed information on JSON mode, see the OpenAI Structured Outputs documentation. - For structured outputs with streaming,
thetext_formatmust be a JSON schema and not a Pydantic model.
- This parameter accepts Pydantic models and JSON schemas for latest models starting from GPT-4o.
reasoning: A dictionary of parameters for reasoning. For example:summary: The summary of the reasoning.effort: The level of effort to put into the reasoning. Can below,mediumorhigh.generate_summary: Whether to generate a summary of the reasoning.
Note: OpenAI does not return the reasoning tokens, but we can view summary if its enabled.
For details, see the OpenAI Reasoning documentation.
timeout: Timeout for OpenAI client calls. If not set, it defaults to either the
OPENAI_TIMEOUTenvironment variable, or 30 seconds.max_retries: Maximum number of retries to contact OpenAI after an internal error.
If not set, it defaults to either theOPENAI_MAX_RETRIESenvironment variable, or set to 5.tools: The tools that the model can use to prepare calls. This parameter can accept either a
mixed list of HaystackToolobjects and HaystackToolset. Or you can pass a dictionary of
OpenAI/MCP tool definitions.
Note: You cannot pass OpenAI/MCP tools and Haystack tools together.
For details on tool support, see OpenAI documentation.tools_strict: Whether to enable strict schema adherence for tool calls. If set toFalse, the model may not exactly
follow the schema provided in theparametersfield of the tool definition. In Response API, tool calls
are strict by default.http_client_kwargs: A dictionary of keyword arguments to configure a customhttpx.Clientorhttpx.AsyncClient.
For more information, see the HTTPX documentation.
OpenAIResponsesChatGenerator.warm_up
def warm_up()Warm up the OpenAI responses chat generator.
This will warm up the tools registered in the chat generator.
This method is idempotent and will only warm up the tools once.
OpenAIResponsesChatGenerator.to_dict
def to_dict() -> dict[str, Any]Serialize this component to a dictionary.
Returns:
The serialized component as a dictionary.
OpenAIResponsesChatGenerator.from_dict
@classmethod
def from_dict(cls, data: dict[str, Any]) -> "OpenAIResponsesChatGenerator"Deserialize this component from a dictionary.
Arguments:
data: The dictionary representation of this component.
Returns:
The deserialized component instance.
OpenAIResponsesChatGenerator.run
@component.output_types(replies=list[ChatMessage])
def run(messages: list[ChatMessage],
*,
streaming_callback: Optional[StreamingCallbackT] = None,
generation_kwargs: Optional[dict[str, Any]] = None,
tools: Optional[Union[ToolsType, list[dict]]] = None,
tools_strict: Optional[bool] = None)Invokes response generation based on the provided messages and generation parameters.
Arguments:
messages: A list of ChatMessage instances representing the input messages.streaming_callback: A callback function that is called when a new token is received from the stream.generation_kwargs: Additional keyword arguments for text generation. These parameters will
override the parameters passed during component initialization.
For details on OpenAI API parameters, see OpenAI documentation.tools: The tools that the model can use to prepare calls. If set, it will override the
toolsparameter set during component initialization. This parameter can accept either a
mixed list of HaystackToolobjects and HaystackToolset. Or you can pass a dictionary of
OpenAI/MCP tool definitions.
Note: You cannot pass OpenAI/MCP tools and Haystack tools together.
For details on tool support, see OpenAI documentation.tools_strict: Whether to enable strict schema adherence for tool calls. If set toFalse, the model may not exactly
follow the schema provided in theparametersfield of the tool definition. In Response API, tool calls
are strict by default.
If set, it will override thetools_strictparameter set during component initialization.
Returns:
A dictionary with the following key:
replies: A list containing the generated responses as ChatMessage instances.
OpenAIResponsesChatGenerator.run_async
@component.output_types(replies=list[ChatMessage])
async def run_async(messages: list[ChatMessage],
*,
streaming_callback: Optional[StreamingCallbackT] = None,
generation_kwargs: Optional[dict[str, Any]] = None,
tools: Optional[Union[ToolsType, list[dict]]] = None,
tools_strict: Optional[bool] = None)Asynchronously invokes response generation based on the provided messages and generation parameters.
This is the asynchronous version of the run method. It has the same parameters and return values
but can be used with await in async code.
Arguments:
messages: A list of ChatMessage instances representing the input messages.streaming_callback: A callback function that is called when a new token is received from the stream.
Must be a coroutine.generation_kwargs: Additional keyword arguments for text generation. These parameters will
override the parameters passed during component initialization.
For details on OpenAI API parameters, see OpenAI documentation.tools: A list of tools or a Toolset for which the model can prepare calls. If set, it will override the
toolsparameter set during component initialization. This parameter can accept either a list of
mixed list of HaystackToolobjects and HaystackToolset. Or you can pass a dictionary of
OpenAI/MCP tool definitions.
Note: You cannot pass OpenAI/MCP tools and Haystack tools together.tools_strict: Whether to enable strict schema adherence for tool calls. If set toTrue, the model will follow exactly
the schema provided in theparametersfield of the tool definition, but this may increase latency.
If set, it will override thetools_strictparameter set during component initialization.
Returns:
A dictionary with the following key:
replies: A list containing the generated responses as ChatMessage instances.
Module chat/fallback
FallbackChatGenerator
A chat generator wrapper that tries multiple chat generators sequentially.
It forwards all parameters transparently to the underlying chat generators and returns the first successful result.
Calls chat generators sequentially until one succeeds. Falls back on any exception raised by a generator.
If all chat generators fail, it raises a RuntimeError with details.
Timeout enforcement is fully delegated to the underlying chat generators. The fallback mechanism will only
work correctly if the underlying chat generators implement proper timeout handling and raise exceptions
when timeouts occur. For predictable latency guarantees, ensure your chat generators:
- Support a
timeoutparameter in their initialization - Implement timeout as total wall-clock time (shared deadline for both streaming and non-streaming)
- Raise timeout exceptions (e.g., TimeoutError, asyncio.TimeoutError, httpx.TimeoutException) when exceeded
Note: Most well-implemented chat generators (OpenAI, Anthropic, Cohere, etc.) support timeout parameters
with consistent semantics. For HTTP-based LLM providers, a single timeout value (e.g., timeout=30)
typically applies to all connection phases: connection setup, read, write, and pool. For streaming
responses, read timeout is the maximum gap between chunks. For non-streaming, it's the time limit for
receiving the complete response.
Failover is automatically triggered when a generator raises any exception, including:
- Timeout errors (if the generator implements and raises them)
- Rate limit errors (429)
- Authentication errors (401)
- Context length errors (400)
- Server errors (500+)
- Any other exception
FallbackChatGenerator.__init__
def __init__(chat_generators: list[ChatGenerator])Creates an instance of FallbackChatGenerator.
Arguments:
chat_generators: A non-empty list of chat generator components to try in order.
FallbackChatGenerator.to_dict
def to_dict() -> dict[str, Any]Serialize the component, including nested chat generators when they support serialization.
FallbackChatGenerator.from_dict
@classmethod
def from_dict(cls, data: dict[str, Any]) -> FallbackChatGeneratorRebuild the component from a serialized representation, restoring nested chat generators.
FallbackChatGenerator.warm_up
def warm_up() -> NoneWarm up all underlying chat generators.
This method calls warm_up() on each underlying generator that supports it.
FallbackChatGenerator.run
@component.output_types(replies=list[ChatMessage], meta=dict[str, Any])
def run(
messages: list[ChatMessage],
generation_kwargs: Union[dict[str, Any], None] = None,
tools: Optional[ToolsType] = None,
streaming_callback: Union[StreamingCallbackT,
None] = None) -> dict[str, Any]Execute chat generators sequentially until one succeeds.
Arguments:
messages: The conversation history as a list of ChatMessage instances.generation_kwargs: Optional parameters for the chat generator (e.g., temperature, max_tokens).tools: A list of Tool and/or Toolset objects, or a single Toolset for function calling capabilities.streaming_callback: Optional callable for handling streaming responses.
Raises:
RuntimeError: If all chat generators fail.
Returns:
A dictionary with:
- "replies": Generated ChatMessage instances from the first successful generator.
- "meta": Execution metadata including successful_chat_generator_index, successful_chat_generator_class,
total_attempts, failed_chat_generators, plus any metadata from the successful generator.
FallbackChatGenerator.run_async
@component.output_types(replies=list[ChatMessage], meta=dict[str, Any])
async def run_async(
messages: list[ChatMessage],
generation_kwargs: Union[dict[str, Any], None] = None,
tools: Optional[ToolsType] = None,
streaming_callback: Union[StreamingCallbackT,
None] = None) -> dict[str, Any]Asynchronously execute chat generators sequentially until one succeeds.
Arguments:
messages: The conversation history as a list of ChatMessage instances.generation_kwargs: Optional parameters for the chat generator (e.g., temperature, max_tokens).tools: A list of Tool and/or Toolset objects, or a single Toolset for function calling capabilities.streaming_callback: Optional callable for handling streaming responses.
Raises:
RuntimeError: If all chat generators fail.
Returns:
A dictionary with:
- "replies": Generated ChatMessage instances from the first successful generator.
- "meta": Execution metadata including successful_chat_generator_index, successful_chat_generator_class,
total_attempts, failed_chat_generators, plus any metadata from the successful generator.
