DocumentationAPI Reference📓 Tutorials🧑‍🍳 Cookbook🤝 Integrations💜 Discord🎨 Studio (Waitlist)
API Reference

Enables text generation using LLMs.

Module azure

AzureOpenAIGenerator

Generates text using OpenAI's large language models (LLMs).

It works with the gpt-4 and gpt-3.5-turbo family of models. You can customize how the text is generated by passing parameters to the OpenAI API. Use the **generation_kwargs argument when you initialize the component or when you run it. Any parameter that works with openai.ChatCompletion.create will work here too.

For details on OpenAI API parameters, see OpenAI documentation.

Usage example

from haystack.components.generators import AzureOpenAIGenerator
from haystack.utils import Secret
client = AzureOpenAIGenerator(
    azure_endpoint="<Your Azure endpoint e.g. `https://your-company.azure.openai.com/>",
    api_key=Secret.from_token("<your-api-key>"),
    azure_deployment="<this a model name, e.g. gpt-35-turbo>")
response = client.run("What's Natural Language Processing? Be brief.")
print(response)
>> {'replies': ['Natural Language Processing (NLP) is a branch of artificial intelligence that focuses on
>> the interaction between computers and human language. It involves enabling computers to understand, interpret,
>> and respond to natural human language in a way that is both meaningful and useful.'], 'meta': [{'model':
>> 'gpt-3.5-turbo-0613', 'index': 0, 'finish_reason': 'stop', 'usage': {'prompt_tokens': 16,
>> 'completion_tokens': 49, 'total_tokens': 65}}]}

AzureOpenAIGenerator.__init__

def __init__(
        azure_endpoint: Optional[str] = None,
        api_version: Optional[str] = "2023-05-15",
        azure_deployment: Optional[str] = "gpt-35-turbo",
        api_key: Optional[Secret] = Secret.from_env_var("AZURE_OPENAI_API_KEY",
                                                        strict=False),
        azure_ad_token: Optional[Secret] = Secret.from_env_var(
            "AZURE_OPENAI_AD_TOKEN", strict=False),
        organization: Optional[str] = None,
        streaming_callback: Optional[Callable[[StreamingChunk], None]] = None,
        system_prompt: Optional[str] = None,
        timeout: Optional[float] = None,
        max_retries: Optional[int] = None,
        generation_kwargs: Optional[Dict[str, Any]] = None)

Initialize the Azure OpenAI Generator.

Arguments:

  • azure_endpoint: The endpoint of the deployed model, for example https://example-resource.azure.openai.com/.
  • api_version: The version of the API to use. Defaults to 2023-05-15.
  • azure_deployment: The deployment of the model, usually the model name.
  • api_key: The API key to use for authentication.
  • azure_ad_token: Azure Active Directory token.
  • organization: Your organization ID, defaults to None. For help, see Setting up your organization.
  • streaming_callback: A callback function called when a new token is received from the stream. It accepts StreamingChunk as an argument.
  • system_prompt: The system prompt to use for text generation. If not provided, the Generator omits the system prompt and uses the default system prompt.
  • timeout: Timeout for AzureOpenAI client. If not set, it is inferred from the OPENAI_TIMEOUT environment variable or set to 30.
  • max_retries: Maximum retries to establish contact with AzureOpenAI if it returns an internal error. If not set, it is inferred from the OPENAI_MAX_RETRIES environment variable or set to 5.
  • generation_kwargs: Other parameters to use for the model, sent directly to the OpenAI endpoint. See OpenAI documentation for more details. Some of the supported parameters:
  • max_tokens: The maximum number of tokens the output text can have.
  • temperature: The sampling temperature to use. Higher values mean the model takes more risks. Try 0.9 for more creative applications and 0 (argmax sampling) for ones with a well-defined answer.
  • top_p: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. For example, 0.1 means only the tokens comprising the top 10% probability mass are considered.
  • n: The number of completions to generate for each prompt. For example, with 3 prompts and n=2, the LLM will generate two completions per prompt, resulting in 6 completions total.
  • stop: One or more sequences after which the LLM should stop generating tokens.
  • presence_penalty: The penalty applied if a token is already present. Higher values make the model less likely to repeat the token.
  • frequency_penalty: Penalty applied if a token has already been generated. Higher values make the model less likely to repeat the token.
  • logit_bias: Adds a logit bias to specific tokens. The keys of the dictionary are tokens, and the values are the bias to add to that token.

AzureOpenAIGenerator.to_dict

def to_dict() -> Dict[str, Any]

Serialize this component to a dictionary.

Returns:

The serialized component as a dictionary.

AzureOpenAIGenerator.from_dict

@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "AzureOpenAIGenerator"

Deserialize this component from a dictionary.

Arguments:

  • data: The dictionary representation of this component.

Returns:

The deserialized component instance.

AzureOpenAIGenerator.run

@component.output_types(replies=List[str], meta=List[Dict[str, Any]])
def run(prompt: str,
        streaming_callback: Optional[Callable[[StreamingChunk], None]] = None,
        generation_kwargs: Optional[Dict[str, Any]] = None)

Invoke the text generation inference based on the provided messages and generation parameters.

Arguments:

  • prompt: The string prompt to use for text generation.
  • streaming_callback: A callback function that is called when a new token is received from the stream.
  • generation_kwargs: Additional keyword arguments for text generation. These parameters will potentially override the parameters passed in the __init__ method. For more details on the parameters supported by the OpenAI API, refer to the OpenAI documentation.

Returns:

A list of strings containing the generated responses and a list of dictionaries containing the metadata for each response.

Module hugging_face_local

HuggingFaceLocalGenerator

Generates text using models from Hugging Face that run locally.

LLMs running locally may need powerful hardware.

Usage example

from haystack.components.generators import HuggingFaceLocalGenerator

generator = HuggingFaceLocalGenerator(
    model="google/flan-t5-large",
    task="text2text-generation",
    generation_kwargs={"max_new_tokens": 100, "temperature": 0.9})

generator.warm_up()

print(generator.run("Who is the best American actor?"))
# {'replies': ['John Cusack']}

HuggingFaceLocalGenerator.__init__

def __init__(model: str = "google/flan-t5-base",
             task: Optional[Literal["text-generation",
                                    "text2text-generation"]] = None,
             device: Optional[ComponentDevice] = None,
             token: Optional[Secret] = Secret.from_env_var(
                 ["HF_API_TOKEN", "HF_TOKEN"], strict=False),
             generation_kwargs: Optional[Dict[str, Any]] = None,
             huggingface_pipeline_kwargs: Optional[Dict[str, Any]] = None,
             stop_words: Optional[List[str]] = None,
             streaming_callback: Optional[Callable[[StreamingChunk],
                                                   None]] = None)

Creates an instance of a HuggingFaceLocalGenerator.

Arguments:

  • model: The Hugging Face text generation model name or path.
  • task: The task for the Hugging Face pipeline. Possible options:
  • text-generation: Supported by decoder models, like GPT.
  • text2text-generation: Supported by encoder-decoder models, like T5. If the task is specified in huggingface_pipeline_kwargs, this parameter is ignored. If not specified, the component calls the Hugging Face API to infer the task from the model name.
  • device: The device for loading the model. If None, automatically selects the default device. If a device or device map is specified in huggingface_pipeline_kwargs, it overrides this parameter.
  • token: The token to use as HTTP bearer authorization for remote files. If the token is specified in huggingface_pipeline_kwargs, this parameter is ignored.
  • generation_kwargs: A dictionary with keyword arguments to customize text generation. Some examples: max_length, max_new_tokens, temperature, top_k, top_p. See Hugging Face's documentation for more information:
  • customize-text-generation
  • transformers.GenerationConfig
  • huggingface_pipeline_kwargs: Dictionary with keyword arguments to initialize the Hugging Face pipeline for text generation. These keyword arguments provide fine-grained control over the Hugging Face pipeline. In case of duplication, these kwargs override model, task, device, and token init parameters. For available kwargs, see Hugging Face documentation. In this dictionary, you can also include model_kwargs to specify the kwargs for model initialization: transformers.PreTrainedModel.from_pretrained
  • stop_words: If the model generates a stop word, the generation stops. If you provide this parameter, don't specify the stopping_criteria in generation_kwargs. For some chat models, the output includes both the new text and the original prompt. In these cases, make sure your prompt has no stop words.
  • streaming_callback: An optional callable for handling streaming responses.

HuggingFaceLocalGenerator.warm_up

def warm_up()

Initializes the component.

HuggingFaceLocalGenerator.to_dict

def to_dict() -> Dict[str, Any]

Serializes the component to a dictionary.

Returns:

Dictionary with serialized data.

HuggingFaceLocalGenerator.from_dict

@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "HuggingFaceLocalGenerator"

Deserializes the component from a dictionary.

Arguments:

  • data: The dictionary to deserialize from.

Returns:

The deserialized component.

HuggingFaceLocalGenerator.run

@component.output_types(replies=List[str])
def run(prompt: str, generation_kwargs: Optional[Dict[str, Any]] = None)

Run the text generation model on the given prompt.

Arguments:

  • prompt: A string representing the prompt.
  • generation_kwargs: Additional keyword arguments for text generation.

Returns:

A dictionary containing the generated replies.

  • replies: A list of strings representing the generated replies.

Module hugging_face_api

HuggingFaceAPIGenerator

Generates text using Hugging Face APIs.

Use it with the following Hugging Face APIs:

Usage examples

With the free serverless inference API

from haystack.components.generators import HuggingFaceAPIGenerator
from haystack.utils import Secret

generator = HuggingFaceAPIGenerator(api_type="serverless_inference_api",
                                    api_params={"model": "HuggingFaceH4/zephyr-7b-beta"},
                                    token=Secret.from_token("<your-api-key>"))

result = generator.run(prompt="What's Natural Language Processing?")
print(result)

With paid inference endpoints

from haystack.components.generators import HuggingFaceAPIGenerator
from haystack.utils import Secret

generator = HuggingFaceAPIGenerator(api_type="inference_endpoints",
                                    api_params={"url": "<your-inference-endpoint-url>"},
                                    token=Secret.from_token("<your-api-key>"))

result = generator.run(prompt="What's Natural Language Processing?")
print(result)

#### With self-hosted text generation inference
```python
from haystack.components.generators import HuggingFaceAPIGenerator

generator = HuggingFaceAPIGenerator(api_type="text_generation_inference",
                                    api_params={"url": "http://localhost:8080"})

result = generator.run(prompt="What's Natural Language Processing?")
print(result)

HuggingFaceAPIGenerator.__init__

def __init__(api_type: Union[HFGenerationAPIType, str],
             api_params: Dict[str, str],
             token: Optional[Secret] = Secret.from_env_var(
                 ["HF_API_TOKEN", "HF_TOKEN"], strict=False),
             generation_kwargs: Optional[Dict[str, Any]] = None,
             stop_words: Optional[List[str]] = None,
             streaming_callback: Optional[Callable[[StreamingChunk],
                                                   None]] = None)

Initialize the HuggingFaceAPIGenerator instance.

Arguments:

  • api_type: The type of Hugging Face API to use. Available types:
  • text_generation_inference: See TGI.
  • inference_endpoints: See Inference Endpoints.
  • serverless_inference_api: See Serverless Inference API.
  • api_params: A dictionary with the following keys:
  • model: Hugging Face model ID. Required when api_type is SERVERLESS_INFERENCE_API.
  • url: URL of the inference endpoint. Required when api_type is INFERENCE_ENDPOINTS or TEXT_GENERATION_INFERENCE.
  • token: The Hugging Face token to use as HTTP bearer authorization. Check your HF token in your account settings.
  • generation_kwargs: A dictionary with keyword arguments to customize text generation. Some examples: max_new_tokens, temperature, top_k, top_p. For details, see Hugging Face documentation for more information.
  • stop_words: An optional list of strings representing the stop words.
  • streaming_callback: An optional callable for handling streaming responses.

HuggingFaceAPIGenerator.to_dict

def to_dict() -> Dict[str, Any]

Serialize this component to a dictionary.

Returns:

A dictionary containing the serialized component.

HuggingFaceAPIGenerator.from_dict

@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "HuggingFaceAPIGenerator"

Deserialize this component from a dictionary.

HuggingFaceAPIGenerator.run

@component.output_types(replies=List[str], meta=List[Dict[str, Any]])
def run(prompt: str, generation_kwargs: Optional[Dict[str, Any]] = None)

Invoke the text generation inference for the given prompt and generation parameters.

Arguments:

  • prompt: A string representing the prompt.
  • generation_kwargs: Additional keyword arguments for text generation.

Returns:

A dictionary with the generated replies and metadata. Both are lists of length n.

  • replies: A list of strings representing the generated replies.

Module openai

OpenAIGenerator

Generates text using OpenAI's large language models (LLMs).

It works with the gpt-4 and gpt-3.5-turbo models and supports streaming responses from OpenAI API. It uses strings as input and output.

You can customize how the text is generated by passing parameters to the OpenAI API. Use the **generation_kwargs argument when you initialize the component or when you run it. Any parameter that works with openai.ChatCompletion.create will work here too.

For details on OpenAI API parameters, see OpenAI documentation.

Usage example

from haystack.components.generators import OpenAIGenerator
client = OpenAIGenerator()
response = client.run("What's Natural Language Processing? Be brief.")
print(response)

>> {'replies': ['Natural Language Processing (NLP) is a branch of artificial intelligence that focuses on
>> the interaction between computers and human language. It involves enabling computers to understand, interpret,
>> and respond to natural human language in a way that is both meaningful and useful.'], 'meta': [{'model':
>> 'gpt-3.5-turbo-0613', 'index': 0, 'finish_reason': 'stop', 'usage': {'prompt_tokens': 16,
>> 'completion_tokens': 49, 'total_tokens': 65}}]}

OpenAIGenerator.__init__

def __init__(api_key: Secret = Secret.from_env_var("OPENAI_API_KEY"),
             model: str = "gpt-3.5-turbo",
             streaming_callback: Optional[Callable[[StreamingChunk],
                                                   None]] = None,
             api_base_url: Optional[str] = None,
             organization: Optional[str] = None,
             system_prompt: Optional[str] = None,
             generation_kwargs: Optional[Dict[str, Any]] = None,
             timeout: Optional[float] = None,
             max_retries: Optional[int] = None)

Creates an instance of OpenAIGenerator. Unless specified otherwise in model, uses OpenAI's GPT-3.5.

By setting the 'OPENAI_TIMEOUT' and 'OPENAI_MAX_RETRIES' you can change the timeout and max_retries parameters in the OpenAI client.

Arguments:

  • api_key: The OpenAI API key to connect to OpenAI.
  • model: The name of the model to use.
  • streaming_callback: A callback function that is called when a new token is received from the stream. The callback function accepts StreamingChunk as an argument.
  • api_base_url: An optional base URL.
  • organization: The Organization ID, defaults to None.
  • system_prompt: The system prompt to use for text generation. If not provided, the system prompt is omitted, and the default system prompt of the model is used.
  • generation_kwargs: Other parameters to use for the model. These parameters are all sent directly to the OpenAI endpoint. See OpenAI documentation for more details. Some of the supported parameters:
  • max_tokens: The maximum number of tokens the output text can have.
  • temperature: What sampling temperature to use. Higher values mean the model will take more risks. Try 0.9 for more creative applications and 0 (argmax sampling) for ones with a well-defined answer.
  • top_p: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So, 0.1 means only the tokens comprising the top 10% probability mass are considered.
  • n: How many completions to generate for each prompt. For example, if the LLM gets 3 prompts and n is 2, it will generate two completions for each of the three prompts, ending up with 6 completions in total.
  • stop: One or more sequences after which the LLM should stop generating tokens.
  • presence_penalty: What penalty to apply if a token is already present at all. Bigger values mean the model will be less likely to repeat the same token in the text.
  • frequency_penalty: What penalty to apply if a token has already been generated in the text. Bigger values mean the model will be less likely to repeat the same token in the text.
  • logit_bias: Add a logit bias to specific tokens. The keys of the dictionary are tokens, and the values are the bias to add to that token.
  • timeout: Timeout for OpenAI Client calls, if not set it is inferred from the OPENAI_TIMEOUT environment variable or set to 30.
  • max_retries: Maximum retries to establish contact with OpenAI if it returns an internal error, if not set it is inferred from the OPENAI_MAX_RETRIES environment variable or set to 5.

OpenAIGenerator.to_dict

def to_dict() -> Dict[str, Any]

Serialize this component to a dictionary.

Returns:

The serialized component as a dictionary.

OpenAIGenerator.from_dict

@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "OpenAIGenerator"

Deserialize this component from a dictionary.

Arguments:

  • data: The dictionary representation of this component.

Returns:

The deserialized component instance.

OpenAIGenerator.run

@component.output_types(replies=List[str], meta=List[Dict[str, Any]])
def run(prompt: str,
        streaming_callback: Optional[Callable[[StreamingChunk], None]] = None,
        generation_kwargs: Optional[Dict[str, Any]] = None)

Invoke the text generation inference based on the provided messages and generation parameters.

Arguments:

  • prompt: The string prompt to use for text generation.
  • streaming_callback: A callback function that is called when a new token is received from the stream.
  • generation_kwargs: Additional keyword arguments for text generation. These parameters will potentially override the parameters passed in the __init__ method. For more details on the parameters supported by the OpenAI API, refer to the OpenAI documentation.

Returns:

A list of strings containing the generated responses and a list of dictionaries containing the metadata for each response.

Module chat/azure

AzureOpenAIChatGenerator

Generates text using OpenAI's models on Azure.

It works with the gpt-4 and gpt-3.5-turbo - type models and supports streaming responses from OpenAI API. It uses ChatMessage format in input and output.

You can customize how the text is generated by passing parameters to the OpenAI API. Use the **generation_kwargs argument when you initialize the component or when you run it. Any parameter that works with openai.ChatCompletion.create will work here too.

For details on OpenAI API parameters, see OpenAI documentation.

Usage example

from haystack.components.generators.chat import AzureOpenAIGenerator
from haystack.dataclasses import ChatMessage
from haystack.utils import Secret

messages = [ChatMessage.from_user("What's Natural Language Processing?")]

client = AzureOpenAIGenerator(
    azure_endpoint="<Your Azure endpoint e.g. `https://your-company.azure.openai.com/>",
    api_key=Secret.from_token("<your-api-key>"),
    azure_deployment="<this a model name, e.g. gpt-35-turbo>")
response = client.run(messages)
print(response)
{'replies':
    [ChatMessage(content='Natural Language Processing (NLP) is a branch of artificial intelligence that focuses on
     enabling computers to understand, interpret, and generate human language in a way that is useful.',
     role=<ChatRole.ASSISTANT: 'assistant'>, name=None,
     meta={'model': 'gpt-3.5-turbo-0613', 'index': 0, 'finish_reason': 'stop',
     'usage': {'prompt_tokens': 15, 'completion_tokens': 36, 'total_tokens': 51}})]
}

AzureOpenAIChatGenerator.__init__

def __init__(
        azure_endpoint: Optional[str] = None,
        api_version: Optional[str] = "2023-05-15",
        azure_deployment: Optional[str] = "gpt-35-turbo",
        api_key: Optional[Secret] = Secret.from_env_var("AZURE_OPENAI_API_KEY",
                                                        strict=False),
        azure_ad_token: Optional[Secret] = Secret.from_env_var(
            "AZURE_OPENAI_AD_TOKEN", strict=False),
        organization: Optional[str] = None,
        streaming_callback: Optional[Callable[[StreamingChunk], None]] = None,
        timeout: Optional[float] = None,
        max_retries: Optional[int] = None,
        generation_kwargs: Optional[Dict[str, Any]] = None)

Initialize the Azure OpenAI Chat Generator component.

Arguments:

  • azure_endpoint: The endpoint of the deployed model, for example "https://example-resource.azure.openai.com/".
  • api_version: The version of the API to use. Defaults to 2023-05-15.
  • azure_deployment: The deployment of the model, usually the model name.
  • api_key: The API key to use for authentication.
  • azure_ad_token: Azure Active Directory token.
  • organization: Your organization ID, defaults to None. For help, see Setting up your organization.
  • streaming_callback: A callback function called when a new token is received from the stream. It accepts StreamingChunk as an argument.
  • timeout: Timeout for OpenAI client calls. If not set, it defaults to either the OPENAI_TIMEOUT environment variable, or 30 seconds.
  • max_retries: Maximum number of retries to contact OpenAI after an internal error. If not set, it defaults to either the OPENAI_MAX_RETRIES environment variable, or set to 5.
  • generation_kwargs: Other parameters to use for the model. These parameters are sent directly to the OpenAI endpoint. For details, see OpenAI documentation. Some of the supported parameters:
  • max_tokens: The maximum number of tokens the output text can have.
  • temperature: The sampling temperature to use. Higher values mean the model takes more risks. Try 0.9 for more creative applications and 0 (argmax sampling) for ones with a well-defined answer.
  • top_p: Nucleus sampling is an alternative to sampling with temperature, where the model considers tokens with a top_p probability mass. For example, 0.1 means only the tokens comprising the top 10% probability mass are considered.
  • n: The number of completions to generate for each prompt. For example, with 3 prompts and n=2, the LLM will generate two completions per prompt, resulting in 6 completions total.
  • stop: One or more sequences after which the LLM should stop generating tokens.
  • presence_penalty: The penalty applied if a token is already present. Higher values make the model less likely to repeat the token.
  • frequency_penalty: Penalty applied if a token has already been generated. Higher values make the model less likely to repeat the token.
  • logit_bias: Adds a logit bias to specific tokens. The keys of the dictionary are tokens, and the values are the bias to add to that token.

AzureOpenAIChatGenerator.to_dict

def to_dict() -> Dict[str, Any]

Serialize this component to a dictionary.

Returns:

The serialized component as a dictionary.

AzureOpenAIChatGenerator.from_dict

@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "AzureOpenAIChatGenerator"

Deserialize this component from a dictionary.

Arguments:

  • data: The dictionary representation of this component.

Returns:

The deserialized component instance.

AzureOpenAIChatGenerator.run

@component.output_types(replies=List[ChatMessage])
def run(messages: List[ChatMessage],
        streaming_callback: Optional[Callable[[StreamingChunk], None]] = None,
        generation_kwargs: Optional[Dict[str, Any]] = None)

Invokes chat completion based on the provided messages and generation parameters.

Arguments:

  • messages: A list of ChatMessage instances representing the input messages.
  • streaming_callback: A callback function that is called when a new token is received from the stream.
  • generation_kwargs: Additional keyword arguments for text generation. These parameters will override the parameters passed during component initialization. For details on OpenAI API parameters, see OpenAI documentation.

Returns:

A list containing the generated responses as ChatMessage instances.

Module chat/hugging_face_local

HuggingFaceLocalChatGenerator

Generates chat responses using models from Hugging Face that run locally.

Use this component with chat-based models, such as HuggingFaceH4/zephyr-7b-beta or meta-llama/Llama-2-7b-chat-hf. LLMs running locally may need powerful hardware.

Usage example

from haystack.components.generators.chat import HuggingFaceLocalChatGenerator
from haystack.dataclasses import ChatMessage

generator = HuggingFaceLocalChatGenerator(model="HuggingFaceH4/zephyr-7b-beta")
generator.warm_up()
messages = [ChatMessage.from_user("What's Natural Language Processing? Be brief.")]
print(generator.run(messages))
{'replies':
    [ChatMessage(content=' Natural Language Processing (NLP) is a subfield of artificial intelligence that deals
    with the interaction between computers and human language. It enables computers to understand, interpret, and
    generate human language in a valuable way. NLP involves various techniques such as speech recognition, text
    analysis, sentiment analysis, and machine translation. The ultimate goal is to make it easier for computers to
    process and derive meaning from human language, improving communication between humans and machines.',
    role=<ChatRole.ASSISTANT: 'assistant'>,
    name=None,
    meta={'finish_reason': 'stop', 'index': 0, 'model':
          'mistralai/Mistral-7B-Instruct-v0.2',
          'usage': {'completion_tokens': 90, 'prompt_tokens': 19, 'total_tokens': 109}})
          ]
}

HuggingFaceLocalChatGenerator.__init__

def __init__(model: str = "HuggingFaceH4/zephyr-7b-beta",
             task: Optional[Literal["text-generation",
                                    "text2text-generation"]] = None,
             device: Optional[ComponentDevice] = None,
             token: Optional[Secret] = Secret.from_env_var(
                 ["HF_API_TOKEN", "HF_TOKEN"], strict=False),
             chat_template: Optional[str] = None,
             generation_kwargs: Optional[Dict[str, Any]] = None,
             huggingface_pipeline_kwargs: Optional[Dict[str, Any]] = None,
             stop_words: Optional[List[str]] = None,
             streaming_callback: Optional[Callable[[StreamingChunk],
                                                   None]] = None)

Initializes the HuggingFaceLocalChatGenerator component.

Arguments:

  • model: The Hugging Face text generation model name or path, for example, mistralai/Mistral-7B-Instruct-v0.2 or TheBloke/OpenHermes-2.5-Mistral-7B-16k-AWQ. The model must be a chat model supporting the ChatML messaging format. If the model is specified in huggingface_pipeline_kwargs, this parameter is ignored.
  • task: The task for the Hugging Face pipeline. Possible options:
  • text-generation: Supported by decoder models, like GPT.
  • text2text-generation: Supported by encoder-decoder models, like T5. If the task is specified in huggingface_pipeline_kwargs, this parameter is ignored. If not specified, the component calls the Hugging Face API to infer the task from the model name.
  • device: The device for loading the model. If None, automatically selects the default device. If a device or device map is specified in huggingface_pipeline_kwargs, it overrides this parameter.
  • token: The token to use as HTTP bearer authorization for remote files. If the token is specified in huggingface_pipeline_kwargs, this parameter is ignored.
  • chat_template: Specifies an optional Jinja template for formatting chat messages. Most high-quality chat models have their own templates, but for models without this feature or if you prefer a custom template, use this parameter.
  • generation_kwargs: A dictionary with keyword arguments to customize text generation. Some examples: max_length, max_new_tokens, temperature, top_k, top_p. See Hugging Face's documentation for more information:
    • GenerationConfig The only generation_kwargs set by default is max_new_tokens, which is set to 512 tokens.
  • huggingface_pipeline_kwargs: Dictionary with keyword arguments to initialize the Hugging Face pipeline for text generation. These keyword arguments provide fine-grained control over the Hugging Face pipeline. In case of duplication, these kwargs override model, task, device, and token init parameters. For kwargs, see Hugging Face documentation. In this dictionary, you can also include model_kwargs to specify the kwargs for model initialization
  • stop_words: A list of stop words. If the model generates a stop word, the generation stops. If you provide this parameter, don't specify the stopping_criteria in generation_kwargs. For some chat models, the output includes both the new text and the original prompt. In these cases, make sure your prompt has no stop words.
  • streaming_callback: An optional callable for handling streaming responses.

HuggingFaceLocalChatGenerator.warm_up

def warm_up()

Initializes the component.

HuggingFaceLocalChatGenerator.to_dict

def to_dict() -> Dict[str, Any]

Serializes the component to a dictionary.

Returns:

Dictionary with serialized data.

HuggingFaceLocalChatGenerator.from_dict

@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "HuggingFaceLocalChatGenerator"

Deserializes the component from a dictionary.

Arguments:

  • data: The dictionary to deserialize from.

Returns:

The deserialized component.

HuggingFaceLocalChatGenerator.run

@component.output_types(replies=List[ChatMessage])
def run(messages: List[ChatMessage],
        generation_kwargs: Optional[Dict[str, Any]] = None)

Invoke text generation inference based on the provided messages and generation parameters.

Arguments:

  • messages: A list of ChatMessage objects representing the input messages.
  • generation_kwargs: Additional keyword arguments for text generation.

Returns:

A list containing the generated responses as ChatMessage instances.

HuggingFaceLocalChatGenerator.create_message

def create_message(text: str, index: int,
                   tokenizer: Union["PreTrainedTokenizer",
                                    "PreTrainedTokenizerFast"], prompt: str,
                   generation_kwargs: Dict[str, Any]) -> ChatMessage

Create a ChatMessage instance from the provided text, populated with metadata.

Arguments:

  • text: The generated text.
  • index: The index of the generated text.
  • tokenizer: The tokenizer used for generation.
  • prompt: The prompt used for generation.
  • generation_kwargs: The generation parameters.

Returns:

A ChatMessage instance.

Module chat/hugging_face_api

HuggingFaceAPIChatGenerator

Completes chats using Hugging Face APIs.

HuggingFaceAPIChatGenerator uses the ChatMessage format for input and output. Use it to generate text with Hugging Face APIs:

Usage examples

With the free serverless inference API

from haystack.components.generators.chat import HuggingFaceAPIChatGenerator
from haystack.dataclasses import ChatMessage
from haystack.utils import Secret
from haystack.utils.hf import HFGenerationAPIType

messages = [ChatMessage.from_system("\nYou are a helpful, respectful and honest assistant"),
            ChatMessage.from_user("What's Natural Language Processing?")]

# the api_type can be expressed using the HFGenerationAPIType enum or as a string
api_type = HFGenerationAPIType.SERVERLESS_INFERENCE_API
api_type = "serverless_inference_api" # this is equivalent to the above

generator = HuggingFaceAPIChatGenerator(api_type=api_type,
                                        api_params={"model": "HuggingFaceH4/zephyr-7b-beta"},
                                        token=Secret.from_token("<your-api-key>"))

result = generator.run(messages)
print(result)

With paid inference endpoints

from haystack.components.generators.chat import HuggingFaceAPIChatGenerator
from haystack.dataclasses import ChatMessage
from haystack.utils import Secret

messages = [ChatMessage.from_system("\nYou are a helpful, respectful and honest assistant"),
            ChatMessage.from_user("What's Natural Language Processing?")]

generator = HuggingFaceAPIChatGenerator(api_type="inference_endpoints",
                                        api_params={"url": "<your-inference-endpoint-url>"},
                                        token=Secret.from_token("<your-api-key>"))

result = generator.run(messages)
print(result)

#### With self-hosted text generation inference

```python
from haystack.components.generators.chat import HuggingFaceAPIChatGenerator
from haystack.dataclasses import ChatMessage

messages = [ChatMessage.from_system("\nYou are a helpful, respectful and honest assistant"),
            ChatMessage.from_user("What's Natural Language Processing?")]

generator = HuggingFaceAPIChatGenerator(api_type="text_generation_inference",
                                        api_params={"url": "http://localhost:8080"})

result = generator.run(messages)
print(result)

HuggingFaceAPIChatGenerator.__init__

def __init__(api_type: Union[HFGenerationAPIType, str],
             api_params: Dict[str, str],
             token: Optional[Secret] = Secret.from_env_var(
                 ["HF_API_TOKEN", "HF_TOKEN"], strict=False),
             generation_kwargs: Optional[Dict[str, Any]] = None,
             stop_words: Optional[List[str]] = None,
             streaming_callback: Optional[Callable[[StreamingChunk],
                                                   None]] = None)

Initialize the HuggingFaceAPIChatGenerator instance.

Arguments:

  • api_type: The type of Hugging Face API to use. Available types:
  • text_generation_inference: See TGI.
  • inference_endpoints: See Inference Endpoints.
  • serverless_inference_api: See Serverless Inference API.
  • api_params: A dictionary with the following keys:
  • model: Hugging Face model ID. Required when api_type is SERVERLESS_INFERENCE_API.
  • url: URL of the inference endpoint. Required when api_type is INFERENCE_ENDPOINTS or TEXT_GENERATION_INFERENCE.
  • token: The Hugging Face token to use as HTTP bearer authorization. Check your HF token in your account settings.
  • generation_kwargs: A dictionary with keyword arguments to customize text generation. Some examples: max_tokens, temperature, top_p. For details, see Hugging Face chat_completion documentation.
  • stop_words: An optional list of strings representing the stop words.
  • streaming_callback: An optional callable for handling streaming responses.

HuggingFaceAPIChatGenerator.to_dict

def to_dict() -> Dict[str, Any]

Serialize this component to a dictionary.

Returns:

A dictionary containing the serialized component.

HuggingFaceAPIChatGenerator.from_dict

@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "HuggingFaceAPIChatGenerator"

Deserialize this component from a dictionary.

HuggingFaceAPIChatGenerator.run

@component.output_types(replies=List[ChatMessage])
def run(messages: List[ChatMessage],
        generation_kwargs: Optional[Dict[str, Any]] = None)

Invoke the text generation inference based on the provided messages and generation parameters.

Arguments:

  • messages: A list of ChatMessage objects representing the input messages.
  • generation_kwargs: Additional keyword arguments for text generation.

Returns:

A dictionary with the following keys:

  • replies: A list containing the generated responses as ChatMessage objects.

Module chat/openai

OpenAIChatGenerator

Completes chats using OpenAI's large language models (LLMs).

It works with the gpt-4 and gpt-3.5-turbo models and supports streaming responses from OpenAI API. It uses ChatMessage format in input and output.

You can customize how the text is generated by passing parameters to the OpenAI API. Use the **generation_kwargs argument when you initialize the component or when you run it. Any parameter that works with openai.ChatCompletion.create will work here too.

For details on OpenAI API parameters, see OpenAI documentation.

Usage example

from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.dataclasses import ChatMessage

messages = [ChatMessage.from_user("What's Natural Language Processing?")]

client = OpenAIChatGenerator()
response = client.run(messages)
print(response)

Output:

{'replies':
    [ChatMessage(content='Natural Language Processing (NLP) is a branch of artificial intelligence
        that focuses on enabling computers to understand, interpret, and generate human language in
        a way that is meaningful and useful.',
     role=<ChatRole.ASSISTANT: 'assistant'>, name=None,
     meta={'model': 'gpt-3.5-turbo-0613', 'index': 0, 'finish_reason': 'stop',
     'usage': {'prompt_tokens': 15, 'completion_tokens': 36, 'total_tokens': 51}})
    ]
}

OpenAIChatGenerator.__init__

def __init__(api_key: Secret = Secret.from_env_var("OPENAI_API_KEY"),
             model: str = "gpt-3.5-turbo",
             streaming_callback: Optional[Callable[[StreamingChunk],
                                                   None]] = None,
             api_base_url: Optional[str] = None,
             organization: Optional[str] = None,
             generation_kwargs: Optional[Dict[str, Any]] = None,
             timeout: Optional[float] = None,
             max_retries: Optional[int] = None)

Creates an instance of OpenAIChatGenerator. Unless specified otherwise in model, uses OpenAI's GPT-3.5.

Before initializing the component, you can set the 'OPENAI_TIMEOUT' and 'OPENAI_MAX_RETRIES' environment variables to override the timeout and max_retries parameters respectively in the OpenAI client.

Arguments:

  • api_key: The OpenAI API key. You can set it with an environment variable OPENAI_API_KEY, or pass with this parameter during initialization.
  • model: The name of the model to use.
  • streaming_callback: A callback function that is called when a new token is received from the stream. The callback function accepts StreamingChunk as an argument.
  • api_base_url: An optional base URL.
  • organization: Your organization ID, defaults to None. See production best practices.
  • generation_kwargs: Other parameters to use for the model. These parameters are sent directly to the OpenAI endpoint. See OpenAI documentation for more details. Some of the supported parameters:
  • max_tokens: The maximum number of tokens the output text can have.
  • temperature: What sampling temperature to use. Higher values mean the model will take more risks. Try 0.9 for more creative applications and 0 (argmax sampling) for ones with a well-defined answer.
  • top_p: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. For example, 0.1 means only the tokens comprising the top 10% probability mass are considered.
  • n: How many completions to generate for each prompt. For example, if the LLM gets 3 prompts and n is 2, it will generate two completions for each of the three prompts, ending up with 6 completions in total.
  • stop: One or more sequences after which the LLM should stop generating tokens.
  • presence_penalty: What penalty to apply if a token is already present at all. Bigger values mean the model will be less likely to repeat the same token in the text.
  • frequency_penalty: What penalty to apply if a token has already been generated in the text. Bigger values mean the model will be less likely to repeat the same token in the text.
  • logit_bias: Add a logit bias to specific tokens. The keys of the dictionary are tokens, and the values are the bias to add to that token.
  • timeout: Timeout for OpenAI client calls. If not set, it defaults to either the OPENAI_TIMEOUT environment variable, or 30 seconds.
  • max_retries: Maximum number of retries to contact OpenAI after an internal error. If not set, it defaults to either the OPENAI_MAX_RETRIES environment variable, or set to 5.

OpenAIChatGenerator.to_dict

def to_dict() -> Dict[str, Any]

Serialize this component to a dictionary.

Returns:

The serialized component as a dictionary.

OpenAIChatGenerator.from_dict

@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "OpenAIChatGenerator"

Deserialize this component from a dictionary.

Arguments:

  • data: The dictionary representation of this component.

Returns:

The deserialized component instance.

OpenAIChatGenerator.run

@component.output_types(replies=List[ChatMessage])
def run(messages: List[ChatMessage],
        streaming_callback: Optional[Callable[[StreamingChunk], None]] = None,
        generation_kwargs: Optional[Dict[str, Any]] = None)

Invokes chat completion based on the provided messages and generation parameters.

Arguments:

  • messages: A list of ChatMessage instances representing the input messages.
  • streaming_callback: A callback function that is called when a new token is received from the stream.
  • generation_kwargs: Additional keyword arguments for text generation. These parameters will override the parameters passed during component initialization. For details on OpenAI API parameters, see OpenAI documentation.

Returns:

A list containing the generated responses as ChatMessage instances.