DocumentationAPI Reference📓 Tutorials🧑‍🍳 Cookbook🤝 Integrations💜 Discord🎨 Studio (Waitlist)
API Reference

Enables text generation using LLMs.

Module azure

AzureOpenAIGenerator

class AzureOpenAIGenerator(OpenAIGenerator)

Enables text generation using OpenAI's large language models (LLMs) on Azure. It supports gpt-4 and gpt-3.5-turbo family of models.

Users can pass any text generation parameters valid for the openai.ChatCompletion.create method directly to this component via the **generation_kwargs parameter in init or the **generation_kwargs parameter in run method.

For more details on OpenAI models deployed on Azure, refer to the Microsoft documentation.

Usage example:

from haystack.components.generators import AzureOpenAIGenerator
from haystack.utils import Secret
client = AzureOpenAIGenerator(
    azure_endpoint="<Your Azure endpoint e.g. `https://your-company.azure.openai.com/>",
    api_key=Secret.from_token("<your-api-key>"),
    azure_deployment="<this a model name, e.g. gpt-35-turbo>")
response = client.run("What's Natural Language Processing? Be brief.")
print(response)
>> {'replies': ['Natural Language Processing (NLP) is a branch of artificial intelligence that focuses on
>> the interaction between computers and human language. It involves enabling computers to understand, interpret,
>> and respond to natural human language in a way that is both meaningful and useful.'], 'meta': [{'model':
>> 'gpt-3.5-turbo-0613', 'index': 0, 'finish_reason': 'stop', 'usage': {'prompt_tokens': 16,
>> 'completion_tokens': 49, 'total_tokens': 65}}]}

AzureOpenAIGenerator.__init__

def __init__(
        azure_endpoint: Optional[str] = None,
        api_version: Optional[str] = "2023-05-15",
        azure_deployment: Optional[str] = "gpt-35-turbo",
        api_key: Optional[Secret] = Secret.from_env_var("AZURE_OPENAI_API_KEY",
                                                        strict=False),
        azure_ad_token: Optional[Secret] = Secret.from_env_var(
            "AZURE_OPENAI_AD_TOKEN", strict=False),
        organization: Optional[str] = None,
        streaming_callback: Optional[Callable[[StreamingChunk], None]] = None,
        system_prompt: Optional[str] = None,
        generation_kwargs: Optional[Dict[str, Any]] = None)

Arguments:

  • azure_endpoint: The endpoint of the deployed model, e.g. https://example-resource.azure.openai.com/
  • api_version: The version of the API to use. Defaults to 2023-05-15
  • azure_deployment: The deployment of the model, usually the model name.
  • api_key: The API key to use for authentication.
  • azure_ad_token: Azure Active Directory token
  • organization: The Organization ID, defaults to None. See production best practices.
  • streaming_callback: A callback function that is called when a new token is received from the stream. The callback function accepts StreamingChunk as an argument.
  • system_prompt: The prompt to use for the system. If not provided, the system prompt will be
  • generation_kwargs: Other parameters to use for the model. These parameters are all sent directly to the OpenAI endpoint. See OpenAI documentation for more details. Some of the supported parameters:
  • max_tokens: The maximum number of tokens the output text can have.
  • temperature: What sampling temperature to use. Higher values mean the model will take more risks. Try 0.9 for more creative applications and 0 (argmax sampling) for ones with a well-defined answer.
  • top_p: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered.
  • n: How many completions to generate for each prompt. For example, if the LLM gets 3 prompts and n is 2, it will generate two completions for each of the three prompts, ending up with 6 completions in total.
  • stop: One or more sequences after which the LLM should stop generating tokens.
  • presence_penalty: What penalty to apply if a token is already present at all. Bigger values mean the model will be less likely to repeat the same token in the text.
  • frequency_penalty: What penalty to apply if a token has already been generated in the text. Bigger values mean the model will be less likely to repeat the same token in the text.
  • logit_bias: Add a logit bias to specific tokens. The keys of the dictionary are tokens, and the values are the bias to add to that token.

AzureOpenAIGenerator.to_dict

def to_dict() -> Dict[str, Any]

Serialize this component to a dictionary.

Returns:

The serialized component as a dictionary.

AzureOpenAIGenerator.from_dict

@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "AzureOpenAIGenerator"

Deserialize this component from a dictionary.

Arguments:

  • data: The dictionary representation of this component.

Returns:

The deserialized component instance.

AzureOpenAIGenerator.run

@component.output_types(replies=List[str], meta=List[Dict[str, Any]])
def run(prompt: str, generation_kwargs: Optional[Dict[str, Any]] = None)

Invoke the text generation inference based on the provided messages and generation parameters.

Arguments:

  • prompt: The string prompt to use for text generation.
  • generation_kwargs: Additional keyword arguments for text generation. These parameters will potentially override the parameters passed in the __init__ method. For more details on the parameters supported by the OpenAI API, refer to the OpenAI documentation.

Returns:

A list of strings containing the generated responses and a list of dictionaries containing the metadata for each response.

Module hugging_face_local

HuggingFaceLocalGenerator

@component
class HuggingFaceLocalGenerator()

Generator based on a Hugging Face model.

This component provides an interface to generate text using a Hugging Face model that runs locally.

Usage example:

from haystack.components.generators import HuggingFaceLocalGenerator

generator = HuggingFaceLocalGenerator(
    model="google/flan-t5-large",
    task="text2text-generation",
    generation_kwargs={"max_new_tokens": 100, "temperature": 0.9})

generator.warm_up()

print(generator.run("Who is the best American actor?"))
# {'replies': ['John Cusack']}

HuggingFaceLocalGenerator.__init__

def __init__(model: str = "google/flan-t5-base",
             task: Optional[Literal["text-generation",
                                    "text2text-generation"]] = None,
             device: Optional[ComponentDevice] = None,
             token: Optional[Secret] = Secret.from_env_var("HF_API_TOKEN",
                                                           strict=False),
             generation_kwargs: Optional[Dict[str, Any]] = None,
             huggingface_pipeline_kwargs: Optional[Dict[str, Any]] = None,
             stop_words: Optional[List[str]] = None)

Creates an instance of a HuggingFaceLocalGenerator.

Arguments:

  • model: The name or path of a Hugging Face model for text generation,
  • task: The task for the Hugging Face pipeline. Possible values are "text-generation" and "text2text-generation". Generally, decoder-only models like GPT support "text-generation", while encoder-decoder models like T5 support "text2text-generation". If the task is also specified in the huggingface_pipeline_kwargs, this parameter will be ignored. If not specified, the component will attempt to infer the task from the model name, calling the Hugging Face Hub API.
  • device: The device on which the model is loaded. If None, the default device is automatically selected. If a device/device map is specified in huggingface_pipeline_kwargs, it overrides this parameter.
  • token: The token to use as HTTP bearer authorization for remote files. If the token is also specified in the huggingface_pipeline_kwargs, this parameter will be ignored.
  • generation_kwargs: A dictionary containing keyword arguments to customize text generation. Some examples: max_length, max_new_tokens, temperature, top_k, top_p,... See Hugging Face's documentation for more information:
  • customize-text-generation
  • transformers.GenerationConfig
  • huggingface_pipeline_kwargs: Dictionary containing keyword arguments used to initialize the Hugging Face pipeline for text generation. These keyword arguments provide fine-grained control over the Hugging Face pipeline. In case of duplication, these kwargs override model, task, device, and token init parameters. See Hugging Face's documentation for more information on the available kwargs. In this dictionary, you can also include model_kwargs to specify the kwargs for model initialization: transformers.PreTrainedModel.from_pretrained
  • stop_words: A list of stop words. If any one of the stop words is generated, the generation is stopped. If you provide this parameter, you should not specify the stopping_criteria in generation_kwargs. For some chat models, the output includes both the new text and the original prompt. In these cases, it's important to make sure your prompt has no stop words.

HuggingFaceLocalGenerator.warm_up

def warm_up()

Initializes the component.

HuggingFaceLocalGenerator.to_dict

def to_dict() -> Dict[str, Any]

Serializes the component to a dictionary.

Returns:

Dictionary with serialized data.

HuggingFaceLocalGenerator.from_dict

@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "HuggingFaceLocalGenerator"

Deserializes the component from a dictionary.

Arguments:

  • data: The dictionary to deserialize from.

Returns:

The deserialized component.

HuggingFaceLocalGenerator.run

@component.output_types(replies=List[str])
def run(prompt: str, generation_kwargs: Optional[Dict[str, Any]] = None)

Run the text generation model on the given prompt.

Arguments:

  • prompt: A string representing the prompt.
  • generation_kwargs: Additional keyword arguments for text generation.

Returns:

A dictionary containing the generated replies.

  • replies: A list of strings representing the generated replies.

Module hugging_face_tgi

HuggingFaceTGIGenerator

@component
class HuggingFaceTGIGenerator()

Enables text generation using HuggingFace Hub hosted non-chat LLMs.

This component is designed to seamlessly inference models deployed on the Text Generation Inference (TGI) backend. You can use this component for LLMs hosted on Hugging Face inference endpoints, the rate-limited Inference API tier.

Key Features and Compatibility:

  • Primary Compatibility: designed to work seamlessly with any non-based model deployed using the TGI framework. For more information on TGI, visit text-generation-inference

  • Hugging Face Inference Endpoints: Supports inference of TGI chat LLMs deployed on Hugging Face inference endpoints. For more details, refer to inference-endpoints

  • Inference API Support: supports inference of TGI LLMs hosted on the rate-limited Inference API tier. Learn more about the Inference API at inference-api. Discover available chat models using the following command: wget -qO- https://api-inference.huggingface.co/framework/text-generation-inference | grep chat and simply use the model ID as the model parameter for this component. You'll also need to provide a valid Hugging Face API token as the token parameter.

  • Custom TGI Endpoints: supports inference of TGI chat LLMs deployed on custom TGI endpoints. Anyone can deploy their own TGI endpoint using the TGI framework. For more details, refer to inference-endpoints

Input and Output Format:

  • String Format: This component uses the str format for structuring both input and output, ensuring coherent and contextually relevant responses in text generation scenarios.
from haystack.components.generators import HuggingFaceTGIGenerator
from haystack.utils import Secret

client = HuggingFaceTGIGenerator(model="mistralai/Mistral-7B-v0.1", token=Secret.from_token("<your-api-key>"))
client.warm_up()
response = client.run("What's Natural Language Processing?", max_new_tokens=120)
print(response)

Or for LLMs hosted on paid https://huggingface.co/inference-endpoints endpoint, and/or your own custom TGI endpoint. In these two cases, you'll need to provide the URL of the endpoint as well as a valid token:

from haystack.components.generators import HuggingFaceTGIGenerator
client = HuggingFaceTGIGenerator(model="mistralai/Mistral-7B-v0.1",
                                 url="<your-tgi-endpoint-url>",
                                 token=Secret.from_token("<your-api-key>"))
client.warm_up()
response = client.run("What's Natural Language Processing?")
print(response)

HuggingFaceTGIGenerator.__init__

def __init__(model: str = "mistralai/Mistral-7B-v0.1",
             url: Optional[str] = None,
             token: Optional[Secret] = Secret.from_env_var("HF_API_TOKEN",
                                                           strict=False),
             generation_kwargs: Optional[Dict[str, Any]] = None,
             stop_words: Optional[List[str]] = None,
             streaming_callback: Optional[Callable[[StreamingChunk],
                                                   None]] = None)

Initialize the HuggingFaceTGIGenerator instance.

Arguments:

  • model: A string representing the model id on HF Hub. Default is "mistralai/Mistral-7B-v0.1".
  • url: An optional string representing the URL of the TGI endpoint. If the url is not provided, check if the model is deployed on the free tier of the HF inference API.
  • token: The HuggingFace token to use as HTTP bearer authorization You can find your HF token in your account settings
  • generation_kwargs: A dictionary containing keyword arguments to customize text generation. Some examples: max_new_tokens, temperature, top_k, top_p,... See Hugging Face's documentation for more information at: [TextGenerationParameters](https://huggingface.co/docs/huggingface_hub/v0.18.0.rc0/en/package_reference/inference_client#huggingface_hub.inference._text_generation.TextGenerationParameters
  • stop_words: An optional list of strings representing the stop words.
  • streaming_callback: An optional callable for handling streaming responses.

HuggingFaceTGIGenerator.warm_up

def warm_up() -> None

Initializes the component.

HuggingFaceTGIGenerator.to_dict

def to_dict() -> Dict[str, Any]

Serialize this component to a dictionary.

Returns:

A dictionary containing the serialized component.

HuggingFaceTGIGenerator.from_dict

@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "HuggingFaceTGIGenerator"

Deserialize this component from a dictionary.

HuggingFaceTGIGenerator.run

@component.output_types(replies=List[str], meta=List[Dict[str, Any]])
def run(prompt: str, generation_kwargs: Optional[Dict[str, Any]] = None)

Invoke the text generation inference for the given prompt and generation parameters.

Arguments:

  • prompt: A string representing the prompt.
  • generation_kwargs: Additional keyword arguments for text generation.

Returns:

A dictionary containing the generated replies and metadata. Both are lists of length n.

  • replies: A list of strings representing the generated replies.

Module openai

OpenAIGenerator

@component
class OpenAIGenerator()

Enables text generation using OpenAI's large language models (LLMs). It supports gpt-4 and gpt-3.5-turbo family of models.

Users can pass any text generation parameters valid for the openai.ChatCompletion.create method directly to this component via the **generation_kwargs parameter in init or the **generation_kwargs parameter in run method.

For more details on the parameters supported by the OpenAI API, refer to the OpenAI documentation.

Key Features and Compatibility:

  • Primary Compatibility: Designed to work seamlessly with gpt-4, gpt-3.5-turbo family of models.
  • Streaming Support: Supports streaming responses from the OpenAI API.
  • Customizability: Supports all parameters supported by the OpenAI API.

Input and Output Format:

  • String Format: This component uses the strings for both input and output.
from haystack.components.generators import OpenAIGenerator
client = OpenAIGenerator()
response = client.run("What's Natural Language Processing? Be brief.")
print(response)

>> {'replies': ['Natural Language Processing (NLP) is a branch of artificial intelligence that focuses on
>> the interaction between computers and human language. It involves enabling computers to understand, interpret,
>> and respond to natural human language in a way that is both meaningful and useful.'], 'meta': [{'model':
>> 'gpt-3.5-turbo-0613', 'index': 0, 'finish_reason': 'stop', 'usage': {'prompt_tokens': 16,
>> 'completion_tokens': 49, 'total_tokens': 65}}]}

OpenAIGenerator.__init__

def __init__(api_key: Secret = Secret.from_env_var("OPENAI_API_KEY"),
             model: str = "gpt-3.5-turbo",
             streaming_callback: Optional[Callable[[StreamingChunk],
                                                   None]] = None,
             api_base_url: Optional[str] = None,
             organization: Optional[str] = None,
             system_prompt: Optional[str] = None,
             generation_kwargs: Optional[Dict[str, Any]] = None)

Creates an instance of OpenAIGenerator. Unless specified otherwise in the model, this is for OpenAI's GPT-3.5 model.

Arguments:

  • api_key: The OpenAI API key.
  • model: The name of the model to use.
  • streaming_callback: A callback function that is called when a new token is received from the stream. The callback function accepts StreamingChunk as an argument.
  • api_base_url: An optional base URL.
  • organization: The Organization ID, defaults to None. See production best practices.
  • system_prompt: The system prompt to use for text generation. If not provided, the system prompt is omitted, and the default system prompt of the model is used.
  • generation_kwargs: Other parameters to use for the model. These parameters are all sent directly to the OpenAI endpoint. See OpenAI documentation for more details. Some of the supported parameters:
  • max_tokens: The maximum number of tokens the output text can have.
  • temperature: What sampling temperature to use. Higher values mean the model will take more risks. Try 0.9 for more creative applications and 0 (argmax sampling) for ones with a well-defined answer.
  • top_p: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So, 0.1 means only the tokens comprising the top 10% probability mass are considered.
  • n: How many completions to generate for each prompt. For example, if the LLM gets 3 prompts and n is 2, it will generate two completions for each of the three prompts, ending up with 6 completions in total.
  • stop: One or more sequences after which the LLM should stop generating tokens.
  • presence_penalty: What penalty to apply if a token is already present at all. Bigger values mean the model will be less likely to repeat the same token in the text.
  • frequency_penalty: What penalty to apply if a token has already been generated in the text. Bigger values mean the model will be less likely to repeat the same token in the text.
  • logit_bias: Add a logit bias to specific tokens. The keys of the dictionary are tokens, and the values are the bias to add to that token.

OpenAIGenerator.to_dict

def to_dict() -> Dict[str, Any]

Serialize this component to a dictionary.

Returns:

The serialized component as a dictionary.

OpenAIGenerator.from_dict

@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "OpenAIGenerator"

Deserialize this component from a dictionary.

Arguments:

  • data: The dictionary representation of this component.

Returns:

The deserialized component instance.

OpenAIGenerator.run

@component.output_types(replies=List[str], meta=List[Dict[str, Any]])
def run(prompt: str, generation_kwargs: Optional[Dict[str, Any]] = None)

Invoke the text generation inference based on the provided messages and generation parameters.

Arguments:

  • prompt: The string prompt to use for text generation.
  • generation_kwargs: Additional keyword arguments for text generation. These parameters will potentially override the parameters passed in the __init__ method. For more details on the parameters supported by the OpenAI API, refer to the OpenAI documentation.

Returns:

A list of strings containing the generated responses and a list of dictionaries containing the metadata for each response.

Module chat/azure

AzureOpenAIChatGenerator

class AzureOpenAIChatGenerator(OpenAIChatGenerator)

Enables text generation using OpenAI's large language models (LLMs) on Azure. It supports gpt-4 and gpt-3.5-turbo family of models accessed through the chat completions API endpoint.

Users can pass any text generation parameters valid for the openai.ChatCompletion.create method directly to this component via the generation_kwargs parameter in __init__ or the generation_kwargs parameter in run method.

For more details on OpenAI models deployed on Azure, refer to the Microsoft documentation.

Key Features and Compatibility:

  • Primary Compatibility: Designed to work seamlessly with the OpenAI API Chat Completion endpoint.
  • Streaming Support: Supports streaming responses from the OpenAI API Chat Completion endpoint.
  • Customizability: Supports all parameters supported by the OpenAI API Chat Completion endpoint.

Input and Output Format:

  • ChatMessage Format: This component uses the ChatMessage format for structuring both input and output, ensuring coherent and contextually relevant responses in chat-based text generation scenarios.
  • Details on the ChatMessage format can be found here.

Usage example:

from haystack.components.generators.chat import AzureOpenAIGenerator
from haystack.dataclasses import ChatMessage
from haystack.utils import Secret

messages = [ChatMessage.from_user("What's Natural Language Processing?")]

client = AzureOpenAIGenerator(
    azure_endpoint="<Your Azure endpoint e.g. `https://your-company.azure.openai.com/>",
    api_key=Secret.from_token("<your-api-key>"),
    azure_deployment="<this a model name, e.g. gpt-35-turbo>")
response = client.run(messages)
print(response)
{'replies':
    [ChatMessage(content='Natural Language Processing (NLP) is a branch of artificial intelligence that focuses on
     enabling computers to understand, interpret, and generate human language in a way that is meaningful and useful.',
     role=<ChatRole.ASSISTANT: 'assistant'>, name=None,
     meta={'model': 'gpt-3.5-turbo-0613', 'index': 0, 'finish_reason': 'stop',
     'usage': {'prompt_tokens': 15, 'completion_tokens': 36, 'total_tokens': 51}})]
}

AzureOpenAIChatGenerator.__init__

def __init__(
        azure_endpoint: Optional[str] = None,
        api_version: Optional[str] = "2023-05-15",
        azure_deployment: Optional[str] = "gpt-35-turbo",
        api_key: Optional[Secret] = Secret.from_env_var("AZURE_OPENAI_API_KEY",
                                                        strict=False),
        azure_ad_token: Optional[Secret] = Secret.from_env_var(
            "AZURE_OPENAI_AD_TOKEN", strict=False),
        organization: Optional[str] = None,
        streaming_callback: Optional[Callable[[StreamingChunk], None]] = None,
        generation_kwargs: Optional[Dict[str, Any]] = None)

Arguments:

  • azure_endpoint: The endpoint of the deployed model, e.g. "https://example-resource.azure.openai.com/"
  • api_version: The version of the API to use. Defaults to 2023-05-15
  • azure_deployment: The deployment of the model, usually the model name.
  • api_key: The API key to use for authentication.
  • azure_ad_token: Azure Active Directory token
  • organization: The Organization ID, defaults to None. See production best practices.
  • streaming_callback: A callback function that is called when a new token is received from the stream. The callback function accepts StreamingChunk as an argument.
  • generation_kwargs: Other parameters to use for the model. These parameters are all sent directly to the OpenAI endpoint. See OpenAI documentation for more details. Some of the supported parameters:
  • max_tokens: The maximum number of tokens the output text can have.
  • temperature: What sampling temperature to use. Higher values mean the model will take more risks. Try 0.9 for more creative applications and 0 (argmax sampling) for ones with a well-defined answer.
  • top_p: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered.
  • n: How many completions to generate for each prompt. For example, if the LLM gets 3 prompts and n is 2, it will generate two completions for each of the three prompts, ending up with 6 completions in total.
  • stop: One or more sequences after which the LLM should stop generating tokens.
  • presence_penalty: What penalty to apply if a token is already present at all. Bigger values mean the model will be less likely to repeat the same token in the text.
  • frequency_penalty: What penalty to apply if a token has already been generated in the text. Bigger values mean the model will be less likely to repeat the same token in the text.
  • logit_bias: Add a logit bias to specific tokens. The keys of the dictionary are tokens, and the values are the bias to add to that token.

AzureOpenAIChatGenerator.to_dict

def to_dict() -> Dict[str, Any]

Serialize this component to a dictionary.

Returns:

The serialized component as a dictionary.

AzureOpenAIChatGenerator.from_dict

@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "AzureOpenAIChatGenerator"

Deserialize this component from a dictionary.

Arguments:

  • data: The dictionary representation of this component.

Returns:

The deserialized component instance.

AzureOpenAIChatGenerator.run

@component.output_types(replies=List[ChatMessage])
def run(messages: List[ChatMessage],
        generation_kwargs: Optional[Dict[str, Any]] = None)

Invoke the text generation inference based on the provided messages and generation parameters.

Arguments:

  • messages: A list of ChatMessage instances representing the input messages.
  • generation_kwargs: Additional keyword arguments for text generation. These parameters will potentially override the parameters passed in the __init__ method. For more details on the parameters supported by the OpenAI API, refer to the OpenAI documentation.

Returns:

A list containing the generated responses as ChatMessage instances.

Module chat/hugging_face_local

HuggingFaceLocalChatGenerator

@component
class HuggingFaceLocalChatGenerator()

The HuggingFaceLocalChatGenerator class is a component designed for generating chat responses using models from Hugging Face's model hub. It is tailored for local runtime text generation tasks and provides a convenient interface for working with chat-based models, such as HuggingFaceH4/zephyr-7b-beta or meta-llama/Llama-2-7b-chat-hf etc.

Usage example:

from haystack.components.generators.chat import HuggingFaceLocalChatGenerator
from haystack.dataclasses import ChatMessage

generator = HuggingFaceLocalChatGenerator(model="HuggingFaceH4/zephyr-7b-beta")
generator.warm_up()
messages = [ChatMessage.from_user("What's Natural Language Processing? Be brief.")]
print(generator.run(messages))
{'replies':
    [ChatMessage(content=' Natural Language Processing (NLP) is a subfield of artificial intelligence that deals
    with the interaction between computers and human language. It enables computers to understand, interpret, and
    generate human language in a valuable way. NLP involves various techniques such as speech recognition, text
    analysis, sentiment analysis, and machine translation. The ultimate goal is to make it easier for computers to
    process and derive meaning from human language, improving communication between humans and machines.',
    role=<ChatRole.ASSISTANT: 'assistant'>,
    name=None,
    meta={'finish_reason': 'stop', 'index': 0, 'model':
          'mistralai/Mistral-7B-Instruct-v0.2',
          'usage': {'completion_tokens': 90, 'prompt_tokens': 19, 'total_tokens': 109}})
          ]
}

HuggingFaceLocalChatGenerator.__init__

def __init__(model: str = "HuggingFaceH4/zephyr-7b-beta",
             task: Optional[Literal["text-generation",
                                    "text2text-generation"]] = None,
             device: Optional[ComponentDevice] = None,
             token: Optional[Secret] = Secret.from_env_var("HF_API_TOKEN",
                                                           strict=False),
             chat_template: Optional[str] = None,
             generation_kwargs: Optional[Dict[str, Any]] = None,
             huggingface_pipeline_kwargs: Optional[Dict[str, Any]] = None,
             stop_words: Optional[List[str]] = None,
             streaming_callback: Optional[Callable[[StreamingChunk],
                                                   None]] = None)

Arguments:

  • model: The name or path of a Hugging Face model for text generation, for example, mistralai/Mistral-7B-Instruct-v0.2, TheBloke/OpenHermes-2.5-Mistral-7B-16k-AWQ, etc. The important aspect of the model is that it should be a chat model and that it supports ChatML messaging format. If the model is also specified in the huggingface_pipeline_kwargs, this parameter will be ignored.
  • task: The task for the Hugging Face pipeline. Possible values are "text-generation" and "text2text-generation". Generally, decoder-only models like GPT support "text-generation", while encoder-decoder models like T5 support "text2text-generation". If the task is also specified in the huggingface_pipeline_kwargs, this parameter will be ignored. If not specified, the component will attempt to infer the task from the model name, calling the Hugging Face Hub API.
  • device: The device on which the model is loaded. If None, the default device is automatically selected. If a device/device map is specified in huggingface_pipeline_kwargs, it overrides this parameter.
  • token: The token to use as HTTP bearer authorization for remote files. If the token is also specified in the huggingface_pipeline_kwargs, this parameter will be ignored.
  • chat_template: This optional parameter allows you to specify a Jinja template for formatting chat messages. While high-quality and well-supported chat models typically include their own chat templates accessible through their tokenizer, there are models that do not offer this feature. For such scenarios, or if you wish to use a custom template instead of the model's default, you can use this parameter to set your preferred chat template.
  • generation_kwargs: A dictionary containing keyword arguments to customize text generation. Some examples: max_length, max_new_tokens, temperature, top_k, top_p, etc. See Hugging Face's documentation for more information:
  • The only generation_kwargs we set by default is max_new_tokens, which is set to 512 tokens.
  • huggingface_pipeline_kwargs: Dictionary containing keyword arguments used to initialize the Hugging Face pipeline for text generation. These keyword arguments provide fine-grained control over the Hugging Face pipeline. In case of duplication, these kwargs override model, task, device, and token init parameters. See Hugging Face's documentation for more information on the available kwargs. In this dictionary, you can also include model_kwargs to specify the kwargs for model initialization
  • stop_words: A list of stop words. If any one of the stop words is generated, the generation is stopped. If you provide this parameter, you should not specify the stopping_criteria in generation_kwargs. For some chat models, the output includes both the new text and the original prompt. In these cases, it's important to make sure your prompt has no stop words.
  • streaming_callback: An optional callable for handling streaming responses.

HuggingFaceLocalChatGenerator.warm_up

def warm_up()

Initializes the component.

HuggingFaceLocalChatGenerator.to_dict

def to_dict() -> Dict[str, Any]

Serializes the component to a dictionary.

Returns:

Dictionary with serialized data.

HuggingFaceLocalChatGenerator.from_dict

@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "HuggingFaceLocalChatGenerator"

Deserializes the component from a dictionary.

Arguments:

  • data: The dictionary to deserialize from.

Returns:

The deserialized component.

HuggingFaceLocalChatGenerator.run

@component.output_types(replies=List[ChatMessage])
def run(messages: List[ChatMessage],
        generation_kwargs: Optional[Dict[str, Any]] = None)

Invoke text generation inference based on the provided messages and generation parameters.

Arguments:

  • messages: A list of ChatMessage instances representing the input messages.
  • generation_kwargs: Additional keyword arguments for text generation.

Returns:

A list containing the generated responses as ChatMessage instances.

HuggingFaceLocalChatGenerator.create_message

def create_message(text: str, index: int,
                   tokenizer: Union["PreTrainedTokenizer",
                                    "PreTrainedTokenizerFast"], prompt: str,
                   generation_kwargs: Dict[str, Any]) -> ChatMessage

Create a ChatMessage instance from the provided text, populated with metadata.

Arguments:

  • text: The generated text.
  • index: The index of the generated text.
  • tokenizer: The tokenizer used for generation.
  • prompt: The prompt used for generation.
  • generation_kwargs: The generation parameters.

Returns:

A ChatMessage instance.

Module chat/hugging_face_tgi

HuggingFaceTGIChatGenerator

class HuggingFaceTGIChatGenerator()

Enables text generation using HuggingFace Hub hosted chat-based LLMs. This component is designed to seamlessly inference chat-based models deployed on the Text Generation Inference (TGI) backend.

You can use this component for chat LLMs hosted on Hugging Face inference endpoints, the rate-limited Inference API tier.

Key Features and Compatibility:

  • Primary Compatibility: designed to work seamlessly with any chat-based model deployed using the TGI framework. For more information on TGI, visit text-generation-inference

  • Hugging Face Inference Endpoints: Supports inference of TGI chat LLMs deployed on Hugging Face inference endpoints. For more details, refer to inference-endpoints

  • Inference API Support: supports inference of TGI chat LLMs hosted on the rate-limited Inference API tier. Learn more about the Inference API at inference-api. Discover available chat models using the following command: wget -qO- https://api-inference.huggingface.co/framework/text-generation-inference | grep chat and simply use the model ID as the model parameter for this component. You'll also need to provide a valid Hugging Face API token as the token parameter.

  • Custom TGI Endpoints: supports inference of TGI chat LLMs deployed on custom TGI endpoints. Anyone can deploy their own TGI endpoint using the TGI framework. For more details, refer to inference-endpoints

Input and Output Format:

  • ChatMessage Format: This component uses the ChatMessage format to structure both input and output, ensuring coherent and contextually relevant responses in chat-based text generation scenarios. Details on the ChatMessage format can be found here.
from haystack.components.generators.chat import HuggingFaceTGIChatGenerator
from haystack.dataclasses import ChatMessage
from haystack.utils import Secret

messages = [ChatMessage.from_system("\nYou are a helpful, respectful and honest assistant"),
            ChatMessage.from_user("What's Natural Language Processing?")]


client = HuggingFaceTGIChatGenerator(model="HuggingFaceH4/zephyr-7b-beta", token=Secret.from_token("<your-api-key>"))
client.warm_up()
response = client.run(messages, generation_kwargs={"max_new_tokens": 120})
print(response)

For chat LLMs hosted on paid https://huggingface.co/inference-endpoints endpoint and/or your own custom TGI endpoint, you'll need to provide the URL of the endpoint as well as a valid token:

from haystack.components.generators.chat import HuggingFaceTGIChatGenerator
from haystack.dataclasses import ChatMessage

messages = [ChatMessage.from_system("\nYou are a helpful, respectful and honest assistant"),
            ChatMessage.from_user("What's Natural Language Processing?")]

client = HuggingFaceTGIChatGenerator(model="HuggingFaceH4/zephyr-7b-beta",
                                     url="<your-tgi-endpoint-url>",
                                     token=Secret.from_token("<your-api-key>"))
client.warm_up()
response = client.run(messages, generation_kwargs={"max_new_tokens": 120})
print(response)

HuggingFaceTGIChatGenerator.__init__

def __init__(model: str = "HuggingFaceH4/zephyr-7b-beta",
             url: Optional[str] = None,
             token: Optional[Secret] = Secret.from_env_var("HF_API_TOKEN",
                                                           strict=False),
             chat_template: Optional[str] = None,
             generation_kwargs: Optional[Dict[str, Any]] = None,
             stop_words: Optional[List[str]] = None,
             streaming_callback: Optional[Callable[[StreamingChunk],
                                                   None]] = None)

Initialize the HuggingFaceTGIChatGenerator instance.

Arguments:

  • model: A string representing the model path or URL. Default is "HuggingFaceH4/zephyr-7b-beta".
  • url: An optional string representing the URL of the TGI endpoint.
  • chat_template: This optional parameter allows you to specify a Jinja template for formatting chat messages. While high-quality and well-supported chat models typically include their own chat templates accessible through their tokenizer, there are models that do not offer this feature. For such scenarios, or if you wish to use a custom template instead of the model's default, you can use this parameter to set your preferred chat template.
  • token: The Hugging Face token for HTTP bearer authorization. You can find your HF token at https://huggingface.co/settings/tokens.
  • generation_kwargs: A dictionary containing keyword arguments to customize text generation. Some examples: max_new_tokens, temperature, top_k, top_p,... See Hugging Face's documentation for more information.
  • stop_words: An optional list of strings representing the stop words.
  • streaming_callback: An optional callable for handling streaming responses.

HuggingFaceTGIChatGenerator.warm_up

def warm_up() -> None

If the url is not provided, check if the model is deployed on the free tier of the HF inference API. Load the tokenizer

HuggingFaceTGIChatGenerator.to_dict

def to_dict() -> Dict[str, Any]

Serialize this component to a dictionary.

Returns:

A dictionary containing the serialized component.

HuggingFaceTGIChatGenerator.from_dict

@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "HuggingFaceTGIChatGenerator"

Deserialize this component from a dictionary.

HuggingFaceTGIChatGenerator.run

@component.output_types(replies=List[ChatMessage])
def run(messages: List[ChatMessage],
        generation_kwargs: Optional[Dict[str, Any]] = None)

Invoke the text generation inference based on the provided messages and generation parameters.

Arguments:

  • messages: A list of ChatMessage instances representing the input messages.
  • generation_kwargs: Additional keyword arguments for text generation.

Returns:

A list containing the generated responses as ChatMessage instances.

Module chat/openai

OpenAIChatGenerator

@component
class OpenAIChatGenerator()

Enables text generation using OpenAI's large language models (LLMs). It supports gpt-4 and gpt-3.5-turbo family of models accessed through the chat completions API endpoint.

Users can pass any text generation parameters valid for the openai.ChatCompletion.create method directly to this component via the generation_kwargs parameter in __init__ or the generation_kwargs parameter in run method.

For more details on the parameters supported by the OpenAI API, refer to the OpenAI documentation.

from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.dataclasses import ChatMessage

messages = [ChatMessage.from_user("What's Natural Language Processing?")]

client = OpenAIChatGenerator()
response = client.run(messages)
print(response)

Output:

{'replies':
    [ChatMessage(content='Natural Language Processing (NLP) is a branch of artificial intelligence
                          that focuses on enabling computers to understand, interpret, and generate human language in
                          a way that is meaningful and useful.',
     role=<ChatRole.ASSISTANT: 'assistant'>, name=None,
     meta={'model': 'gpt-3.5-turbo-0613', 'index': 0, 'finish_reason': 'stop',
     'usage': {'prompt_tokens': 15, 'completion_tokens': 36, 'total_tokens': 51}})
    ]
}

Key Features and Compatibility:

  • Primary Compatibility: designed to work seamlessly with the OpenAI API Chat Completion endpoint and gpt-4 and gpt-3.5-turbo family of models.
  • Streaming Support: supports streaming responses from the OpenAI API Chat Completion endpoint.
  • Customizability: supports all parameters supported by the OpenAI API Chat Completion endpoint.

Input and Output Format:

  • ChatMessage Format: this component uses the ChatMessage format for structuring both input and output, ensuring coherent and contextually relevant responses in chat-based text generation scenarios. Details on the ChatMessage format can be found at here.

OpenAIChatGenerator.__init__

def __init__(api_key: Secret = Secret.from_env_var("OPENAI_API_KEY"),
             model: str = "gpt-3.5-turbo",
             streaming_callback: Optional[Callable[[StreamingChunk],
                                                   None]] = None,
             api_base_url: Optional[str] = None,
             organization: Optional[str] = None,
             generation_kwargs: Optional[Dict[str, Any]] = None)

Creates an instance of OpenAIChatGenerator. Unless specified otherwise in the model, this is for OpenAI's

GPT-3.5 model.

Arguments:

  • api_key: The OpenAI API key.
  • model: The name of the model to use.
  • streaming_callback: A callback function that is called when a new token is received from the stream. The callback function accepts StreamingChunk as an argument.
  • api_base_url: An optional base URL.
  • organization: The Organization ID, defaults to None. See production best practices.
  • generation_kwargs: Other parameters to use for the model. These parameters are all sent directly to the OpenAI endpoint. See OpenAI documentation for more details. Some of the supported parameters:
  • max_tokens: The maximum number of tokens the output text can have.
  • temperature: What sampling temperature to use. Higher values mean the model will take more risks. Try 0.9 for more creative applications and 0 (argmax sampling) for ones with a well-defined answer.
  • top_p: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered.
  • n: How many completions to generate for each prompt. For example, if the LLM gets 3 prompts and n is 2, it will generate two completions for each of the three prompts, ending up with 6 completions in total.
  • stop: One or more sequences after which the LLM should stop generating tokens.
  • presence_penalty: What penalty to apply if a token is already present at all. Bigger values mean the model will be less likely to repeat the same token in the text.
  • frequency_penalty: What penalty to apply if a token has already been generated in the text. Bigger values mean the model will be less likely to repeat the same token in the text.
  • logit_bias: Add a logit bias to specific tokens. The keys of the dictionary are tokens, and the values are the bias to add to that token.

OpenAIChatGenerator.to_dict

def to_dict() -> Dict[str, Any]

Serialize this component to a dictionary.

Returns:

The serialized component as a dictionary.

OpenAIChatGenerator.from_dict

@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "OpenAIChatGenerator"

Deserialize this component from a dictionary.

Arguments:

  • data: The dictionary representation of this component.

Returns:

The deserialized component instance.

OpenAIChatGenerator.run

@component.output_types(replies=List[ChatMessage])
def run(messages: List[ChatMessage],
        generation_kwargs: Optional[Dict[str, Any]] = None)

Invoke the text generation inference based on the provided messages and generation parameters.

Arguments:

  • messages: A list of ChatMessage instances representing the input messages.
  • generation_kwargs: Additional keyword arguments for text generation. These parameters will potentially override the parameters passed in the __init__ method. For more details on the parameters supported by the OpenAI API, refer to the OpenAI documentation.

Returns:

A list containing the generated responses as ChatMessage instances.