DocumentationAPI Reference📓 Tutorials🧑‍🍳 Cookbook🤝 Integrations💜 Discord🎨 Studio (Waitlist)
API Reference

Enables text generation using LLMs.

Module haystack_experimental.components.generators.chat.openai

OpenAIChatGenerator

Completes chats using OpenAI's large language models (LLMs).

It works with the gpt-4 and gpt-3.5-turbo models and supports streaming responses from OpenAI API. It uses ChatMessage format in input and output.

You can customize how the text is generated by passing parameters to the OpenAI API. Use the **generation_kwargs argument when you initialize the component or when you run it. Any parameter that works with openai.ChatCompletion.create will work here too.

For details on OpenAI API parameters, see OpenAI documentation.

Usage example

from haystack_experimental.components.generators.chat import OpenAIChatGenerator
from haystack_experimental.dataclasses import ChatMessage

messages = [ChatMessage.from_user("What's Natural Language Processing?")]

client = OpenAIChatGenerator()
response = client.run(messages)
print(response)

Output:

{'replies': [
    ChatMessage(_role=<ChatRole.ASSISTANT: 'assistant'>,
                _content=[TextContent(text='Natural Language Processing (NLP) is a field of artificial ...')],
                _meta={'model': 'gpt-4o-mini', 'index': 0, 'finish_reason': 'stop',
                    'usage': {'completion_tokens': 71, 'prompt_tokens': 13, 'total_tokens': 84}}
                )
            ]
}

OpenAIChatGenerator.__init__

def __init__(api_key: Secret = Secret.from_env_var("OPENAI_API_KEY"),
             model: str = "gpt-4o-mini",
             streaming_callback: Optional[Callable[[StreamingChunk],
                                                   None]] = None,
             api_base_url: Optional[str] = None,
             organization: Optional[str] = None,
             generation_kwargs: Optional[Dict[str, Any]] = None,
             timeout: Optional[float] = None,
             max_retries: Optional[int] = None,
             tools: Optional[List[Tool]] = None,
             tools_strict: bool = False)

Creates an instance of OpenAIChatGenerator. Unless specified otherwise in model, uses OpenAI's GPT-3.5.

Before initializing the component, you can set the 'OPENAI_TIMEOUT' and 'OPENAI_MAX_RETRIES' environment variables to override the timeout and max_retries parameters respectively in the OpenAI client.

Arguments:

  • api_key: The OpenAI API key. You can set it with an environment variable OPENAI_API_KEY, or pass with this parameter during initialization.
  • model: The name of the model to use.
  • streaming_callback: A callback function that is called when a new token is received from the stream. The callback function accepts StreamingChunk as an argument.
  • api_base_url: An optional base URL.
  • organization: Your organization ID, defaults to None. See production best practices.
  • generation_kwargs: Other parameters to use for the model. These parameters are sent directly to the OpenAI endpoint. See OpenAI documentation for more details. Some of the supported parameters:
  • max_tokens: The maximum number of tokens the output text can have.
  • temperature: What sampling temperature to use. Higher values mean the model will take more risks. Try 0.9 for more creative applications and 0 (argmax sampling) for ones with a well-defined answer.
  • top_p: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. For example, 0.1 means only the tokens comprising the top 10% probability mass are considered.
  • n: How many completions to generate for each prompt. For example, if the LLM gets 3 prompts and n is 2, it will generate two completions for each of the three prompts, ending up with 6 completions in total.
  • stop: One or more sequences after which the LLM should stop generating tokens.
  • presence_penalty: What penalty to apply if a token is already present at all. Bigger values mean the model will be less likely to repeat the same token in the text.
  • frequency_penalty: What penalty to apply if a token has already been generated in the text. Bigger values mean the model will be less likely to repeat the same token in the text.
  • logit_bias: Add a logit bias to specific tokens. The keys of the dictionary are tokens, and the values are the bias to add to that token.
  • timeout: Timeout for OpenAI client calls. If not set, it defaults to either the OPENAI_TIMEOUT environment variable, or 30 seconds.
  • max_retries: Maximum number of retries to contact OpenAI after an internal error. If not set, it defaults to either the OPENAI_MAX_RETRIES environment variable, or set to 5.
  • tools: A list of tools for which the model can prepare calls.
  • tools_strict: Whether to enable strict schema adherence for tool calls. If set to True, the model will follow exactly the schema provided in the parameters field of the tool definition, but this may increase latency.

OpenAIChatGenerator.to_dict

def to_dict() -> Dict[str, Any]

Serialize this component to a dictionary.

Returns:

The serialized component as a dictionary.

OpenAIChatGenerator.from_dict

@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "OpenAIChatGenerator"

Deserialize this component from a dictionary.

Arguments:

  • data: The dictionary representation of this component.

Returns:

The deserialized component instance.

OpenAIChatGenerator.run

@component.output_types(replies=List[ChatMessage])
def run(messages: List[ChatMessage],
        streaming_callback: Optional[Callable[[StreamingChunk], None]] = None,
        generation_kwargs: Optional[Dict[str, Any]] = None,
        tools: Optional[List[Tool]] = None,
        tools_strict: Optional[bool] = None)

Invokes chat completion based on the provided messages and generation parameters.

Arguments:

  • messages: A list of ChatMessage instances representing the input messages.
  • streaming_callback: A callback function that is called when a new token is received from the stream.
  • generation_kwargs: Additional keyword arguments for text generation. These parameters will override the parameters passed during component initialization. For details on OpenAI API parameters, see OpenAI documentation.
  • tools: A list of tools for which the model can prepare calls. If set, it will override the tools parameter set during component initialization.
  • tools_strict: Whether to enable strict schema adherence for tool calls. If set to True, the model will follow exactly the schema provided in the parameters field of the tool definition, but this may increase latency. If set, it will override the tools_strict parameter set during component initialization.

Returns:

A list containing the generated responses as ChatMessage instances.

Module haystack_experimental.components.generators.chat.hugging_face_api

HuggingFaceAPIChatGenerator

Completes chats using Hugging Face APIs.

HuggingFaceAPIChatGenerator uses the ChatMessage format for input and output. Use it to generate text with Hugging Face APIs:

Usage examples

With the free serverless inference API

from haystack.components.generators.chat import HuggingFaceAPIChatGenerator
from haystack.dataclasses import ChatMessage
from haystack.utils import Secret
from haystack.utils.hf import HFGenerationAPIType

messages = [ChatMessage.from_system("\nYou are a helpful, respectful and honest assistant"),
            ChatMessage.from_user("What's Natural Language Processing?")]

# the api_type can be expressed using the HFGenerationAPIType enum or as a string
api_type = HFGenerationAPIType.SERVERLESS_INFERENCE_API
api_type = "serverless_inference_api" # this is equivalent to the above

generator = HuggingFaceAPIChatGenerator(api_type=api_type,
                                        api_params={"model": "HuggingFaceH4/zephyr-7b-beta"},
                                        token=Secret.from_token("<your-api-key>"))

result = generator.run(messages)
print(result)

With paid inference endpoints

from haystack.components.generators.chat import HuggingFaceAPIChatGenerator
from haystack.dataclasses import ChatMessage
from haystack.utils import Secret

messages = [ChatMessage.from_system("\nYou are a helpful, respectful and honest assistant"),
            ChatMessage.from_user("What's Natural Language Processing?")]

generator = HuggingFaceAPIChatGenerator(api_type="inference_endpoints",
                                        api_params={"url": "<your-inference-endpoint-url>"},
                                        token=Secret.from_token("<your-api-key>"))

result = generator.run(messages)
print(result)

#### With self-hosted text generation inference

```python
from haystack.components.generators.chat import HuggingFaceAPIChatGenerator
from haystack.dataclasses import ChatMessage

messages = [ChatMessage.from_system("\nYou are a helpful, respectful and honest assistant"),
            ChatMessage.from_user("What's Natural Language Processing?")]

generator = HuggingFaceAPIChatGenerator(api_type="text_generation_inference",
                                        api_params={"url": "http://localhost:8080"})

result = generator.run(messages)
print(result)

HuggingFaceAPIChatGenerator.__init__

def __init__(api_type: Union[HFGenerationAPIType, str],
             api_params: Dict[str, str],
             token: Optional[Secret] = Secret.from_env_var(
                 ["HF_API_TOKEN", "HF_TOKEN"], strict=False),
             generation_kwargs: Optional[Dict[str, Any]] = None,
             stop_words: Optional[List[str]] = None,
             streaming_callback: Optional[Callable[[StreamingChunk],
                                                   None]] = None,
             tools: Optional[List[Tool]] = None)

Initialize the HuggingFaceAPIChatGenerator instance.

Arguments:

  • api_type: The type of Hugging Face API to use. Available types:
  • text_generation_inference: See TGI.
  • inference_endpoints: See Inference Endpoints.
  • serverless_inference_api: See Serverless Inference API.
  • api_params: A dictionary with the following keys:
  • model: Hugging Face model ID. Required when api_type is SERVERLESS_INFERENCE_API.
  • url: URL of the inference endpoint. Required when api_type is INFERENCE_ENDPOINTS or TEXT_GENERATION_INFERENCE.
  • token: The Hugging Face token to use as HTTP bearer authorization. Check your HF token in your account settings.
  • generation_kwargs: A dictionary with keyword arguments to customize text generation. Some examples: max_tokens, temperature, top_p. For details, see Hugging Face chat_completion documentation.
  • stop_words: An optional list of strings representing the stop words.
  • streaming_callback: An optional callable for handling streaming responses.
  • tools: A list of tools for which the model can prepare calls. The chosen model should support tool/function calling, according to the model card. Support for tools in the Hugging Face API and TGI is not yet fully refined and you may experience unexpected behavior.

HuggingFaceAPIChatGenerator.to_dict

def to_dict() -> Dict[str, Any]

Serialize this component to a dictionary.

Returns:

A dictionary containing the serialized component.

HuggingFaceAPIChatGenerator.from_dict

@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "HuggingFaceAPIChatGenerator"

Deserialize this component from a dictionary.

HuggingFaceAPIChatGenerator.run

@component.output_types(replies=List[ChatMessage])
def run(messages: List[ChatMessage],
        generation_kwargs: Optional[Dict[str, Any]] = None,
        tools: Optional[List[Tool]] = None)

Invoke the text generation inference based on the provided messages and generation parameters.

Arguments:

  • messages: A list of ChatMessage objects representing the input messages.
  • generation_kwargs: Additional keyword arguments for text generation.
  • tools: A list of tools for which the model can prepare calls. If set, it will override the tools parameter set during component initialization.

Returns:

A dictionary with the following keys:

  • replies: A list containing the generated responses as ChatMessage objects.

Module haystack_experimental.components.generators.ollama.chat.chat_generator

OllamaChatGenerator

Supports models running on Ollama.

Find the full list of supported models [here](https://ollama.ai/library).

Usage example:
```python
from haystack_experimental.components.generators.ollama import OllamaChatGenerator
from haystack_experimental.dataclasses import ChatMessage

generator = OllamaChatGenerator(model="zephyr",
                            url = "http://localhost:11434",
                            generation_kwargs={
                            "num_predict": 100,
                            "temperature": 0.9,
                            })

messages = [ChatMessage.from_system("

You are a helpful, respectful and honest assistant"), ChatMessage.from_user("What's Natural Language Processing?")]

print(generator.run(messages=messages))
```

OllamaChatGenerator.__init__

def __init__(model: str = "orca-mini",
             url: str = "http://localhost:11434",
             generation_kwargs: Optional[Dict[str, Any]] = None,
             timeout: int = 120,
             keep_alive: Optional[Union[float, str]] = None,
             streaming_callback: Optional[Callable[[StreamingChunk],
                                                   None]] = None,
             tools: Optional[List[Tool]] = None)

Creates an instance of OllamaChatGenerator.

Arguments:

  • model: The name of the model to use. The model should be available in the running Ollama instance.
  • url: The URL of a running Ollama instance.
  • generation_kwargs: Optional arguments to pass to the Ollama generation endpoint, such as temperature, top_p, and others. See the available arguments in Ollama docs.
  • timeout: The number of seconds before throwing a timeout error from the Ollama API.
  • keep_alive: The option that controls how long the model will stay loaded into memory following the request. If not set, it will use the default value from the Ollama (5 minutes). The value can be set to:
  • a duration string (such as "10m" or "24h")
  • a number in seconds (such as 3600)
  • any negative number which will keep the model loaded in memory (e.g. -1 or "-1m")
  • '0' which will unload the model immediately after generating a response.
  • streaming_callback: A callback function that is called when a new token is received from the stream. The callback function accepts StreamingChunk as an argument.
  • tools: A list of tools for which the model can prepare calls. Not all models support tools. For a list of models compatible with tools, see the models page.

OllamaChatGenerator.to_dict

def to_dict() -> Dict[str, Any]

Serialize this component to a dictionary.

Returns:

The serialized component as a dictionary.

OllamaChatGenerator.from_dict

@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "OllamaChatGenerator"

Deserialize this component from a dictionary.

Arguments:

  • data: The dictionary representation of this component.

Returns:

The deserialized component instance.

OllamaChatGenerator.run

@component.output_types(replies=List[ChatMessage])
def run(messages: List[ChatMessage],
        generation_kwargs: Optional[Dict[str, Any]] = None,
        tools: Optional[List[Tool]] = None)

Runs an Ollama Model on a given chat history.

Arguments:

  • messages: A list of ChatMessage instances representing the input messages.
  • generation_kwargs: Optional arguments to pass to the Ollama generation endpoint, such as temperature, top_p, etc. See the Ollama docs.
  • tools: A list of tools for which the model can prepare calls. If set, it will override the tools parameter set during component initialization.

Returns:

A dictionary with the following keys:

  • replies: The responses from the model