Enables text generation using LLMs.
Module azure
AzureOpenAIGenerator
Generates text using OpenAI's large language models (LLMs).
It works with the gpt-4 and gpt-3.5-turbo family of models.
You can customize how the text is generated by passing parameters to the
OpenAI API. Use the **generation_kwargs
argument when you initialize
the component or when you run it. Any parameter that works with
openai.ChatCompletion.create
will work here too.
For details on OpenAI API parameters, see OpenAI documentation.
Usage example
from haystack.components.generators import AzureOpenAIGenerator
from haystack.utils import Secret
client = AzureOpenAIGenerator(
azure_endpoint="<Your Azure endpoint e.g. `https://your-company.azure.openai.com/>",
api_key=Secret.from_token("<your-api-key>"),
azure_deployment="<this a model name, e.g. gpt-4o-mini>")
response = client.run("What's Natural Language Processing? Be brief.")
print(response)
>> {'replies': ['Natural Language Processing (NLP) is a branch of artificial intelligence that focuses on
>> the interaction between computers and human language. It involves enabling computers to understand, interpret,
>> and respond to natural human language in a way that is both meaningful and useful.'], 'meta': [{'model':
>> 'gpt-4o-mini', 'index': 0, 'finish_reason': 'stop', 'usage': {'prompt_tokens': 16,
>> 'completion_tokens': 49, 'total_tokens': 65}}]}
AzureOpenAIGenerator.__init__
def __init__(
azure_endpoint: Optional[str] = None,
api_version: Optional[str] = "2023-05-15",
azure_deployment: Optional[str] = "gpt-4o-mini",
api_key: Optional[Secret] = Secret.from_env_var("AZURE_OPENAI_API_KEY",
strict=False),
azure_ad_token: Optional[Secret] = Secret.from_env_var(
"AZURE_OPENAI_AD_TOKEN", strict=False),
organization: Optional[str] = None,
streaming_callback: Optional[Callable[[StreamingChunk], None]] = None,
system_prompt: Optional[str] = None,
timeout: Optional[float] = None,
max_retries: Optional[int] = None,
generation_kwargs: Optional[Dict[str, Any]] = None,
default_headers: Optional[Dict[str, str]] = None)
Initialize the Azure OpenAI Generator.
Arguments:
azure_endpoint
: The endpoint of the deployed model, for examplehttps://example-resource.azure.openai.com/
.api_version
: The version of the API to use. Defaults to 2023-05-15.azure_deployment
: The deployment of the model, usually the model name.api_key
: The API key to use for authentication.azure_ad_token
: Azure Active Directory token.organization
: Your organization ID, defaults toNone
. For help, see Setting up your organization.streaming_callback
: A callback function called when a new token is received from the stream. It accepts StreamingChunk as an argument.system_prompt
: The system prompt to use for text generation. If not provided, the Generator omits the system prompt and uses the default system prompt.timeout
: Timeout for AzureOpenAI client. If not set, it is inferred from theOPENAI_TIMEOUT
environment variable or set to 30.max_retries
: Maximum retries to establish contact with AzureOpenAI if it returns an internal error. If not set, it is inferred from theOPENAI_MAX_RETRIES
environment variable or set to 5.generation_kwargs
: Other parameters to use for the model, sent directly to the OpenAI endpoint. See OpenAI documentation for more details. Some of the supported parameters:max_tokens
: The maximum number of tokens the output text can have.temperature
: The sampling temperature to use. Higher values mean the model takes more risks. Try 0.9 for more creative applications and 0 (argmax sampling) for ones with a well-defined answer.top_p
: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. For example, 0.1 means only the tokens comprising the top 10% probability mass are considered.n
: The number of completions to generate for each prompt. For example, with 3 prompts and n=2, the LLM will generate two completions per prompt, resulting in 6 completions total.stop
: One or more sequences after which the LLM should stop generating tokens.presence_penalty
: The penalty applied if a token is already present. Higher values make the model less likely to repeat the token.frequency_penalty
: Penalty applied if a token has already been generated. Higher values make the model less likely to repeat the token.logit_bias
: Adds a logit bias to specific tokens. The keys of the dictionary are tokens, and the values are the bias to add to that token.default_headers
: Default headers to use for the AzureOpenAI client.
AzureOpenAIGenerator.to_dict
def to_dict() -> Dict[str, Any]
Serialize this component to a dictionary.
Returns:
The serialized component as a dictionary.
AzureOpenAIGenerator.from_dict
@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "AzureOpenAIGenerator"
Deserialize this component from a dictionary.
Arguments:
data
: The dictionary representation of this component.
Returns:
The deserialized component instance.
AzureOpenAIGenerator.run
@component.output_types(replies=List[str], meta=List[Dict[str, Any]])
def run(prompt: str,
system_prompt: Optional[str] = None,
streaming_callback: Optional[Callable[[StreamingChunk], None]] = None,
generation_kwargs: Optional[Dict[str, Any]] = None)
Invoke the text generation inference based on the provided messages and generation parameters.
Arguments:
prompt
: The string prompt to use for text generation.system_prompt
: The system prompt to use for text generation. If this run time system prompt is omitted, the system prompt, if defined at initialisation time, is used.streaming_callback
: A callback function that is called when a new token is received from the stream.generation_kwargs
: Additional keyword arguments for text generation. These parameters will potentially override the parameters passed in the__init__
method. For more details on the parameters supported by the OpenAI API, refer to the OpenAI documentation.
Returns:
A list of strings containing the generated responses and a list of dictionaries containing the metadata for each response.
Module hugging_face_local
HuggingFaceLocalGenerator
Generates text using models from Hugging Face that run locally.
LLMs running locally may need powerful hardware.
Usage example
from haystack.components.generators import HuggingFaceLocalGenerator
generator = HuggingFaceLocalGenerator(
model="google/flan-t5-large",
task="text2text-generation",
generation_kwargs={"max_new_tokens": 100, "temperature": 0.9})
generator.warm_up()
print(generator.run("Who is the best American actor?"))
# {'replies': ['John Cusack']}
HuggingFaceLocalGenerator.__init__
def __init__(model: str = "google/flan-t5-base",
task: Optional[Literal["text-generation",
"text2text-generation"]] = None,
device: Optional[ComponentDevice] = None,
token: Optional[Secret] = Secret.from_env_var(
["HF_API_TOKEN", "HF_TOKEN"], strict=False),
generation_kwargs: Optional[Dict[str, Any]] = None,
huggingface_pipeline_kwargs: Optional[Dict[str, Any]] = None,
stop_words: Optional[List[str]] = None,
streaming_callback: Optional[Callable[[StreamingChunk],
None]] = None)
Creates an instance of a HuggingFaceLocalGenerator.
Arguments:
model
: The Hugging Face text generation model name or path.task
: The task for the Hugging Face pipeline. Possible options:text-generation
: Supported by decoder models, like GPT.text2text-generation
: Supported by encoder-decoder models, like T5. If the task is specified inhuggingface_pipeline_kwargs
, this parameter is ignored. If not specified, the component calls the Hugging Face API to infer the task from the model name.device
: The device for loading the model. IfNone
, automatically selects the default device. If a device or device map is specified inhuggingface_pipeline_kwargs
, it overrides this parameter.token
: The token to use as HTTP bearer authorization for remote files. If the token is specified inhuggingface_pipeline_kwargs
, this parameter is ignored.generation_kwargs
: A dictionary with keyword arguments to customize text generation. Some examples:max_length
,max_new_tokens
,temperature
,top_k
,top_p
. See Hugging Face's documentation for more information:- customize-text-generation
- transformers.GenerationConfig
huggingface_pipeline_kwargs
: Dictionary with keyword arguments to initialize the Hugging Face pipeline for text generation. These keyword arguments provide fine-grained control over the Hugging Face pipeline. In case of duplication, these kwargs overridemodel
,task
,device
, andtoken
init parameters. For available kwargs, see Hugging Face documentation. In this dictionary, you can also includemodel_kwargs
to specify the kwargs for model initialization: transformers.PreTrainedModel.from_pretrainedstop_words
: If the model generates a stop word, the generation stops. If you provide this parameter, don't specify thestopping_criteria
ingeneration_kwargs
. For some chat models, the output includes both the new text and the original prompt. In these cases, make sure your prompt has no stop words.streaming_callback
: An optional callable for handling streaming responses.
HuggingFaceLocalGenerator.warm_up
def warm_up()
Initializes the component.
HuggingFaceLocalGenerator.to_dict
def to_dict() -> Dict[str, Any]
Serializes the component to a dictionary.
Returns:
Dictionary with serialized data.
HuggingFaceLocalGenerator.from_dict
@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "HuggingFaceLocalGenerator"
Deserializes the component from a dictionary.
Arguments:
data
: The dictionary to deserialize from.
Returns:
The deserialized component.
HuggingFaceLocalGenerator.run
@component.output_types(replies=List[str])
def run(prompt: str,
streaming_callback: Optional[Callable[[StreamingChunk], None]] = None,
generation_kwargs: Optional[Dict[str, Any]] = None)
Run the text generation model on the given prompt.
Arguments:
prompt
: A string representing the prompt.streaming_callback
: A callback function that is called when a new token is received from the stream.generation_kwargs
: Additional keyword arguments for text generation.
Returns:
A dictionary containing the generated replies.
- replies: A list of strings representing the generated replies.
Module hugging_face_api
HuggingFaceAPIGenerator
Generates text using Hugging Face APIs.
Use it with the following Hugging Face APIs:
- [Free Serverless Inference API]((https://huggingface.co/inference-api)
- Paid Inference Endpoints
- Self-hosted Text Generation Inference
Usage examples
With the free serverless inference API
from haystack.components.generators import HuggingFaceAPIGenerator
from haystack.utils import Secret
generator = HuggingFaceAPIGenerator(api_type="serverless_inference_api",
api_params={"model": "HuggingFaceH4/zephyr-7b-beta"},
token=Secret.from_token("<your-api-key>"))
result = generator.run(prompt="What's Natural Language Processing?")
print(result)
With paid inference endpoints
from haystack.components.generators import HuggingFaceAPIGenerator
from haystack.utils import Secret
generator = HuggingFaceAPIGenerator(api_type="inference_endpoints",
api_params={"url": "<your-inference-endpoint-url>"},
token=Secret.from_token("<your-api-key>"))
result = generator.run(prompt="What's Natural Language Processing?")
print(result)
#### With self-hosted text generation inference
```python
from haystack.components.generators import HuggingFaceAPIGenerator
generator = HuggingFaceAPIGenerator(api_type="text_generation_inference",
api_params={"url": "http://localhost:8080"})
result = generator.run(prompt="What's Natural Language Processing?")
print(result)
HuggingFaceAPIGenerator.__init__
def __init__(api_type: Union[HFGenerationAPIType, str],
api_params: Dict[str, str],
token: Optional[Secret] = Secret.from_env_var(
["HF_API_TOKEN", "HF_TOKEN"], strict=False),
generation_kwargs: Optional[Dict[str, Any]] = None,
stop_words: Optional[List[str]] = None,
streaming_callback: Optional[Callable[[StreamingChunk],
None]] = None)
Initialize the HuggingFaceAPIGenerator instance.
Arguments:
api_type
: The type of Hugging Face API to use. Available types:text_generation_inference
: See TGI.inference_endpoints
: See Inference Endpoints.serverless_inference_api
: See Serverless Inference API.api_params
: A dictionary with the following keys:model
: Hugging Face model ID. Required whenapi_type
isSERVERLESS_INFERENCE_API
.url
: URL of the inference endpoint. Required whenapi_type
isINFERENCE_ENDPOINTS
orTEXT_GENERATION_INFERENCE
.token
: The Hugging Face token to use as HTTP bearer authorization. Check your HF token in your account settings.generation_kwargs
: A dictionary with keyword arguments to customize text generation. Some examples:max_new_tokens
,temperature
,top_k
,top_p
. For details, see Hugging Face documentation for more information.stop_words
: An optional list of strings representing the stop words.streaming_callback
: An optional callable for handling streaming responses.
HuggingFaceAPIGenerator.to_dict
def to_dict() -> Dict[str, Any]
Serialize this component to a dictionary.
Returns:
A dictionary containing the serialized component.
HuggingFaceAPIGenerator.from_dict
@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "HuggingFaceAPIGenerator"
Deserialize this component from a dictionary.
HuggingFaceAPIGenerator.run
@component.output_types(replies=List[str], meta=List[Dict[str, Any]])
def run(prompt: str,
streaming_callback: Optional[Callable[[StreamingChunk], None]] = None,
generation_kwargs: Optional[Dict[str, Any]] = None)
Invoke the text generation inference for the given prompt and generation parameters.
Arguments:
prompt
: A string representing the prompt.streaming_callback
: A callback function that is called when a new token is received from the stream.generation_kwargs
: Additional keyword arguments for text generation.
Returns:
A dictionary with the generated replies and metadata. Both are lists of length n.
- replies: A list of strings representing the generated replies.
Module openai
OpenAIGenerator
Generates text using OpenAI's large language models (LLMs).
It works with the gpt-4 and gpt-3.5-turbo models and supports streaming responses from OpenAI API. It uses strings as input and output.
You can customize how the text is generated by passing parameters to the
OpenAI API. Use the **generation_kwargs
argument when you initialize
the component or when you run it. Any parameter that works with
openai.ChatCompletion.create
will work here too.
For details on OpenAI API parameters, see OpenAI documentation.
Usage example
from haystack.components.generators import OpenAIGenerator
client = OpenAIGenerator()
response = client.run("What's Natural Language Processing? Be brief.")
print(response)
>> {'replies': ['Natural Language Processing (NLP) is a branch of artificial intelligence that focuses on
>> the interaction between computers and human language. It involves enabling computers to understand, interpret,
>> and respond to natural human language in a way that is both meaningful and useful.'], 'meta': [{'model':
>> 'gpt-4o-mini', 'index': 0, 'finish_reason': 'stop', 'usage': {'prompt_tokens': 16,
>> 'completion_tokens': 49, 'total_tokens': 65}}]}
OpenAIGenerator.__init__
def __init__(api_key: Secret = Secret.from_env_var("OPENAI_API_KEY"),
model: str = "gpt-4o-mini",
streaming_callback: Optional[Callable[[StreamingChunk],
None]] = None,
api_base_url: Optional[str] = None,
organization: Optional[str] = None,
system_prompt: Optional[str] = None,
generation_kwargs: Optional[Dict[str, Any]] = None,
timeout: Optional[float] = None,
max_retries: Optional[int] = None)
Creates an instance of OpenAIGenerator. Unless specified otherwise in model
, uses OpenAI's gpt-4o-mini
By setting the 'OPENAI_TIMEOUT' and 'OPENAI_MAX_RETRIES' you can change the timeout and max_retries parameters in the OpenAI client.
Arguments:
api_key
: The OpenAI API key to connect to OpenAI.model
: The name of the model to use.streaming_callback
: A callback function that is called when a new token is received from the stream. The callback function accepts StreamingChunk as an argument.api_base_url
: An optional base URL.organization
: The Organization ID, defaults toNone
.system_prompt
: The system prompt to use for text generation. If not provided, the system prompt is omitted, and the default system prompt of the model is used.generation_kwargs
: Other parameters to use for the model. These parameters are all sent directly to the OpenAI endpoint. See OpenAI documentation for more details. Some of the supported parameters:max_tokens
: The maximum number of tokens the output text can have.temperature
: What sampling temperature to use. Higher values mean the model will take more risks. Try 0.9 for more creative applications and 0 (argmax sampling) for ones with a well-defined answer.top_p
: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So, 0.1 means only the tokens comprising the top 10% probability mass are considered.n
: How many completions to generate for each prompt. For example, if the LLM gets 3 prompts and n is 2, it will generate two completions for each of the three prompts, ending up with 6 completions in total.stop
: One or more sequences after which the LLM should stop generating tokens.presence_penalty
: What penalty to apply if a token is already present at all. Bigger values mean the model will be less likely to repeat the same token in the text.frequency_penalty
: What penalty to apply if a token has already been generated in the text. Bigger values mean the model will be less likely to repeat the same token in the text.logit_bias
: Add a logit bias to specific tokens. The keys of the dictionary are tokens, and the values are the bias to add to that token.timeout
: Timeout for OpenAI Client calls, if not set it is inferred from theOPENAI_TIMEOUT
environment variable or set to 30.max_retries
: Maximum retries to establish contact with OpenAI if it returns an internal error, if not set it is inferred from theOPENAI_MAX_RETRIES
environment variable or set to 5.
OpenAIGenerator.to_dict
def to_dict() -> Dict[str, Any]
Serialize this component to a dictionary.
Returns:
The serialized component as a dictionary.
OpenAIGenerator.from_dict
@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "OpenAIGenerator"
Deserialize this component from a dictionary.
Arguments:
data
: The dictionary representation of this component.
Returns:
The deserialized component instance.
OpenAIGenerator.run
@component.output_types(replies=List[str], meta=List[Dict[str, Any]])
def run(prompt: str,
system_prompt: Optional[str] = None,
streaming_callback: Optional[Callable[[StreamingChunk], None]] = None,
generation_kwargs: Optional[Dict[str, Any]] = None)
Invoke the text generation inference based on the provided messages and generation parameters.
Arguments:
prompt
: The string prompt to use for text generation.system_prompt
: The system prompt to use for text generation. If this run time system prompt is omitted, the system prompt, if defined at initialisation time, is used.streaming_callback
: A callback function that is called when a new token is received from the stream.generation_kwargs
: Additional keyword arguments for text generation. These parameters will potentially override the parameters passed in the__init__
method. For more details on the parameters supported by the OpenAI API, refer to the OpenAI documentation.
Returns:
A list of strings containing the generated responses and a list of dictionaries containing the metadata for each response.
Module chat/azure
AzureOpenAIChatGenerator
Generates text using OpenAI's models on Azure.
It works with the gpt-4 and gpt-3.5-turbo - type models and supports streaming responses from OpenAI API. It uses ChatMessage format in input and output.
You can customize how the text is generated by passing parameters to the
OpenAI API. Use the **generation_kwargs
argument when you initialize
the component or when you run it. Any parameter that works with
openai.ChatCompletion.create
will work here too.
For details on OpenAI API parameters, see OpenAI documentation.
Usage example
from haystack.components.generators.chat import AzureOpenAIGenerator
from haystack.dataclasses import ChatMessage
from haystack.utils import Secret
messages = [ChatMessage.from_user("What's Natural Language Processing?")]
client = AzureOpenAIGenerator(
azure_endpoint="<Your Azure endpoint e.g. `https://your-company.azure.openai.com/>",
api_key=Secret.from_token("<your-api-key>"),
azure_deployment="<this a model name, e.g. gpt-4o-mini>")
response = client.run(messages)
print(response)
{'replies':
[ChatMessage(content='Natural Language Processing (NLP) is a branch of artificial intelligence that focuses on
enabling computers to understand, interpret, and generate human language in a way that is useful.',
role=<ChatRole.ASSISTANT: 'assistant'>, name=None,
meta={'model': 'gpt-4o-mini', 'index': 0, 'finish_reason': 'stop',
'usage': {'prompt_tokens': 15, 'completion_tokens': 36, 'total_tokens': 51}})]
}
AzureOpenAIChatGenerator.__init__
def __init__(
azure_endpoint: Optional[str] = None,
api_version: Optional[str] = "2023-05-15",
azure_deployment: Optional[str] = "gpt-4o-mini",
api_key: Optional[Secret] = Secret.from_env_var("AZURE_OPENAI_API_KEY",
strict=False),
azure_ad_token: Optional[Secret] = Secret.from_env_var(
"AZURE_OPENAI_AD_TOKEN", strict=False),
organization: Optional[str] = None,
streaming_callback: Optional[Callable[[StreamingChunk], None]] = None,
timeout: Optional[float] = None,
max_retries: Optional[int] = None,
generation_kwargs: Optional[Dict[str, Any]] = None,
default_headers: Optional[Dict[str, str]] = None)
Initialize the Azure OpenAI Chat Generator component.
Arguments:
azure_endpoint
: The endpoint of the deployed model, for example"https://example-resource.azure.openai.com/"
.api_version
: The version of the API to use. Defaults to 2023-05-15.azure_deployment
: The deployment of the model, usually the model name.api_key
: The API key to use for authentication.azure_ad_token
: Azure Active Directory token.organization
: Your organization ID, defaults toNone
. For help, see Setting up your organization.streaming_callback
: A callback function called when a new token is received from the stream. It accepts StreamingChunk as an argument.timeout
: Timeout for OpenAI client calls. If not set, it defaults to either theOPENAI_TIMEOUT
environment variable, or 30 seconds.max_retries
: Maximum number of retries to contact OpenAI after an internal error. If not set, it defaults to either theOPENAI_MAX_RETRIES
environment variable, or set to 5.generation_kwargs
: Other parameters to use for the model. These parameters are sent directly to the OpenAI endpoint. For details, see OpenAI documentation. Some of the supported parameters:max_tokens
: The maximum number of tokens the output text can have.temperature
: The sampling temperature to use. Higher values mean the model takes more risks. Try 0.9 for more creative applications and 0 (argmax sampling) for ones with a well-defined answer.top_p
: Nucleus sampling is an alternative to sampling with temperature, where the model considers tokens with a top_p probability mass. For example, 0.1 means only the tokens comprising the top 10% probability mass are considered.n
: The number of completions to generate for each prompt. For example, with 3 prompts and n=2, the LLM will generate two completions per prompt, resulting in 6 completions total.stop
: One or more sequences after which the LLM should stop generating tokens.presence_penalty
: The penalty applied if a token is already present. Higher values make the model less likely to repeat the token.frequency_penalty
: Penalty applied if a token has already been generated. Higher values make the model less likely to repeat the token.logit_bias
: Adds a logit bias to specific tokens. The keys of the dictionary are tokens, and the values are the bias to add to that token.default_headers
: Default headers to use for the AzureOpenAI client.
AzureOpenAIChatGenerator.to_dict
def to_dict() -> Dict[str, Any]
Serialize this component to a dictionary.
Returns:
The serialized component as a dictionary.
AzureOpenAIChatGenerator.from_dict
@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "AzureOpenAIChatGenerator"
Deserialize this component from a dictionary.
Arguments:
data
: The dictionary representation of this component.
Returns:
The deserialized component instance.
AzureOpenAIChatGenerator.run
@component.output_types(replies=List[ChatMessage])
def run(messages: List[ChatMessage],
streaming_callback: Optional[Callable[[StreamingChunk], None]] = None,
generation_kwargs: Optional[Dict[str, Any]] = None)
Invokes chat completion based on the provided messages and generation parameters.
Arguments:
messages
: A list of ChatMessage instances representing the input messages.streaming_callback
: A callback function that is called when a new token is received from the stream.generation_kwargs
: Additional keyword arguments for text generation. These parameters will override the parameters passed during component initialization. For details on OpenAI API parameters, see OpenAI documentation.
Returns:
A list containing the generated responses as ChatMessage instances.
Module chat/hugging_face_local
HuggingFaceLocalChatGenerator
Generates chat responses using models from Hugging Face that run locally.
Use this component with chat-based models,
such as HuggingFaceH4/zephyr-7b-beta
or meta-llama/Llama-2-7b-chat-hf
.
LLMs running locally may need powerful hardware.
Usage example
from haystack.components.generators.chat import HuggingFaceLocalChatGenerator
from haystack.dataclasses import ChatMessage
generator = HuggingFaceLocalChatGenerator(model="HuggingFaceH4/zephyr-7b-beta")
generator.warm_up()
messages = [ChatMessage.from_user("What's Natural Language Processing? Be brief.")]
print(generator.run(messages))
{'replies':
[ChatMessage(content=' Natural Language Processing (NLP) is a subfield of artificial intelligence that deals
with the interaction between computers and human language. It enables computers to understand, interpret, and
generate human language in a valuable way. NLP involves various techniques such as speech recognition, text
analysis, sentiment analysis, and machine translation. The ultimate goal is to make it easier for computers to
process and derive meaning from human language, improving communication between humans and machines.',
role=<ChatRole.ASSISTANT: 'assistant'>,
name=None,
meta={'finish_reason': 'stop', 'index': 0, 'model':
'mistralai/Mistral-7B-Instruct-v0.2',
'usage': {'completion_tokens': 90, 'prompt_tokens': 19, 'total_tokens': 109}})
]
}
HuggingFaceLocalChatGenerator.__init__
def __init__(model: str = "HuggingFaceH4/zephyr-7b-beta",
task: Optional[Literal["text-generation",
"text2text-generation"]] = None,
device: Optional[ComponentDevice] = None,
token: Optional[Secret] = Secret.from_env_var(
["HF_API_TOKEN", "HF_TOKEN"], strict=False),
chat_template: Optional[str] = None,
generation_kwargs: Optional[Dict[str, Any]] = None,
huggingface_pipeline_kwargs: Optional[Dict[str, Any]] = None,
stop_words: Optional[List[str]] = None,
streaming_callback: Optional[Callable[[StreamingChunk],
None]] = None)
Initializes the HuggingFaceLocalChatGenerator component.
Arguments:
model
: The Hugging Face text generation model name or path, for example,mistralai/Mistral-7B-Instruct-v0.2
orTheBloke/OpenHermes-2.5-Mistral-7B-16k-AWQ
. The model must be a chat model supporting the ChatML messaging format. If the model is specified inhuggingface_pipeline_kwargs
, this parameter is ignored.task
: The task for the Hugging Face pipeline. Possible options:text-generation
: Supported by decoder models, like GPT.text2text-generation
: Supported by encoder-decoder models, like T5. If the task is specified inhuggingface_pipeline_kwargs
, this parameter is ignored. If not specified, the component calls the Hugging Face API to infer the task from the model name.device
: The device for loading the model. IfNone
, automatically selects the default device. If a device or device map is specified inhuggingface_pipeline_kwargs
, it overrides this parameter.token
: The token to use as HTTP bearer authorization for remote files. If the token is specified inhuggingface_pipeline_kwargs
, this parameter is ignored.chat_template
: Specifies an optional Jinja template for formatting chat messages. Most high-quality chat models have their own templates, but for models without this feature or if you prefer a custom template, use this parameter.generation_kwargs
: A dictionary with keyword arguments to customize text generation. Some examples:max_length
,max_new_tokens
,temperature
,top_k
,top_p
. See Hugging Face's documentation for more information:-
- GenerationConfig
The only
generation_kwargs
set by default ismax_new_tokens
, which is set to 512 tokens.
- GenerationConfig
The only
huggingface_pipeline_kwargs
: Dictionary with keyword arguments to initialize the Hugging Face pipeline for text generation. These keyword arguments provide fine-grained control over the Hugging Face pipeline. In case of duplication, these kwargs overridemodel
,task
,device
, andtoken
init parameters. For kwargs, see Hugging Face documentation. In this dictionary, you can also includemodel_kwargs
to specify the kwargs for model initializationstop_words
: A list of stop words. If the model generates a stop word, the generation stops. If you provide this parameter, don't specify thestopping_criteria
ingeneration_kwargs
. For some chat models, the output includes both the new text and the original prompt. In these cases, make sure your prompt has no stop words.streaming_callback
: An optional callable for handling streaming responses.
HuggingFaceLocalChatGenerator.warm_up
def warm_up()
Initializes the component.
HuggingFaceLocalChatGenerator.to_dict
def to_dict() -> Dict[str, Any]
Serializes the component to a dictionary.
Returns:
Dictionary with serialized data.
HuggingFaceLocalChatGenerator.from_dict
@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "HuggingFaceLocalChatGenerator"
Deserializes the component from a dictionary.
Arguments:
data
: The dictionary to deserialize from.
Returns:
The deserialized component.
HuggingFaceLocalChatGenerator.run
@component.output_types(replies=List[ChatMessage])
def run(messages: List[ChatMessage],
generation_kwargs: Optional[Dict[str, Any]] = None)
Invoke text generation inference based on the provided messages and generation parameters.
Arguments:
messages
: A list of ChatMessage objects representing the input messages.generation_kwargs
: Additional keyword arguments for text generation.
Returns:
A list containing the generated responses as ChatMessage instances.
HuggingFaceLocalChatGenerator.create_message
def create_message(text: str, index: int,
tokenizer: Union["PreTrainedTokenizer",
"PreTrainedTokenizerFast"], prompt: str,
generation_kwargs: Dict[str, Any]) -> ChatMessage
Create a ChatMessage instance from the provided text, populated with metadata.
Arguments:
text
: The generated text.index
: The index of the generated text.tokenizer
: The tokenizer used for generation.prompt
: The prompt used for generation.generation_kwargs
: The generation parameters.
Returns:
A ChatMessage instance.
Module chat/hugging_face_api
HuggingFaceAPIChatGenerator
Completes chats using Hugging Face APIs.
HuggingFaceAPIChatGenerator uses the ChatMessage format for input and output. Use it to generate text with Hugging Face APIs:
Usage examples
With the free serverless inference API
from haystack.components.generators.chat import HuggingFaceAPIChatGenerator
from haystack.dataclasses import ChatMessage
from haystack.utils import Secret
from haystack.utils.hf import HFGenerationAPIType
messages = [ChatMessage.from_system("\nYou are a helpful, respectful and honest assistant"),
ChatMessage.from_user("What's Natural Language Processing?")]
# the api_type can be expressed using the HFGenerationAPIType enum or as a string
api_type = HFGenerationAPIType.SERVERLESS_INFERENCE_API
api_type = "serverless_inference_api" # this is equivalent to the above
generator = HuggingFaceAPIChatGenerator(api_type=api_type,
api_params={"model": "HuggingFaceH4/zephyr-7b-beta"},
token=Secret.from_token("<your-api-key>"))
result = generator.run(messages)
print(result)
With paid inference endpoints
from haystack.components.generators.chat import HuggingFaceAPIChatGenerator
from haystack.dataclasses import ChatMessage
from haystack.utils import Secret
messages = [ChatMessage.from_system("\nYou are a helpful, respectful and honest assistant"),
ChatMessage.from_user("What's Natural Language Processing?")]
generator = HuggingFaceAPIChatGenerator(api_type="inference_endpoints",
api_params={"url": "<your-inference-endpoint-url>"},
token=Secret.from_token("<your-api-key>"))
result = generator.run(messages)
print(result)
#### With self-hosted text generation inference
```python
from haystack.components.generators.chat import HuggingFaceAPIChatGenerator
from haystack.dataclasses import ChatMessage
messages = [ChatMessage.from_system("\nYou are a helpful, respectful and honest assistant"),
ChatMessage.from_user("What's Natural Language Processing?")]
generator = HuggingFaceAPIChatGenerator(api_type="text_generation_inference",
api_params={"url": "http://localhost:8080"})
result = generator.run(messages)
print(result)
HuggingFaceAPIChatGenerator.__init__
def __init__(api_type: Union[HFGenerationAPIType, str],
api_params: Dict[str, str],
token: Optional[Secret] = Secret.from_env_var(
["HF_API_TOKEN", "HF_TOKEN"], strict=False),
generation_kwargs: Optional[Dict[str, Any]] = None,
stop_words: Optional[List[str]] = None,
streaming_callback: Optional[Callable[[StreamingChunk],
None]] = None)
Initialize the HuggingFaceAPIChatGenerator instance.
Arguments:
api_type
: The type of Hugging Face API to use. Available types:text_generation_inference
: See TGI.inference_endpoints
: See Inference Endpoints.serverless_inference_api
: See Serverless Inference API.api_params
: A dictionary with the following keys:model
: Hugging Face model ID. Required whenapi_type
isSERVERLESS_INFERENCE_API
.url
: URL of the inference endpoint. Required whenapi_type
isINFERENCE_ENDPOINTS
orTEXT_GENERATION_INFERENCE
.token
: The Hugging Face token to use as HTTP bearer authorization. Check your HF token in your account settings.generation_kwargs
: A dictionary with keyword arguments to customize text generation. Some examples:max_tokens
,temperature
,top_p
. For details, see Hugging Face chat_completion documentation.stop_words
: An optional list of strings representing the stop words.streaming_callback
: An optional callable for handling streaming responses.
HuggingFaceAPIChatGenerator.to_dict
def to_dict() -> Dict[str, Any]
Serialize this component to a dictionary.
Returns:
A dictionary containing the serialized component.
HuggingFaceAPIChatGenerator.from_dict
@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "HuggingFaceAPIChatGenerator"
Deserialize this component from a dictionary.
HuggingFaceAPIChatGenerator.run
@component.output_types(replies=List[ChatMessage])
def run(messages: List[ChatMessage],
generation_kwargs: Optional[Dict[str, Any]] = None)
Invoke the text generation inference based on the provided messages and generation parameters.
Arguments:
messages
: A list of ChatMessage objects representing the input messages.generation_kwargs
: Additional keyword arguments for text generation.
Returns:
A dictionary with the following keys:
replies
: A list containing the generated responses as ChatMessage objects.
Module chat/openai
OpenAIChatGenerator
Completes chats using OpenAI's large language models (LLMs).
It works with the gpt-4 and gpt-3.5-turbo models and supports streaming responses from OpenAI API. It uses ChatMessage format in input and output.
You can customize how the text is generated by passing parameters to the
OpenAI API. Use the **generation_kwargs
argument when you initialize
the component or when you run it. Any parameter that works with
openai.ChatCompletion.create
will work here too.
For details on OpenAI API parameters, see OpenAI documentation.
Usage example
from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.dataclasses import ChatMessage
messages = [ChatMessage.from_user("What's Natural Language Processing?")]
client = OpenAIChatGenerator()
response = client.run(messages)
print(response)
Output:
{'replies':
[ChatMessage(content='Natural Language Processing (NLP) is a branch of artificial intelligence
that focuses on enabling computers to understand, interpret, and generate human language in
a way that is meaningful and useful.',
role=<ChatRole.ASSISTANT: 'assistant'>, name=None,
meta={'model': 'gpt-4o-mini', 'index': 0, 'finish_reason': 'stop',
'usage': {'prompt_tokens': 15, 'completion_tokens': 36, 'total_tokens': 51}})
]
}
OpenAIChatGenerator.__init__
def __init__(api_key: Secret = Secret.from_env_var("OPENAI_API_KEY"),
model: str = "gpt-4o-mini",
streaming_callback: Optional[Callable[[StreamingChunk],
None]] = None,
api_base_url: Optional[str] = None,
organization: Optional[str] = None,
generation_kwargs: Optional[Dict[str, Any]] = None,
timeout: Optional[float] = None,
max_retries: Optional[int] = None)
Creates an instance of OpenAIChatGenerator. Unless specified otherwise in model
, uses OpenAI's gpt-4o-mini
Before initializing the component, you can set the 'OPENAI_TIMEOUT' and 'OPENAI_MAX_RETRIES'
environment variables to override the timeout
and max_retries
parameters respectively
in the OpenAI client.
Arguments:
api_key
: The OpenAI API key. You can set it with an environment variableOPENAI_API_KEY
, or pass with this parameter during initialization.model
: The name of the model to use.streaming_callback
: A callback function that is called when a new token is received from the stream. The callback function accepts StreamingChunk as an argument.api_base_url
: An optional base URL.organization
: Your organization ID, defaults toNone
. See production best practices.generation_kwargs
: Other parameters to use for the model. These parameters are sent directly to the OpenAI endpoint. See OpenAI documentation for more details. Some of the supported parameters:max_tokens
: The maximum number of tokens the output text can have.temperature
: What sampling temperature to use. Higher values mean the model will take more risks. Try 0.9 for more creative applications and 0 (argmax sampling) for ones with a well-defined answer.top_p
: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. For example, 0.1 means only the tokens comprising the top 10% probability mass are considered.n
: How many completions to generate for each prompt. For example, if the LLM gets 3 prompts and n is 2, it will generate two completions for each of the three prompts, ending up with 6 completions in total.stop
: One or more sequences after which the LLM should stop generating tokens.presence_penalty
: What penalty to apply if a token is already present at all. Bigger values mean the model will be less likely to repeat the same token in the text.frequency_penalty
: What penalty to apply if a token has already been generated in the text. Bigger values mean the model will be less likely to repeat the same token in the text.logit_bias
: Add a logit bias to specific tokens. The keys of the dictionary are tokens, and the values are the bias to add to that token.timeout
: Timeout for OpenAI client calls. If not set, it defaults to either theOPENAI_TIMEOUT
environment variable, or 30 seconds.max_retries
: Maximum number of retries to contact OpenAI after an internal error. If not set, it defaults to either theOPENAI_MAX_RETRIES
environment variable, or set to 5.
OpenAIChatGenerator.to_dict
def to_dict() -> Dict[str, Any]
Serialize this component to a dictionary.
Returns:
The serialized component as a dictionary.
OpenAIChatGenerator.from_dict
@classmethod
def from_dict(cls, data: Dict[str, Any]) -> "OpenAIChatGenerator"
Deserialize this component from a dictionary.
Arguments:
data
: The dictionary representation of this component.
Returns:
The deserialized component instance.
OpenAIChatGenerator.run
@component.output_types(replies=List[ChatMessage])
def run(messages: List[ChatMessage],
streaming_callback: Optional[Callable[[StreamingChunk], None]] = None,
generation_kwargs: Optional[Dict[str, Any]] = None)
Invokes chat completion based on the provided messages and generation parameters.
Arguments:
messages
: A list of ChatMessage instances representing the input messages.streaming_callback
: A callback function that is called when a new token is received from the stream.generation_kwargs
: Additional keyword arguments for text generation. These parameters will override the parameters passed during component initialization. For details on OpenAI API parameters, see OpenAI documentation.
Returns:
A list containing the generated responses as ChatMessage instances.