HuggingFaceAPIChatGenerator
This generator enables chat completion using various Hugging Face APIs.
Most common position in a pipeline | After a ChatPromptBuilder |
Mandatory init variables | "api_type": The type of Hugging Face API to use "api_params": A dictionary with one of the following keys: - model : Hugging Face model ID. Required when api_type is SERVERLESS_INFERENCE_API .OR - url : URL of the inference endpoint. Required when api_type is INFERENCE_ENDPOINTS or TEXT_EMBEDDINGS_INFERENCE ."token": The Hugging Face API token. Can be set with HF_API_TOKEN or HF_TOKEN env var. |
Mandatory run variables | “messages”: A list of ChatMessage objects representing the chat |
Output variables | “replies”: A list of replies of the LLM to the input chat |
API reference | Generators |
GitHub link | https://github.com/deepset-ai/haystack/blob/main/haystack/components/generators/chat/hugging_face_api.py |
Overview
HuggingFaceAPIChatGenerator
can be used to generate chat completions using different Hugging Face APIs:
This component’s main input is a list of ChatMessage
objects. ChatMessage
is a data class that contains a message, a role (who generated the message, such as user
, assistant
, system
, function
), and optional metadata. For more information, check out our ChatMessage
docs.
This component is designed for chat completion, so it expects a list of messages, not a single string. If you want to use Hugging Face APIs for simple text generation (such as translation or summarization tasks) or don’t want to use the
ChatMessage
object, useHuggingFaceAPIGenerator
instead.
The component uses a HF_API_TOKEN
environment variable by default. Otherwise, you can pass a Hugging Face API token at initialization with token
– see code examples below.
The token is needed:
- If you use the Serverless Inference API, or
- If you use the Inference Endpoints.
Streaming
This Generator supports streaming the tokens from the LLM directly in output. To do so, pass a function to the streaming_callback
init parameter.
Usage
On its own
Using Free Serverless Inference API
Formerly known as (free) Hugging Face Inference API, this API allows you to quickly experiment with many models hosted on the Hugging Face Hub, offloading the inference to Hugging Face servers. It’s rate-limited and not meant for production.
To use this API, you need a free Hugging Face token.
The Generator expects the model
in api_params
.
from haystack.components.generators.chat import HuggingFaceAPIChatGenerator
from haystack.dataclasses import ChatMessage
from haystack.utils import Secret
from haystack.utils.hf import HFGenerationAPIType
messages = [ChatMessage.from_system("\\nYou are a helpful, respectful and honest assistant"),
ChatMessage.from_user("What's Natural Language Processing?")]
# the api_type can be expressed using the HFGenerationAPIType enum or as a string
api_type = HFGenerationAPIType.SERVERLESS_INFERENCE_API
api_type = "serverless_inference_api" # this is equivalent to the above
generator = HuggingFaceAPIChatGenerator(api_type=api_type,
api_params={"model": "HuggingFaceH4/zephyr-7b-beta"},
token=Secret.from_env_var("HF_API_TOKEN"))
result = generator.run(messages)
print(result)
Using Paid Inference Endpoints
In this case, a private instance of the model is deployed by Hugging Face, and you typically pay per hour.
To understand how to spin up an Inference Endpoint, visit Hugging Face documentation.
Additionally, in this case, you need to provide your Hugging Face token.
The Generator expects the url
of your endpoint in api_params
.
from haystack.components.generators.chat import HuggingFaceAPIChatGenerator
from haystack.dataclasses import ChatMessage
from haystack.utils import Secret
messages = [ChatMessage.from_system("\\nYou are a helpful, respectful and honest assistant"),
ChatMessage.from_user("What's Natural Language Processing?")]
generator = HuggingFaceAPIChatGenerator(api_type="inference_endpoints",
api_params={"url": "<your-inference-endpoint-url>"},
token=Secret.from_env_var("HF_API_TOKEN"))
result = generator.run(messages)
print(result)
Using Self-Hosted Text Generation Inference (TGI)
Hugging Face Text Generation Inference is a toolkit for efficiently deploying and serving LLMs.
While it powers the most recent versions of Serverless Inference API and Inference Endpoints, it can be used easily on-premise through Docker.
For example, you can run a TGI container as follows:
model=HuggingFaceH4/zephyr-7b-beta
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:1.4 --model-id $model
For more information, refer to the official TGI repository.
The Generator expects the url
of your TGI instance in api_params
.
from haystack.components.generators.chat import HuggingFaceAPIChatGenerator
from haystack.dataclasses import ChatMessage
messages = [ChatMessage.from_system("\\nYou are a helpful, respectful and honest assistant"),
ChatMessage.from_user("What's Natural Language Processing?")]
generator = HuggingFaceAPIChatGenerator(api_type="text_generation_inference",
api_params={"url": "http://localhost:8080"})
result = generator.run(messages)
print(result)
In a pipeline
from haystack.components.builders import ChatPromptBuilder
from haystack.components.generators.chat import HuggingFaceAPIChatGenerator
from haystack.dataclasses import ChatMessage
from haystack import Pipeline
from haystack.utils import Secret
from haystack.utils.hf import HFGenerationAPIType
# no parameter init, we don't use any runtime template variables
prompt_builder = ChatPromptBuilder()
llm = HuggingFaceAPIChatGenerator(api_type=HFGenerationAPIType.SERVERLESS_INFERENCE_API,
api_params={"model": "HuggingFaceH4/zephyr-7b-beta"},
token=Secret.from_env_var("HF_API_TOKEN"))
pipe = Pipeline()
pipe.add_component("prompt_builder", prompt_builder)
pipe.add_component("llm", llm)
pipe.connect("prompt_builder.prompt", "llm.messages")
location = "Berlin"
messages = [ChatMessage.from_system("Always respond in German even if some input data is in other languages."),
ChatMessage.from_user("Tell me about {{location}}")]
result = pipe.run(data={"prompt_builder": {"template_variables":{"location": location}, "template": messages}})
print(result)
>> {'llm': {'replies': [ChatMessage(content='Berlin ist die Hauptstadt Deutschlands und die größte Stadt des Landes.
>> Es ist eine lebhafte Metropole, die für ihre Geschichte, Kultur und einzigartigen Sehenswürdigkeiten bekannt ist.
>> Berlin bietet eine vielfältige Kulturszene, beeindruckende architektonische Meisterwerke wie den Berliner Dom
>> und das Brandenburger Tor, sowie weltberühmte Museen wie das Pergamonmuseum. Die Stadt hat auch eine pulsierende
>> Clubszene und ist für ihr aufregendes Nachtleben berühmt. Berlin ist ein Schmelztiegel verschiedener Kulturen und
>> zieht jedes Jahr Millionen von Touristen an.', role=<ChatRole.ASSISTANT: 'assistant'>, name=None}}
Additional References
🧑🍳 Cookbook: Build with Google Gemma: chat and RAG
Updated about 1 month ago