Most common position in a pipeline	After a `ChatPromptBuilder`
Mandatory init variables	"api_type": The type of Hugging Face API to use "api_params": A dictionary with one of the following keys: - `model`: Hugging Face model ID. Required when `api_type` is `SERVERLESS_INFERENCE_API`.OR - `url`: URL of the inference endpoint. Required when `api_type` is `INFERENCE_ENDPOINTS` or `TEXT_EMBEDDINGS_INFERENCE`."token": The Hugging Face API token. Can be set with `HF_API_TOKEN` or `HF_TOKEN` env var.
Mandatory run variables	“messages”: A list of `ChatMessage` objects representing the chat
Output variables	“replies”: A list of replies of the LLM to the input chat
API reference	Generators
GitHub link	https://github.com/deepset-ai/haystack/blob/main/haystack/components/generators/chat/hugging_face_api.py

Overview

HuggingFaceAPIChatGenerator can be used to generate chat completions using different Hugging Face APIs:

This component’s main input is a list of ChatMessage objects. ChatMessage is a data class that contains a message, a role (who generated the message, such as user, assistant, system, function), and optional metadata. For more information, check out our ChatMessage docs.

📘
This component is designed for chat completion, so it expects a list of messages, not a single string. If you want to use Hugging Face APIs for simple text generation (such as translation or summarization tasks) or don’t want to use the ChatMessage object, use HuggingFaceAPIGenerator instead.

The component uses a HF_API_TOKEN environment variable by default. Otherwise, you can pass a Hugging Face API token at initialization with token – see code examples below.
The token is needed:

If you use the Serverless Inference API, or
If you use the Inference Endpoints.

Streaming

This Generator supports streaming the tokens from the LLM directly in output. To do so, pass a function to the streaming_callback init parameter.

Usage

On its own

Using Free Serverless Inference API

Formerly known as (free) Hugging Face Inference API, this API allows you to quickly experiment with many models hosted on the Hugging Face Hub, offloading the inference to Hugging Face servers. It’s rate-limited and not meant for production.

To use this API, you need a free Hugging Face token.
The Generator expects the model in api_params.

from haystack.components.generators.chat import HuggingFaceAPIChatGenerator
from haystack.dataclasses import ChatMessage
from haystack.utils import Secret
from haystack.utils.hf import HFGenerationAPIType

messages = [ChatMessage.from_system("\\nYou are a helpful, respectful and honest assistant"),
            ChatMessage.from_user("What's Natural Language Processing?")]

# the api_type can be expressed using the HFGenerationAPIType enum or as a string
api_type = HFGenerationAPIType.SERVERLESS_INFERENCE_API
api_type = "serverless_inference_api" # this is equivalent to the above

generator = HuggingFaceAPIChatGenerator(api_type=api_type,
                                        api_params={"model": "HuggingFaceH4/zephyr-7b-beta"},
                                        token=Secret.from_env_var("HF_API_TOKEN"))

result = generator.run(messages)
print(result)

Using Paid Inference Endpoints

In this case, a private instance of the model is deployed by Hugging Face, and you typically pay per hour.

To understand how to spin up an Inference Endpoint, visit Hugging Face documentation.

Additionally, in this case, you need to provide your Hugging Face token.
The Generator expects the url of your endpoint in api_params.

from haystack.components.generators.chat import HuggingFaceAPIChatGenerator
from haystack.dataclasses import ChatMessage
from haystack.utils import Secret

messages = [ChatMessage.from_system("\\nYou are a helpful, respectful and honest assistant"),
            ChatMessage.from_user("What's Natural Language Processing?")]

generator = HuggingFaceAPIChatGenerator(api_type="inference_endpoints",
                                        api_params={"url": "<your-inference-endpoint-url>"},
                                        token=Secret.from_env_var("HF_API_TOKEN"))

result = generator.run(messages)
print(result)

Using Self-Hosted Text Generation Inference (TGI)

Hugging Face Text Generation Inference is a toolkit for efficiently deploying and serving LLMs.

While it powers the most recent versions of Serverless Inference API and Inference Endpoints, it can be used easily on-premise through Docker.

For example, you can run a TGI container as follows:

model=HuggingFaceH4/zephyr-7b-beta
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:1.4 --model-id $model

For more information, refer to the official TGI repository.

The Generator expects the url of your TGI instance in api_params.

from haystack.components.generators.chat import HuggingFaceAPIChatGenerator
from haystack.dataclasses import ChatMessage

messages = [ChatMessage.from_system("\\nYou are a helpful, respectful and honest assistant"),
            ChatMessage.from_user("What's Natural Language Processing?")]

generator = HuggingFaceAPIChatGenerator(api_type="text_generation_inference",
                                        api_params={"url": "http://localhost:8080"})

result = generator.run(messages)
print(result)

In a pipeline

from haystack.components.builders import ChatPromptBuilder
from haystack.components.generators.chat import HuggingFaceAPIChatGenerator
from haystack.dataclasses import ChatMessage
from haystack import Pipeline
from haystack.utils import Secret
from haystack.utils.hf import HFGenerationAPIType

# no parameter init, we don't use any runtime template variables
prompt_builder = ChatPromptBuilder()
llm = HuggingFaceAPIChatGenerator(api_type=HFGenerationAPIType.SERVERLESS_INFERENCE_API,
                                  api_params={"model": "HuggingFaceH4/zephyr-7b-beta"},
                                  token=Secret.from_env_var("HF_API_TOKEN"))
                                        
pipe = Pipeline()
pipe.add_component("prompt_builder", prompt_builder)
pipe.add_component("llm", llm)
pipe.connect("prompt_builder.prompt", "llm.messages")
location = "Berlin"
messages = [ChatMessage.from_system("Always respond in German even if some input data is in other languages."),
ChatMessage.from_user("Tell me about {{location}}")]
result = pipe.run(data={"prompt_builder": {"template_variables":{"location": location}, "template": messages}})

print(result)
>> {'llm': {'replies': [ChatMessage(content='Berlin ist die Hauptstadt Deutschlands und die größte Stadt des Landes.
>> Es ist eine lebhafte Metropole, die für ihre Geschichte, Kultur und einzigartigen Sehenswürdigkeiten bekannt ist.
>> Berlin bietet eine vielfältige Kulturszene, beeindruckende architektonische Meisterwerke wie den Berliner Dom
>> und das Brandenburger Tor, sowie weltberühmte Museen wie das Pergamonmuseum. Die Stadt hat auch eine pulsierende
>> Clubszene und ist für ihr aufregendes Nachtleben berühmt. Berlin ist ein Schmelztiegel verschiedener Kulturen und
>> zieht jedes Jahr Millionen von Touristen an.', role=<ChatRole.ASSISTANT: 'assistant'>, name=None}}

Additional References

🧑‍🍳 Cookbook: Build with Google Gemma: chat and RAG