Hugging Face API
haystack_integrations.components.embedders.huggingface_api.document_embedder
HuggingFaceAPIDocumentEmbedder
Embeds documents using Hugging Face APIs.
Use it with the following Hugging Face APIs:
Usage examples
With free serverless inference API
from haystack_integrations.components.embedders.huggingface_api import HuggingFaceAPIDocumentEmbedder
from haystack.utils import Secret
from haystack.dataclasses import Document
doc = Document(content="I love pizza!")
doc_embedder = HuggingFaceAPIDocumentEmbedder(api_type="serverless_inference_api",
api_params={"model": "BAAI/bge-small-en-v1.5"},
token=Secret.from_token("<your-api-key>"))
result = document_embedder.run([doc])
print(result["documents"][0].embedding)
# [0.017020374536514282, -0.023255806416273117, ...]
With paid inference endpoints
from haystack_integrations.components.embedders.huggingface_api import HuggingFaceAPIDocumentEmbedder
from haystack.utils import Secret
from haystack.dataclasses import Document
doc = Document(content="I love pizza!")
doc_embedder = HuggingFaceAPIDocumentEmbedder(api_type="inference_endpoints",
api_params={"url": "<your-inference-endpoint-url>"},
token=Secret.from_token("<your-api-key>"))
result = document_embedder.run([doc])
print(result["documents"][0].embedding)
# [0.017020374536514282, -0.023255806416273117, ...]
With self-hosted text embeddings inference
from haystack_integrations.components.embedders.huggingface_api import HuggingFaceAPIDocumentEmbedder
from haystack.dataclasses import Document
doc = Document(content="I love pizza!")
doc_embedder = HuggingFaceAPIDocumentEmbedder(api_type="text_embeddings_inference",
api_params={"url": "http://localhost:8080"})
result = document_embedder.run([doc])
print(result["documents"][0].embedding)
# [0.017020374536514282, -0.023255806416273117, ...]
init
__init__(
api_type: HFEmbeddingAPIType | str,
api_params: dict[str, str],
token: Secret | None = Secret.from_env_var(
["HF_API_TOKEN", "HF_TOKEN"], strict=False
),
prefix: str = "",
suffix: str = "",
truncate: bool | None = True,
normalize: bool | None = False,
batch_size: int = 32,
progress_bar: bool = True,
meta_fields_to_embed: list[str] | None = None,
embedding_separator: str = "\n",
concurrency_limit: int = 4,
) -> None
Creates a HuggingFaceAPIDocumentEmbedder component.
Parameters:
- api_type (
HFEmbeddingAPIType | str) – The type of Hugging Face API to use. - api_params (
dict[str, str]) – A dictionary with the following keys: model: Hugging Face model ID. Required whenapi_typeisSERVERLESS_INFERENCE_API.url: URL of the inference endpoint. Required whenapi_typeisINFERENCE_ENDPOINTSorTEXT_EMBEDDINGS_INFERENCE.- token (
Secret | None) – The Hugging Face token to use as HTTP bearer authorization. Check your HF token in your account settings. - prefix (
str) – A string to add at the beginning of each text. - suffix (
str) – A string to add at the end of each text. - truncate (
bool | None) – Truncates the input text to the maximum length supported by the model. Applicable whenapi_typeisTEXT_EMBEDDINGS_INFERENCE, orINFERENCE_ENDPOINTSif the backend uses Text Embeddings Inference. Ifapi_typeisSERVERLESS_INFERENCE_API, this parameter is ignored. - normalize (
bool | None) – Normalizes the embeddings to unit length. Applicable whenapi_typeisTEXT_EMBEDDINGS_INFERENCE, orINFERENCE_ENDPOINTSif the backend uses Text Embeddings Inference. Ifapi_typeisSERVERLESS_INFERENCE_API, this parameter is ignored. - batch_size (
int) – Number of documents to process at once. - progress_bar (
bool) – IfTrue, shows a progress bar when running. - meta_fields_to_embed (
list[str] | None) – List of metadata fields to embed along with the document text. - embedding_separator (
str) – Separator used to concatenate the metadata fields to the document text. - concurrency_limit (
int) – The maximum number of requests that should be allowed to run concurrently. This parameter is only used in therun_asyncmethod.
Raises:
ValueError– If the requiredmodelorurlis missing fromapi_params, theurlis invalid, or theapi_typeis unknown.
to_dict
Serializes the component to a dictionary.
Returns:
dict[str, Any]– Dictionary with serialized data.
from_dict
Deserializes the component from a dictionary.
Parameters:
- data (
dict[str, Any]) – Dictionary to deserialize from.
Returns:
HuggingFaceAPIDocumentEmbedder– Deserialized component.
run
Embeds a list of documents.
Parameters:
- documents (
list[Document]) – Documents to embed.
Returns:
dict[str, list[Document]]– A dictionary with the following keys:documents: A list of documents with embeddings.
Raises:
TypeError– Ifdocumentsis not a list of Documents.ValueError– If the embeddings returned by the API have an unexpected shape.
run_async
Embeds a list of documents asynchronously.
Parameters:
- documents (
list[Document]) – Documents to embed.
Returns:
dict[str, list[Document]]– A dictionary with the following keys:documents: A list of documents with embeddings.
Raises:
TypeError– Ifdocumentsis not a list of Documents.ValueError– If the embeddings returned by the API have an unexpected shape.
haystack_integrations.components.embedders.huggingface_api.text_embedder
HuggingFaceAPITextEmbedder
Embeds strings using Hugging Face APIs.
Use it with the following Hugging Face APIs:
Usage examples
With free serverless inference API
from haystack_integrations.components.embedders.huggingface_api import HuggingFaceAPITextEmbedder
from haystack.utils import Secret
text_embedder = HuggingFaceAPITextEmbedder(api_type="serverless_inference_api",
api_params={"model": "BAAI/bge-small-en-v1.5"},
token=Secret.from_token("<your-api-key>"))
print(text_embedder.run("I love pizza!"))
# {'embedding': [0.017020374536514282, -0.023255806416273117, ...],
With paid inference endpoints
from haystack_integrations.components.embedders.huggingface_api import HuggingFaceAPITextEmbedder
from haystack.utils import Secret
text_embedder = HuggingFaceAPITextEmbedder(api_type="inference_endpoints",
api_params={"model": "BAAI/bge-small-en-v1.5"},
token=Secret.from_token("<your-api-key>"))
print(text_embedder.run("I love pizza!"))
# {'embedding': [0.017020374536514282, -0.023255806416273117, ...],
With self-hosted text embeddings inference
from haystack_integrations.components.embedders.huggingface_api import HuggingFaceAPITextEmbedder
from haystack.utils import Secret
text_embedder = HuggingFaceAPITextEmbedder(api_type="text_embeddings_inference",
api_params={"url": "http://localhost:8080"})
print(text_embedder.run("I love pizza!"))
# {'embedding': [0.017020374536514282, -0.023255806416273117, ...],
init
__init__(
api_type: HFEmbeddingAPIType | str,
api_params: dict[str, str],
token: Secret | None = Secret.from_env_var(
["HF_API_TOKEN", "HF_TOKEN"], strict=False
),
prefix: str = "",
suffix: str = "",
truncate: bool | None = True,
normalize: bool | None = False,
) -> None
Creates a HuggingFaceAPITextEmbedder component.
Parameters:
- api_type (
HFEmbeddingAPIType | str) – The type of Hugging Face API to use. - api_params (
dict[str, str]) – A dictionary with the following keys: model: Hugging Face model ID. Required whenapi_typeisSERVERLESS_INFERENCE_API.url: URL of the inference endpoint. Required whenapi_typeisINFERENCE_ENDPOINTSorTEXT_EMBEDDINGS_INFERENCE.- token (
Secret | None) – The Hugging Face token to use as HTTP bearer authorization. Check your HF token in your account settings. - prefix (
str) – A string to add at the beginning of each text. - suffix (
str) – A string to add at the end of each text. - truncate (
bool | None) – Truncates the input text to the maximum length supported by the model. Applicable whenapi_typeisTEXT_EMBEDDINGS_INFERENCE, orINFERENCE_ENDPOINTSif the backend uses Text Embeddings Inference. Ifapi_typeisSERVERLESS_INFERENCE_API, this parameter is ignored. - normalize (
bool | None) – Normalizes the embeddings to unit length. Applicable whenapi_typeisTEXT_EMBEDDINGS_INFERENCE, orINFERENCE_ENDPOINTSif the backend uses Text Embeddings Inference. Ifapi_typeisSERVERLESS_INFERENCE_API, this parameter is ignored.
Raises:
ValueError– If the requiredmodelorurlis missing fromapi_params, theurlis invalid, or theapi_typeis unknown.
to_dict
Serializes the component to a dictionary.
Returns:
dict[str, Any]– Dictionary with serialized data.
from_dict
Deserializes the component from a dictionary.
Parameters:
- data (
dict[str, Any]) – Dictionary to deserialize from.
Returns:
HuggingFaceAPITextEmbedder– Deserialized component.
run
Embeds a single string.
Parameters:
- text (
str) – Text to embed.
Returns:
dict[str, Any]– A dictionary with the following keys:embedding: The embedding of the input text.
Raises:
TypeError– Iftextis not a string.ValueError– If the embedding returned by the API has an unexpected shape.
run_async
Embeds a single string asynchronously.
Parameters:
- text (
str) – Text to embed.
Returns:
dict[str, Any]– A dictionary with the following keys:embedding: The embedding of the input text.
Raises:
TypeError– Iftextis not a string.ValueError– If the embedding returned by the API has an unexpected shape.
haystack_integrations.components.generators.huggingface_api.chat.chat_generator
HuggingFaceAPIChatGenerator
Completes chats using Hugging Face APIs.
HuggingFaceAPIChatGenerator uses the ChatMessage format for input and output. Use it to generate text with Hugging Face APIs:
- Serverless Inference API (Inference Providers)
- Paid Inference Endpoints
- Self-hosted Text Generation Inference
Usage examples
With the serverless inference API (Inference Providers) - free tier available
from haystack_integrations.components.generators.huggingface_api import HuggingFaceAPIChatGenerator
from haystack.dataclasses import ChatMessage
from haystack.utils import Secret
from haystack_integrations.components.common.huggingface_api.utils import HFGenerationAPIType
messages = [ChatMessage.from_system("\nYou are a helpful, respectful and honest assistant"),
ChatMessage.from_user("What's Natural Language Processing?")]
# the api_type can be expressed using the HFGenerationAPIType enum or as a string
api_type = HFGenerationAPIType.SERVERLESS_INFERENCE_API
api_type = "serverless_inference_api" # this is equivalent to the above
generator = HuggingFaceAPIChatGenerator(api_type=api_type,
api_params={"model": "Qwen/Qwen2.5-7B-Instruct",
"provider": "together"},
token=Secret.from_token("<your-api-key>"))
result = generator.run(messages)
print(result)
With the serverless inference API (Inference Providers) and text+image input
from haystack_integrations.components.generators.huggingface_api import HuggingFaceAPIChatGenerator
from haystack.dataclasses import ChatMessage, ImageContent
from haystack.utils import Secret
from haystack_integrations.components.common.huggingface_api.utils import HFGenerationAPIType
# Create an image from file path, URL, or base64
image = ImageContent.from_file_path("path/to/your/image.jpg")
# Create a multimodal message with both text and image
messages = [ChatMessage.from_user(content_parts=["Describe this image in detail", image])]
generator = HuggingFaceAPIChatGenerator(
api_type=HFGenerationAPIType.SERVERLESS_INFERENCE_API,
api_params={
"model": "Qwen/Qwen2.5-VL-7B-Instruct", # Vision Language Model
"provider": "hyperbolic"
},
token=Secret.from_token("<your-api-key>")
)
result = generator.run(messages)
print(result)
With paid inference endpoints
from haystack_integrations.components.generators.huggingface_api import HuggingFaceAPIChatGenerator
from haystack.dataclasses import ChatMessage
from haystack.utils import Secret
messages = [ChatMessage.from_system("\nYou are a helpful, respectful and honest assistant"),
ChatMessage.from_user("What's Natural Language Processing?")]
generator = HuggingFaceAPIChatGenerator(api_type="inference_endpoints",
api_params={"url": "<your-inference-endpoint-url>"},
token=Secret.from_token("<your-api-key>"))
result = generator.run(messages)
print(result)
With self-hosted text generation inference
from haystack_integrations.components.generators.huggingface_api import HuggingFaceAPIChatGenerator
from haystack.dataclasses import ChatMessage
messages = [ChatMessage.from_system("\nYou are a helpful, respectful and honest assistant"),
ChatMessage.from_user("What's Natural Language Processing?")]
generator = HuggingFaceAPIChatGenerator(api_type="text_generation_inference",
api_params={"url": "http://localhost:8080"})
result = generator.run(messages)
print(result)
init
__init__(
api_type: HFGenerationAPIType | str,
api_params: dict[str, str],
token: Secret | None = Secret.from_env_var(
["HF_API_TOKEN", "HF_TOKEN"], strict=False
),
generation_kwargs: dict[str, Any] | None = None,
stop_words: list[str] | None = None,
streaming_callback: StreamingCallbackT | None = None,
tools: ToolsType | None = None,
) -> None
Initialize the HuggingFaceAPIChatGenerator instance.
Parameters:
- api_type (
HFGenerationAPIType | str) – The type of Hugging Face API to use. Available types: text_generation_inference: See TGI.inference_endpoints: See Inference Endpoints.serverless_inference_api: See Serverless Inference API - Inference Providers.- api_params (
dict[str, str]) – A dictionary with the following keys: model: Hugging Face model ID. Required whenapi_typeisSERVERLESS_INFERENCE_API.provider: Provider name. Recommended whenapi_typeisSERVERLESS_INFERENCE_API.url: URL of the inference endpoint. Required whenapi_typeisINFERENCE_ENDPOINTSorTEXT_GENERATION_INFERENCE.- Other parameters specific to the chosen API type, such as
timeout,headers, etc. - token (
Secret | None) – The Hugging Face token to use as HTTP bearer authorization. Check your HF token in your account settings. - generation_kwargs (
dict[str, Any] | None) – A dictionary with keyword arguments to customize text generation. Some examples:max_tokens,temperature,top_p. For details, see Hugging Face chat_completion documentation. - stop_words (
list[str] | None) – An optional list of strings representing the stop words. - streaming_callback (
StreamingCallbackT | None) – An optional callable for handling streaming responses. - tools (
ToolsType | None) – A list of Tool and/or Toolset objects, or a single Toolset for which the model can prepare calls. The chosen model should support tool/function calling, according to the model card. Support for tools in the Hugging Face API and TGI is not yet fully refined and you may experience unexpected behavior.
Raises:
ValueError– If the requiredmodelorurlis missing fromapi_params, theurlis invalid, theapi_typeis unknown,toolsandstreaming_callbackare used together, or duplicate tool names are provided.
warm_up
Warm up the Hugging Face API chat generator.
This will warm up the tools registered in the chat generator. This method is idempotent and will only warm up the tools once.
to_dict
Serialize this component to a dictionary.
Returns:
dict[str, Any]– A dictionary containing the serialized component.
from_dict
Deserialize this component from a dictionary.
run
run(
messages: list[ChatMessage] | str,
generation_kwargs: dict[str, Any] | None = None,
tools: ToolsType | None = None,
streaming_callback: StreamingCallbackT | None = None,
) -> dict[str, list[ChatMessage]]
Invoke the text generation inference based on the provided messages and generation parameters.
Parameters:
- messages (
list[ChatMessage] | str) – A list of ChatMessage objects representing the input messages. If a string is provided, it is converted to a list containing a ChatMessage with user role. - generation_kwargs (
dict[str, Any] | None) – Additional keyword arguments for text generation. - tools (
ToolsType | None) – A list of tools or a Toolset for which the model can prepare calls. If set, it will override thetoolsparameter set during component initialization. This parameter can accept either a list ofToolobjects or aToolsetinstance. - streaming_callback (
StreamingCallbackT | None) – An optional callable for handling streaming responses. If set, it will override thestreaming_callbackparameter set during component initialization.
Returns:
dict[str, list[ChatMessage]]– A dictionary with the following keys:replies: A list containing the generated responses as ChatMessage objects.
Raises:
ValueError– Iftoolsand a streaming callback are used together, or if duplicate tool names are provided.
run_async
run_async(
messages: list[ChatMessage] | str,
generation_kwargs: dict[str, Any] | None = None,
tools: ToolsType | None = None,
streaming_callback: StreamingCallbackT | None = None,
) -> dict[str, list[ChatMessage]]
Asynchronously invokes the text generation inference based on the provided messages and generation parameters.
This is the asynchronous version of the run method. It has the same parameters
and return values but can be used with await in an async code.
Parameters:
- messages (
list[ChatMessage] | str) – A list of ChatMessage objects representing the input messages. If a string is provided, it is converted to a list containing a ChatMessage with user role. - generation_kwargs (
dict[str, Any] | None) – Additional keyword arguments for text generation. - tools (
ToolsType | None) – A list of tools or a Toolset for which the model can prepare calls. If set, it will override thetoolsparameter set during component initialization. This parameter can accept either a list ofToolobjects or aToolsetinstance. - streaming_callback (
StreamingCallbackT | None) – An optional callable for handling streaming responses. If set, it will override thestreaming_callbackparameter set during component initialization.
Returns:
dict[str, list[ChatMessage]]– A dictionary with the following keys:replies: A list containing the generated responses as ChatMessage objects.
Raises:
ValueError– Iftoolsand a streaming callback are used together, or if duplicate tool names are provided.
haystack_integrations.components.rankers.huggingface_api.ranker
TruncationDirection
Bases: str, Enum
Defines the direction to truncate text when input length exceeds the model's limit.
Attributes: LEFT: Truncate text from the left side (start of text). RIGHT: Truncate text from the right side (end of text).
HuggingFaceTEIRanker
Ranks documents based on their semantic similarity to the query.
It can be used with a Text Embeddings Inference (TEI) API endpoint:
Usage example:
from haystack import Document
from haystack.utils import Secret
from haystack_integrations.components.rankers.huggingface_api import HuggingFaceTEIRanker
reranker = HuggingFaceTEIRanker(
url="http://localhost:8080",
top_k=5,
timeout=30,
token=Secret.from_token("my_api_token")
)
docs = [Document(content="The capital of France is Paris"), Document(content="The capital of Germany is Berlin")]
result = reranker.run(query="What is the capital of France?", documents=docs)
ranked_docs = result["documents"]
print(ranked_docs)
# >> {'documents': [Document(id=..., content: 'the capital of France is Paris', score: 0.9979767),
# >> Document(id=..., content: 'the capital of Germany is Berlin', score: 0.13982213)]}
init
__init__(
*,
url: str,
top_k: int = 10,
raw_scores: bool = False,
timeout: int | None = 30,
max_retries: int = 3,
retry_status_codes: list[int] | None = None,
token: Secret | None = Secret.from_env_var(
["HF_API_TOKEN", "HF_TOKEN"], strict=False
)
) -> None
Initializes the TEI reranker component.
Parameters:
- url (
str) – Base URL of the TEI reranking service (for example, "https://api.example.com"). - top_k (
int) – Maximum number of top documents to return. - raw_scores (
bool) – If True, include raw relevance scores in the API payload. - timeout (
int | None) – Request timeout in seconds. - max_retries (
int) – Maximum number of retry attempts for failed requests. - retry_status_codes (
list[int] | None) – List of HTTP status codes that will trigger a retry. When None, HTTP 408, 418, 429 and 503 will be retried (default: None). - token (
Secret | None) – The Hugging Face token to use as HTTP bearer authorization. Not always required depending on your TEI server configuration. Check your HF token in your account settings.
to_dict
Serializes the component to a dictionary.
Returns:
dict[str, Any]– Dictionary with serialized data.
from_dict
Deserializes the component from a dictionary.
Parameters:
- data (
dict[str, Any]) – Dictionary to deserialize from.
Returns:
HuggingFaceTEIRanker– Deserialized component.
run
run(
query: str,
documents: list[Document],
top_k: int | None = None,
truncation_direction: TruncationDirection | None = None,
) -> dict[str, list[Document]]
Reranks the provided documents by relevance to the query using the TEI API.
Before ranking, documents are deduplicated by their id, retaining only the document with the highest score if a score is present.
Parameters:
- query (
str) – The user query string to guide reranking. - documents (
list[Document]) – List ofDocumentobjects to rerank. - top_k (
int | None) – Optional override for the maximum number of documents to return. - truncation_direction (
TruncationDirection | None) – If set, enables text truncation in the specified direction.
Returns:
dict[str, list[Document]]– A dictionary with the following keys:documents: A list of reranked documents.
Raises:
RuntimeError– - If the API request fails.RuntimeError– - If the API returns an error response.TypeError– - If the API response is not in the expected list format.
run_async
run_async(
query: str,
documents: list[Document],
top_k: int | None = None,
truncation_direction: TruncationDirection | None = None,
) -> dict[str, list[Document]]
Asynchronously reranks the provided documents by relevance to the query using the TEI API.
Before ranking, documents are deduplicated by their id, retaining only the document with the highest score if a score is present.
Parameters:
- query (
str) – The user query string to guide reranking. - documents (
list[Document]) – List ofDocumentobjects to rerank. - top_k (
int | None) – Optional override for the maximum number of documents to return. - truncation_direction (
TruncationDirection | None) – If set, enables text truncation in the specified direction.
Returns:
dict[str, list[Document]]– A dictionary with the following keys:documents: A list of reranked documents.
Raises:
httpx.RequestError– - If the API request fails.RuntimeError– - If the API returns an error response.TypeError– - If the API response is not in the expected list format.