Choosing the Right Generator
This page provides information on choosing the right Generator for interacting with Generative Language Models in Haystack. It explains the distinction between Generators and ChatGenerators, discusses using proprietary and open models from various providers, and explores options for using open models on-premise.
In Haystack, Generators are the main interface for interacting with Generative Language Models. This guide aims to simplify the process of choosing the right Generator based on your preferences and computing resources. This guide does not focus on selecting a specific model itself but rather a model type and a Haystack Generator: as you will see, in several cases, you have different options to use the same model.
Generators vs ChatGenerators
The first distinction we are talking about is between Generators and ChatGenerators, for example, OpenAIGenerator and OpenAIChatGenerator, HuggingFaceAPIGenerator and HuggingFaceAPIChatGenerator, and so on.
- Generators are components that expect a prompt (a string) and return the generated text in “replies”.
- ChatGenerators support the ChatMessage data class out of the box. They expect a list of Chat Messages and return a Chat Message in “replies”.
ChatGenerators are way more powerful than Generators: they for example support Function Calling and Multimodal inputs.
We recommend using ChatGenerators. Generators might be removed in a future major version of Haystack.
Streaming Support
Streaming refers to outputting LLM responses word by word rather than waiting for the entire response to be generated before outputting everything at once.
You can check which Generators have streaming support on the Generators overview page.
When you enable streaming, the generator calls your streaming_callback for every StreamingChunk. Each chunk represents exactly one of the following:
- Tool calls: The model is building a tool/function call. Read
chunk.tool_calls. - Tool result: A tool finished and returned output. Read
chunk.tool_call_result. - Text tokens: Normal assistant text. Read
chunk.content. - Reasoning tokens: Extended thinking output (for models that support it). Read
chunk.reasoning.
Only one of these fields appears per chunk. Use chunk.start and chunk.finish_reason to detect boundaries. Use chunk.index and chunk.component_info for tracing.
For providers that support multiple candidates, set n=1 to stream.
Check out the parameter details in our API Reference for StreamingChunk.
The simplest way is to use the built-in print_streaming_chunk function. It handles all chunk types and prints formatted output to stdout:
from haystack.components.generators.utils import print_streaming_chunk
generator = SomeGenerator(streaming_callback=print_streaming_chunk)
# For ChatGenerators, pass a list[ChatMessage]. For text generators, pass a prompt string.
Custom Callback
If you need custom rendering, write your own callback. Handle the four chunk types in order:
from haystack.dataclasses import StreamingChunk
def my_streaming_callback(chunk: StreamingChunk) -> None:
if chunk.start and chunk.index and chunk.index > 0:
print("\n\n", flush=True, end="")
# Tool Call streaming
if chunk.tool_calls:
for tool_call in chunk.tool_calls:
if chunk.start:
if chunk.index and tool_call.index > chunk.index:
print("\n\n", flush=True, end="")
print(
f">>> Tool Call: {tool_call.tool_name}\n>>> Arguments: ",
flush=True,
end="",
)
if tool_call.arguments:
print(tool_call.arguments, flush=True, end="")
# Tool Result streaming
if chunk.tool_call_result:
print(f">>> Tool Result\n{chunk.tool_call_result.result}", flush=True, end="")
# Text streaming
if chunk.content:
if chunk.start:
print(">>> Assistant\n", flush=True, end="")
print(chunk.content, flush=True, end="")
# Reasoning streaming
if chunk.reasoning:
if chunk.start:
print(">>> Reasoning\n", flush=True, end="")
print(chunk.reasoning.reasoning_text, flush=True, end="")
if chunk.finish_reason is not None:
print("\n\n", flush=True, end="")
Agents and Tools
Agents and ToolInvoker forward your streaming_callback. They also emit a final tool-result chunk with a finish_reason so UIs can close the “tool phase” cleanly before assistant text resumes. The default print_streaming_chunk formats this for you.
Proprietary vs Open-weights Models
Before choosing a Generator, it helps to know which type of model you want to use.
Proprietary Models
Using proprietary models is a quick way to start with Generative Language Models. The typical approach involves calling these hosted models using an API Key. You are paying based on the number of tokens, both sent and generated. You don’t need significant resources on your local machine, as the computation is executed on the provider’s infrastructure. When using these models, your data exits your machine and is transmitted to the model provider.
Open-weights Models
When discussing open (weights) models, we're referring to models with public weights that anyone can deploy on their infrastructure. The datasets used for training are shared less frequently. One could choose to use an open model for several reasons, including more transparency and control of the model.
Not all open models are suitable for commercial use. We advise thoroughly reviewing the license, typically available on Hugging Face, before considering their adoption.
Even if the model is open, you might still want to rely on model providers to use it, mostly because you want someone else to host the model and take care of the infrastructural aspects. In these scenarios, your data transitions from your machine to the provider facilitating the model.
Where the Model runs
Where the model runs is a separate decision from the proprietary-vs-open one: a proprietary model is always provider-hosted, but an open-weights model can be served in any of the ways described below.
The Generator you pick is mostly determined by where the model runs and which API you call. The costs associated with these solutions can vary. Depending on the solution you choose, you pay for the tokens consumed, both sent and generated or for the hosting of the model, often billed per hour.
Provider-hosted APIs
With provider-hosted APIs, you leverage an instance of the model shared with other users, with payment typically based on consumed tokens, both sent and generated.
Single-vendor APIs
These providers host their own models behind a dedicated API. Haystack supports the models offered by a variety of providers: OpenAI, Azure, Google, Cohere, and Mistral, with more being added constantly.
Multi-model Gateways
Several providers expose many models through a single API, so one Generator lets you switch between models from different vendors. Some of these providers focus on open-weights models, while others also include proprietary ones:
- Amazon Bedrock provides access to proprietary models from the Amazon Titan family, AI21 Labs, Anthropic, and Cohere, plus several open models, such as Llama from Meta.
- Hugging Face Inference Providers, available through the Hugging Face API Generators, give access to hundreds of LLMs from different providers through a unified interface.
- AIMLAPI, Comet API, NVIDIA, OpenRouter, STACKIT, Together AI, and WatsonX each have a dedicated Haystack integration.
- DeepInfra, Fireworks, and other cloud providers offer OpenAI-compatible interfaces and can be used through the OpenAI Generators.
Here is an example using DeepInfra and OpenAIChatGenerator:
from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.dataclasses import ChatMessage
from haystack.utils import Secret
generator = OpenAIChatGenerator(
api_key=Secret.from_env_var("ENVVAR_WITH_API_KEY"),
api_base_url="https://api.deepinfra.com/v1",
model="Qwen/Qwen3.6-35B-A3B",
)
generator.run(messages=[ChatMessage.from_user("What is the best French cheese?")])
Dedicated Cloud Instances
In this case, a private instance of the model is deployed by the provider, and you typically pay per hour.
Here are the components that support this in Haystack:
- Amazon SagemakerGenerator
- HuggingFace API Generators, when used to query HuggingFace Inference endpoints.
Provider-hosted API vs Dedicated Cloud Instance
Why choose a provider-hosted API:
- Cost Savings: Access cost-effective solutions especially suitable for users with varying usage patterns or limited budgets.
- Ease of Use: Setup and maintenance are simplified as the provider manages the infrastructure and updates, making it user-friendly.
Why choose a dedicated cloud instance:
- Dedicated Resources: Ensure consistent performance with dedicated resources for your instance and avoid any impact from other users.
- Scalability: Scale resources based on requirements while ensuring optimal performance during peak times and cost savings during off-peak hours.
- Predictable Costs: Billing per hour leads to more predictable costs, especially when there is a clear understanding of usage patterns.
Self-hosted / On-premise
On-premise models mean that you host open models on your machine or infrastructure. This is ideal for local experimentation, and also suitable in production scenarios where data privacy concerns prevent sending data to external providers, provided you have ample computational resources.
Local Experimentation
- GPU:
HuggingFaceLocalChatGeneratoris based on the Hugging Face Transformers library. This is good for experimentation when you have some GPU resources (for example, in Colab). If GPU resources are limited, alternative quantization options like bitsandbytes, GPTQ, and AWQ are supported. For more performant solutions in production use cases, refer to the options below. - CPU (+ GPU if available):
LlamaCppChatGeneratoruses the Llama.cpp library – a project written in C/C++ for efficient inference of LLMs. In particular, it employs the quantized GGUF format, suitable for running these models on standard machines (even without GPUs). If GPU resources are available, some model layers can be offloaded to GPU for enhanced speed. - CPU (+ GPU if available):
OllamaChatGeneratoris based on the Ollama project, acting like Docker for LLMs. It provides a simple way to package and deploy these models. Internally based on the Llama.cpp library, it offers a more streamlined process for running on various platforms.
Serving LLMs in Production
The following solutions are suitable if you want to run Language Models in production and have GPU resources available. They use innovative techniques for fast inference and efficient handling of numerous concurrent requests.
- vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs. Haystack supports vLLM through vLLMChatGenerator.
- SGLang is a similar high-performance LLM serving framework. Haystack supports it through the OpenAI Generators.
- Hugging Face API Generators, when used to query a TGI instance deployed on-premise. Hugging Face Text Generation Inference is a toolkit for efficiently deploying and serving LLMs. This project is now in maintenance mode.