vLLM
haystack_integrations.components.embedders.vllm.document_embedder
VLLMDocumentEmbedder
A component for computing Document embeddings using models served with vLLM.
The embedding of each Document is stored in the embedding field of the Document.
It expects a vLLM server to be running and accessible at the api_base_url parameter and uses the
OpenAI-compatible Embeddings API exposed by vLLM.
Starting the vLLM server
Before using this component, start a vLLM server with an embedding model:
For details on server options, see the vLLM CLI docs.
Usage example
from haystack import Document
from haystack_integrations.components.embedders.vllm import VLLMDocumentEmbedder
doc = Document(content="I love pizza!")
document_embedder = VLLMDocumentEmbedder(model="google/embeddinggemma-300m")
result = document_embedder.run([doc])
print(result["documents"][0].embedding)
Usage example with vLLM-specific parameters
Pass vLLM-specific parameters via the extra_parameters dictionary. They are forwarded as extra_body
to the OpenAI-compatible endpoint.
document_embedder = VLLMDocumentEmbedder(
model="google/embeddinggemma-300m",
extra_parameters={"truncate_prompt_tokens": 256, "truncation_side": "right"},
)
init
__init__(
*,
model: str,
api_key: Secret | None = Secret.from_env_var("VLLM_API_KEY", strict=False),
api_base_url: str = "http://localhost:8000/v1",
prefix: str = "",
suffix: str = "",
dimensions: int | None = None,
batch_size: int = 32,
progress_bar: bool = True,
meta_fields_to_embed: list[str] | None = None,
embedding_separator: str = "\n",
timeout: float | None = None,
max_retries: int | None = None,
http_client_kwargs: dict[str, Any] | None = None,
raise_on_failure: bool = False,
extra_parameters: dict[str, Any] | None = None
) -> None
Creates an instance of VLLMDocumentEmbedder.
Parameters:
- model (
str) – The name of the model served by vLLM. Check vLLM documentation for more information. - api_key (
Secret | None) – The vLLM API key. Defaults to theVLLM_API_KEYenvironment variable. Only required if the vLLM server was started with--api-key. - api_base_url (
str) – The base URL of the vLLM server. - prefix (
str) – A string to add at the beginning of each text. - suffix (
str) – A string to add at the end of each text. - dimensions (
int | None) – The number of dimensions of the resulting embedding. Only models trained with Matryoshka Representation Learning support this parameter. See vLLM documentation for more information. - batch_size (
int) – Number of documents to encode at once. - progress_bar (
bool) – Whether to show a progress bar. - meta_fields_to_embed (
list[str] | None) – List of meta fields to embed along with the document text. - embedding_separator (
str) – Separator used to concatenate the meta fields to the document text. - timeout (
float | None) – Timeout in seconds for vLLM client calls. If not set, the OpenAI client default applies. - max_retries (
int | None) – Maximum number of retries for failed requests. If not set, the OpenAI client default applies. - http_client_kwargs (
dict[str, Any] | None) – A dictionary of keyword arguments to configure a customhttpx.Clientorhttpx.AsyncClient. For more information, see the HTTPX documentation. - raise_on_failure (
bool) – Whether to raise an exception if the embedding request fails. IfFalse, the component logs the error and continues processing the remaining documents. - extra_parameters (
dict[str, Any] | None) – Additional parameters forwarded asextra_bodyto the vLLM embeddings endpoint. Use this to pass parameters not part of the standard OpenAI Embeddings API, such astruncate_prompt_tokens,truncation_side, etc. See the vLLM Embeddings API docs.
warm_up
Create the OpenAI clients.
run
Embed a list of Documents.
Parameters:
- documents (
list[Document]) – Documents to embed.
Returns:
dict[str, list[Document] | dict[str, Any]]– A dictionary with:documents: The input documents with theirembeddingfield populated.meta: Information about the usage of the model.
run_async
run_async(
documents: list[Document],
) -> dict[str, list[Document] | dict[str, Any]]
Asynchronously embed a list of Documents.
Parameters:
- documents (
list[Document]) – Documents to embed.
Returns:
dict[str, list[Document] | dict[str, Any]]– A dictionary with:documents: The input documents with theirembeddingfield populated.meta: Information about the usage of the model.
haystack_integrations.components.embedders.vllm.text_embedder
VLLMTextEmbedder
A component for embedding strings using models served with vLLM.
It expects a vLLM server to be running and accessible at the api_base_url parameter and uses the
OpenAI-compatible Embeddings API exposed by vLLM.
Starting the vLLM server
Before using this component, start a vLLM server with an embedding model:
For details on server options, see the vLLM CLI docs.
Usage example
from haystack_integrations.components.embedders.vllm import VLLMTextEmbedder
text_embedder = VLLMTextEmbedder(model="google/embeddinggemma-300m")
print(text_embedder.run("I love pizza!"))
Usage example with vLLM-specific parameters
Pass vLLM-specific parameters via the extra_parameters dictionary. They are forwarded as extra_body
to the OpenAI-compatible endpoint.
text_embedder = VLLMTextEmbedder(
model="google/embeddinggemma-300m",
extra_parameters={"truncate_prompt_tokens": 256, "truncation_side": "right"},
)
init
__init__(
*,
model: str,
api_key: Secret | None = Secret.from_env_var("VLLM_API_KEY", strict=False),
api_base_url: str = "http://localhost:8000/v1",
prefix: str = "",
suffix: str = "",
dimensions: int | None = None,
timeout: float | None = None,
max_retries: int | None = None,
http_client_kwargs: dict[str, Any] | None = None,
extra_parameters: dict[str, Any] | None = None
) -> None
Creates an instance of VLLMTextEmbedder.
Parameters:
- model (
str) – The name of the model served by vLLM (e.g., "intfloat/e5-mistral-7b-instruct"). - api_key (
Secret | None) – The vLLM API key. Defaults to theVLLM_API_KEYenvironment variable. Only required if the vLLM server was started with--api-key. - api_base_url (
str) – The base URL of the vLLM server. - prefix (
str) – A string to add at the beginning of each text to embed. - suffix (
str) – A string to add at the end of each text to embed. - dimensions (
int | None) – The number of dimensions of the resulting embedding. Only models trained with Matryoshka Representation Learning support this parameter. See vLLM documentation for more information. - timeout (
float | None) – Timeout in seconds for vLLM client calls. If not set, the OpenAI client default applies. - max_retries (
int | None) – Maximum number of retries for failed requests. If not set, the OpenAI client default applies. - http_client_kwargs (
dict[str, Any] | None) – A dictionary of keyword arguments to configure a customhttpx.Clientorhttpx.AsyncClient. For more information, see the HTTPX documentation. - extra_parameters (
dict[str, Any] | None) – Additional parameters forwarded asextra_bodyto the vLLM embeddings endpoint. Use this to pass parameters not part of the standard OpenAI Embeddings API, such astruncate_prompt_tokens,truncation_side,additional_data,use_activation, etc. See the vLLM Embeddings API docs.
warm_up
Create the OpenAI clients.
run
Embed a single string.
Parameters:
- text (
str) – Text to embed.
Returns:
dict[str, list[float] | dict[str, Any]]– A dictionary with:embedding: The embedding of the input text.meta: Information about the usage of the model.
run_async
Asynchronously embed a single string.
Parameters:
- text (
str) – Text to embed.
Returns:
dict[str, list[float] | dict[str, Any]]– A dictionary with:embedding: The embedding of the input text.meta: Information about the usage of the model.
haystack_integrations.components.generators.vllm.chat.chat_generator
VLLMChatGenerator
A component for generating chat completions using models served with vLLM.
It expects a vLLM server to be running and accessible at the api_base_url parameter.
Starting the vLLM server
Before using this component, start a vLLM server:
For reasoning models, start the server with the appropriate reasoning parser:
For tool calling, the server must be started with --enable-auto-tool-choice and --tool-call-parser:
The available tool call parsers depend on the model. See the vLLM tool calling docs for the full list.
For details on server options, see the vLLM CLI docs.
Usage example
from haystack.dataclasses import ChatMessage
from haystack_integrations.components.generators.vllm import VLLMChatGenerator
generator = VLLMChatGenerator(
model="Qwen/Qwen3-0.6B",
generation_kwargs={"max_tokens": 512, "temperature": 0.7},
)
messages = [ChatMessage.from_user("What's Natural Language Processing?")]
response = generator.run(messages=messages)
print(response["replies"][0].text)
Usage example with vLLM-specific parameters
Pass the vLLM-specific parameters inside the generation_kwargs["extra_body"] dictionary.
from haystack_integrations.components.generators.vllm import VLLMChatGenerator
generator = VLLMChatGenerator(
model="Qwen/Qwen3-0.6B",
generation_kwargs={
"max_tokens": 512,
"extra_body": {
"top_k": 50,
"min_tokens": 10,
"repetition_penalty": 1.1,
},
},
)
Usage example with tool calling
To use tool calling, start the vLLM server with --enable-auto-tool-choice and --tool-call-parser.
from haystack.dataclasses import ChatMessage
from haystack.tools import tool
from haystack_integrations.components.generators.vllm import VLLMChatGenerator
@tool
def weather(city: str) -> str:
"""Get the weather in a given city."""
return f"The weather in {city} is sunny"
generator = VLLMChatGenerator(model="Qwen/Qwen3-0.6B", tools=[weather])
messages = [ChatMessage.from_user("What is the weather in Paris?")]
response = generator.run(messages=messages)
print(response["replies"][0].tool_calls)
Usage example with reasoning models
To use reasoning models, start the vLLM server with --reasoning-parser.
from haystack.dataclasses import ChatMessage
from haystack_integrations.components.generators.vllm import VLLMChatGenerator
generator = VLLMChatGenerator(model="Qwen/Qwen3-0.6B")
messages = [ChatMessage.from_user("Solve step by step: what is 15 * 37?")]
response = generator.run(messages=messages)
reply = response["replies"][0]
if reply.reasoning:
print("Reasoning:", reply.reasoning.reasoning_text)
print("Answer:", reply.text)
init
__init__(
*,
model: str,
api_key: Secret | None = Secret.from_env_var("VLLM_API_KEY", strict=False),
streaming_callback: StreamingCallbackT | None = None,
api_base_url: str = "http://localhost:8000/v1",
generation_kwargs: dict[str, Any] | None = None,
timeout: float | None = None,
max_retries: int | None = None,
tools: ToolsType | None = None,
http_client_kwargs: dict[str, Any] | None = None
) -> None
Creates an instance of VLLMChatGenerator.
Parameters:
- model (
str) – The name of the model served by vLLM (e.g., "Qwen/Qwen3-0.6B"). - api_key (
Secret | None) – The vLLM API key. Defaults to theVLLM_API_KEYenvironment variable. Only required if the vLLM server was started with--api-key. - streaming_callback (
StreamingCallbackT | None) – A callback function that is called when a new token is received from the stream. The callback function accepts StreamingChunk as an argument. - api_base_url (
str) – The base URL of the vLLM server. - generation_kwargs (
dict[str, Any] | None) – Additional parameters for text generation. These parameters are sent directly to the vLLM OpenAI-compatible endpoint. See vLLM documentation for more details. Some of the supported parameters: max_tokens: Maximum number of tokens to generate.temperature: Sampling temperature.top_p: Nucleus sampling parameter.n: Number of completions to generate for each prompt.stop: One or more sequences after which the model should stop generating tokens.response_format: A JSON schema or a Pydantic model that enforces the structure of the response.extra_body: A dictionary of vLLM-specific parameters not part of the standard OpenAI API (e.g.,top_k,min_tokens,repetition_penalty).- timeout (
float | None) – Timeout for vLLM client calls. If not set, it defaults to the default set by the OpenAI client. - max_retries (
int | None) – Maximum number of retries to attempt for failed requests. If not set, it defaults to the default set by the OpenAI client. - tools (
ToolsType | None) – A list of Tool and/or Toolset objects, or a single Toolset for which the model can prepare calls. Each tool should have a unique name. Not all models support tools. - http_client_kwargs (
dict[str, Any] | None) – A dictionary of keyword arguments to configure a customhttpx.Clientorhttpx.AsyncClient. For more information, see the HTTPX documentation.
warm_up
Create the OpenAI clients and warm up tools.
to_dict
Serialize this component to a dictionary.
Returns:
dict[str, Any]– The serialized component as a dictionary.
from_dict
Deserialize this component from a dictionary.
Parameters:
- data (
dict[str, Any]) – The dictionary representation of this component.
Returns:
VLLMChatGenerator– The deserialized component instance.
run
run(
messages: list[ChatMessage],
streaming_callback: StreamingCallbackT | None = None,
generation_kwargs: dict[str, Any] | None = None,
*,
tools: ToolsType | None = None
) -> dict[str, list[ChatMessage]]
Run the VLLM chat generator on the given input data.
Parameters:
- messages (
list[ChatMessage]) – A list of ChatMessage instances representing the input messages. - streaming_callback (
StreamingCallbackT | None) – A callback function that is called when a new token is received from the stream. - generation_kwargs (
dict[str, Any] | None) – Additional keyword arguments for text generation. These parameters will override the parameters passed during component initialization. For details on vLLM API parameters, see vLLM documentation. - tools (
ToolsType | None) – A list of Tool and/or Toolset objects, or a single Toolset for which the model can prepare calls. If set, it will override thetoolsparameter provided during initialization.
Returns:
dict[str, list[ChatMessage]]– A dictionary with the following key:replies: A list containing the generated responses as ChatMessage instances.
run_async
run_async(
messages: list[ChatMessage],
streaming_callback: StreamingCallbackT | None = None,
generation_kwargs: dict[str, Any] | None = None,
*,
tools: ToolsType | None = None
) -> dict[str, list[ChatMessage]]
Run the VLLM chat generator on the given input data asynchronously.
Parameters:
- messages (
list[ChatMessage]) – A list of ChatMessage instances representing the input messages. - streaming_callback (
StreamingCallbackT | None) – A callback function that is called when a new token is received from the stream. Must be a coroutine. - generation_kwargs (
dict[str, Any] | None) – Additional keyword arguments for text generation. These parameters will override the parameters passed during component initialization. For details on vLLM API parameters, see vLLM documentation. - tools (
ToolsType | None) – A list of Tool and/or Toolset objects, or a single Toolset for which the model can prepare calls. If set, it will override thetoolsparameter provided during initialization.
Returns:
dict[str, list[ChatMessage]]– A dictionary with the following key:replies: A list containing the generated responses as ChatMessage instances.
haystack_integrations.components.rankers.vllm.ranker
VLLMRanker
Ranks Documents based on their similarity to a query using models served with vLLM.
It expects a vLLM server to be running and accessible at the api_base_url parameter and uses the
/rerank endpoint exposed by vLLM.
Starting the vLLM server
Before using this component, start a vLLM server with a reranker model:
For details on server options, see the vLLM CLI docs.
Usage example
from haystack import Document
from haystack_integrations.components.rankers.vllm import VLLMRanker
ranker = VLLMRanker(model="BAAI/bge-reranker-base")
docs = [
Document(content="The capital of Brazil is Brasilia."),
Document(content="The capital of France is Paris."),
]
result = ranker.run(query="What is the capital of France?", documents=docs)
print(result["documents"][0].content)
Usage example with vLLM-specific parameters
Pass vLLM-specific parameters via the extra_parameters dictionary. They are merged into the
request body sent to the /rerank endpoint.
ranker = VLLMRanker(
model="BAAI/bge-reranker-base",
extra_parameters={"truncate_prompt_tokens": 256},
)
init
__init__(
*,
model: str,
api_key: Secret | None = Secret.from_env_var("VLLM_API_KEY", strict=False),
api_base_url: str = "http://localhost:8000/v1",
top_k: int | None = None,
score_threshold: float | None = None,
meta_fields_to_embed: list[str] | None = None,
meta_data_separator: str = "\n",
http_client_kwargs: dict[str, Any] | None = None,
extra_parameters: dict[str, Any] | None = None
) -> None
Creates an instance of VLLMRanker.
Parameters:
- model (
str) – The name of the reranker model served by vLLM. Check vLLM documentation for information on supported models. - api_key (
Secret | None) – The vLLM API key. Defaults to theVLLM_API_KEYenvironment variable. Only required if the vLLM server was started with--api-key. - api_base_url (
str) – The base URL of the vLLM server. - top_k (
int | None) – The maximum number of Documents to return. IfNone, all documents are returned. - score_threshold (
float | None) – If set, documents with a relevance score below this value are dropped. Applied aftertop_k, so the output may contain fewer thantop_kdocuments. - meta_fields_to_embed (
list[str] | None) – List of meta fields that should be concatenated with the document content before reranking. - meta_data_separator (
str) – Separator used to concatenate the meta fields to the document content. - http_client_kwargs (
dict[str, Any] | None) – A dictionary of keyword arguments to configure a customhttpx.Clientorhttpx.AsyncClient. For more information, see the HTTPX documentation. - extra_parameters (
dict[str, Any] | None) – Additional parameters merged into the request body sent to the vLLM/rerankendpoint. Use this to pass parameters not part of the standard rerank API, such astruncate_prompt_tokens. See the vLLM docs for more information.
Raises:
ValueError– Iftop_kis not > 0.
warm_up
Create the httpx clients.
run
run(
query: str,
documents: list[Document],
top_k: int | None = None,
score_threshold: float | None = None,
) -> dict[str, list[Document] | dict[str, Any]]
Returns a list of Documents ranked by their similarity to the given query.
Parameters:
- query (
str) – Query string. - documents (
list[Document]) – List of Documents to rank. - top_k (
int | None) – The maximum number of Documents to return. Overrides the value set at initialization. - score_threshold (
float | None) – Minimum relevance score required for a document to be returned. Overrides the value set at initialization.
Returns:
dict[str, list[Document] | dict[str, Any]]– A dictionary with:documents: Documents sorted from most to least relevant.meta: Information about the model and usage.
Raises:
ValueError– Iftop_kis not > 0.
run_async
run_async(
query: str,
documents: list[Document],
top_k: int | None = None,
score_threshold: float | None = None,
) -> dict[str, list[Document] | dict[str, Any]]
Asynchronously returns a list of Documents ranked by their similarity to the given query.
Parameters:
- query (
str) – Query string. - documents (
list[Document]) – List of Documents to rank. - top_k (
int | None) – The maximum number of Documents to return. Overrides the value set at initialization. - score_threshold (
float | None) – Minimum relevance score required for a document to be returned. Overrides the value set at initialization.
Returns:
dict[str, list[Document] | dict[str, Any]]– A dictionary with:documents: Documents sorted from most to least relevant.meta: Information about the model and usage.
Raises:
ValueError– Iftop_kis not > 0.