Version: 2.24

LlamaCppChatGenerator

LlamaCppGenerator enables chat completion using an LLM running on Llama.cpp.


Most common position in a pipeline	After a `ChatPromptBuilder`
Mandatory init variables	`model`: The path of the model to use
Mandatory run variables	`messages`: A list of `ChatMessage` instances representing the input messages
Output variables	`replies`: A list of `ChatMessage` instances with all the replies generated by the LLM
API reference	Llama.cpp
GitHub link	https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/llama_cpp

Overview

Llama.cpp is a library written in C/C++ for efficient inference of Large Language Models. It leverages the efficient quantized GGUF format, dramatically reducing memory requirements and accelerating inference. This means it is possible to run LLMs efficiently on standard machines (even without GPUs).

Llama.cpp uses the quantized binary file of the LLM in GGUF format, which can be downloaded from Hugging Face. LlamaCppChatGenerator supports models running on Llama.cpp by taking the path to the locally saved GGUF file as model parameter at initialization.

Tool Support

LlamaCppChatGenerator supports function calling through the tools parameter, which accepts flexible tool configurations:

A list of Tool objects: Pass individual tools as a list
A single Toolset: Pass an entire Toolset directly
Mixed Tools and Toolsets: Combine multiple Toolsets with standalone tools in a single list

This allows you to organize related tools into logical groups while also including standalone tools as needed.

python

from haystack.tools import Tool, Toolset
from haystack_integrations.components.generators.llama_cpp import LlamaCppChatGenerator

# Create individual tools
weather_tool = Tool(name="weather", description="Get weather info", ...)
news_tool = Tool(name="news", description="Get latest news", ...)

# Group related tools into a toolset
math_toolset = Toolset([add_tool, subtract_tool, multiply_tool])

# Pass mixed tools and toolsets to the generator
generator = LlamaCppChatGenerator(
    model="/path/to/model.gguf",
    tools=[math_toolset, weather_tool, news_tool]  # Mix of Toolset and Tool objects
)

For more details on working with tools, see the Tool and Toolset documentation.

Installation

Install the llama-cpp-haystack package to use this integration:

shell

pip install llama-cpp-haystack

Using a different compute backend

The default installation behavior is to build llama.cpp for CPU on Linux and Windows and use Metal on MacOS. To use other compute backends:

Follow instructions on the llama.cpp installation page to install llama-cpp-python for your preferred compute backend.
Install llama-cpp-haystack using the command above.

For example, to use llama-cpp-haystack with the cuBLAS backend, you have to run the following commands:

shell

export GGML_CUDA=1
CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python
pip install llama-cpp-haystack

Usage

Download the GGUF version of the desired LLM. The GGUF versions of popular models can be downloaded from Hugging Face.
Initialize LlamaCppChatGenerator with the path to the GGUF file and specify the required model and text generation parameters:

python

from haystack_integrations.components.generators.llama_cpp import LlamaCppChatGenerator

generator = LlamaCppChatGenerator(
    model="/content/openchat-3.5-1210.Q3_K_S.gguf",
    n_ctx=512,
    n_batch=128,
    model_kwargs={"n_gpu_layers": -1},
    generation_kwargs={"max_tokens": 128, "temperature": 0.1},
)
generator.warm_up()
messages = [ChatMessage.from_user("Who is the best American actor?")]
result = generator.run(messages)

Passing additional model parameters

The model, n_ctx, n_batch arguments have been exposed for convenience and can be directly passed to the Generator during initialization as keyword arguments. Note that model translates to llama.cpp's model_path parameter.

The model_kwargs parameter can pass additional arguments when initializing the model. In case of duplication, these parameters override the model, n_ctx, and n_batch initialization parameters.

See Llama.cpp's LLM documentation for more information on the available model arguments.

Note: Llama.cpp automatically extracts the chat_template from the model metadata for applying formatting to ChatMessages. You can override the chat_template used by passing in a custom chat_handler or chat_format as a model parameter.

For example, to offload the model to GPU during initialization:

python

from haystack_integrations.components.generators.llama_cpp import LlamaCppChatGenerator
from haystack.dataclasses import ChatMessage

generator = LlamaCppChatGenerator(
    model="/content/openchat-3.5-1210.Q3_K_S.gguf",
    n_ctx=512,
    n_batch=128,
    model_kwargs={"n_gpu_layers": -1},
)
generator.warm_up()
messages = [ChatMessage.from_user("Who is the best American actor?")]
result = generator.run(messages, generation_kwargs={"max_tokens": 128})
generated_reply = result["replies"][0].content
print(generated_reply)

Passing text generation parameters

The generation_kwargs parameter can pass additional generation arguments like max_tokens, temperature, top_k, top_p, and others to the model during inference.

See Llama.cpp's Chat Completion API documentation for more information on the available generation arguments.

Note: JSON mode, Function Calling, and Tools are all supported as generation_kwargs. Please see the llama-cpp-python GitHub README for more information on how to use them.

For example, to set the max_tokens and temperature:

python

from haystack_integrations.components.generators.llama_cpp import LlamaCppChatGenerator
from haystack.dataclasses import ChatMessage

generator = LlamaCppChatGenerator(
    model="/content/openchat-3.5-1210.Q3_K_S.gguf",
    n_ctx=512,
    n_batch=128,
    generation_kwargs={"max_tokens": 128, "temperature": 0.1},
)
generator.warm_up()
messages = [ChatMessage.from_user("Who is the best American actor?")]
result = generator.run(messages)

With multimodal (image + text) inputs

python

from haystack.dataclasses import ChatMessage, ImageContent
from haystack_integrations.components.generators.llama_cpp import LlamaCppChatGenerator

# Initialize with multimodal support
llm = LlamaCppChatGenerator(
    model="llava-v1.5-7b-q4_0.gguf",
    chat_handler_name="Llava15ChatHandler",  # Use llava-1-5 handler
    model_clip_path="mmproj-model-f16.gguf",  # CLIP model
    n_ctx=4096,  # Larger context for image processing
)
llm.warm_up()

image = ImageContent.from_file_path("apple.jpg")
user_message = ChatMessage.from_user(
    content_parts=["What does the image show? Max 5 words.", image],
)

response = llm.run([user_message])["replies"][0].text
print(response)

# Red apple on straw.

The generation_kwargs can also be passed to the run method of the generator directly:

python

from haystack_integrations.components.generators.llama_cpp import LlamaCppChatGenerator
from haystack.dataclasses import ChatMessage

generator = LlamaCppChatGenerator(
    model="/content/openchat-3.5-1210.Q3_K_S.gguf",
    n_ctx=512,
    n_batch=128,
)
generator.warm_up()
messages = [ChatMessage.from_user("Who is the best American actor?")]
result = generator.run(
    messages,
    generation_kwargs={"max_tokens": 128, "temperature": 0.1},
)

In a pipeline

We use the LlamaCppChatGenerator in a Retrieval Augmented Generation pipeline on the Simple Wikipedia Dataset from Hugging Face and generate answers using the OpenChat-3.5 LLM.

Load the dataset:

python

## Install HuggingFace Datasets using "pip install datasets"
from datasets import load_dataset
from haystack import Document, Pipeline
from haystack.components.builders.answer_builder import AnswerBuilder
from haystack.components.builders import ChatPromptBuilder
from haystack.components.embedders import (
    SentenceTransformersDocumentEmbedder,
    SentenceTransformersTextEmbedder,
)
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.dataclasses import ChatMessage

## Import LlamaCppChatGenerator
from haystack_integrations.components.generators.llama_cpp import LlamaCppChatGenerator

## Load first 100 rows of the Simple Wikipedia Dataset from HuggingFace
dataset = load_dataset("pszemraj/simple_wikipedia", split="validation[:100]")

docs = [
    Document(
        content=doc["text"],
        meta={
            "title": doc["title"],
            "url": doc["url"],
        },
    )
    for doc in dataset
]

Index the documents to the InMemoryDocumentStore using the SentenceTransformersDocumentEmbedder and DocumentWriter:

python

doc_store = InMemoryDocumentStore(embedding_similarity_function="cosine")
## Install sentence transformers using "pip install sentence-transformers"
doc_embedder = SentenceTransformersDocumentEmbedder(
    model="sentence-transformers/all-MiniLM-L6-v2",
)

## Indexing Pipeline
indexing_pipeline = Pipeline()
indexing_pipeline.add_component(instance=doc_embedder, name="DocEmbedder")
indexing_pipeline.add_component(
    instance=DocumentWriter(document_store=doc_store),
    name="DocWriter",
)
indexing_pipeline.connect("DocEmbedder", "DocWriter")

indexing_pipeline.run({"DocEmbedder": {"documents": docs}})

Create the RAG pipeline and add the LlamaCppChatGenerator to it:

python

system_message = ChatMessage.from_system(
    """
    Answer the question using the provided context.
    Context:
    {% for doc in documents %}
        {{ doc.content }}
    {% endfor %}
    """,
)
user_message = ChatMessage.from_user("Question: {{question}}")
assistent_message = ChatMessage.from_assistant("Answer: ")

chat_template = [system_message, user_message, assistent_message]

rag_pipeline = Pipeline()

text_embedder = SentenceTransformersTextEmbedder(
    model="sentence-transformers/all-MiniLM-L6-v2",
)

## Load the LLM using LlamaCppChatGenerator
model_path = "openchat-3.5-1210.Q3_K_S.gguf"
generator = LlamaCppChatGenerator(model=model_path, n_ctx=4096, n_batch=128)

rag_pipeline.add_component(
    instance=text_embedder,
    name="text_embedder",
)
rag_pipeline.add_component(
    instance=InMemoryEmbeddingRetriever(document_store=doc_store, top_k=3),
    name="retriever",
)
rag_pipeline.add_component(
    instance=ChatPromptBuilder(template=chat_template),
    name="prompt_builder",
)
rag_pipeline.add_component(instance=generator, name="llm")
rag_pipeline.add_component(instance=AnswerBuilder(), name="answer_builder")

rag_pipeline.connect("text_embedder", "retriever")
rag_pipeline.connect("retriever", "prompt_builder.documents")
rag_pipeline.connect("prompt_builder", "llm")
rag_pipeline.connect("llm", "answer_builder")
rag_pipeline.connect("retriever", "answer_builder.documents")

Run the pipeline:

python

question = "Which year did the Joker movie release?"
result = rag_pipeline.run(
    {
        "text_embedder": {"text": question},
        "prompt_builder": {"question": question},
        "llm": {"generation_kwargs": {"max_tokens": 128, "temperature": 0.1}},
        "answer_builder": {"query": question},
    },
)

generated_answer = result["answer_builder"]["answers"][0]
print(generated_answer.data)
## The Joker movie was released on October 4, 2019.

Overview​

Tool Support​

Installation​

Using a different compute backend​

Usage​

Passing additional model parameters​

Passing text generation parameters​

With multimodal (image + text) inputs​

In a pipeline​

Overview

Tool Support

Installation

Using a different compute backend

Usage

Passing additional model parameters

Passing text generation parameters

With multimodal (image + text) inputs

In a pipeline