LlamaCppChatGenerator
LlamaCppGenerator
enables chat completion using an LLM running on Llama.cpp.
Name | LlamaCppChatGenerator |
Source | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/llama_cpp |
Most common position in a pipeline | After a ChatPromptBuilder |
Mandatory input variables | “messages”: A list of ChatMessage instances representing the input messages |
Output variables | “replies”: A list of ChatMessage instances with all the replies generated by the LLM |
Overview
Llama.cpp is a library written in C/C++ for efficient inference of Large Language Models. It leverages the efficient quantized GGUF format, dramatically reducing memory requirements and accelerating inference. This means it is possible to run LLMs efficiently on standard machines (even without GPUs).
Llama.cpp
uses the quantized binary file of the LLM in GGUF format, which can be downloaded from Hugging Face. LlamaCppChatGenerator
supports models running on Llama.cpp
by taking the path to the locally saved GGUF file as model
parameter at initialization.
Installation
Install the llama-cpp-haystack
package to use this integration:
pip install llama-cpp-haystack
Using a different compute backend
The default installation behavior is to build llama.cpp
for CPU on Linux and Windows and use Metal on MacOS. To use other compute backends:
- Follow instructions on the llama.cpp installation page to install llama-cpp-python for your preferred compute backend.
- Install llama-cpp-haystack using the command above.
For example, to use llama-cpp-haystack
with the cuBLAS backend, you have to run the following commands:
export LLAMA_CUBLAS=1
CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python
pip install llama-cpp-haystack
Usage
- Download the GGUF version of the desired LLM. The GGUF versions of popular models can be downloaded from Hugging Face.
- Initialize
LlamaCppChatGenerator
with the path to the GGUF file and specify the required model and text generation parameters:
from haystack_integrations.components.generators.llama_cpp import LlamaCppChatGenerator
generator = LlamaCppChatGenerator(
model="/content/openchat-3.5-1210.Q3_K_S.gguf",
n_ctx=512,
n_batch=128,
model_kwargs={"n_gpu_layers": -1},
generation_kwargs={"max_tokens": 128, "temperature": 0.1},
)
generator.warm_up()
messages = [ChatMessage.from_user("Who is the best American actor?")]
result = generator.run(messages)
Passing additional model parameters
The model
, n_ctx
, n_batch
arguments have been exposed for convenience and can be directly passed to the Generator during initialization as keyword arguments. Note that model
translates to llama.cpp
's model_path
parameter.
The model_kwargs
parameter can pass additional arguments when initializing the model. In case of duplication, these parameters override the model
, n_ctx
, and n_batch
initialization parameters.
See Llama.cpp's LLM documentation for more information on the available model arguments.
Note: Llama.cpp automatically extracts the chat_template
from the model metadata for applying formatting to ChatMessages. You can overide the chat_template
used by passing in a custom chat_handler
or chat_format
as a model parameter.
For example, to offload the model to GPU during initialization:
from haystack_integrations.components.generators.llama_cpp import LlamaCppChatGenerator
from haystack.dataclasses import ChatMessage
generator = LlamaCppChatGenerator(
model="/content/openchat-3.5-1210.Q3_K_S.gguf",
n_ctx=512,
n_batch=128,
model_kwargs={"n_gpu_layers": -1}
)
generator.warm_up()
messages = [ChatMessage.from_user("Who is the best American actor?")]
result = generator.run(messages, generation_kwargs={"max_tokens": 128})
generated_reply = result["replies"][0].content
print(generated_reply)
Passing text generation parameters
The generation_kwargs
parameter can pass additional generation arguments like max_tokens
, temperature
, top_k
, top_p
, and others to the model during inference.
See Llama.cpp's Chat Completion API documentation for more information on the available generation arguments.
Note: JSON mode, Function Calling, and Tools are all supported as generation_kwargs
. Please see the llama-cpp-python GitHub README for more information on how to use them.
For example, to set the max_tokens
and temperature
:
from haystack_integrations.components.generators.llama_cpp import LlamaCppChatGenerator
from haystack.dataclasses import ChatMessage
generator = LlamaCppChatGenerator(
model="/content/openchat-3.5-1210.Q3_K_S.gguf",
n_ctx=512,
n_batch=128,
generation_kwargs={"max_tokens": 128, "temperature": 0.1},
)
generator.warm_up()
messages = [ChatMessage.from_user("Who is the best American actor?")]
result = generator.run(messages)
The generation_kwargs
can also be passed to the run
method of the generator directly:
from haystack_integrations.components.generators.llama_cpp import LlamaCppChatGenerator
from haystack.dataclasses import ChatMessage
generator = LlamaCppChatGenerator(
model="/content/openchat-3.5-1210.Q3_K_S.gguf",
n_ctx=512,
n_batch=128,
)
generator.warm_up()
messages = [ChatMessage.from_user("Who is the best American actor?")]
result = generator.run(
messages,
generation_kwargs={"max_tokens": 128, "temperature": 0.1},
)
In a pipeline
We use the LlamaCppChatGenerator
in a Retrieval Augmented Generation pipeline on the Simple Wikipedia Dataset from Hugging Face and generate answers using the OpenChat-3.5 LLM.
Load the dataset:
# Install HuggingFace Datasets using "pip install datasets"
from datasets import load_dataset
from haystack import Document, Pipeline
from haystack.components.builders.answer_builder import AnswerBuilder
from haystack.components.builders import ChatPromptBuilder
from haystack.components.embedders import SentenceTransformersDocumentEmbedder, SentenceTransformersTextEmbedder
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.dataclasses import ChatMessage
# Import LlamaCppChatGenerator
from haystack_integrations.components.generators.llama_cpp import LlamaCppChatGenerator
# Load first 100 rows of the Simple Wikipedia Dataset from HuggingFace
dataset = load_dataset("pszemraj/simple_wikipedia", split="validation[:100]")
docs = [
Document(
content=doc["text"],
meta={
"title": doc["title"],
"url": doc["url"],
},
)
for doc in dataset
]
Index the documents to the InMemoryDocumentStore
using the SentenceTransformersDocumentEmbedder
and DocumentWriter
:
doc_store = InMemoryDocumentStore(embedding_similarity_function="cosine")
# Install sentence transformers using "pip install sentence-transformers"
doc_embedder = SentenceTransformersDocumentEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")
# Indexing Pipeline
indexing_pipeline = Pipeline()
indexing_pipeline.add_component(instance=doc_embedder, name="DocEmbedder")
indexing_pipeline.add_component(instance=DocumentWriter(document_store=doc_store), name="DocWriter")
indexing_pipeline.connect("DocEmbedder", "DocWriter")
indexing_pipeline.run({"DocEmbedder": {"documents": docs}})
Create the RAG pipeline and add the LlamaCppChatGenerator
to it:
system_message = ChatMessage.from_system(
"""
Answer the question using the provided context.
Context:
{% for doc in documents %}
{{ doc.content }}
{% endfor %}
"""
)
user_message = ChatMessage.from_user("Question: {{question}}")
assistent_message = ChatMessage.from_assistant("Answer: ")
chat_template = [system_message, user_message, assistent_message]
rag_pipeline = Pipeline()
text_embedder = SentenceTransformersTextEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")
# Load the LLM using LlamaCppChatGenerator
model_path = "openchat-3.5-1210.Q3_K_S.gguf"
generator = LlamaCppChatGenerator(model=model_path, n_ctx=4096, n_batch=128)
rag_pipeline.add_component(
instance=text_embedder,
name="text_embedder",
)
rag_pipeline.add_component(instance=InMemoryEmbeddingRetriever(document_store=doc_store, top_k=3), name="retriever")
rag_pipeline.add_component(instance=ChatPromptBuilder(template=chat_template), name="prompt_builder")
rag_pipeline.add_component(instance=generator, name="llm")
rag_pipeline.add_component(instance=AnswerBuilder(), name="answer_builder")
rag_pipeline.connect("text_embedder", "retriever")
rag_pipeline.connect("retriever", "prompt_builder.documents")
rag_pipeline.connect("prompt_builder", "llm")
rag_pipeline.connect("llm", "answer_builder")
rag_pipeline.connect("retriever", "answer_builder.documents")
Run the pipeline:
question = "Which year did the Joker movie release?"
result = rag_pipeline.run(
{
"text_embedder": {"text": question},
"prompt_builder": {"question": question},
"llm": {"generation_kwargs": {"max_tokens": 128, "temperature": 0.1}},
"answer_builder": {"query": question},
}
)
generated_answer = result["answer_builder"]["answers"][0]
print(generated_answer.data)
# The Joker movie was released on October 4, 2019.
Updated 6 months ago