Secret Management
This page emphasizes secret management in Haystack components and introduces the Secret
type for structured secret handling. It explains the drawbacks of hard-coding secrets in code and suggests using environment variables instead.
Many Haystack components interact with third-party frameworks and service providers such as Azure, Google Vertex AI, and OpenAI. Their libraries often require the user to authenticate themselves to ensure they receive access to the underlying product. The authentication process usually works with a secret value that acts as an opaque identifier to the third-party backend.
This page describes the two main types of secrets: token-based and environment variable-based, and how to handle them when using Haystack.
You can find additional details for the Secret
class in our API reference.
Example Use Case - Problem Statement
Problem Statement
Let’s consider an example RAG pipeline that embeds a query, uses a Retriever component to locate documents relevant to the query, and then leverages an LLM to generate an answer based on the retrieved documents.
The OpenAIGenerator
component used in the pipeline below expects an API key to authenticate with OpenAI’s servers and perform the generation. Let’s assume that the component accepts a str
value for it:
generator = OpenAIGenerator(model="gpt-4", api_key="sk-xxxxxxxxxxxxxxxxxx")
pipeline.add_component("generator", generator)
This works in a pinch, but this is bad practice - we shouldn’t hard-code such secrets in the codebase. An alternative would be to store the key in an environment variable externally, read from it in Python, and pass that to the component:
import os
api_key = os.environ.get("OPENAI_API_KEY")
generator = OpenAIGenerator(model="gpt-4", api_key=api_key)
pipeline.add_component("generator", generator)
This is better – the pipeline works as intended, and we aren’t hard-coding any secrets in the code.
Remember that pipelines are serializable. Since the API key is a secret, we should definitely avoid saving it to disk. Let’s modify the component’s to_dict
method to exclude the key:
def to_dict(self) -> Dict[str, Any]:
# Do not pass the `api_key` init parameter.
return default_to_dict(self, model=self.model)
But what happens when the pipeline is loaded from disk? In the best-case scenario, the component’s backend will automatically try to read the key from a hard-coded environment variable, and that key is the same as the one that was passed to the component before it was serialized. But in a worse case, the backend doesn’t look up the key in a hard-coded environment variable and fails when it gets called inside a pipeline.run()
invocation.
Import
To use Haystack secrets within the code, first import with:
from haystack.utils import Secret
Token-Based Secrets
You can paste tokens directly as a string using the from_token
method:
llm = OpenAIGenerator(api_key=Secret.from_token("sk-randomAPIkeyasdsa32ekasd32e"))
Note that this type of code cannot be serialized, meaning you can't convert the above component to a dictionary or save a pipeline containing it to a YAML file. This is a security feature to prevent accidental exposure of sensitive data.
Environment Variable-Based Secrets
Environment variable-based secrets are more flexible. They allow you to specify one or more environment variables that may contain your secret.
Existing Haystack components that require an API Key (like OpenAIGenerator) have a default value for Secret.from_env_var
(in this case, OPENAI_API_KEY
). This means that the OpenAIGenerator
will look for the value of the environment variable OPENAI_API_KEY
(if it exists) and use it for authentication. And when pipelines are serialized to YAML, only the name of the environment variable is save to the YAML file. In doing so, this method ensures that there are no security leaks and is therefore strongly recommended.
# First, export an environment variable name `OPENAI_API_KEY` with its value
export OPENAI_API_KEY=sk-randomAPIkeyasdsa32ekasd32e
# or alternatively, using Python
# import os
# os.environ[”OPENAI_API_KEY”]=sk-randomAPIkeyasdsa32ekasd32e
llm_generator = OpenAIGenerator() # Uses the default value from the env var for the component
Alternatively, in components where a Secret is expected, you can customize the name of the environment variable from which the API Key is to be read.
# Export an environment variable with custom name and its value
llm_generator = OpenAIGenerator(api_key=Secret.from_env_var("YOUR_ENV_VAR"))
When OpenAIGenerator
is serialized within a pipeline, this is what the YAML code will look like, using the custom variable name:
components:
llm:
init_parameters:
api_base_url: null
api_key:
env_vars:
- YOUR_ENV_VAR
strict: true
type: env_var
generation_kwargs: {}
model: gpt-4o-mini
organization: null
streaming_callback: null
system_prompt: null
type: haystack.components.generators.openai.OpenAIGenerator
...
Serialization
While token-based secrets cannot be serialized, environment variable-based secrets can be converted to and from dictionaries:
# Convert to dictionary
env_secret_dict = env_secret.to_dict()
# Create from dictionary
new_env_secret = Secret.from_dict(env_secret_dict)
Resolving Secrets
Both types of secrets can be resolved to their actual values using the resolve_value
method. This method returns the token or the value of the environment variable.
# Resolve the token-based secret
token_value = api_key_secret.resolve_value()
# Resolve the environment variable-based secret
env_value = env_secret.resolve_value()
Custom Component Example
Here is a complete example that shows how to create a component that uses the Secret
class in Haystack, highlighting the differences between token-based and environment variable-based authentication, and showing that token-based secrets cannot be serialized:
from haystack.utils import Secret, deserialize_secrets_inplace
@component
class MyComponent:
def __init__(self, api_key: Optional[Secret] = None, **kwargs):
self.api_key = api_key
self.backend = None
def warm_up(self):
# Call resolve_value to yield a single result. The semantics of the result is policy-dependent.
# Currently, all supported policies will return a single string token.
self.backend = SomeBackend(api_key=self.api_key.resolve_value() if self.api_key else None, ...)
def to_dict(self):
# Serialize the policy like any other (custom) data. If the policy is token-based, it will
# raise an error.
return default_to_dict(self, api_key=self.api_key.to_dict() if self.api_key else None, ...)
@classmethod
def from_dict(cls, data):
# Deserialize the policy data before passing it to the generic from_dict function.
api_key_data = data["init_parameters"]["api_key"]
api_key = Secret.from_dict(api_key_data) if api_key_data is not None else None
data["init_parameters"]["api_key"] = api_key
# Alternatively, use the helper function.
# deserialize_secrets_inplace(data["init_parameters"], keys=["api_key"])
return default_from_dict(cls, data)
# No authentication.
component = MyComponent(api_key=None)
# Token based authentication
component = MyComponent(api_key=Secret.from_token("sk-randomAPIkeyasdsa32ekasd32e"))
component.to_dict() # Error! Can't serialize authentication tokens
# Environment variable based authentication
component = MyComponent(api_key=Secret.from_env_var("OPENAI_API_KEY"))
component.to_dict() # This is fine
Updated about 2 months ago