Secret Management
This page emphasizes secret management in Haystack components and introduces the Secret
type for structured secret handling. It explains the drawbacks of hard-coding secrets in code and suggests using environment variables instead.
Many Haystack components interact with third-party frameworks and service providers such as Azure, Google Vertex AI, and Hugging Face. Their libraries often require the user to authenticate themselves to ensure they receive access to the underlying product. The authentication process usually works with a secret value that acts as an opaque identifier to the third-party backend. This secret is something that must be handled with caution.
Example Use Case
The Problem
Let’s consider an example RAG Pipeline that embeds a query, uses a Retriever component to locate Documents relevant to the query, and then leverages an LLM to generate an answer based on the retrieved Documents.
The OpenAIGenerator
component used in the above Pipeline expects an API key to authenticate with OpenAI’s servers and perform the generation. Let’s assume that the component accepts a str
value for it:
generator = OpenAIGenerator(model="gpt-4", api_key="sk-xxxxxxxxxxxxxxxxxx")
pipeline.add_component("generator", generator)
This works in a pinch, but this is bad practice - we shouldn’t hard-code such secrets in the codebase. An alternative would be to store the key in an environment variable externally, read from it in Python, and pass that to the component:
import os
api_key = os.environ.get("OPENAI_API_KEY")
generator = OpenAIGenerator(model="gpt-4", api_key=api_key)
pipeline.add_component("generator", generator)
This is better – the Pipeline works as intended, and we aren’t hard-coding any secrets in the code.
Remember that Pipelines are serializable. Since the API key is a secret, we should definitely avoid saving it to disk. Let’s modify the component’s to_dict
method to exclude the key:
def to_dict(self) -> Dict[str, Any]:
# Do not pass the `api_key` init parameter.
return default_to_dict(self, model=self.model)
But what happens when the Pipeline is loaded from disk? In the best-case scenario, the component’s backend will automatically try to read the key from a hard-coded environment variable, and that key is the same as the one that was passed to the component before it was serialized. But in a worse case, the backend doesn’t look up the key in a hard-coded environment variable and fails when it gets called inside a pipeline.run()
invocation.
The Solution
Structured Secret Handling
To avoid such problematic situations, Haystack introduces a simple Secret
type. This type is used by components that require authentication, and it provides a consistent API that ensures that secrets do not get unwittingly serialized to disk. It also ensures that the user explicitly chooses how the secret is passed to the component.
The Secret
class has two class methods:
from_token
- Accepts a bare string token that acts as the secret.from_env
- Accepts the names of one or more environment variables that can contain the secret.
Both of those methods return an instance of the Secret
type. The instance has an additional resolve_value
method that either returns a str
that represents the secret or None
if the resolution fails. The return value of the above method can be passed to the backend expecting the secret.
Ensure that the components that accept one or more Secret
s as initial parameters are saved in their to_dict
method and loaded in their from_dict
class method, respectively. Token-based Secret
instances cannot be serialized and will raise an exception when attempted. Environment-based Secret
instances can be serialized by default.
Example
from haystack.utils import Secret, deserialize_secrets_inplace
@component
class MyComponent:
def __init__(self, api_key: Optional[Secret] = None, **kwargs):
self.api_key = api_key
self.backend = None
def warm_up(self):
# Call resolve_value to yield a single result. The semantics of the result is policy-dependent.
# Currently, all supported policies will return a single string token.
self.backend = SomeBackend(api_key=self.api_key.resolve_value() if self.api_key else None, ...)
def to_dict(self):
# Serialize the policy like any other (custom) data. If the policy is token-based, it will
# raise an error.
return default_to_dict(self, api_key=self.api_key.to_dict() if self.api_key else None, ...)
@classmethod
def from_dict(cls, data):
# Deserialize the policy data before passing it to the generic from_dict function.
api_key_data = data["init_parameters"]["api_key"]
api_key = Secret.from_dict(api_key_data) if api_key_data is not None else None
data["init_parameters"]["api_key"] = api_key
# Alternatively, use the helper function.
# deserialize_secrets_inplace(data["init_parameters"], keys=["api_key"])
return default_from_dict(cls, data)
# No authentication.
component = MyComponent(api_key=None)
# Token based authentication
component = MyComponent(api_key=Secret.from_token("sk-randomAPIkeyasdsa32ekasd32e"))
component.to_dict() # Error! Can't serialize authentication tokens
# Environment variable based authentication
component = MyComponent(api_key=Secret.from_env("OPENAI_API_KEY"))
component.to_dict() # This is fine
Updated 8 months ago